simple keyword table index is very simple
def simple_extract_keywords(
text_chunk: str, max_keywords: Optional[int] = None, filter_stopwords: bool = True
) -> Set[str]:
"""Extract keywords with simple algorithm."""
tokens = [t.strip().lower() for t in re.findall(r"\w+", text_chunk)]
if filter_stopwords:
tokens = [t for t in tokens if t not in globals_helper.stopwords]
value_counts = pd.Series(tokens).value_counts()
keywords = value_counts.index.tolist()[:max_keywords]
return set(keywords)
Probably max keywords is preventing some keywords from slipping through.
You can set
max_keywords_per_chunk
to change this. Default is 10
index = SimpleKeywordTableIndex.from_documents(documents, max_keywords_per_chunk=20)