Find answers from the community

Updated 5 months ago

Does anyone have any insight into how

At a glance

The community member is having an issue with the SimpleKeywordTableIndex, where it is not indexing all the important words from the text they are providing. Another community member suggests that the issue might be related to the max_keywords_per_chunk parameter, which limits the number of keywords extracted per text chunk. They recommend increasing this parameter to allow more keywords to be indexed. The main contributor to the library confirms this and points the community member to the API documentation, which mentions this parameter.

Useful resources
Does anyone have any insight into how SimpleKeywordTableIndex chooses which words it indexes? I'm trying to a build something with it where I give it several text nodes (of about 200 tokens length), but when I inspect the _index_struct.table of keywords, there are many important words that are completely missing. Anybody know about this?
L
j
4 comments
simple keyword table index is very simple

Plain Text
def simple_extract_keywords(
    text_chunk: str, max_keywords: Optional[int] = None, filter_stopwords: bool = True
) -> Set[str]:
    """Extract keywords with simple algorithm."""
    tokens = [t.strip().lower() for t in re.findall(r"\w+", text_chunk)]
    if filter_stopwords:
        tokens = [t for t in tokens if t not in globals_helper.stopwords]
    value_counts = pd.Series(tokens).value_counts()
    keywords = value_counts.index.tolist()[:max_keywords]
    return set(keywords)


Probably max keywords is preventing some keywords from slipping through.

You can set max_keywords_per_chunk to change this. Default is 10

index = SimpleKeywordTableIndex.from_documents(documents, max_keywords_per_chunk=20)
ahhh thank you so much! How did you get to this answer? Did you search the source code for SimpleKeywordTableIndex?

I've been looking around in the public docs for a while but didn't see anything in there at all about max_keywords_per_chunk
yea I have the code open right now (It might help that I'm the main contributor to the library lol)

Yea the docs on the keyword index are probably not ideal

I do see it in the API docs (but our API docs are πŸ’© so I don't blame anyone for missing it)
https://docs.llamaindex.ai/en/stable/api_reference/indices/table.html#llama_index.indices.keyword_table.SimpleKeywordTableIndex
haha being the main contributor definitely helps πŸ˜‰

and ah ok I see it now. appreciate the pointer!
Add a reply
Sign up and join the conversation on Discord