Find answers from the community

Updated last year

My data is a collection of documents

My data is a collection of documents which contain long lists of items as well as descriptions of both the items and the sets which contain them (not organized in any particular way because scraped). When indexing I always get the error
Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model (1050 > 1024). Running this sequence through the model will result in indexing errors
However, even though the index is successfully created, I often get incorrect responses, most commonly something like "X is not in set Y" even when it appears in the list for that set. So my assumption is that the documents are longer than the maximum chunk size and are being split into multiple chunks (with a little bit of overlap), and then I get situations where X appears in the second chunk of a list, only the index has no context for what the name of the set is? Sorry if any part of this is confusing, I am trying to verify that I understand why I am getting bad responses from my data. Would the solution to this problem be to manipulate the data so that either every document is < 1024 tokens? Or instead of having lists formatted like "Set Y: -x1 -x2 ..." have something like "x1 is in Set Y. x2 is in Set Y. ..."?
L
M
2 comments
That warning is actually unrelated/harmless

In older versions of llama-index, python3.8 would default to a GPT2 tokenizer from huggingface. This is used for token counting

Newer versions of llama-index should not have this issue/warning

The bad responses are likely related to other settings? Tbh first thing I would do is maybe try and update (there may be some breaking issue if your version/code is quite old though)
Thanks I will try that
Add a reply
Sign up and join the conversation on Discord