My data is a collection of documents

My data is a collection of documents which contain long lists of items as well as descriptions of both the items and the sets which contain them (not organized in any particular way because scraped). When indexing I always get the error

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1050 > 1024). Running this sequence through the model will result in indexing errors

However, even though the index is successfully created, I often get incorrect responses, most commonly something like "X is not in set Y" even when it appears in the list for that set. So my assumption is that the documents are longer than the maximum chunk size and are being split into multiple chunks (with a little bit of overlap), and then I get situations where X appears in the second chunk of a list, only the index has no context for what the name of the set is? Sorry if any part of this is confusing, I am trying to verify that I understand why I am getting bad responses from my data. Would the solution to this problem be to manipulate the data so that either every document is < 1024 tokens? Or instead of having lists formatted like "Set Y: -x1 -x2 ..." have something like "x1 is in Set Y. x2 is in Set Y. ..."?

Find answers from the community

My data is a collection of documents