Find answers from the community

Updated 3 months ago

Hi new here and learning to use the

Hi, new here and learning to use the GPTSimpleVectorIndex.
I am attempting to index a larger documentation file (consists of multiple files).

When using SimpleDirectoryReader and getting the documents array, the GPTSimpleVectorIndex(documents) will throw this error
Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model

I know its too much content for the model, but what I don't understand, is how do you index the entire documentation.

I attempted to break the information into chunks and index them separately, but indexing in the following way throw the same warning
Plain Text
for doc in documents:
    indexes.append(GPTSimpleVectorIndex([doc]))


Been trying to debug this with GPT4 but I think I need some human intellect here πŸ™‚

Any tips how to index a larger dataset?
j
K
4 comments
hi @KasparTr ! this kind of sounds like a bug with our text splitter. could you print the full stack trace?

in the meantime you could try manually setting the chunk size (index = GPTSimpleVectorIndex(documents, ..., chunk_size_limit=512)
Sure thing. Here is the code and stack trace.

main.py
Plain Text
documents = SimpleDirectoryReader('docs').load_data()
# docs is a directory consisting of 10 .txt files
index = GPTSimpleVectorIndex(documents)


Stack trace
Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model (1644 > 1024). Running this sequence through the model will result in indexing errors
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 15910 tokens


Thank you, when I add the param chunk_size_limit=1024, I no longer get the warning. Though not sure what this means, will some data be not indexed?
ooh i see. This is a warning from the tokenization model i believe. as long as the openai api doesn't throw errors you should be ok
Add a reply
Sign up and join the conversation on Discord