Hi new here and learning to use the

KKasparTr

Hi, new here and learning to use the GPTSimpleVectorIndex.
I am attempting to index a larger documentation file (consists of multiple files).

When using SimpleDirectoryReader and getting the documents array, the GPTSimpleVectorIndex(documents) will throw this error

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model

I know its too much content for the model, but what I don't understand, is how do you index the entire documentation.

I attempted to break the information into chunks and index them separately, but indexing in the following way throw the same warning

Plain Text

for doc in documents:
    indexes.append(GPTSimpleVectorIndex([doc]))

Been trying to debug this with GPT4 but I think I need some human intellect here 🙂

Any tips how to index a larger dataset?

4 comments

jjerryjliu0

hi @KasparTr ! this kind of sounds like a bug with our text splitter. could you print the full stack trace?

in the meantime you could try manually setting the chunk size (index = GPTSimpleVectorIndex(documents, ..., chunk_size_limit=512)

KKasparTr

Sure thing. Here is the code and stack trace.

main.py

Plain Text

documents = SimpleDirectoryReader('docs').load_data()
# docs is a directory consisting of 10 .txt files
index = GPTSimpleVectorIndex(documents)

Stack trace

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1644 > 1024). Running this sequence through the model will result in indexing errors
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 15910 tokens

Thank you, when I add the param chunk_size_limit=1024, I no longer get the warning. Though not sure what this means, will some data be not indexed?

jjerryjliu0

ooh i see. This is a warning from the tokenization model i believe. as long as the openai api doesn't throw errors you should be ok

KKasparTr

Nice, thanks!

Add a reply

Find answers from the community

Hi new here and learning to use the