Find answers from the community

Updated 3 months ago

Large document error

Hey, new here! Would someone kindly point me to some docs or a tutorial about building a robust index? I used this helpful tutorial (https://bootcamp.uxdesign.cc/a-step-by-step-guide-to-building-a-chatbot-based-on-your-own-documents-with-gpt-2d550534eea5), which was super helpful to start, but it seems to choke with larger files.

Specifically, I'm getting the error Token indices sequence length is longer.... I searched through this Discord and found a few others, but couldn't work out the issue.
L
c
70 comments
What's the full error/stack trace?

PDF parsing isn't always perfect. I suspect there is a huge line with no spaces between words maybe?
Here's an example.

Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model (1908 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 6321 tokens

Literally just one text doc, though.
Also, it still builds the index, but it seems to struggle when I query it
Before you build the index, try printing the document object and see if the text looks correct πŸ€”
FWIW, here's the warnings when I run the query.

Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model (1943 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [query] Total LLM token usage: 2011 tokens
INFO:root:> [query] Total embedding token usage: 7 tokens
ok, let me print the doc and see
Looks right. There are a few odd characters represented by solid squares or circles - could that be the problem?
Hmmm maybe? I kind of doubt it though

I'm not 100% sure since its not printing a full traceback (and i cant seem to find this string in the code), but I thiiink the error is coming from the text splitter?

You can try passing in a different text splitter from langchain when constructing the index:
https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html

index = GPTSimpleVectorIndex(documents, text_splitter=text_splitter)
Am I literally using text_splitter or is that a placeholder? If it's a placeholder, do you have a suggestion for what to use there?
Oh, I think I see. So I can break it up whenever there's a double line break, for example
Yea, just have to create your own text splitter object and pass it in, just an example of what it might look like

Hope it works! Working with PDFs can be a pain sometimes πŸ˜…
index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any))

Limited Python skills - what am I doing wrong? πŸ™‚
index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator='\n\n'))
NameError: name 'CharacterTextSplitter' is not defined
doesn't that come from importing langchain, though?
You'll need import it specifically, from langchain.text_splitter import CharacterTextSplitter at the top
yeah, I tried that, but it keeps barfing
Plain Text
ImportError: cannot import name 'CharacterTextSplitter' from 'langchain' (/home/dh/.local/lib/python3.8/site-packages/langchain/__init__.py)
Hmmm one sec, I'll check their source code lol
thanks, I really appreciate all the help!
Hmm, it worked for me in a quick test.

Maybe make sure langchain is updated pip install --upgrade langchain llama_index
Attachment
image.png
updated to:
Successfully installed PyYAML-6.0 langchain-0.0.121 llama-index-0.4.36

same error
If you open a python shell like my screenshot and type the command, still an error? Can you send a screenshot of what that looks like?
I'm an idiot, I found it
I had added it to another import, commented it out, then added it back in and forgot to clean it up
I'm still getting this error:

Plain Text
Token indices sequence length is longer than the specified maximum sequence length for this model (1066 > 1024). Running this sequence through the model will result in indexing errors
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 5809 tokens
I'll try to print the docs and crawl through to find the issue
Yea, there must be a line that is super long πŸ€”πŸ€”
I.e. with no spaces
That's my best guess
got it, will dig in
thanks again for all the help!
Actually, now I'm really confused, lol. I tried simply changing the text splitter to be based off of \n instead of \n\n, but now the error says (1244 > 1024), implying that there is a larger chunk than when I split by \n\n. How is that possible?
(Because it was 1066 when I split on \n\n)
When I open up index.json, there are many places where the text contains \n, which I would've expected to be removed by the splitting process. Am I missing something obvious here?
😡
Is the PDF proprietary? I can run a test or two on my side if you can share it
(I'm confused a bit as well haha)
It's just a text file, but unfortunately it is. :/
Dang... OK one last question response = index.query(...), when you ask a query, you can check response.source_nodes to see the nodes that built the response

Do the response nodes look good?
response.source_nodes[0].textI think is the one (this will print the text from the first source node)
This will let you check what we showed the LLM to generate an answer to the query
Plain Text
    print(response.source_nodes[0].text)
AttributeError: 'SourceNode' object has no attribute 'text'
whoops, looked at the wrong object
response.source_nodes[0].source_text
yeah, I dropped the .text and I see it
it looks like it gets cut off at the end
Yea, that's fine. We split documents into chunks with some overlap
Different text splitters might split on sentences instead, but usually the default works fine
Is the source node relevant to the query? If it is, I'm not sure why the LLM response would be bad πŸ€”
it is, but tbf, the response on this one is good
I can try to recreate the bad one, but it was from yesterday, so maybe I'll just keep working on it and let you know if it happens again
is there a way to trim down what's being sent to the gpt api?
Oh, I see, maybe I can use the text splitter to decrease the chunk size
Yup! You can also set the chunk size with the default splitter like this

GPTSimpleVectorIndex(documents, chunk_size_limit=512)
nice, maybe I'll try that instead
@Logan M looks like it's working - thanks so much! Last (hopefully, right?!) question: can we use Chat instead of Instruct?
You mean use chatgpt instead of davinci?

Yes! You can basically use any model you want

Sometimes, ChatGPT doesn't quite follow internal instructions in prompts (it's a stubborn bugger), but for the most part should work and is way cheaper

Check out this notebook https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/SimpleIndexDemo-ChatGPT.ipynb
You are a scholar and a gentleman, sir
Anytime, godspeed sir @curator.sol :dotsCATJAM:
Add a reply
Sign up and join the conversation on Discord