LlamaIndex

Log inLog into community

Find answers from the community

Updated 3 months ago

Large document error

Large document error

·

Hey, new here! Would someone kindly point me to some docs or a tutorial about building a robust index? I used this helpful tutorial (https://bootcamp.uxdesign.cc/a-step-by-step-guide-to-building-a-chatbot-based-on-your-own-documents-with-gpt-2d550534eea5), which was super helpful to start, but it seems to choke with larger files.

Specifically, I'm getting the error Token indices sequence length is longer.... I searched through this Discord and found a few others, but couldn't work out the issue.

L

c

70 comments

What's the full error/stack trace?

PDF parsing isn't always perfect. I suspect there is a huge line with no spaces between words maybe?

Here's an example.

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1908 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 6321 tokens

Literally just one text doc, though.

Also, it still builds the index, but it seems to struggle when I query it

Before you build the index, try printing the document object and see if the text looks correct 🤔

FWIW, here's the warnings when I run the query.

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1943 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [query] Total LLM token usage: 2011 tokens
INFO:root:> [query] Total embedding token usage: 7 tokens

ok, let me print the doc and see

Looks right. There are a few odd characters represented by solid squares or circles - could that be the problem?

Attachment

Hmmm maybe? I kind of doubt it though

I'm not 100% sure since its not printing a full traceback (and i cant seem to find this string in the code), but I thiiink the error is coming from the text splitter?

You can try passing in a different text splitter from langchain when constructing the index:
https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html

index = GPTSimpleVectorIndex(documents, text_splitter=text_splitter)

Am I literally using text_splitter or is that a placeholder? If it's a placeholder, do you have a suggestion for what to use there?

Oh, I think I see. So I can break it up whenever there's a double line break, for example

Will try, tyvm

Yea, just have to create your own text splitter object and pass it in, just an example of what it might look like

Hope it works! Working with PDFs can be a pain sometimes 😅

index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any))

Limited Python skills - what am I doing wrong? 🙂

Ah try this

index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator='\n\n'))

NameError: name 'CharacterTextSplitter' is not defined

doesn't that come from importing langchain, though?

You'll need import it specifically, from langchain.text_splitter import CharacterTextSplitter at the top

yeah, I tried that, but it keeps barfing

Plain Text

ImportError: cannot import name 'CharacterTextSplitter' from 'langchain' (/home/dh/.local/lib/python3.8/site-packages/langchain/__init__.py)

Hmmm one sec, I'll check their source code lol

thanks, I really appreciate all the help!

Hmm, it worked for me in a quick test.

Maybe make sure langchain is updated pip install --upgrade langchain llama_index

Attachment

updated to:
Successfully installed PyYAML-6.0 langchain-0.0.121 llama-index-0.4.36

same error

If you open a python shell like my screenshot and type the command, still an error? Can you send a screenshot of what that looks like?

I'm an idiot, I found it

Oh!

I had added it to another import, commented it out, then added it back in and forgot to clean it up

BUT

I'm still getting this error:

Plain Text

Token indices sequence length is longer than the specified maximum sequence length for this model (1066 > 1024). Running this sequence through the model will result in indexing errors
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 5809 tokens

I'll try to print the docs and crawl through to find the issue

Yea, there must be a line that is super long 🤔🤔

I.e. with no spaces

That's my best guess

got it, will dig in

thanks again for all the help!

Actually, now I'm really confused, lol. I tried simply changing the text splitter to be based off of \n instead of \n\n, but now the error says (1244 > 1024), implying that there is a larger chunk than when I split by \n\n. How is that possible?

(Because it was 1066 when I split on \n\n)

When I open up index.json, there are many places where the text contains \n, which I would've expected to be removed by the splitting process. Am I missing something obvious here?

😵
Is the PDF proprietary? I can run a test or two on my side if you can share it

(I'm confused a bit as well haha)

It's just a text file, but unfortunately it is. :/

Dang... OK one last question response = index.query(...), when you ask a query, you can check response.source_nodes to see the nodes that built the response

Do the response nodes look good?

node_info=None

that?

response.source_nodes[0].textI think is the one (this will print the text from the first source node)

This will let you check what we showed the LLM to generate an answer to the query

Plain Text

    print(response.source_nodes[0].text)
AttributeError: 'SourceNode' object has no attribute 'text'

whoops, looked at the wrong object

response.source_nodes[0].source_text

yeah, I dropped the .text and I see it

it looks like it gets cut off at the end

Yea, that's fine. We split documents into chunks with some overlap

ah, got it

Different text splitters might split on sentences instead, but usually the default works fine

Is the source node relevant to the query? If it is, I'm not sure why the LLM response would be bad 🤔

it is, but tbf, the response on this one is good

lol

lol oh nice

I can try to recreate the bad one, but it was from yesterday, so maybe I'll just keep working on it and let you know if it happens again

is there a way to trim down what's being sent to the gpt api?

Oh, I see, maybe I can use the text splitter to decrease the chunk size

Yup! You can also set the chunk size with the default splitter like this

GPTSimpleVectorIndex(documents, chunk_size_limit=512)

https://tenor.com/view/michael-scott-the-office-you-gif-24903458

Attachment

nice, maybe I'll try that instead

@Logan M looks like it's working - thanks so much! Last (hopefully, right?!) question: can we use Chat instead of Instruct?

You mean use chatgpt instead of davinci?

Yes! You can basically use any model you want

Sometimes, ChatGPT doesn't quite follow internal instructions in prompts (it's a stubborn bugger), but for the most part should work and is way cheaper

Check out this notebook https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/SimpleIndexDemo-ChatGPT.ipynb

You are a scholar and a gentleman, sir

Anytime, godspeed sir @curator.sol :dotsCATJAM:

Add a reply

Sign up and join the conversation on Discord