What's the full error/stack trace?
PDF parsing isn't always perfect. I suspect there is a huge line with no spaces between words maybe?
Here's an example.
Token indices sequence length is longer than the specified maximum sequence length for this model (1908 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 6321 tokens
Literally just one text doc, though.
Also, it still builds the index, but it seems to struggle when I query it
Before you build the index, try printing the document object and see if the text looks correct π€
FWIW, here's the warnings when I run the query.
Token indices sequence length is longer than the specified maximum sequence length for this model (1943 > 1024). Running this sequence through the model will result in indexing errors
INFO:root:> [query] Total LLM token usage: 2011 tokens
INFO:root:> [query] Total embedding token usage: 7 tokens
ok, let me print the doc and see
Looks right. There are a few odd characters represented by solid squares or circles - could that be the problem?
Hmmm maybe? I kind of doubt it though
I'm not 100% sure since its not printing a full traceback (and i cant seem to find this string in the code), but I thiiink the error is coming from the text splitter?
You can try passing in a different text splitter from langchain when constructing the index:
https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.htmlindex = GPTSimpleVectorIndex(documents, text_splitter=text_splitter)
Am I literally using text_splitter
or is that a placeholder? If it's a placeholder, do you have a suggestion for what to use there?
Oh, I think I see. So I can break it up whenever there's a double line break, for example
Yea, just have to create your own text splitter object and pass it in, just an example of what it might look like
Hope it works! Working with PDFs can be a pain sometimes π
index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator: str = '\n\n', **kwargs: Any))
Limited Python skills - what am I doing wrong? π
index = GPTSimpleVectorIndex(documents, text_splitter=CharacterTextSplitter(separator='\n\n'))
NameError: name 'CharacterTextSplitter' is not defined
doesn't that come from importing langchain, though?
You'll need import it specifically, from langchain.text_splitter import CharacterTextSplitter
at the top
yeah, I tried that, but it keeps barfing
ImportError: cannot import name 'CharacterTextSplitter' from 'langchain' (/home/dh/.local/lib/python3.8/site-packages/langchain/__init__.py)
Hmmm one sec, I'll check their source code lol
thanks, I really appreciate all the help!
Hmm, it worked for me in a quick test.
Maybe make sure langchain is updated pip install --upgrade langchain llama_index
updated to:
Successfully installed PyYAML-6.0 langchain-0.0.121 llama-index-0.4.36
same error
If you open a python shell like my screenshot and type the command, still an error? Can you send a screenshot of what that looks like?
I had added it to another import, commented it out, then added it back in and forgot to clean it up
I'm still getting this error:
Token indices sequence length is longer than the specified maximum sequence length for this model (1066 > 1024). Running this sequence through the model will result in indexing errors
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 5809 tokens
I'll try to print the docs and crawl through to find the issue
Yea, there must be a line that is super long π€π€
thanks again for all the help!
Actually, now I'm really confused, lol. I tried simply changing the text splitter to be based off of \n
instead of \n\n
, but now the error says (1244 > 1024)
, implying that there is a larger chunk than when I split by \n\n
. How is that possible?
(Because it was 1066 when I split on \n\n
)
When I open up index.json, there are many places where the text contains \n
, which I would've expected to be removed by the splitting process. Am I missing something obvious here?
π΅
Is the PDF proprietary? I can run a test or two on my side if you can share it
(I'm confused a bit as well haha)
It's just a text file, but unfortunately it is. :/
Dang... OK one last question response = index.query(...)
, when you ask a query, you can check response.source_nodes
to see the nodes that built the response
Do the response nodes look good?
response.source_nodes[0].text
I think is the one (this will print the text from the first source node)
This will let you check what we showed the LLM to generate an answer to the query
print(response.source_nodes[0].text)
AttributeError: 'SourceNode' object has no attribute 'text'
whoops, looked at the wrong object
response.source_nodes[0].source_text
yeah, I dropped the .text
and I see it
it looks like it gets cut off at the end
Yea, that's fine. We split documents into chunks with some overlap
Different text splitters might split on sentences instead, but usually the default works fine
Is the source node relevant to the query? If it is, I'm not sure why the LLM response would be bad π€
it is, but tbf, the response on this one is good
I can try to recreate the bad one, but it was from yesterday, so maybe I'll just keep working on it and let you know if it happens again
is there a way to trim down what's being sent to the gpt api?
Oh, I see, maybe I can use the text splitter to decrease the chunk size
Yup! You can also set the chunk size with the default splitter like this
GPTSimpleVectorIndex(documents, chunk_size_limit=512)
nice, maybe I'll try that instead
@Logan M looks like it's working - thanks so much! Last (hopefully, right?!) question: can we use Chat instead of Instruct?
You are a scholar and a gentleman, sir
Anytime, godspeed sir @curator.sol :dotsCATJAM: