Find answers from the community

Updated 3 months ago

Another question

Another question...
We're noticing that some of our documents are getting split into multiples in the database...
For example... We've got a post, and it goes on for ~650 words in one entry, stopping 51 words from the end of the post, and then another entry, that contains the last 67 words of the post.
Is there a reason for this? And is there a setting for this?
L
R
17 comments
Yea this is due to the (relatively simple) default node-parsing approach

By default, documents get split into chunks of 1024 tokens with some overlap (1024 tokens is about 650 words it seems haha)

You can adjust the chunk size in the service context before building the index
service_context = ServiceContext.from_defaults(chunk_size=1024)

You could also pre-chunk the documents into nodes yourself if you'd prefer
Uh, it's that way in the db so, I think it has to do with the document parsing portion
yea that's what I was referring to? I can give a more complete example
well, just don't know where/how to specify the chunk size during that
Plain Text
from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex

documents = SimpleDirectoryReader("./data").load_data()

# here you can adjust the chunk size
service_context = ServiceContext.from_defaults(chunk_size=1024)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)
uh, so I do't need to adjust:
Plain Text
SimpleNodeParser().get_nodes_from_documents(documents)
?
Oh, I assumed you werent using the node parser directly lol
bit more complicated to customize that, but here goes
Well, I'm also down for avoiding doing that
I just don't know how to do that either ;p
The example I gave above sets the chunk size in the node parser for you
So to avoid the problem you are seeing, you can increase it slightly
OR you can manually create node objects, so that they are chunked however you want
Oh I see -- so by passing in the service_context, it'll convert the documents to nodes for me?
Actually, its the from_documents() that transforms documents into nodes. The service context is just slightly modifying how it does that
Okay I follow. ty so much
Add a reply
Sign up and join the conversation on Discord