Another question

At a glance

Another question...
We're noticing that some of our documents are getting split into multiples in the database...
For example... We've got a post, and it goes on for ~650 words in one entry, stopping 51 words from the end of the post, and then another entry, that contains the last 67 words of the post.
Is there a reason for this? And is there a setting for this?

17 comments

LLogan M

Yea this is due to the (relatively simple) default node-parsing approach

By default, documents get split into chunks of 1024 tokens with some overlap (1024 tokens is about 650 words it seems haha)

You can adjust the chunk size in the service context before building the index
service_context = ServiceContext.from_defaults(chunk_size=1024)

You could also pre-chunk the documents into nodes yourself if you'd prefer

RRubenator

Uh, it's that way in the db so, I think it has to do with the document parsing portion

LLogan M

yea that's what I was referring to? I can give a more complete example

RRubenator

well, just don't know where/how to specify the chunk size during that

LLogan M

Plain Text

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex

documents = SimpleDirectoryReader("./data").load_data()

# here you can adjust the chunk size
service_context = ServiceContext.from_defaults(chunk_size=1024)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

LLogan M

ez pz

RRubenator

uh, so I do't need to adjust:

Plain Text

SimpleNodeParser().get_nodes_from_documents(documents)

LLogan M

Oh, I assumed you werent using the node parser directly lol

LLogan M

bit more complicated to customize that, but here goes

RRubenator

Well, I'm also down for avoiding doing that

RRubenator

I just don't know how to do that either ;p

LLogan M

The example I gave above sets the chunk size in the node parser for you

LLogan M

So to avoid the problem you are seeing, you can increase it slightly

LLogan M

OR you can manually create node objects, so that they are chunked however you want

RRubenator

Oh I see -- so by passing in the service_context, it'll convert the documents to nodes for me?

LLogan M

Actually, its the from_documents() that transforms documents into nodes. The service context is just slightly modifying how it does that

RRubenator

Okay I follow. ty so much

Add a reply

Find answers from the community

Another question