Chunk overlap

At a glance

The post is about reducing chunk overlap when processing documentation. The community members discuss two levels of splitting/overlap: one at query time and another when inserting documents. They suggest using the PromptHelper to control overlap at query time, and customizing the node_parser in the ServiceContext to use a token splitter with different chunk overlap for index construction.

The community members share their experiences with different chunk size and overlap settings, and suggest that the best improvement usually comes from document pre-processing. They also discuss the relationship between the ServiceContext, PromptHelper, and node_parser, noting that the ServiceContext can accept chunk_size_limit but not chunk_overlap, which needs to be set in the node_parser and PromptHelper separately.

The community members also discuss issues with vector index queries, where some queries return zero nodes, and try to troubleshoot the problem. They suggest that the issue might be related to how the documents are being parameterized.

Useful resources

FFairlyAverage

How do I reduce chunk overlap?
I'm experimenting with different chunk size limits, so find an optimum size for processing documentation.

Plain Text

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=100, )

68 comments

LLogan M

There's two levels of splitting/overlap

One happens at query time. You can use this guide to control that
https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-fine-grained-control-over-all-parameters

Another happens when inserting documents. You can change the node parser in the service context object to use a token splitter with different chunk overlap. I can't find a quick example of that haha but maybe start with the prompt helper settings

FFairlyAverage

do you have a sense of optimum chunk size and overlap, for technical documents, like ReadTheDocs or a github repo?

I used a tree index with 300 chunk size limit, and 200 overlap, and it was pretty good, but a bit slow. Just wondering what others use to query technical information.

LLogan M

wowza, that is a large overlap with a small chunk size. Interesting though!

Personally, I haven't played with the settings too much. Usually setting the chunk size limit is enough tweaking to get good results

The best improvement usually comes from document pre-processing. If you can split your documents into clear logical sections/chunks ahead of time, this will help

FFairlyAverage

well, that chunk overlap, is just what I saw in the error, when I reduce chunk to 100. so 200 i a default somewhere.

LLogan M

Yup, it's the default in the initial text splitter

LLogan M

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/langchain_helpers/text_splitter.py#L23

FFairlyAverage

so, in prompthelper?

https://gpt-index.readthedocs.io/en/latest/reference/service_context/prompt_helper.html

FFairlyAverage

My goal is to index a bunch of ReadTheDocs, and make the indexes available for others.

LLogan M

In addition to changing the prompt helper (which only runs during query time) you can customize the node parser like this (which is used in .from_documents()

Plain Text

from llama_index.langhchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=XX, chunk_overlap=xx))`
service_context = ServiceContext.from_defaults(..., node_parser=node_parser)

LLogan M

hopefully I got all those imports/attributes right, just wrote that looking at the source code lol

FFairlyAverage

ah, ok. I will look at that.
I'm using the github loader, from llama-hub

LLogan M

Yea for sure. Just need to pass in the service context object when you call from_documents (or load from disk) 👍

FFairlyAverage

🙏🏼 thanks for helping.

Find answers from the community

Chunk overlap