Find answers from the community

Updated 2 years ago

Chunk overlap

How do I reduce chunk overlap?
I'm experimenting with different chunk size limits, so find an optimum size for processing documentation.

Plain Text
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=100, )
L
F
M
68 comments
There's two levels of splitting/overlap

One happens at query time. You can use this guide to control that
https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-fine-grained-control-over-all-parameters

Another happens when inserting documents. You can change the node parser in the service context object to use a token splitter with different chunk overlap. I can't find a quick example of that haha but maybe start with the prompt helper settings
do you have a sense of optimum chunk size and overlap, for technical documents, like ReadTheDocs or a github repo?

I used a tree index with 300 chunk size limit, and 200 overlap, and it was pretty good, but a bit slow. Just wondering what others use to query technical information.
wowza, that is a large overlap with a small chunk size. Interesting though!

Personally, I haven't played with the settings too much. Usually setting the chunk size limit is enough tweaking to get good results

The best improvement usually comes from document pre-processing. If you can split your documents into clear logical sections/chunks ahead of time, this will help
well, that chunk overlap, is just what I saw in the error, when I reduce chunk to 100. so 200 i a default somewhere.
Yup, it's the default in the initial text splitter
My goal is to index a bunch of ReadTheDocs, and make the indexes available for others.
In addition to changing the prompt helper (which only runs during query time) you can customize the node parser like this (which is used in .from_documents()

Plain Text
from llama_index.langhchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=XX, chunk_overlap=xx))`
service_context = ServiceContext.from_defaults(..., node_parser=node_parser)
hopefully I got all those imports/attributes right, just wrote that looking at the source code lol
ah, ok. I will look at that.
I'm using the github loader, from llama-hub
Yea for sure. Just need to pass in the service context object when you call from_documents (or load from disk) πŸ‘
πŸ™πŸΌ thanks for helping.
I'm out of the loop here, let's recap please πŸ˜„ prompt helper also takes a max_chunk_overlap, which the service context doesn't... but this is needed also at index construction time, right?
I'm debugging an issue where i'm getting a huge overlap with small chunk sizes and i'm probably parameterizing wrong..
Right, so there are two places where documents are chunked:

  1. During index construction, using the node_parser
  2. During queries, using the prompt_helper
If you set the chunk_size_limit to be very small, you might want to adjust the default chunk_overlap in the example I gave above for the node_parser (the default is 200)
I see, thanks! So the ServiceContext can take chunk_size_limit directly, but not the overlap
correct πŸ‘ This is because for queries vs. index construction, you need different sizes of overlap
That leaves me wondering why the service context can accept only one of the highly related params
Which parameter is the service context missing? Since the queries vs. index construction use different processes, either can have the chunk overlap specified using the node parser (index construction) or prompt helper (query time)

Both of these can be set in the service context
it has chunk_size_limit, but not overlap, when constructing using .from_defaults
I can set both through the text splitter of the node parser
Right. And this all comes down to generally you want a smaller overlap at query time, but larger during index construction with the node parser

A very janky example, showing where/how to set everything
Plain Text
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=3900, chunk_overlap=200))
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
service_context = ServiceContext.from_defaults(..., prompt_helper=prompt_helper, node_parser=node_parser, chunk_size_limit)
Definitely, the user experience here could be improved πŸ˜… Very open to any PRs around this
It is starting to make sense!
I just updated some code that I wrote when there were no service context or node parsers, and everything is on fire now
Ah i see I see. Llama index introduced some big changes around v0.5.0 to support all this a little differently. The docs and notebooks should have all the updated usages
one more thing.. in vector index query, similarity_top_k is deprecated as well?
Nope, that should still work πŸ‘€
hmm okay is there some hidden threshold for embedding similarities
Nope, but the default top k is 1
good πŸ˜„
well then what can make a vector index return zero nodes for some queries and k nodes for some
..if not a hidden threshold
never had a problem with the reliability of vector stores before πŸ˜„
Did it actually return zero nodes? How did you check that?

I agree, something seems fishy haha
response.source_nodes is empty
πŸ€”πŸ€” did it report any LLM token usage when you ran the query?
but I can slightly change the query and get a normal response
Hmmmm what would be causing that πŸ€”πŸ€”πŸ€” imma check some source code
I'm parameterizing something wrong
also here if I set chunk_size_limit to service_context, it will 100k tokens for embedding, which leads me to think it doesn't really respect the splitting.. it really needs like 10k tokens to embed those data
lol I should pass in the node parser..
good catch! πŸ˜…
Let's go back to your example:

prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap) service_context = ServiceContext.from_defaults(..., prompt_helper=prompt_helper, node_parser=node_parser, chunk_size_limit)

  • I have the node parser controlling index construction
  • I have the prompt helper controlling overlap (and it allows chunk size) for queries
  • what does the lone chunk_size_limit in the service context achieve?
So, both the node_parser and prompt_helper use a text splitter. the lone chunk_size_limit sets that limit in both https://github.com/jerryjliu/llama_index/blob/main/gpt_index/indices/service_context.py#L72
But since you pass in both, it looks like it's actually not doing that πŸ˜…
prompt_helper = prompt_helper or PromptHelper.from_llm_predictor(...

node_parser = node_parser or _get_default_node_parser(...
there's a bug in the qdrant store I think, or at least non-documented behavior
it's passing the query string as filter to the qdrant search, when it should just search by embedding
or is that controlled by some query mode perhaps?
I'm pretty unfamiliar with the qdrant code πŸ€” Not too sure on that one
I know who to ping πŸ™‚
but look at this
def _build_query_filter(self, query: VectorStoreQuery) -> Optional[Any]: if not query.doc_ids and not query.query_str: return None from qdrant_client.http.models import ( FieldCondition, Filter, MatchAny, MatchText, ) must_conditions = [] if query.doc_ids: must_conditions.append( FieldCondition( key="doc_id", match=MatchAny(any=[doc_id for doc_id in query.doc_ids]), ) ) if query.query_str: must_conditions.append( FieldCondition( key="text", match=MatchText(text=query.query_str), ) ) import IPython; IPython.embed() return Filter(must=must_conditions)
this is qdrant index code
it is adding the query string as a filter which qdrant MUST match..
now before I open an issue I'd like to know if some the query modes control that πŸ˜…
at least it is very unexpected behavioer
Hmmm yea seems pretty weird. It looks like it is also sending the embedding. I don't see any option that would control this behavior though
Just took a peek at the code myself
thanks for your help πŸ™‚
there at least used to be a feature to let users filter query nodes by keywords, but this is not how it should be implemented imo
I agree! Thanks for helping debug this too πŸ™
Add a reply
Sign up and join the conversation on Discord