do you have a sense of optimum chunk size and overlap, for technical documents, like ReadTheDocs or a github repo?
I used a tree index with 300 chunk size limit, and 200 overlap, and it was pretty good, but a bit slow. Just wondering what others use to query technical information.
wowza, that is a large overlap with a small chunk size. Interesting though!
Personally, I haven't played with the settings too much. Usually setting the chunk size limit is enough tweaking to get good results
The best improvement usually comes from document pre-processing. If you can split your documents into clear logical sections/chunks ahead of time, this will help
well, that chunk overlap, is just what I saw in the error, when I reduce chunk to 100. so 200 i a default somewhere.
Yup, it's the default in the initial text splitter
My goal is to index a bunch of ReadTheDocs, and make the indexes available for others.
In addition to changing the prompt helper (which only runs during query time) you can customize the node parser like this (which is used in
.from_documents()
from llama_index.langhchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index import ServiceContext
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=XX, chunk_overlap=xx))`
service_context = ServiceContext.from_defaults(..., node_parser=node_parser)
hopefully I got all those imports/attributes right, just wrote that looking at the source code lol
ah, ok. I will look at that.
I'm using the github loader, from llama-hub
Yea for sure. Just need to pass in the service context object when you call from_documents (or load from disk) π
ππΌ thanks for helping.
I'm out of the loop here, let's recap please π prompt helper also takes a max_chunk_overlap, which the service context doesn't... but this is needed also at index construction time, right?
I'm debugging an issue where i'm getting a huge overlap with small chunk sizes and i'm probably parameterizing wrong..
Right, so there are two places where documents are chunked:
- During index construction, using the node_parser
- During queries, using the prompt_helper
If you set the chunk_size_limit to be very small, you might want to adjust the default
chunk_overlap
in the example I gave above for the node_parser (the default is 200)
I see, thanks! So the ServiceContext can take chunk_size_limit directly, but not the overlap
correct π This is because for queries vs. index construction, you need different sizes of overlap
That leaves me wondering why the service context can accept only one of the highly related params
Which parameter is the service context missing? Since the queries vs. index construction use different processes, either can have the chunk overlap specified using the node parser (index construction) or prompt helper (query time)
Both of these can be set in the service context
it has chunk_size_limit, but not overlap, when constructing using .from_defaults
I can set both through the text splitter of the node parser
Right. And this all comes down to generally you want a smaller overlap at query time, but larger during index construction with the node parser
A very janky example, showing where/how to set everything
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=3900, chunk_overlap=200))
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
service_context = ServiceContext.from_defaults(..., prompt_helper=prompt_helper, node_parser=node_parser, chunk_size_limit)
Definitely, the user experience here could be improved π
Very open to any PRs around this
It is starting to make sense!
I just updated some code that I wrote when there were no service context or node parsers, and everything is on fire now
Ah i see I see. Llama index introduced some big changes around v0.5.0 to support all this a little differently. The docs and notebooks should have all the updated usages
one more thing.. in vector index query, similarity_top_k is deprecated as well?
Nope, that should still work π
hmm okay is there some hidden threshold for embedding similarities
Nope, but the default top k is 1
well then what can make a vector index return zero nodes for some queries and k nodes for some
..if not a hidden threshold
never had a problem with the reliability of vector stores before π
Did it actually return zero nodes? How did you check that?
I agree, something seems fishy haha
response.source_nodes is empty
π€π€ did it report any LLM token usage when you ran the query?
but I can slightly change the query and get a normal response
Hmmmm what would be causing that π€π€π€ imma check some source code
I'm parameterizing something wrong
also here if I set chunk_size_limit to service_context, it will 100k tokens for embedding, which leads me to think it doesn't really respect the splitting.. it really needs like 10k tokens to embed those data
lol I should pass in the node parser..
Let's go back to your example:
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
service_context = ServiceContext.from_defaults(..., prompt_helper=prompt_helper, node_parser=node_parser, chunk_size_limit)
- I have the node parser controlling index construction
- I have the prompt helper controlling overlap (and it allows chunk size) for queries
- what does the lone chunk_size_limit in the service context achieve?
But since you pass in both, it looks like it's actually not doing that π
prompt_helper = prompt_helper or PromptHelper.from_llm_predictor(...
node_parser = node_parser or _get_default_node_parser(...
there's a bug in the qdrant store I think, or at least non-documented behavior
it's passing the query string as filter to the qdrant search, when it should just search by embedding
or is that controlled by some query mode perhaps?
I'm pretty unfamiliar with the qdrant code π€ Not too sure on that one
def _build_query_filter(self, query: VectorStoreQuery) -> Optional[Any]:
if not query.doc_ids and not query.query_str:
return None
from qdrant_client.http.models import (
FieldCondition,
Filter,
MatchAny,
MatchText,
)
must_conditions = []
if query.doc_ids:
must_conditions.append(
FieldCondition(
key="doc_id",
match=MatchAny(any=[doc_id for doc_id in query.doc_ids]),
)
)
if query.query_str:
must_conditions.append(
FieldCondition(
key="text",
match=MatchText(text=query.query_str),
)
)
import IPython; IPython.embed()
return Filter(must=must_conditions)
this is qdrant index code
it is adding the query string as a filter which qdrant MUST match..
now before I open an issue I'd like to know if some the query modes control that π
at least it is very unexpected behavioer
Hmmm yea seems pretty weird. It looks like it is also sending the embedding. I don't see any option that would control this behavior though
Just took a peek at the code myself
thanks for your help π
there at least used to be a feature to let users filter query nodes by keywords, but this is not how it should be implemented imo
I agree! Thanks for helping debug this too π