Find answers from the community

Updated 5 months ago

Is there notebook regarding source

At a glance

Is there notebook regarding source retrieval for chunks? For example if my chunks are 512 tokens and my query engine returns 3 of the top chunks I can't return those to the user because 512 tokens is like multiple paragraphs.

12 comments

BBioHacker

I can see two options, chunk small and when doing retrieval auto merge small chunks to larger chunks but return to the user small chunks.
OR
keep chunks large but after detecting the right nodes, perform a second retrieval to look for specific sentences in the chunk that are most relevant.
Any thoughts?

LLogan M

Why is returning those 3 chunks an issue in this example?

LLogan M

You are trying to indentify more specific pieces of text?

BBioHacker

well if each chunk is around 512 tokens then it will return a page of text and if we have 3 of those that's three pages. Think back to the SEC pdf example Llamaindex had. When you asked a question it would highlight a few sentences as its soruce.

BBioHacker

Yes basically identify the most relevant pieces in selected node

BBioHacker

@Logan M Any thoughts on this?

LLogan M

It was highlighting based on the source node text lol

You could use fuzzy matching to indentify smaller pieces of source text though

For example
https://llamahub.ai/l/llama-packs/llama-index-packs-fuzzy-citation?from=

https://github.com/run-llama/llama_index/blob/3e5d0a146fcda01a984818d381f31a19287aead8/llama-index-packs/llama-index-packs-fuzzy-citation/llama_index/packs/fuzzy_citation/base.py#L29

BBioHacker

This is sooo cool.

BBioHacker

Thank you so much @Logan M

yyoelk

Hey @Logan M , just ran into this thread. Do you know of a streamlit example that demonstrates PDF highlighting of citations (i.e by using fuzzy matching)?

LLogan M

I'm not aware of anything like that for streamlit no

yyoelk

Thank you @Logan M

Add a reply