Find answers from the community

Updated 2 years ago

Smart chunks

At a glance

The community members discuss the idea of having "complete" ideas in each chunk (paragraph, section, or chapter) to better represent the text in embeddings. Some community members suggest that "complete" nodes may also be more effective, rather than just at the document level. They acknowledge this is a harder problem to tackle. One community member suggests using a large language model to pre-process the text before creating the nodes, especially for small documents. Another community member raises concerns about the speed and accuracy of this approach. The discussion also touches on the challenges of extracting text from PDFs, such as issues with formatting, line breaks, and the impact on downstream processing.

I think so! Have "complete" ideas in each chunk (whether a paragraph, or a section, or a chapter) usually helps the embeddings better represent the text. Normally I would do this at the document level and let the actual nodes fall where they may lol
E
L
t
13 comments
but wouldn't โ€œcompleteโ€ nodes be more effective as well?
Instead of do it only on document level?
Definitely -- just a harder problem to tackle ๐Ÿ™‚
Do you have any ideas about it?

I was thinking about call a LLM to pre-process before create the nodes
For small documents of course haha
It may work! My only concerns would be speed and accuracy of the data it writes ๐Ÿ‘€
@Logan M PDFs are first chunked into "documents" page by page, then each page is chunked into "nodes" that comply to the max tokens limit, no ?

It means that chapters in a PDF are chunked page by page, sentences are cut in half if continued on the next page etc. Am I right? (this is what I noticed with the PDF I tried) ...

Also PDF to plain text conversion is horrible (not llamaindex fault): there are almost no notion of headings, no notion of columns in PDF, etc (unless specifically made for accessibility reasons) ... everything is just treated as (stupid) text ... and also I've noticed that many PDF to text conversion end up in having one word then a new line, then one word, then a new line.

Thus a sentence like "I like my black dog" becomes
I
like
my
black
dog

And then for a reason that eludes me, the text sent to be vectorized removes the \n and we end up with "ilikemyblackdog" without any space ...

It seems that openAI does not care the spaces are missing but it's hard to believe it doesn't have any impact ...
Yeaaa extracting text from a PDF is such a complicated issue under the hood lol. You are almost better converting the PDF to an image and running OCR LOL
I was thinking on that too
Attachment
image0.jpg
I used one time, not sure how it would perform but sounds a nice test
Add a reply
Sign up and join the conversation on Discord