Smart chunks

At a glance

The community members discuss the idea of having "complete" ideas in each chunk (paragraph, section, or chapter) to better represent the text in embeddings. Some community members suggest that "complete" nodes may also be more effective, rather than just at the document level. They acknowledge this is a harder problem to tackle. One community member suggests using a large language model to pre-process the text before creating the nodes, especially for small documents. Another community member raises concerns about the speed and accuracy of this approach. The discussion also touches on the challenges of extracting text from PDFs, such as issues with formatting, line breaks, and the impact on downstream processing.

LLogan M

I think so! Have "complete" ideas in each chunk (whether a paragraph, or a section, or a chapter) usually helps the embeddings better represent the text. Normally I would do this at the document level and let the actual nodes fall where they may lol

13 comments

EEmanuel Ferreira

but wouldn't “complete” nodes be more effective as well?

EEmanuel Ferreira

Instead of do it only on document level?

LLogan M

Definitely -- just a harder problem to tackle 🙂

EEmanuel Ferreira

Interesting

EEmanuel Ferreira

tyy

EEmanuel Ferreira

Do you have any ideas about it?

I was thinking about call a LLM to pre-process before create the nodes

EEmanuel Ferreira

For small documents of course haha

LLogan M

It may work! My only concerns would be speed and accuracy of the data it writes 👀

ttilleul

@Logan M PDFs are first chunked into "documents" page by page, then each page is chunked into "nodes" that comply to the max tokens limit, no ?

It means that chapters in a PDF are chunked page by page, sentences are cut in half if continued on the next page etc. Am I right? (this is what I noticed with the PDF I tried) ...

Also PDF to plain text conversion is horrible (not llamaindex fault): there are almost no notion of headings, no notion of columns in PDF, etc (unless specifically made for accessibility reasons) ... everything is just treated as (stupid) text ... and also I've noticed that many PDF to text conversion end up in having one word then a new line, then one word, then a new line.

Thus a sentence like "I like my black dog" becomes

I
like
my
black
dog

And then for a reason that eludes me, the text sent to be vectorized removes the \n and we end up with "ilikemyblackdog" without any space ...

It seems that openAI does not care the spaces are missing but it's hard to believe it doesn't have any impact ...

LLogan M

Yeaaa extracting text from a PDF is such a complicated issue under the hood lol. You are almost better converting the PDF to an image and running OCR LOL

EEmanuel Ferreira

I was thinking on that too

Attachment

EEmanuel Ferreira

Amazon textract

EEmanuel Ferreira

I used one time, not sure how it would perform but sounds a nice test

Add a reply

Find answers from the community

Smart chunks