Parsing pdfs

At a glance

The community member is using SimpleDiretoryReader + PDFs to chunk up files and query with GPTSimpleVectorIndex, but is facing issues with the chunking being arbitrary and resulting in unpredictable results. They are looking for recommended ways to split PDFs into more logical chunks, such as by pages or paragraphs, or to introduce a larger overlap between chunks.

The comments suggest using unstructured.io to parse PDFs into more logical elements, and a sentence splitter instead of a token splitter. Community members are interested in collaborating to understand how to index PDFs better, and are discussing using tools like pymupdf or unstructured to extract metadata and structure from PDFs. There is also discussion around the challenges of using unstructured, such as issues with column ordering and the tradeoffs between embedding quality and granularity.

One community member mentions submitting a PR to address the page number and document title in their internal PDF loader, and suggests that making the time investment to get unstructured up and running and providing simple documentation would be more effective in the long run.

Useful resources

RRunonthespot

Hi folks,
I'm using SimpleDiretoryReader + pdfs to chunk up files into some size, querying with GPTSimpleVectorIndex. One issue I have is that it seems quite arbitrary where the chunking happens - and can create some very unpredictable results. If the split happens to be in the middle of a paragraph, the embedding quality drops and doesn't give the right answer. Adding top_k=2 (or more) doesn't help as it's sort of broken the paragraph.

I was wondering if there are any recommended ways of splitting PDFs into more logical chunks (pages, paragraphs) or at least introducing a much bigger overlap between chunks. I've not been able to do this with max_chunk_overlap so far, and am considering writing my own pdf->json parser instead, but would love to see if anyone else has encountered this?

47 comments

LLogan M

I know unstructured.io can parse into more logical elements for you, but I havent checked it out too much.

There is also a sentence splitter that you can use instead of the token splitter now 💪

cconic

@Runonthespot I'm having a similar experience parsing PDF. Fee free to dm me if you'd like to collaborate on understanding how to index. It's not very straight forward to me atm. Maybe spitballing ideas might help?

cconic

what's a good resources on using splitters?

LLogan M

Seems like the docs haven't quite caught up for this.

Here's an example though I made after reading the source code just now lol

Plain Text

from llama_index.langchain_helpers.text_splitter import SentenceSplitter
from gpt_index.node_parser.simple import SimpleNodeParser

node_parser = SimpleNodeParser(text_splitter=SentenceSplitter())
service_context = Service_context.from_defaults(node_parser=node_parser)

There are a few settings to the splitter you can set too, here's the class def https://github.com/jerryjliu/llama_index/blob/main/gpt_index/langchain_helpers/text_splitter.py#L239

LLogan M

There's also a (very rough) notebook here... that notebook needs to be cleaned up lol https://github.com/jerryjliu/llama_index/blob/main/examples/paul_graham_essay/SentenceSplittingDemo.ipynb

cconic

these are good places to start

cconic

thanks @Logan M

Find answers from the community

Parsing pdfs