Find answers from the community

Updated 2 years ago

Parsing pdfs

Hi folks,
I'm using SimpleDiretoryReader + pdfs to chunk up files into some size, querying with GPTSimpleVectorIndex. One issue I have is that it seems quite arbitrary where the chunking happens - and can create some very unpredictable results. If the split happens to be in the middle of a paragraph, the embedding quality drops and doesn't give the right answer. Adding top_k=2 (or more) doesn't help as it's sort of broken the paragraph.

I was wondering if there are any recommended ways of splitting PDFs into more logical chunks (pages, paragraphs) or at least introducing a much bigger overlap between chunks. I've not been able to do this with max_chunk_overlap so far, and am considering writing my own pdf->json parser instead, but would love to see if anyone else has encountered this?
4
L
c
B
47 comments
I know unstructured.io can parse into more logical elements for you, but I havent checked it out too much.

There is also a sentence splitter that you can use instead of the token splitter now 💪
@Runonthespot I'm having a similar experience parsing PDF. Fee free to dm me if you'd like to collaborate on understanding how to index. It's not very straight forward to me atm. Maybe spitballing ideas might help?
what's a good resources on using splitters?
Seems like the docs haven't quite caught up for this.

Here's an example though I made after reading the source code just now lol
Plain Text
from llama_index.langchain_helpers.text_splitter import SentenceSplitter
from gpt_index.node_parser.simple import SimpleNodeParser

node_parser = SimpleNodeParser(text_splitter=SentenceSplitter())
service_context = Service_context.from_defaults(node_parser=node_parser)


There are a few settings to the splitter you can set too, here's the class def https://github.com/jerryjliu/llama_index/blob/main/gpt_index/langchain_helpers/text_splitter.py#L239
There's also a (very rough) notebook here... that notebook needs to be cleaned up lol https://github.com/jerryjliu/llama_index/blob/main/examples/paul_graham_essay/SentenceSplittingDemo.ipynb
these are good places to start
thanks @Logan M
@conic I'm working on something similar if you wanna collab
I do want to collab
I'll add you as a friend on discord, if you want we can do a private VC to figure out what we understand/don't understand about this.
for sure. Have you started playing around with a pdf yet?
Not yet. It's on a list of things taht Im, trying to underestand. I'm reading through notebooks. I'm also trying to overcome some issues with indexing data for GPT-3.5-turbo.
I think if anything we can/should probably just share/sync understandings to see if we understand how indexing works in the first place
I have the same issue... we need the splitter to be smarter and to break up text by section, paragraph and sentence vs. a fixed character length like it is now
Really struggling to find this in the documentation, but do you know if there's a way to make token aware sentence splitting?
Or is the function already doing this?
Looking at the code, it already does this, cool!
Great tip, thank you
Definitely up for collaborating. Right now I have a few ideas…. One idea is to use pymupdf or unstructured and then store as json annotated with doc name, page, paragraph number etc, as the json reader does some cool stuff to include fields at higher levels. I’d really like the source nodes to have enough information to zero in on the paragraphs being used too, for highlighting but this needs to be traded off with embedding quality which drops if the chunk size is small.
@Runonthespot im working on unstructured today. Hit me up if you wanna share some code.
I've been trying out pymupdf which is also looking promising
Plain Text
import json
import fitz

def get_block_type(block):
    if block[6] == 0:
        return "text"
    elif block[6] == 1:
        return "heading"
    elif block[6] == 2:
        return "image"
    elif block[6] == 3:
        return "list"
    else:
        return "unknown"

def get_page_blocks(page):
    blocks = []
    for block in page.get_text("blocks"):
        block_type = get_block_type(block)
        if block_type == "text" or block_type == "heading":
            text = block[4]
        elif block_type == "list":
            text = [span["text"] for line in block["lines"] for span in line["spans"]]
        else:
            text = ""
        blocks.append({
            "type": block_type,
            "text": text,
            "coordinates": block[:4]
        })
    return blocks

def get_pdf_content(pdf_path):
    doc = fitz.open(pdf_path)
    content = {
        "document_metadata": {
            "title": doc.metadata["title"],
            "author": doc.metadata["author"],
            "creation_date": doc.metadata["creationDate"],
            "modification_date": doc.metadata["modDate"]
        },
        "pages": []
    }
    for page in doc:
        content["pages"].append({
            "page_number": page.number,
            "page_metadata": {
                "width": page.rect.width,
                "height": page.rect.height,
                
            },
            "blocks": get_page_blocks(page)
        })
    doc.close()
    return content

if __name__ == '__main__':
    pdf_content = get_pdf_content('test.pdf')
    json_content = json.dumps(pdf_content, indent=4)
    #write json content to output.json file
    with open('output.json', 'w') as f:
        f.write(json_content)
^^ just a basic example, I'm grabbing doc, page, block structure from a pdf. My thinking was it could be used as a basis for embedding maybe via the existing JSON reader, but if not, then I'd want to pass the doc/page level info with enough blocks to fill the embedding window. This way the embedding also gets some useful context like page number, doc title etc. Another nice thing this does is provide bounding box coordinates for each block/paragraph, which could be useful for showing which exact part of a PDF was used to answer the question at a more granular level.
**Only other thing worth being aware of that may limit us using this is that licensing is AGPL or commercial 😕
I had a go with Unstructured too - much more complicated to install, but does a more detailed job, but into much tinier slices.
Not sure how this would work with embeddings
Also note unstructured has an open issue parsing PDFs with columns in the correct order (arxiv pdf as an example) - I feel it will be the better solution in the long run by pymupdf may be more mature
Yeah I decided to not use pymu due to their licensing issues.
Unstructured is much more powerful as it uses visual cues to destructive the pdf. Especially useful for research papers.
I will be submitting a PR tonight to address the page number and doc title in our internal pdf loader.
I think making the time investment to get unstructured up and running and provide simple to follow docs for others would be more effective in the long term.
As I mentioned in the issue and help, the future iterations of LLM will be able to perform summary and QA. If the user clicks on any sentences in them, they can then load the pdf, take you to the page and highlight the text used for that sentence.
So definitely getting bounding boxes will be crucial! But again pymu does not have a open source license. So gotta find another approach.
@Runonthespot let me know your thoughts.
I agree- I think unstructured is the way to go. I’m just a bit worried about the two column thing and also seeing every list item appear as a single sentence etc makes me think we need to think carefully how we group stuff up. Short sentences make poor embeddings in my experience.
certainly. Would you be open to discussing these two issues today? We can do some testing to see how bad those bugs/embeddings are. @Runonthespot
Unstructured is now super simple to use. Pretty exciting
It is… although I think this sort of thing can make it a bit more tricky in an enterprise setting- I’m okay with running inferencing locally and need to get to a sort of airgapped solution. Im keen to collaborate but should add I’m in London, UK so will need to time it carefully
We have our own k8s cluster though, so a docker solution is super cool
Yeah they announced docker version last week. So you choose your medicine haha.
I’m now just worried about the bugs you mentioned. Hopefully they’re not a big deal.
Well to be fair, they’re raised as an issue, that’s good!
@Runonthespot What is your next step with unstructured. I plan to dedicate today to learning and implementing and customizing it.
@BioHacker I am interested in the highlighting part, where can I find the PR you are talking about in this thread? Thank you.
Add a reply
Sign up and join the conversation on Discord