what is the best index for a large pdf

At a glance

what is the best index for a large pdf? GPTSimpleVectorIndex is pretty hit and miss unfortunately

22 comments

Hi Nilu, what issues are you running into? I'd recommend starting with the GPTSimpleVectorIndex

It could be a matter of data cleaning? If our pdf parser doesn't work you could check out our unstructured.io parser https://llamahub.ai/l/file-unstructured

NNilu

the searching just is off sometimes - for example when I search for "What happened at the Battle of Valrain Fields" it gives me no information despite clearly being in the pdf

NNilu

will try with unstructured parser

jjerryjliu0

yeah try it and let me know! also try decreasing chunk_size_limit when building the index and increasing similarity_top_k during query time

NNilu

well the unstructured parser failed on an M1 Mac, took 2 hours and couldn't parse anything, going to try the chunk_size_limit and similarity_top_k suggestions

NNilu

@jerryjliu0 reducing_chunk_size to 512 and keeping similaritly top k to 1 works perfectly, thanks!

NNilu

now time to get it working on cohere

aauser1234

@Nilu , can you share what pdf(s) you were attempting to parse?

aauser1234

thanks!

aauser1234

@Nilu , on a mac m1, this took about 10 minutes. It took a bit over 4gb in memory, wonder if you ran into swap?
Here is what the structured outputs look like:
https://gist.github.com/cragwolfe/17a0eafaf2f8ababe1cee5042b051a4a

aauser1234

we're also aware ~10 minutes is not great, we're working on that!

NNilu

yeah probably, kept getting these errors

aauser1234

good info! can you tell me what version of unstructured-inference is pip installed?

NNilu

Plain Text

Name: unstructured-inference
Version: 0.2.8
Summary: A library for performing inference using trained models.
Home-page: https://github.com/Unstructured-IO/unstructured-inference
Author: Unstructured Technologies
Author-email: devops@unstructuredai.io
License: Apache-2.0
Location: /Users/niluk/mambaforge/lib/python3.10/site-packages
Requires: fastapi, huggingface-hub, layoutparser, onnxruntime, opencv-python, python-multipart, transformers, uvicorn
Required-by

aauser1234

🙏

aauser1234

i'm pretty sure pip install unstructured-inference==0.2.7 would fix the issue. A similar issue (but seemingly rare) for 0.2.8 was also recently observed -- we'll release a new version of the unstuctured package shortly that pins the dependency appropriately.

NNilu

awesome, will try it out once I have time

aauser1234

one last follow up: i was able to repro it taking extra long with unstructured-inference==0.2.8 (i killed it after 30 mins).
in a fresh pyenv,

Plain Text

pip install "unstructured[local-inference]"
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

correctly has unstructured-inference==0.2.7 pinned, and another test confirms the doc processes in ~10 mins.
install details: mac m1, python3.8.

NNilu

can confirm 0.2.7 works

NNilu

what were the settings you used to get the "narrative text" etc

NNilu

since that doesn't appear on mine

aauser1234

You an convert an individual Element using .to_dict() to get all it's fields, including type. E.g.: https://gist.github.com/cragwolfe/1d432c6c8597d007efc10bf29f09bed1

Add a reply

Find answers from the community

what is the best index for a large pdf