Find answers from the community

Updated 3 months ago

what is the best index for a large pdf

what is the best index for a large pdf? GPTSimpleVectorIndex is pretty hit and miss unfortunately
j
N
a
22 comments
Hi Nilu, what issues are you running into? I'd recommend starting with the GPTSimpleVectorIndex

It could be a matter of data cleaning? If our pdf parser doesn't work you could check out our unstructured.io parser https://llamahub.ai/l/file-unstructured
the searching just is off sometimes - for example when I search for "What happened at the Battle of Valrain Fields" it gives me no information despite clearly being in the pdf
will try with unstructured parser
yeah try it and let me know! also try decreasing chunk_size_limit when building the index and increasing similarity_top_k during query time
well the unstructured parser failed on an M1 Mac, took 2 hours and couldn't parse anything, going to try the chunk_size_limit and similarity_top_k suggestions
@jerryjliu0 reducing_chunk_size to 512 and keeping similaritly top k to 1 works perfectly, thanks!
now time to get it working on cohere
@Nilu , can you share what pdf(s) you were attempting to parse?
@Nilu , on a mac m1, this took about 10 minutes. It took a bit over 4gb in memory, wonder if you ran into swap?
Here is what the structured outputs look like:
https://gist.github.com/cragwolfe/17a0eafaf2f8ababe1cee5042b051a4a
we're also aware ~10 minutes is not great, we're working on that!
yeah probably, kept getting these errors
good info! can you tell me what version of unstructured-inference is pip installed?
Plain Text
Name: unstructured-inference
Version: 0.2.8
Summary: A library for performing inference using trained models.
Home-page: https://github.com/Unstructured-IO/unstructured-inference
Author: Unstructured Technologies
Author-email: devops@unstructuredai.io
License: Apache-2.0
Location: /Users/niluk/mambaforge/lib/python3.10/site-packages
Requires: fastapi, huggingface-hub, layoutparser, onnxruntime, opencv-python, python-multipart, transformers, uvicorn
Required-by
i'm pretty sure pip install unstructured-inference==0.2.7 would fix the issue. A similar issue (but seemingly rare) for 0.2.8 was also recently observed -- we'll release a new version of the unstuctured package shortly that pins the dependency appropriately.
awesome, will try it out once I have time
one last follow up: i was able to repro it taking extra long with unstructured-inference==0.2.8 (i killed it after 30 mins).
in a fresh pyenv,
Plain Text
pip install "unstructured[local-inference]"
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

correctly has unstructured-inference==0.2.7 pinned, and another test confirms the doc processes in ~10 mins.
install details: mac m1, python3.8.
can confirm 0.2.7 works
what were the settings you used to get the "narrative text" etc
since that doesn't appear on mine
You an convert an individual Element using .to_dict() to get all it's fields, including type. E.g.: https://gist.github.com/cragwolfe/1d432c6c8597d007efc10bf29f09bed1
Add a reply
Sign up and join the conversation on Discord