Find answers from the community

Updated 3 months ago

When using unstructured data reader page

When using unstructured data reader page number doesn’t get extracted from the documents in metadata. Is there any way to do this
W
s
F
9 comments
Page number gets added for PDF on its own only I guess. You could use PDFReader to extract content from the PDF file and for rest you can use UnstructuredReader

Sample code would look like this

Plain Text
from llama_index import download_loader
from llama_index import SimpleDirectoryReader
from llama_index.readers.file.docs_reader import PDFReader


UnstructuredReader = download_loader('UnstructuredReader')

dir_reader = SimpleDirectoryReader('./data', file_extractor={
  ".pdf": PDFReader(),
  ".html": UnstructuredReader(),
  ".eml": UnstructuredReader(),
})
documents = dir_reader.load_data()
Yes for pdf reader it adds page label but when I use the unstructured reader metadata is different is there any way to control this metadata extraction in unstructured reader, I am intending to use unstructured reader it gives comparatively better result
On Unstructured official page it says that they provide page_number metadata.
https://unstructured-io.github.io/unstructured/metadata.html#additional-metadata-fields-by-document-type



There are two ways to extract content via unstructured, API and locally


Haven't used it so cannot be fully sure if they provide this page_number on API only or not

https://github.com/run-llama/llama-hub/blob/afaf94d964452f865d54b56e125bfa469e672450/llama_hub/file/unstructured/base.py#L58
Thanks for sharing i changed the default value of split documents to true
It gave the page number label
But it’s splitting document on its own
i am also interested in splitting data myself but still using unstructured loader, do you have any insights
In my case I splitted the documents in to multiple pages and loaded which helped me to control the metadata
If you want to extract only the text you can parse pdf and get only the relevant data later on you can chunk it as u need
Add a reply
Sign up and join the conversation on Discord