When using unstructured data reader page

ssarath_sk

When using unstructured data reader page number doesn’t get extracted from the documents in metadata. Is there any way to do this

9 comments

WWhiteFang_Jr

Page number gets added for PDF on its own only I guess. You could use PDFReader to extract content from the PDF file and for rest you can use UnstructuredReader

Sample code would look like this

Plain Text

from llama_index import download_loader
from llama_index import SimpleDirectoryReader
from llama_index.readers.file.docs_reader import PDFReader


UnstructuredReader = download_loader('UnstructuredReader')

dir_reader = SimpleDirectoryReader('./data', file_extractor={
  ".pdf": PDFReader(),
  ".html": UnstructuredReader(),
  ".eml": UnstructuredReader(),
})
documents = dir_reader.load_data()

ssarath_sk

Yes for pdf reader it adds page label but when I use the unstructured reader metadata is different is there any way to control this metadata extraction in unstructured reader, I am intending to use unstructured reader it gives comparatively better result

WWhiteFang_Jr

On Unstructured official page it says that they provide page_number metadata.
https://unstructured-io.github.io/unstructured/metadata.html#additional-metadata-fields-by-document-type

There are two ways to extract content via unstructured, API and locally

Haven't used it so cannot be fully sure if they provide this page_number on API only or not

https://github.com/run-llama/llama-hub/blob/afaf94d964452f865d54b56e125bfa469e672450/llama_hub/file/unstructured/base.py#L58

ssarath_sk

Thanks for sharing i changed the default value of split documents to true

ssarath_sk

It gave the page number label

ssarath_sk

But it’s splitting document on its own

FFried cheese

i am also interested in splitting data myself but still using unstructured loader, do you have any insights

ssarath_sk

In my case I splitted the documents in to multiple pages and loaded which helped me to control the metadata

ssarath_sk

If you want to extract only the text you can parse pdf and get only the relevant data later on you can chunk it as u need

Add a reply

Find answers from the community

When using unstructured data reader page