Find answers from the community

Updated 3 months ago

Hi, I am using SimpleDirectoryReader

Hi, I am using SimpleDirectoryReader which works great, but I am having trouble with having just a simple identifier on documents which get chunked up (because they're too large). I notice it just creates new ids and links it with Relationships, but I wish to have a simple id on each part of a document which is the same. Is there such a thing?

I have also added an identifier to the metadata_fn, but it feels like something LlamaIndex should be supporting in some way or another. Am I overlooking something?

8 comments

RRohan

If I got the question right, I think what you're looking for is the ref_doc_id
After chunking, each node will have a ref_doc_id that is the id of parent document it was chunked from

node.ref_doc_id

Do correct me if this is not what you're looking for.

RRohan

and for the next question, just out of curiosity, how do you plan to use the id for the metadata?

yyschermer

Yes, so I actually did use ref_doc_id, but for a large .pdf for example it still outputs multiple parts with of the same pdf with different ref_doc_id. I expected this to stay consistent for the whole pdf

yyschermer

I want to use it so I can do local transformations over nodes of every pdf.

yyschermer

I also noticed there is file_path but it feels strange to use this as an identifier

yyschermer

Especially, since not all documents have file_path too (e.g. webpages)

RRohan

I see. By default the PDFReader creates a document for each page of the pdf.

If you want you can change that behavior like this:

Plain Text

from llama_index.readers import PDFReader

pdf_reader = PDFReader(return_full_document=True)
documents = pdf_reader.load_data(Path('huge2.pdf'))

If you want one document per page, then you can use the filename as id or setup something else as the identifier in the metadata.

RRohan

to use it with SimpleDirectoryReader

Plain Text

full_pdf_reader = PDFReader(return_full_document=True)

documents = SimpleDirectoryReader('./',
                                    input_files=['huge2.pdf'],
                                    file_extractor={
                                      '.pdf': full_pdf_reader
                                    }
                                ).load_data()

Add a reply