Find answers from the community

Updated 5 months ago

RAG

At a glance

The community member is building a RAG pipeline that processes PDF, DOCX, and TXT files. They are using the SimpleDirectoryReader to load the files, but the metadata is inconsistent - PDFs have file name and page label, DOCXs have only file name, and TXTs have no metadata. The community member wants to have consistent metadata across all file types, with file names for DOCXs and TXTs, and file names and page labels for PDFs.

In the comments, another community member suggests that the community member can read the TXT files manually and set the required metadata values, then add them to the PDF and DOCX documents. Another community member provides an example of how they are currently handling this, by using a custom metadata function for the TXT files and then extending the document list with the TXT documents.

There is no explicitly marked answer in the post or comments.

Hi,

I am building a RAG pipeline over a set of the documents. For now, I am only allowing pdf, docx and txt files. I am using the SimpleDirectoryReader to load the files. By default, pdfs have file name and page label as metadata. docx have just file name. txts have no metadata. I want all 3 file types to have consistent metadata.

After some research, I realized it's not easy/possible to get page labels for .txt and .docx files. I still want docx and txts to have file names in metadata. pdfs can have default file name and page label in metadata. What's the best way to achieve it without having to make changes in readers/file/base.py or readers/file/docs_reader.py?

Thanks!
W
A
2 comments
Docx has filename in metadata, You can read txt file at your end manually and set the required metadata values and then add it to PDF + Docx documents
This is how I am doing it right now but I am open to better alternatives:

documents = SimpleDirectoryReader(folder_path, recursive=True, required_exts = ['.pdf', '.docx']).load_data()
txt_filename_metadata_fn = lambda filename: {'file_name': filename}
txt_documents = SimpleDirectoryReader(folder_path, recursive=True, required_exts = ['.txt'], file_metadata = txt_filename_metadata_fn).load_data()
documents.extend(txt_documents)
Add a reply
Sign up and join the conversation on Discord