Find answers from the community

Updated 2 months ago

RAG

Hi,

I am building a RAG pipeline over a set of the documents. For now, I am only allowing pdf, docx and txt files. I am using the SimpleDirectoryReader to load the files. By default, pdfs have file name and page label as metadata. docx have just file name. txts have no metadata. I want all 3 file types to have consistent metadata.

After some research, I realized it's not easy/possible to get page labels for .txt and .docx files. I still want docx and txts to have file names in metadata. pdfs can have default file name and page label in metadata. What's the best way to achieve it without having to make changes in readers/file/base.py or readers/file/docs_reader.py?

Thanks!
W
A
2 comments
Docx has filename in metadata, You can read txt file at your end manually and set the required metadata values and then add it to PDF + Docx documents
This is how I am doing it right now but I am open to better alternatives:

documents = SimpleDirectoryReader(folder_path, recursive=True, required_exts = ['.pdf', '.docx']).load_data()
txt_filename_metadata_fn = lambda filename: {'file_name': filename}
txt_documents = SimpleDirectoryReader(folder_path, recursive=True, required_exts = ['.txt'], file_metadata = txt_filename_metadata_fn).load_data()
documents.extend(txt_documents)
Add a reply
Sign up and join the conversation on Discord