I am building a RAG pipeline over a set of the documents. For now, I am only allowing pdf, docx and txt files. I am using the SimpleDirectoryReader to load the files. By default, pdfs have file name and page label as metadata. docx have just file name. txts have no metadata. I want all 3 file types to have consistent metadata.
After some research, I realized it's not easy/possible to get page labels for .txt and .docx files. I still want docx and txts to have file names in metadata. pdfs can have default file name and page label in metadata. What's the best way to achieve it without having to make changes in readers/file/base.py or readers/file/docs_reader.py?