Find answers from the community

Updated 2 months ago

Document loading

that seems feasible as well though it assumes the docs have been paginated into single pages.
L
B
j
9 comments
Here's the source code for SimpleDirectoryReader, and from there it uses several parsers depending on file types

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/base.py

It would be nice if every loader kept track of the file name πŸ€” and only a select few loaders would consider page numbers I think

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/base.py
Tbh I've been meaning to look into this. I would do it right now too but my wife said I'm not allowed to code this weekend 🀣 been going a little hard lately
bro youve been killing it. Dont worry i can code no problem. So the code we are looking for is here. Here is a screenshot
It seems like its able to get the page number no problem. It just does not append it to the node. hmmm
Hmmm yea it just creates a giant string right now.

Could modify it to return a list of strings instead of calling join at the end? Would probably need to change the code that calls this parser too
I wonder if paged documents could have an extra option to merge/not merge text πŸ€”
yeah the SimpleDirectoryReader has a file_metadata argument that can return file names, but each individual loader doesn't track the file name by deafult
Add a reply
Sign up and join the conversation on Discord