Find answers from the community

Updated 5 months ago

Document loading

At a glance

that seems feasible as well though it assumes the docs have been paginated into single pages.

9 comments

Here's the source code for SimpleDirectoryReader, and from there it uses several parsers depending on file types

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/base.py

It would be nice if every loader kept track of the file name 🤔 and only a select few loaders would consider page numbers I think

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/base.py

LLogan M

Tbh I've been meaning to look into this. I would do it right now too but my wife said I'm not allowed to code this weekend 🤣 been going a little hard lately

BBioHacker

bro youve been killing it. Dont worry i can code no problem. So the code we are looking for is here. Here is a screenshot

BBioHacker

https://github.com/jerryjliu/llama_index/blob/22dbdedfaaf92e0f3a435fdf7e73c41ccb75ca21/gpt_index/readers/file/docs_parser.py#L12

BBioHacker

Attachment

BBioHacker

It seems like its able to get the page number no problem. It just does not append it to the node. hmmm

LLogan M

Hmmm yea it just creates a giant string right now.

Could modify it to return a list of strings instead of calling join at the end? Would probably need to change the code that calls this parser too

LLogan M

I wonder if paged documents could have an extra option to merge/not merge text 🤔

jjerryjliu0

yeah the SimpleDirectoryReader has a file_metadata argument that can return file names, but each individual loader doesn't track the file name by deafult

Add a reply