@Logan M Can we cache the loading of the

At a glance

The community member @Logan M asked if the loading of a PDF can be cached, as each new run seems to start parsing the file. The comments suggest that the llama-parse library already caches the results for 48 hours, so on the second run, it just loads the cached results, which should be much faster. Alternatively, the community members suggest caching the results locally.

The discussion also touches on caching the index nodes specifically, as the text to them is generated during the run. One community member provides a code snippet to save a map of document to nodes and pickle it to disk to cache.

Additionally, a community member mentions being stumped with having the recursive retriever respond with the page number of the document after parsing it through LlamaParse, and asks if there is a straightforward way to do that. Another community member suggests using the JSON result to add metadata to the nodes/documents before indexing.

Useful resources

bbarlazy

@Logan M Can we cache the loading of the PDF? https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-8/Advanced_RAG_with_LlamaParse/

Every new run it seems to start parsing the file--

Plain Text

Started parsing the file under job_id d2f528c8-7756-4b7a-53d3a771
5it [00:00, 49461.13it/s]
2it [00:00, 41734.37it/s]
0it [00:00, ?it/s]
4it [00:00, 58254.22it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 19878.22it/s]
1it [00:00, 11491.24it/s]
4it [00:00, 62368.83it/s]
30it [00:00, 528693.78it/s]

9 comments

LLogan M

llama-parse already caches it for 48hrs. On the second run, its just loading the cached results (you should see its much faster on re-runs for the same file)

LLogan M

Alternatively you can cache the results locally

bbarlazy

I think I was referring to the parsing part where it parses the index and the text nodes

bbarlazy

Is there a way to simply cache the index nodes? I believe that the text to them is generated during run?

LLogan M

This part?

Plain Text

from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)

You could keep save a map of doucment -> nodes to disk as a cache

Plain Text

document_to_nodes = {}
for node in nodes:
  if node.ref_doc_id not in document_to_nodes:
    document_to_nodes[node.ref_doc_id] = []
  document_to_nodes[node.ref_doc_id].append(node.model_dump())

And then just pickle that to disk to cache

bbarlazy

Gotcha, I also am getting stumped with having the recursive retriever respond with the page number of the document after parsing it through LlamaParse, is there a straightforward way to do that?

LLogan M

https://discord.com/channels/1059199217496772688/1304438858263560264/1304461762233634927

LLogan M

Use the json result to add metadata to the nodes/documents before indexing

bbarlazy

Thank you @Logan M

Add a reply

Find answers from the community

@Logan M Can we cache the loading of the