Find answers from the community

Updated 2 months ago

@Logan M Can we cache the loading of the

At a glance

The community member @Logan M asked if the loading of a PDF can be cached, as each new run seems to start parsing the file. The comments suggest that the llama-parse library already caches the results for 48 hours, so on the second run, it just loads the cached results, which should be much faster. Alternatively, the community members suggest caching the results locally.

The discussion also touches on caching the index nodes specifically, as the text to them is generated during the run. One community member provides a code snippet to save a map of document to nodes and pickle it to disk to cache.

Additionally, a community member mentions being stumped with having the recursive retriever respond with the page number of the document after parsing it through LlamaParse, and asks if there is a straightforward way to do that. Another community member suggests using the JSON result to add metadata to the nodes/documents before indexing.

Useful resources
@Logan M Can we cache the loading of the PDF? https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-8/Advanced_RAG_with_LlamaParse/

Every new run it seems to start parsing the file--
Plain Text
Started parsing the file under job_id d2f528c8-7756-4b7a-53d3a771
5it [00:00, 49461.13it/s]
2it [00:00, 41734.37it/s]
0it [00:00, ?it/s]
4it [00:00, 58254.22it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 19878.22it/s]
1it [00:00, 11491.24it/s]
4it [00:00, 62368.83it/s]
30it [00:00, 528693.78it/s]
L
b
9 comments
llama-parse already caches it for 48hrs. On the second run, its just loading the cached results (you should see its much faster on re-runs for the same file)
Alternatively you can cache the results locally
I think I was referring to the parsing part where it parses the index and the text nodes
Is there a way to simply cache the index nodes? I believe that the text to them is generated during run?
This part?

Plain Text
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)


You could keep save a map of doucment -> nodes to disk as a cache

Plain Text
document_to_nodes = {}
for node in nodes:
  if node.ref_doc_id not in document_to_nodes:
    document_to_nodes[node.ref_doc_id] = []
  document_to_nodes[node.ref_doc_id].append(node.model_dump())


And then just pickle that to disk to cache
Gotcha, I also am getting stumped with having the recursive retriever respond with the page number of the document after parsing it through LlamaParse, is there a straightforward way to do that?
Use the json result to add metadata to the nodes/documents before indexing
Thank you @Logan M
Add a reply
Sign up and join the conversation on Discord