Find answers from the community

Updated 2 weeks ago

@Logan M Can we cache the loading of the

@Logan M Can we cache the loading of the PDF? https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-8/Advanced_RAG_with_LlamaParse/

Every new run it seems to start parsing the file--
Plain Text
Started parsing the file under job_id d2f528c8-7756-4b7a-53d3a771
5it [00:00, 49461.13it/s]
2it [00:00, 41734.37it/s]
0it [00:00, ?it/s]
4it [00:00, 58254.22it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
1it [00:00, 19878.22it/s]
1it [00:00, 11491.24it/s]
4it [00:00, 62368.83it/s]
30it [00:00, 528693.78it/s]
L
b
9 comments
llama-parse already caches it for 48hrs. On the second run, its just loading the cached results (you should see its much faster on re-runs for the same file)
Alternatively you can cache the results locally
I think I was referring to the parsing part where it parses the index and the text nodes
Is there a way to simply cache the index nodes? I believe that the text to them is generated during run?
This part?

Plain Text
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

nodes = node_parser.get_nodes_from_documents(documents)


You could keep save a map of doucment -> nodes to disk as a cache

Plain Text
document_to_nodes = {}
for node in nodes:
  if node.ref_doc_id not in document_to_nodes:
    document_to_nodes[node.ref_doc_id] = []
  document_to_nodes[node.ref_doc_id].append(node.model_dump())


And then just pickle that to disk to cache
Gotcha, I also am getting stumped with having the recursive retriever respond with the page number of the document after parsing it through LlamaParse, is there a straightforward way to do that?
Use the json result to add metadata to the nodes/documents before indexing
Thank you @Logan M
Add a reply
Sign up and join the conversation on Discord