Find answers from the community

Updated 2 months ago

Might be a niche question, but maybe

Might be a niche question, but maybe someone can give some insights / ideas. We would like documents that are indexed to be stored in a 'staging' environment, where they are not instantly linked to an index.

Basically I'd like to persist Document objects before they are added to an index. The use case is that we want to upload a large scale of documents to our application that are ready for use, but that do not instantly need to be added to an index, as this should be done 'on the fly' . Does anyone know of a way to realize this with LLamaIndex functionality? I haven't been that up to date with the last months of developments so there's a possibility that I missed some things. Thanks in advance!
W
O
L
18 comments
I think you can create the docs and persist them using the docstore. And when you want to load it you can insert it via index.insert

Plain Text
from llama_index.node_parser import SimpleNodeParser
from llama_index.storage.docstore import SimpleDocumentStore

nodes = SimpleNodeParser.from_defaults().get_nodes_from_documents(documents)


docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

# save the created nodes locally either at default path or give your desired path via persist_path
docstore.persist()

# load it when needed to be added, provide persist_path if it is not default
nodes = docstore.from_persist_path()

# add it in index
for node in nodes:
  index.insert(node)

This should work πŸ˜…
This looks ideal! Really matches the use case that we had in mind :) Thanks for the quick response β™₯️
@OverclockedClock also, v0.9 is launching later today, there's a super helpful concept of an IngestionPipeline that is pretty much made for this exact purpose too!
Preview blog post πŸ‘
You guys are amazingπŸ™πŸ™πŸ™ right on time for us
Sorry to bother @Logan M , but just to check if I understand the IngestionPipeline correctly.

This is a pipeline that automatically processes Document objects and turns them into Nodes, and can subsequently turn them into a VectorStore as well, on the fly. I could use this with the docstore where I retrieve relevant documents, and put them through the IngestionPipeline, which will automatically recognize Document objects that have been through this pipeline before, and skip them during the creation of my VectorStore, only transforming Documents that have not been transformed before.
I still need to use the docstore to store uploaded documents in this 'staging' state, where they are in the LLamaIndex ecosystem, but not readily added to an Index
Hmmm I think maybe a slight misunderstanding

It "skips" the processing of already seen data, but it will still return it (its just returning the cached version)

If you need the docstore, you could run the pipeline and then throw the nodes into the docstore + wherever else you need them
definitely have plans to make this smarter as we go though πŸ™‚
Ah I probably worded it wrong, I was expecting it to return the cached data too.
It seems like the IngestionPipeline is mostly there for the part about setting up indices 'on the fly' correct? Not necessarily for the persistence of documents / nodes that are not linked to any index
Yea not nesccarily for persistance, just for the processing (i.e. you can process data without ever needing an actual index)
Great! I think I'm on the same page then
Thank you very much as usual
awesome, sounds good!
Add a reply
Sign up and join the conversation on Discord