Might be a niche question, but maybe

At a glance

Might be a niche question, but maybe someone can give some insights / ideas. We would like documents that are indexed to be stored in a 'staging' environment, where they are not instantly linked to an index.

Basically I'd like to persist Document objects before they are added to an index. The use case is that we want to upload a large scale of documents to our application that are ready for use, but that do not instantly need to be added to an index, as this should be done 'on the fly' . Does anyone know of a way to realize this with LLamaIndex functionality? I haven't been that up to date with the last months of developments so there's a possibility that I missed some things. Thanks in advance!

18 comments

WWhiteFang_Jr

I think you can create the docs and persist them using the docstore. And when you want to load it you can insert it via index.insert

Plain Text

from llama_index.node_parser import SimpleNodeParser
from llama_index.storage.docstore import SimpleDocumentStore

nodes = SimpleNodeParser.from_defaults().get_nodes_from_documents(documents)


docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

# save the created nodes locally either at default path or give your desired path via persist_path
docstore.persist()

# load it when needed to be added, provide persist_path if it is not default
nodes = docstore.from_persist_path()

# add it in index
for node in nodes:
  index.insert(node)

This should work 😅

WWhiteFang_Jr

For more, you can check the code for docstore:https://github.com/run-llama/llama_index/blob/main/llama_index/storage/docstore/simple_docstore.py

OOverclockedClock

This looks ideal! Really matches the use case that we had in mind :) Thanks for the quick response ♥️

LLogan M

@OverclockedClock also, v0.9 is launching later today, there's a super helpful concept of an IngestionPipeline that is pretty much made for this exact purpose too!

LLogan M

https://pretty-sodium-5e0.notion.site/Alpha-Preview-LlamaIndex-v0-9-8f815bfdd4c346c1a696e013fccefe5e

LLogan M

Preview blog post 👍

OOverclockedClock

You guys are amazing🙏🙏🙏 right on time for us

OOverclockedClock

Sorry to bother @Logan M , but just to check if I understand the IngestionPipeline correctly.

This is a pipeline that automatically processes Document objects and turns them into Nodes, and can subsequently turn them into a VectorStore as well, on the fly. I could use this with the docstore where I retrieve relevant documents, and put them through the IngestionPipeline, which will automatically recognize Document objects that have been through this pipeline before, and skip them during the creation of my VectorStore, only transforming Documents that have not been transformed before.

OOverclockedClock

I still need to use the docstore to store uploaded documents in this 'staging' state, where they are in the LLamaIndex ecosystem, but not readily added to an Index

LLogan M

Hmmm I think maybe a slight misunderstanding

It "skips" the processing of already seen data, but it will still return it (its just returning the cached version)

If you need the docstore, you could run the pipeline and then throw the nodes into the docstore + wherever else you need them

LLogan M

definitely have plans to make this smarter as we go though 🙂

OOverclockedClock

Ah I probably worded it wrong, I was expecting it to return the cached data too.

LLogan M

ah perfect

OOverclockedClock

It seems like the IngestionPipeline is mostly there for the part about setting up indices 'on the fly' correct? Not necessarily for the persistence of documents / nodes that are not linked to any index

LLogan M

Yea not nesccarily for persistance, just for the processing (i.e. you can process data without ever needing an actual index)

OOverclockedClock

Great! I think I'm on the same page then

OOverclockedClock

Thank you very much as usual

LLogan M

awesome, sounds good!

Add a reply

Find answers from the community

Might be a niche question, but maybe