persist/load pipeline with vector store & docstore

At a glance

The community member is trying to understand how to persist and load a pipeline workflow. They provide a code snippet and ask if the flow is correct. The comments suggest that the community member is having issues with the pipeline cache, the difference between persist/load and cache, and why they are not seeing any nodes being ingested.

The community members discuss adding a docstore, following the LlamaIndex documentation, and trying different approaches. Eventually, they find a solution that involves running the pipeline with the documents=documents argument, and they share their working code with debugging lines.

There is no explicitly marked answer, but the community members seem to have found a solution through their discussion and experimentation.

Useful resources

rrawwerks

i think i need some help making sure i understand how to persist/load using the pipeline workflow.

is this a correct flow?

Plain Text

# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

#restore the pipeline
pipeline.load("pipeline_storage")

# Ingest directly into a vector db
pipeline.run(documents)

# save the pipeline
pipeline.persist("pipeline_storage")

my assumption is that if there is no diff, then pipeline.run won't actually re-embed the docs. if there is, it will run and persist for next time.

14 comments

rrawwerks

...but maybe i don't understand the order of operations (or what the pipeline cache even does...)

rrawwerks

pretty sure this is not correct, because when i add a document, the cache doesn't change and it doesn't seem to be generating new embeddings....

rrawwerks

maybe i'm not understanding the difference between persist/load and cache=

rrawwerks

so i added a docstore...

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
)

rrawwerks

trying to follow https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

rrawwerks

what am i doing wrong that there are no nodes?

43 Docs => 0 nodes

Plain Text

# Load documents
documents = SimpleDirectoryReader("data", recursive=True, filename_as_id=True).load_data()

# this returns "Found 43 Documents"
print(f"Found {len(documents)} Documents")

# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    docstore=SimpleDocumentStore(),
)

# run the pipeline
nodes = pipeline.run(documents)

#this returns "Ingested 0 Nodes"
print(f"Ingested {len(nodes)} Nodes")

rrawwerks

(same result with a vectorstore)

rrawwerks

persist/load pipeline with vector store & docstore

LLogan M

The persist of an ingestion pipeline is only when you have a cache and/or doctsore attached to it (and assumes those two both aren't hosted remotely like redis)

rrawwerks

thanks @Logan M -
that part makes sense after reading the document management pipeline docs more carefully...

...what doesn't make sense is why i have 43 documents that get converted to 0 nodes... (with or without a vectorstore)

LLogan M

Try .run(documents=documents)

rrawwerks

huzzah!

rrawwerks

ok so i think i'm finally up and running with docstore & persist/load!

before i jinx it...sharing with others what seems to work for me, with some debugging lines to help troubleshoot

Plain Text

# Load documents
documents = SimpleDirectoryReader("data", recursive=True, filename_as_id=True).load_data()

print(f"Found {len(documents)} Documents")

# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
)

# Check if the folder exists
if os.path.exists("pipeline_storage"):
    # Restore the pipeline
    pipeline.load("pipeline_storage")

# Ingest directly into a vector db
nodes = pipeline.run(documents=documents)

print(f"Ingested {len(nodes)} Nodes")

for node in nodes:
    print(f"Node: {node.text}")
    print(f"Node: {node.id_}")

# save the pipeline
pipeline.persist("pipeline_storage")

LLogan M

(that looks right to me, nice!)

Add a reply

Find answers from the community

persist/load pipeline with vector store & docstore