Find answers from the community

Updated 2 months ago

persist/load pipeline with vector store & docstore

i think i need some help making sure i understand how to persist/load using the pipeline workflow.

is this a correct flow?
Plain Text
# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)

#restore the pipeline
pipeline.load("pipeline_storage")

# Ingest directly into a vector db
pipeline.run(documents)

# save the pipeline
pipeline.persist("pipeline_storage")


my assumption is that if there is no diff, then pipeline.run won't actually re-embed the docs. if there is, it will run and persist for next time.
r
L
14 comments
...but maybe i don't understand the order of operations (or what the pipeline cache even does...)
pretty sure this is not correct, because when i add a document, the cache doesn't change and it doesn't seem to be generating new embeddings....
maybe i'm not understanding the difference between persist/load and cache=
so i added a docstore...
Plain Text
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
)
what am i doing wrong that there are no nodes?

43 Docs => 0 nodes

Plain Text
# Load documents
documents = SimpleDirectoryReader("data", recursive=True, filename_as_id=True).load_data()

# this returns "Found 43 Documents"
print(f"Found {len(documents)} Documents")

# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    docstore=SimpleDocumentStore(),
)

# run the pipeline
nodes = pipeline.run(documents)

#this returns "Ingested 0 Nodes"
print(f"Ingested {len(nodes)} Nodes")
(same result with a vectorstore)
persist/load pipeline with vector store & docstore
The persist of an ingestion pipeline is only when you have a cache and/or doctsore attached to it (and assumes those two both aren't hosted remotely like redis)
thanks @Logan M -
that part makes sense after reading the document management pipeline docs more carefully...

...what doesn't make sense is why i have 43 documents that get converted to 0 nodes... (with or without a vectorstore)
Try .run(documents=documents)
ok so i think i'm finally up and running with docstore & persist/load!

before i jinx it...sharing with others what seems to work for me, with some debugging lines to help troubleshoot

Plain Text
# Load documents
documents = SimpleDirectoryReader("data", recursive=True, filename_as_id=True).load_data()

print(f"Found {len(documents)} Documents")

# Generate an ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
    docstore=SimpleDocumentStore(),
)

# Check if the folder exists
if os.path.exists("pipeline_storage"):
    # Restore the pipeline
    pipeline.load("pipeline_storage")

# Ingest directly into a vector db
nodes = pipeline.run(documents=documents)

print(f"Ingested {len(nodes)} Nodes")

for node in nodes:
    print(f"Node: {node.text}")
    print(f"Node: {node.id_}")

# save the pipeline
pipeline.persist("pipeline_storage")
(that looks right to me, nice!)
Add a reply
Sign up and join the conversation on Discord