Find answers from the community

Updated 11 months ago

Ingestion

anyone have an ingestion pipeline example for reading a directory with unstructuredreader()? my use case: i want to embed/store all PDFs in a folder, and be able to add new files without having to worry about re-embedding the old ones.

i tried to follow this example: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html , but when i build the index this way (vs plain old index = VectorStoreIndex.from_documents(documents)) - it's clearly not working.
Attachment
image.png
L
r
13 comments
I mean, it seems like it didn't pull the correct data?

What happens when you check response.source_nodes ? Does it make sense?
no, good catch, i didn't replace nodes = pipeline.run(documents=[Document.example()]) with documents=documents
the article builds up step-by-step, but once i have the pipeline storage, don't i want to replace that with the cached code?

Plain Text
# save
pipeline.persist("./pipeline_storage")

# load and restore state
new_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
    ],
)
new_pipeline.load("./pipeline_storage")

# will run instantly due to the cache
nodes = pipeline.run(documents=[Document.example()])
for example, even if the nodes are cached, do i still need to redo index = VectorStoreIndex.from_vector_store(vector_store) every time?
yea like from_vector_store() doesn't do anything but make a connection to an existing vector store (i.e the one you attached to the pipeline)
I hope that kind of makes sense?
Whereas the pipeline is putting stuff into the vector store
i think your document management example helps clarify.
@Logan M - i've got "addition" down, but how do i do "subtraction"? with the above example, files i remove from /data still persist...
Need to explicitly delete them

index.delete(document.doc_id)
Add a reply
Sign up and join the conversation on Discord