Ingestion

At a glance

anyone have an ingestion pipeline example for reading a directory with unstructuredreader()? my use case: i want to embed/store all PDFs in a folder, and be able to add new files without having to worry about re-embedding the old ones.

i tried to follow this example: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html , but when i build the index this way (vs plain old index = VectorStoreIndex.from_documents(documents)) - it's clearly not working.

Attachment

13 comments

LLogan M

I mean, it seems like it didn't pull the correct data?

What happens when you check response.source_nodes ? Does it make sense?

rrawwerks

no, good catch, i didn't replace nodes = pipeline.run(documents=[Document.example()]) with documents=documents

LLogan M

Aha!

rrawwerks

still a little confused on the storage piece of https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html

rrawwerks

the article builds up step-by-step, but once i have the pipeline storage, don't i want to replace that with the cached code?

Plain Text

# save
pipeline.persist("./pipeline_storage")

# load and restore state
new_pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=25, chunk_overlap=0),
        TitleExtractor(),
    ],
)
new_pipeline.load("./pipeline_storage")

# will run instantly due to the cache
nodes = pipeline.run(documents=[Document.example()])

rrawwerks

for example, even if the nodes are cached, do i still need to redo index = VectorStoreIndex.from_vector_store(vector_store) every time?

rrawwerks

gonna try to follow https://github.com/run-llama/llama_index/blob/2d44b89e1840a5e3c25b93d439cb72df31266ac7/docs/examples/ingestion/document_management_pipeline.ipynb

LLogan M

yea like from_vector_store() doesn't do anything but make a connection to an existing vector store (i.e the one you attached to the pipeline)

LLogan M

I hope that kind of makes sense?

LLogan M

Whereas the pipeline is putting stuff into the vector store

rrawwerks

i think your document management example helps clarify.