Document Management

At a glance

I have quite a large database. Is there a way i can iterate on changing some of my service_context (e.g. mix and change metadata extractors) without rebuilding the index?

Also, what would be the easist way to push NULL as the embedding. I would like to just do the embedding step in batch with a script on a rented GPU rather than as part of a pipeline.

4 comments

EEmanuel Ferreira

https://docs.llamaindex.ai/en/stable/core_modules/data_modules/index/document_management.html#update

I think it will help you, you don’t need to rebuild the entire index

WWizboar

Not super clear if I can do this:

Plain Text

# build original index

service_context = create_service_context(included_metadata_extactors=False)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context, show_progress=True
)

# change the service context to now include metadata extractors and refresh the index to act on any updates.
service_context = create_service_context(include_metadata_extractors=True)
index.refresh(documents, service_context=service_context)

It looks like refresh only changes if the text changes...

ddisiok

hey @Wizboar, in this case, you might want to just use the lower-level components to build the pipeline yourself. For example, first load documents with loaders, then use the node parser (https://docs.llamaindex.ai/en/stable/core_modules/data_modules/node_parsers/root.html#) to parse documents into nodes, then calling the embedding model to compute embeddings on your own GPU

WWizboar

@disiok why adding nodes to the VectorStoreIndex, I want to pass the nodes saved to jsonl from the nodeparser step.

Am I loading it in as the BaseNode schema or the IndexNode schema?

Add a reply

Find answers from the community

Document Management