Hello Is it possible to add an existing

At a glance

Hello! Is it possible to add an existing, locally saved vector database (created with llama index vector index) to an external vector db provider? I saved the index as files with storage_context.persist, but I would like to transfer the vector db to an external provider, without needing to recompute the whole index. Is it somehow possible or not yet?

13 comments

LLogan M

It's possible, but... slightly hacky haha. Embeddings are dirt cheap, so tbh I would recommend just re-building it

But if you really want to avoid re-computing the embeddings, here's the method. Basically you need to reconstruct the original nodes and attach their embeddings, and then use those to create a new index

Plain Text

# load the index
index = load_index_from_storage(...)

# get the nodes and embeddings
nodes = index.docstore.docs
embeddings = index.vector_store._data.embedding_dict

# attach the embeddings
nodes_with_embeddings = []
for node in nodes:
  node.embedding = embeddings[node.node_id]

# create a new index with the new backend
vector_index = VectorStoreIndex(nodes_with_embeddings, storage_context=storage_context)

LLogan M

not 100% sure this will work, but seems like it will based on the source code 😆

DDrSebastianK

Wow, thanks a lot! I will try it for sure, as I am working with quite a large volume of docs... 😄

DDrSebastianK

@Logan M unfortunately I am getting an error: ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-7036255f6e3d> in <cell line: 3>()
2 nodes_with_embeddings = []
3 for node in nodes:
----> 4 node.embedding = embeddings[node.node_id]
5

AttributeError: 'str' object has no attribute 'node_id'

But I am trying to figure out something

LLogan M

string object hey 🤔 It shouldn't be a string

One sec, I can also double check the code a gave

DDrSebastianK

yes, it's kind of weird

LLogan M

ohhh I see

LLogan M

index.docstore.docs returns a dict, not a list

LLogan M

whoops, here's an updated version

LLogan M

Plain Text

# load the index
index = load_index_from_storage(...)

# get the nodes and embeddings
nodes = index.docstore.docs
embeddings = index.vector_store._data.embedding_dict

# attach the embeddings
nodes_with_embeddings = []
for node_id, node in nodes.items():
  node.embedding = embeddings[node_id]
  nodes_with_embeddings.append(node)

# create a new index with the new backend
vector_index = VectorStoreIndex(nodes_with_embeddings, storage_context=storage_context)

DDrSebastianK

works almost perfectly. I tried to use it with an index for the auto merging retriever. there are nodes without embedding, but after counting them, I guess they are the nodes that are parent chunks and aren't embedded:

Initialize an empty list to store nodes with their embeddings

nodes_with_embeddings = []

Initialize a counter for nodes without embeddings

nodes_without_embeddings_count = 0

Loop through each node_id and node in the nodes dictionary

for node_id, node in nodes.items():
if node_id in embeddings:
node.embedding = embeddings[node_id]
nodes_with_embeddings.append(node)
else:
nodes_without_embeddings_count += 1 # Increment the counter

print(f"Number of nodes without embeddings: {nodes_without_embeddings_count}")

DDrSebastianK

I think you could post your code snippet to the llama index wiki! 🙂

LLogan M

Ah yes! Although heads up, the auto-merging retriever won't work with a vector store integration (at least by default)

With vector db integrations, the entire index is stored in the vector store

This is nice because you don't have to keep track of local files

The downside is that features that rely on the docstore will break (like the automerging retriever). You can override this behaviour by setting store_nodes_override=True in the constructor, but then you still need to manage the docstore and index_store files (either locally, or using an integration like redis or mongodb)

Add a reply

Find answers from the community

Hello Is it possible to add an existing

Initialize an empty list to store nodes with their embeddings

Initialize a counter for nodes without embeddings

Loop through each node_id and node in the nodes dictionary