Find answers from the community

Updated 4 months ago

Hello Is it possible to add an existing

At a glance
Hello! Is it possible to add an existing, locally saved vector database (created with llama index vector index) to an external vector db provider? I saved the index as files with storage_context.persist, but I would like to transfer the vector db to an external provider, without needing to recompute the whole index. Is it somehow possible or not yet?
L
D
13 comments
It's possible, but... slightly hacky haha. Embeddings are dirt cheap, so tbh I would recommend just re-building it

But if you really want to avoid re-computing the embeddings, here's the method. Basically you need to reconstruct the original nodes and attach their embeddings, and then use those to create a new index

Plain Text
# load the index
index = load_index_from_storage(...)

# get the nodes and embeddings
nodes = index.docstore.docs
embeddings = index.vector_store._data.embedding_dict

# attach the embeddings
nodes_with_embeddings = []
for node in nodes:
  node.embedding = embeddings[node.node_id]

# create a new index with the new backend
vector_index = VectorStoreIndex(nodes_with_embeddings, storage_context=storage_context)
not 100% sure this will work, but seems like it will based on the source code πŸ˜†
Wow, thanks a lot! I will try it for sure, as I am working with quite a large volume of docs... πŸ˜„
@Logan M unfortunately I am getting an error: ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-7036255f6e3d> in <cell line: 3>()
2 nodes_with_embeddings = []
3 for node in nodes:
----> 4 node.embedding = embeddings[node.node_id]
5

AttributeError: 'str' object has no attribute 'node_id'

But I am trying to figure out something
string object hey πŸ€” It shouldn't be a string

One sec, I can also double check the code a gave
yes, it's kind of weird
index.docstore.docs returns a dict, not a list
whoops, here's an updated version
Plain Text
# load the index
index = load_index_from_storage(...)

# get the nodes and embeddings
nodes = index.docstore.docs
embeddings = index.vector_store._data.embedding_dict

# attach the embeddings
nodes_with_embeddings = []
for node_id, node in nodes.items():
  node.embedding = embeddings[node_id]
  nodes_with_embeddings.append(node)

# create a new index with the new backend
vector_index = VectorStoreIndex(nodes_with_embeddings, storage_context=storage_context)
works almost perfectly. I tried to use it with an index for the auto merging retriever. there are nodes without embedding, but after counting them, I guess they are the nodes that are parent chunks and aren't embedded:

Initialize an empty list to store nodes with their embeddings

nodes_with_embeddings = []

Initialize a counter for nodes without embeddings

nodes_without_embeddings_count = 0

Loop through each node_id and node in the nodes dictionary

for node_id, node in nodes.items():
if node_id in embeddings:
node.embedding = embeddings[node_id]
nodes_with_embeddings.append(node)
else:
nodes_without_embeddings_count += 1 # Increment the counter

print(f"Number of nodes without embeddings: {nodes_without_embeddings_count}")
I think you could post your code snippet to the llama index wiki! πŸ™‚
Ah yes! Although heads up, the auto-merging retriever won't work with a vector store integration (at least by default)

With vector db integrations, the entire index is stored in the vector store

This is nice because you don't have to keep track of local files

The downside is that features that rely on the docstore will break (like the automerging retriever). You can override this behaviour by setting store_nodes_override=True in the constructor, but then you still need to manage the docstore and index_store files (either locally, or using an integration like redis or mongodb)
Add a reply
Sign up and join the conversation on Discord