Find answers from the community

Updated 2 months ago

Hi, I have indexed a big amount of data

Hi, I have indexed a big amount of data with llama-index and stored the data into the disk. But here is the problem that default_vector_store.json file is bigger than 6GB. Now each time I want to load the data into a storage_context and to crate a new VectoreStoreIndex to make query, it takes more than half an hour just to load it. Any idea !!!!!!
L
H
20 comments
I would migrate to an actual vector db (faiss, qdrant, chroma, etc.)

(And I would have migrated once it got to the 1GB size πŸ˜… )

To migrate, you need to get the nodes + vectors out and then re-create using the vector db of your choice

Plain Text
vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

new_index = VectorStoreIndex(
    nodes,
    storage_context=StorageContext.from_defaults(vector_store=vector_store)
)
Thanks a lot
I have implemented your code in migrate my indexed data to Chromadb, but in creating new_index to store the data into the chromadb, I got this error:
Traceback (most recent call last): ......... File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add validate_batch( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch raise ValueError( ValueError: Batch size 41665 exceeds maximum batch size 5461
ah yea, that might happen with a lot of data lol
Plain Text
vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

# batch add
batch_size = 5000
for batch_idx in range(0, len(nodes_with_embeddings), batch_size):
  vector_store.add(nodes_with_embeddings[batch_idx:batch_idx+batch_size])

# then to use your vector store in an index
index = VectorStoreIndex.from_vector_store(vector_store)
it means i have to store my data in batches of 5000 nodes ?
yea, just inserting in batches (I picked 5000 as a round number, it seems chromadb has a max batch size of 5461)
I still get the the same error as now I am adding the batches of 5000 nodes each time to the Chroma vector_store but when this process is finished and then it tries to build the new new_index VectorStoreIndex based on the chromadb data it throws this batching limit error:
File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 255, in build_index_from_nodes return self._build_index_from_nodes(nodes, **insert_kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 236, in _build_index_from_nodes self._add_nodes_to_index( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 190, in _add_nodes_to_index new_ids = self._vector_store.add(nodes, **insert_kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\vector_stores\chroma.py", line 243, in add self._collection.add( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\models\Collection.py", line 168, in add self._client._add(ids, self.id, embeddings, metadatas, documents, uris) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\telemetry\opentelemetry\__init__.py", line 127, in wrapper return f(*args, **kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add validate_batch( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch raise ValueError( ValueError: Batch size 41665 exceeds maximum batch size 5461
Any new idea!!!
Which line of code throws that error? You should just need to do index = VectorStoreIndex.from_vector_store(vector_store)
exactly this line of code throws the error:
index = VectorStoreIndex.from_vector_store(vector_store)
The chromadb vector_store is written to the disk before this index creating line.
are you sure? The traceback doesn't make sense for that line of code
The chroma.slqlit3 file on my disk is now bigger than 5GB
from_vector_store() will never get into _add_nodes_to_index -- there's no nodes to add πŸ˜…
Sorrry this the lind of code that makes the trouble:
new_index = VectorStoreIndex( nodes_with_embeddings, storage_context=StorageContext.from_defaults(vector_store=chromadb_vs) )
right -- you don't need that line of code anymore, if you already did vector_store.add() in batches
Ok thanks now I will try to load and make some queries based on this Chromadb data, but I fear I will get into some errors since the last line of code tried to create an index object based on Chromadb data that did throw some limitation errors
Add a reply
Sign up and join the conversation on Discord