Find answers from the community

Updated 4 months ago

Hi, I have indexed a big amount of data

At a glance

The community member has indexed a large amount of data using llama-index and stored it on disk, but the default_vector_store.json file has grown to over 6GB in size. Loading the data into a storage_context and creating a new VectorStoreIndex for querying takes over half an hour. The community members suggest migrating the data to an actual vector database like Faiss, Qdrant, or Chroma to improve performance.

The community members provide code to extract the nodes and embeddings from the llama-index and add them to a Chroma vector store in batches to avoid exceeding the maximum batch size. However, the community member still encounters an error when trying to create a new VectorStoreIndex based on the Chroma data. The community members suggest that the community member should no longer need to create a new VectorStoreIndex, and can instead just use the Chroma vector store directly for querying.

Hi, I have indexed a big amount of data with llama-index and stored the data into the disk. But here is the problem that default_vector_store.json file is bigger than 6GB. Now each time I want to load the data into a storage_context and to crate a new VectoreStoreIndex to make query, it takes more than half an hour just to load it. Any idea !!!!!!
L
H
20 comments
I would migrate to an actual vector db (faiss, qdrant, chroma, etc.)

(And I would have migrated once it got to the 1GB size πŸ˜… )

To migrate, you need to get the nodes + vectors out and then re-create using the vector db of your choice

Plain Text
vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

new_index = VectorStoreIndex(
    nodes,
    storage_context=StorageContext.from_defaults(vector_store=vector_store)
)
Thanks a lot
I have implemented your code in migrate my indexed data to Chromadb, but in creating new_index to store the data into the chromadb, I got this error:
Traceback (most recent call last): ......... File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add validate_batch( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch raise ValueError( ValueError: Batch size 41665 exceeds maximum batch size 5461
ah yea, that might happen with a lot of data lol
Plain Text
vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

# batch add
batch_size = 5000
for batch_idx in range(0, len(nodes_with_embeddings), batch_size):
  vector_store.add(nodes_with_embeddings[batch_idx:batch_idx+batch_size])

# then to use your vector store in an index
index = VectorStoreIndex.from_vector_store(vector_store)
it means i have to store my data in batches of 5000 nodes ?
yea, just inserting in batches (I picked 5000 as a round number, it seems chromadb has a max batch size of 5461)
I still get the the same error as now I am adding the batches of 5000 nodes each time to the Chroma vector_store but when this process is finished and then it tries to build the new new_index VectorStoreIndex based on the chromadb data it throws this batching limit error:
File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 255, in build_index_from_nodes return self._build_index_from_nodes(nodes, **insert_kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 236, in _build_index_from_nodes self._add_nodes_to_index( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 190, in _add_nodes_to_index new_ids = self._vector_store.add(nodes, **insert_kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\vector_stores\chroma.py", line 243, in add self._collection.add( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\models\Collection.py", line 168, in add self._client._add(ids, self.id, embeddings, metadatas, documents, uris) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\telemetry\opentelemetry\__init__.py", line 127, in wrapper return f(*args, **kwargs) File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add validate_batch( File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch raise ValueError( ValueError: Batch size 41665 exceeds maximum batch size 5461
Any new idea!!!
Which line of code throws that error? You should just need to do index = VectorStoreIndex.from_vector_store(vector_store)
exactly this line of code throws the error:
index = VectorStoreIndex.from_vector_store(vector_store)
The chromadb vector_store is written to the disk before this index creating line.
are you sure? The traceback doesn't make sense for that line of code
The chroma.slqlit3 file on my disk is now bigger than 5GB
from_vector_store() will never get into _add_nodes_to_index -- there's no nodes to add πŸ˜…
Sorrry this the lind of code that makes the trouble:
new_index = VectorStoreIndex( nodes_with_embeddings, storage_context=StorageContext.from_defaults(vector_store=chromadb_vs) )
right -- you don't need that line of code anymore, if you already did vector_store.add() in batches
Ok thanks now I will try to load and make some queries based on this Chromadb data, but I fear I will get into some errors since the last line of code tried to create an index object based on Chromadb data that did throw some limitation errors
Add a reply
Sign up and join the conversation on Discord