Hi, I have indexed a big amount of data

At a glance

The community member has indexed a large amount of data using llama-index and stored it on disk, but the default_vector_store.json file has grown to over 6GB in size. Loading the data into a storage_context and creating a new VectorStoreIndex for querying takes over half an hour. The community members suggest migrating the data to an actual vector database like Faiss, Qdrant, or Chroma to improve performance.

The community members provide code to extract the nodes and embeddings from the llama-index and add them to a Chroma vector store in batches to avoid exceeding the maximum batch size. However, the community member still encounters an error when trying to create a new VectorStoreIndex based on the Chroma data. The community members suggest that the community member should no longer need to create a new VectorStoreIndex, and can instead just use the Chroma vector store directly for querying.

HHoaz

Hi, I have indexed a big amount of data with llama-index and stored the data into the disk. But here is the problem that default_vector_store.json file is bigger than 6GB. Now each time I want to load the data into a storage_context and to crate a new VectoreStoreIndex to make query, it takes more than half an hour just to load it. Any idea !!!!!!

20 comments

LLogan M

oh my lol

LLogan M

I would migrate to an actual vector db (faiss, qdrant, chroma, etc.)

(And I would have migrated once it got to the 1GB size 😅 )

To migrate, you need to get the nodes + vectors out and then re-create using the vector db of your choice

Plain Text

vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

new_index = VectorStoreIndex(
    nodes,
    storage_context=StorageContext.from_defaults(vector_store=vector_store)
)

HHoaz

Thanks a lot

HHoaz

I have implemented your code in migrate my indexed data to Chromadb, but in creating new_index to store the data into the chromadb, I got this error:

Traceback (most recent call last):
 .........
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add
    validate_batch(
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch
    raise ValueError(
ValueError: Batch size 41665 exceeds maximum batch size 5461

LLogan M

ah yea, that might happen with a lot of data lol

LLogan M

Alternative

LLogan M

Plain Text

vector_store = <My new vectordb integration>

nodes_dict = index.docstore.docs
nodes_with_embeddings = []
for id_, vector in index.vector_store._data.embedding_dict.items():
  node = nodes_dict[id_]
  node.embedding = vector
  nodes_with_embeddings.append(node)

# batch add
batch_size = 5000
for batch_idx in range(0, len(nodes_with_embeddings), batch_size):
  vector_store.add(nodes_with_embeddings[batch_idx:batch_idx+batch_size])

# then to use your vector store in an index
index = VectorStoreIndex.from_vector_store(vector_store)

HHoaz

it means i have to store my data in batches of 5000 nodes ?

LLogan M

yea, just inserting in batches (I picked 5000 as a round number, it seems chromadb has a max batch size of 5461)

HHoaz

Thanks

HHoaz

I still get the the same error as now I am adding the batches of 5000 nodes each time to the Chroma vector_store but when this process is finished and then it tries to build the new new_index VectorStoreIndex based on the chromadb data it throws this batching limit error:

File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 255, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **insert_kwargs)
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 236, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\indices\vector_store\base.py", line 190, in _add_nodes_to_index
    new_ids = self._vector_store.add(nodes, **insert_kwargs)
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\llama_index\vector_stores\chroma.py", line 243, in add
    self._collection.add(
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\models\Collection.py", line 168, in add
    self._client._add(ids, self.id, embeddings, metadatas, documents, uris)
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\telemetry\opentelemetry\__init__.py", line 127, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\segment.py", line 361, in _add
    validate_batch(
  File "C:\Users\General\AppData\Roaming\Python\Python310\site-packages\chromadb\api\types.py", line 505, in validate_batch
    raise ValueError(
ValueError: Batch size 41665 exceeds maximum batch size 5461

Any new idea!!!

LLogan M

Which line of code throws that error? You should just need to do index = VectorStoreIndex.from_vector_store(vector_store)

HHoaz

exactly this line of code throws the error:
index = VectorStoreIndex.from_vector_store(vector_store)

HHoaz

The chromadb vector_store is written to the disk before this index creating line.

LLogan M

are you sure? The traceback doesn't make sense for that line of code

HHoaz

The chroma.slqlit3 file on my disk is now bigger than 5GB

LLogan M

from_vector_store() will never get into _add_nodes_to_index -- there's no nodes to add 😅

HHoaz

Sorrry this the lind of code that makes the trouble:

new_index = VectorStoreIndex(
    nodes_with_embeddings,
    storage_context=StorageContext.from_defaults(vector_store=chromadb_vs)
)

LLogan M

right -- you don't need that line of code anymore, if you already did vector_store.add() in batches

HHoaz

Ok thanks now I will try to load and make some queries based on this Chromadb data, but I fear I will get into some errors since the last line of code tried to create an index object based on Chromadb data that did throw some limitation errors

Add a reply

Find answers from the community

Hi, I have indexed a big amount of data