Embedding

At a glance

The community member is trying to create a VectorStoreIndex from a large list of documents, but the embeddings generation is taking a long time. The community members suggest trying indexing with async, increasing the batch size, and updating the llama_index library to a newer version. The community members also provide specific code examples for creating a VectorStoreIndex with use_async=True. The community members note that increasing the batch size to 2048 helped a little, but the process is still slow for 100k documents. The solution seems to be updating the llama_index library, which resolved the AssertionError and made the embedding process much faster.

Useful resources

TTungdepzai

hey, I have a big list of documents and Im trying to do VectorStoreIndex.from_documents on it but the embeddings generation takes very long, how can I fix this, thanks

Attachment

8 comments

WWhiteFang_Jr

You could try indexing with async.

https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/AsyncIndexCreationDemo.html

LLogan M

Also increasing the batch size could help

https://gpt-index.readthedocs.io/en/stable/module_guides/models/embeddings.html#batch-size

TTungdepzai

thanks imma try it out

TTungdepzai

increasing batch size to 2048 increased the speed a little bit while async does not work .For more information, I'm using flask with chromadb, I'm trying to read a csv file with paged csv loader and insert those documents to chroma. For 100k documents its still taking well over 40 minutes, any help is appreciated!!

Attachment

WWhiteFang_Jr

Did you pass use_async=True in VectorStoreIndex?

WWhiteFang_Jr

I think, you'll have to Create an Instance of VectorStoreIndex with use_async=True.
Something like this

Plain Text

from llama_index import VectorStoreIndex
index = VectorStoreIndex(documents, use_async=True,storage_context=storage_context, show_progress=True )

TTungdepzai

I specified the batch size to be 2048 but I'm still getting AssertionError: The batch size should not be larger than 2048 , the progress bar also changed to .../2 which I think is over 2048

Attachments

TTungdepzai

updating llama_index to newer version got rid of AssertionError: The batch size should not be larger than 2048
The embedding process much faster now, thanks @WhiteFang_Jr @Logan M !

Add a reply

Find answers from the community

Embedding