Optimizing the speed of VectorStoreIndex.from_documents() for large JSON files

Question

Hi, how can i increase speed of VectorStoreIndex.from_documents()? I have single JSON ~140mb and it spend 50min on generating index on google colab, before i interrupted it. Now i try to run locally and already 10min of executing. What is the common time it take to generate index?

WhiteFang_Jr · Answer

Are you using CPU? it takes time on CPUalso your embedding model is hosted somewehere or locally on your machine?

Philipp · Answer

I'm noob at this. I think i'm using CPU. I just running .ipynb on my apple silicon machine without additional setup. I tried to enable GPU in google colab, but it told me like "U are not using gpu" As for embeddings model - i followed basic example and i did not specify any:documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)Already 340min since i started index creation

WhiteFang_Jr · Answer

Ah yes CPU tends to take time

WhiteFang_Jr · Answer

You can enable GPU in Google colab just by clicking on Resource and then change runtime type

Philipp · Answer

yeah i tried this. But while its running i got popup, like "u choose GPU but not using it". So i guessed i need some more config in code itself

Philipp · Answer

how much time? And how much faster the gpu will be? I'm currently at 420min

WhiteFang_Jr · Answer

Embedding step uses the gpu is available imo

WhiteFang_Jr · Answer

In my application I feed 2000+ docsIt takes 1 hour 30 min to complete documents + embedding + index creation

Philipp · Answer

was it on gpu?

WhiteFang_Jr · Answer

Yes

Philipp · Answer

thx ill give it another try, cuz its already 440min on my cpu and i have no idea will it finish before 2026 LOL

Philipp · Answer

how can i specify embedding model and do i have to?

Philipp · Answer

its a pain that progress showing nothing. it got 100% in 5 minutes and 1.5h already running on google colab gpu

WhiteFang_Jr · Answer

Can you print how many documents are there ?

Philipp · Answer

Its a single json ~150mb. Its a collection of short messages

Philipp · Answer

print(len(documents)) also returns 1

WhiteFang_Jr · Answer

Do print(documents[0].text)

WhiteFang_Jr · Answer

Younhave the JSON file under data folder? Right?

Philipp · Answer

right. print(documents[0].text) prints contents of my json

WhiteFang_Jr · Answer

It's just one text?

Philipp · Answer

I think so. It prints part of my json and Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Philipp · Answer

its single json, which contains short messages from chat like {date, name, text}[] something like that

WhiteFang_Jr · Answer

So list of dicts?

Philipp · Answer

yes

WhiteFang_Jr · Answer

What embedding model are you using?Also GPU is active now?

WhiteFang_Jr · Answer

Then it should not be a single document length. Something is not rightI would suggest you iterate over the JSON list create document object on your own and then pass it to the index creation step

Philipp · Answer

i did not specify it. I just use documents = SimpleDirectoryReader("data").load_data()
ndex = VectorStoreIndex.from_documents(documents, show_progress=True)

Philipp · Answer

I have it checked in runtime but sometimes i got

WhiteFang_Jr · Answer

Ah I see you have open ai so it must be using openai for embedding too

WhiteFang_Jr · Answer

That is why it is saying you are not using GPU

WhiteFang_Jr · Answer

Can you try this once

WhiteFang_Jr · Answer

Iterate over your json list For each dict make document object

WhiteFang_Jr · Answer

I'm away from the machine I'll share a small sample code on this shortly

Philipp · Answer

I used openai later for . Before creating index i have not specified any embeddings model, since i followed basic quickstart. I use llm only for index.as_query_engine(llm=llm later. I do not know what embedding model is used by default

Philipp · Answer

For each dict make document object. Do you mean to create separate json in filesystem for each dict? or just manually make documents = list of objects?

Philipp · Answer

That would be great!

Philipp · Answer

Ah, i see, documents is just a list. I thought it is some special data structure used by Llamaindex. Then, i dont need SimpleDirectoryReader("data").load_data(). I can just parse my json and create list of dicts. I'll try this later

WhiteFang_Jr · Answer

yes exactly

WhiteFang_Jr · Answer

from llama_index.core import Document, VectorStoreIndex
documents = []
json_sample = [ #COnsidering this is how your JSON is looking
{},
{},
{}
] for record in json_sample: formatted_text = f"Date: {record['date']}
Name: {record['name']}
Text: {record['text']}" documents.append(Document(text=formatted_text)) # Indexing step, since you have OpenAI key and have not defined your own embedding model so it will default to OpenAI embedding model.
index = VectorStoreIndex.from_documents(documents)This will call OpenAI for embedding step, if you dont want to use OpenAI for embedding then you can define local llm which will use GPU from colab

Philipp · Answer

thx! But how can i specify embedding model to be used in index = VectorStoreIndex.from_documents(documents)? I havent seen llm property here. And i do not think it uses open ai, since it configured later in my code. After index creation(which is never succeeded yet)

WhiteFang_Jr · Answer

You have set OpenAI key in your env?

Philipp · Answer

no, just hardcoded it to test

Philipp · Answer

Here is my code:from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI documents = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(documents, show_progress=True)

WhiteFang_Jr · Answer

Any hardcoded part is above this or below this?

Philipp · Answer

below i have llm = OpenAI(model="gpt-4o-mini", api_key="blahblah") query_engine = index.as_query_engine(llm=llm, response_mode="tree_summarize")but i guess it doesnt matter since it was not executed

Philipp · Answer

So i havent specified any embeddings model

WhiteFang_Jr · Answer

And you have not set anything like openai.api_key?

Philipp · Answer

only this: model="gpt-4o-mini", api_key="blahblah"

WhiteFang_Jr · Answer

LlamaIndex first checks if you have provided any model for embedding or not. if not it will default to OpenA

WhiteFang_Jr · Answer

lol lets do this:Do you want to use OpenAI for embedding?

Philipp · Answer

I do not know what is better 😂 BTW how does LLamaindex used openAi without me providing api key?

WhiteFang_Jr · Answer

# DO this at the top
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding(model="text-embedding-3-large") # see if it working or not
print(embed_model.get_text_embedding( "Open AI new Embeddings models is great."
)) Settings.embed_model = embed_model

WhiteFang_Jr · Answer

it should not!! it should raise an error i'm not sure why it didnt

Philipp · Answer

ah i see! I read the docs again - i needed to set export OPENAI_API_KEY=XXXXX I expected it to be explicitly set in code somewhere. Seems like this is the problem. Strange behavior of Llamaindex not throwing error and just running the creating index task endlessly lol

Philipp · Answer

@WhiteFang_Jr Will try this again tomorrow with embedding model. Thank you for help! 🤝

Philipp · Answer

I made it work with huggingFace embedding and openAi LLM. But out of the box answers are much worse than i expected. I do not understand if it hallucinate or overgeneralize things. Also no citations provided 🙁
---
Yes it hallucinates, and seems like it prioritize its knowledge over provided data

WhiteFang_Jr · Answer

You can customize the prompt that your query/chat engine is using

WhiteFang_Jr · Answer

how are you querying, can you show

Philipp · Answer

I tried to follow basic example from llamaindex docs:query_engine = index.as_query_engine(llm=llm, response_mode="tree_summarize", similarity_top_k=10) async def search_documents(query: str) -> str: response = await query_engine.aquery(query) return str(response) agent = AgentWorkflow.from_tools_or_functions( [search_documents], llm=llm, system_prompt="""You are helpful assistant which can provide answers about immigration, based on chat messages""",
) print(await agent.run( "What documents do i need to open bank account?" ))

WhiteFang_Jr · Answer

And your data contains answer about bank account?

Philipp · Answer

yes it has some messages regarding problems people faced and some solutions

Philipp · Answer

But chat prefer to answer generally like "it depends on banks, blah blah blah... might be that, might be those" and same useless crap i will get just using chatgpt without any RAG ))

WhiteFang_Jr · Answer

Can you check if your method search_documents was called or not if did check the source nodes it pickedAlso what embedding model youa re using?

WhiteFang_Jr · Answer

async def search_documents(query: str) -> str: response = await query_engine.aquery(query) print(response.source_nodes) return str(response)

Philipp · Answer

[NodeWithScore(node=TextNode(id_='d0b5a2ef-3352-42d4-818e-e69eebf91c29', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={: RelatedNodeInfo(node_id='ef524e7e-e231-4b42-8093-31edf904b6ae', node_type='4', metadata={}, hash='359b6bbd5a6a023daed679b8c45a466aa0ae5b1dae2b7dc388cb04ebdc2ace51'), : RelatedNodeInfo(node_id='efdf39c8-bab6-447c-9cb3-06ac115bb059', node_type='1', metadata={}, hash='c7d90b310b7972173803800b6234774ede6eb701a660510ac7ecd941ac415dab')}, metadata_template='{key}: {value}', metadata_separator=' ', text=' .... TONS of text therescores are like score=0.7137104147544358

Find answers from the community

Optimizing the speed of VectorStoreIndex.from_documents() for large JSON files