Are you using CPU? it takes time on CPU
also your embedding model is hosted somewehere or locally on your machine?
I'm noob at this. I think i'm using CPU. I just running
.ipynb
on my apple silicon machine without additional setup. I tried to enable GPU in google colab, but it told me like "U are not using gpu" As for embeddings model - i followed basic example and i did not specify any:
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
Already 340min since i started index creation
Ah yes CPU tends to take time
You can enable GPU in Google colab just by clicking on Resource and then change runtime type
yeah i tried this. But while its running i got popup, like "u choose GPU but not using it". So i guessed i need some more config in code itself
how much time? And how much faster the gpu will be? I'm currently at 420min
Embedding step uses the gpu is available imo
In my application I feed 2000+ docs
It takes 1 hour 30 min to complete documents + embedding + index creation
thx ill give it another try, cuz its already 440min on my cpu and i have no idea will it finish before 2026 LOL
how can i specify embedding model and do i have to?
its a pain that progress showing nothing. it got 100% in 5 minutes and 1.5h already running on google colab gpu
Can you print how many documents are there ?
Its a single json ~150mb. Its a collection of short messages
print(len(documents))
also returns 1
Do print(documents[0].text)
Younhave the JSON file under data folder? Right?
right. print(documents[0].text)
prints contents of my json
I think so. It prints part of my json and Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
its single json, which contains short messages from chat like {date, name, text}[] something like that
What embedding model are you using?
Also GPU is active now?
Then it should not be a single document length. Something is not right
I would suggest you iterate over the JSON list create document object on your own and then pass it to the index creation step
i did not specify it. I just use
documents = SimpleDirectoryReader("data").load_data()
ndex = VectorStoreIndex.from_documents(documents, show_progress=True)
I have it checked in runtime but sometimes i got
Ah I see you have open ai so it must be using openai for embedding too
That is why it is saying you are not using GPU
Iterate over your json list
For each dict make document object
I'm away from the machine I'll share a small sample code on this shortly
I used openai later for . Before creating index i have not specified any embeddings model, since i followed basic quickstart. I use llm only for index.as_query_engine(llm=llm
later. I do not know what embedding model is used by default
For each dict make document object. Do you mean to create separate json in filesystem for each dict? or just manually make documents = list of objects
?
Ah, i see, documents
is just a list. I thought it is some special data structure used by Llamaindex. Then, i dont need SimpleDirectoryReader("data").load_data()
. I can just parse my json and create list of dicts. I'll try this later
from llama_index.core import Document, VectorStoreIndex
documents = []
json_sample = [ #COnsidering this is how your JSON is looking
{},
{},
{}
]
for record in json_sample:
formatted_text = f"Date: {record['date']}\nName: {record['name']}\nText: {record['text']}"
documents.append(Document(text=formatted_text))
# Indexing step, since you have OpenAI key and have not defined your own embedding model so it will default to OpenAI embedding model.
index = VectorStoreIndex.from_documents(documents)
This will call OpenAI for embedding step, if you dont want to use OpenAI for embedding then you can define local llm which will use GPU from colab
thx! But how can i specify embedding model to be used in index = VectorStoreIndex.from_documents(documents)
? I havent seen llm property here. And i do not think it uses open ai, since it configured later in my code. After index creation(which is never succeeded yet)
You have set OpenAI key in your env?
no, just hardcoded it to test
Here is my code:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)
Any hardcoded part is above this or below this?
below i have
llm = OpenAI(model="gpt-4o-mini", api_key="blahblah")
query_engine = index.as_query_engine(llm=llm, response_mode="tree_summarize")
but i guess it doesnt matter since it was not executed
So i havent specified any embeddings model
And you have not set anything like openai.api_key?
only this: model="gpt-4o-mini", api_key="blahblah"
LlamaIndex first checks if you have provided any model for embedding or not. if not it will default to OpenA
lol lets do this:
Do you want to use OpenAI for embedding?
I do not know what is better π BTW how does LLamaindex used openAi without me providing api key?
# DO this at the top
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
# get API key and create embeddings
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# see if it working or not
print(embed_model.get_text_embedding(
"Open AI new Embeddings models is great."
))
Settings.embed_model = embed_model
it should not!! it should raise an error i'm not sure why it didnt
ah i see! I read the docs again - i needed to set export OPENAI_API_KEY=XXXXX
I expected it to be explicitly set in code somewhere. Seems like this is the problem. Strange behavior of Llamaindex not throwing error and just running the creating index task endlessly lol
@WhiteFang_Jr Will try this again tomorrow with embedding model. Thank you for help! π€
I made it work with huggingFace embedding and openAi LLM. But out of the box answers are much worse than i expected. I do not understand if it hallucinate or overgeneralize things. Also no citations provided π
---
Yes it hallucinates, and seems like it prioritize its knowledge over provided data
You can customize the prompt that your query/chat engine is using
how are you querying, can you show
I tried to follow basic example from llamaindex docs:
query_engine = index.as_query_engine(llm=llm, response_mode="tree_summarize", similarity_top_k=10)
async def search_documents(query: str) -> str:
response = await query_engine.aquery(query)
return str(response)
agent = AgentWorkflow.from_tools_or_functions(
[search_documents],
llm=llm,
system_prompt="""You are helpful assistant which can provide answers about immigration, based on chat messages""",
)
print(await agent.run(
"What documents do i need to open bank account?"
))
And your data contains answer about bank account?
yes it has some messages regarding problems people faced and some solutions
But chat prefer to answer generally like "it depends on banks, blah blah blah... might be that, might be those" and same useless crap i will get just using chatgpt without any RAG ))
Can you check if your method search_documents
was called or not if did check the source nodes it picked
Also what embedding model youa re using?
async def search_documents(query: str) -> str:
response = await query_engine.aquery(query)
print(response.source_nodes)
return str(response)
[NodeWithScore(node=TextNode(id_='d0b5a2ef-3352-42d4-818e-e69eebf91c29', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='ef524e7e-e231-4b42-8093-31edf904b6ae', node_type='4', metadata={}, hash='359b6bbd5a6a023daed679b8c45a466aa0ae5b1dae2b7dc388cb04ebdc2ace51'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='efdf39c8-bab6-447c-9cb3-06ac115bb059', node_type='1', metadata={}, hash='c7d90b310b7972173803800b6234774ede6eb701a660510ac7ecd941ac415dab')}, metadata_template='{key}: {value}', metadata_separator='\n', text=' .... TONS of text there
scores are like
score=0.7137104147544358