Find answers from the community

Updated last year

LLM Calls

Hey Llama_index Team,

I wanted to bring up a concern I've noticed while working with the OpenAI integration in Llama_index.

It seems that when I use the index from my local environment and perform queries, the system makes multiple calls to the OpenAI chat method. As a result, I'm seeing more OpenAI calls being made than intended ( Attaching the code snippet used below). This raises concerns about unnecessary costs incurred due to these extra calls, which I believe is unintentional.

Plain Text
from llama_index import StorageContext, load_index_from_storage
from llama_index.llms import OpenAI
from llama_index import (ServiceContext)

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")

# load index
index = load_index_from_storage(storage_context)

llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
service_context = ServiceContext.from_defaults(llm=llm)

print("querying the index.")

query_engine = index.as_query_engine(service_context=service_context)

res = query_engine.query("What did the author do after his time at Y Combinator?")

print(res)


I wanted to bring this to your attention because it seems to be affecting the cost-efficiency of using Llama_index in my workflow. I would greatly appreciate it if you could look into this matter and potentially optimize the way the OpenAI integration is being used to avoid these extra calls.

I'm happy to offer my assistance in contributing to the resolution of this issue. If you could provide me with the necessary context, I'd be more than willing to help identify the root cause and work on implementing a solution.
L
N
6 comments
When you created the index, did you use the default chunk size? Something else?
I just tested the same code with default settings (and set logging to debug) -- one request was sent for embeddings, and another for chat completions
Plain Text
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=69 request_id=69b8f0e0b2762189cfe32d31f898acdc response_code=200
...
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=887 request_id=e3960f2cb051d9b0bf974a9e8c3d2d0d response_code=200
I apologize for omitting the context in my previous message. To provide more information, I had installed the package locally using the pip install -e . command. I was running the code from the main branch within the examples/paul_graham_essay directory where the index was stored locally using the ./data as the source of index. The index creation was done with the default settings of OpenAI.

My goal is to gain a deep understanding of how Llama_index works so I can seamlessly integrate it into my workflow and also contribute to the repository whenever the opportunity arises. However, during my exploration, I observed multiple calls being made, which struck me as a bit unusual.
Yea, nothing unusal going on as far as I can tell
  • when you create a vector index, it calls the embedding model
  • when you query, it calls the embedding model AND the LLM (and possibly calls the LLM multiple times, depending on chunk_size and similarity_top_k. Default settings the LLM is only caused once)
Yes, It appears that the behavior is related to the chunking and similarity_top_k parameters that causes multiple calls to the LLM. Your clarification has been immensely helpful. Thanks πŸ™‚
Add a reply
Sign up and join the conversation on Discord