Hey Llama_index Team,
I wanted to bring up a concern I've noticed while working with the OpenAI integration in Llama_index.
It seems that when I use the index from my local environment and perform queries, the system makes multiple calls to the OpenAI chat method. As a result, I'm seeing more OpenAI calls being made than intended ( Attaching the code snippet used below). This raises concerns about unnecessary costs incurred due to these extra calls, which I believe is unintentional.
from llama_index import StorageContext, load_index_from_storage
from llama_index.llms import OpenAI
from llama_index import (ServiceContext)
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")
# load index
index = load_index_from_storage(storage_context)
llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
service_context = ServiceContext.from_defaults(llm=llm)
print("querying the index.")
query_engine = index.as_query_engine(service_context=service_context)
res = query_engine.query("What did the author do after his time at Y Combinator?")
print(res)
I wanted to bring this to your attention because it seems to be affecting the cost-efficiency of using Llama_index in my workflow. I would greatly appreciate it if you could look into this matter and potentially optimize the way the OpenAI integration is being used to avoid these extra calls.
I'm happy to offer my assistance in contributing to the resolution of this issue. If you could provide me with the necessary context, I'd be more than willing to help identify the root cause and work on implementing a solution.