Find answers from the community

Updated 2 years ago

Hey I ve built a chatbot using llama

Hey, I've built a chatbot using llama-index however I feel that they is a lot of latency to get an answer even if I only use a small vector index. Btw I'm using the free pinecone version. do you think it mainly do to this ?
L
p
b
12 comments
The latency is by far dominated by the time the LLM takes (and depending on your index/query setup, you might be making more than one LLM call)

enabling streaming is the best way to make things feel faster
Even when trying to reduce the size of each chunk and the number of similarity-top-k the answer is slow for a chatbot. I saw some chatbot app like Botsonic having very great answering time do you think they are built over Langchain or Llama-Index ?
SO, if I move from Llama-Index to purely langchain it will not change anything to the latency ?
Any chatbot is going to be limited by how long the LLM call(s) take. Even with llama-index, 99% of the runtime is spent calling the LLM. When using OpenAI, LLM calls can take various amounts of time depending on their server load

Moving from llama-index to purely langchain likely won't improve latency, at least in my opinion.
Enabling streaming, or running your own LLM on a server are probably the best options for increasing speed.

The text-generation library from huggingface works super well with llama-index, assuming you have access to resources to run the models https://github.com/huggingface/text-generation-inference
Great, thank you so much for your detailed answer, I appreciate it ! 🙂
Hey Logan, I have an endpoint setup using text-generation-inference. Do you have any examples to make it work with llama-index? I’ve tried using the langchain huggingfacetextgeninference class but it doesn’t work out of the box
Why doesn't it work out of the box?

Should be something like this

Plain Text
from llama_index import LLMPredictor, ServiceContext

llm = HuggingFaceTextGenInference(...)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
I’m getting an authentication/max retries error. My code is almost exactly what you wrote. Also I tested the endpoint within the code using

From text_generation import Client
client = Client(…)
client.generate(…)

And it works fine. Thanks for your response- I’ll keep digging, just wanted to make sure there wasn’t a public example out there I missed
Did you also setup an embed model? Otherwise it will be pinging openai to generate embeddings 🙂
Ah that’s it. I’m actually passing embeddings to my Documents (using a homegrown embedding endpoint). But I forgot I need it at query time too.
Add a reply
Sign up and join the conversation on Discord