Hey I ve built a chatbot using llama

At a glance

Hey, I've built a chatbot using llama-index however I feel that they is a lot of latency to get an answer even if I only use a small vector index. Btw I'm using the free pinecone version. do you think it mainly do to this ?

12 comments

LLogan M

The latency is by far dominated by the time the LLM takes (and depending on your index/query setup, you might be making more than one LLM call)

enabling streaming is the best way to make things feel faster

pplouplou

Even when trying to reduce the size of each chunk and the number of similarity-top-k the answer is slow for a chatbot. I saw some chatbot app like Botsonic having very great answering time do you think they are built over Langchain or Llama-Index ?

pplouplou

SO, if I move from Llama-Index to purely langchain it will not change anything to the latency ?

LLogan M

Any chatbot is going to be limited by how long the LLM call(s) take. Even with llama-index, 99% of the runtime is spent calling the LLM. When using OpenAI, LLM calls can take various amounts of time depending on their server load

Moving from llama-index to purely langchain likely won't improve latency, at least in my opinion.

LLogan M

Enabling streaming, or running your own LLM on a server are probably the best options for increasing speed.

The text-generation library from huggingface works super well with llama-index, assuming you have access to resources to run the models https://github.com/huggingface/text-generation-inference

pplouplou

Great, thank you so much for your detailed answer, I appreciate it ! 🙂

bbig_ol_tender

Hey Logan, I have an endpoint setup using text-generation-inference. Do you have any examples to make it work with llama-index? I’ve tried using the langchain huggingfacetextgeninference class but it doesn’t work out of the box

LLogan M

Why doesn't it work out of the box?

Should be something like this

Plain Text

from llama_index import LLMPredictor, ServiceContext

llm = HuggingFaceTextGenInference(...)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

bbig_ol_tender

I’m getting an authentication/max retries error. My code is almost exactly what you wrote. Also I tested the endpoint within the code using

From text_generation import Client
client = Client(…)
client.generate(…)

And it works fine. Thanks for your response- I’ll keep digging, just wanted to make sure there wasn’t a public example out there I missed

LLogan M

Did you also setup an embed model? Otherwise it will be pinging openai to generate embeddings 🙂

bbig_ol_tender

Ah that’s it. I’m actually passing embeddings to my Documents (using a homegrown embedding endpoint). But I forgot I need it at query time too.

bbig_ol_tender

Thanks!

Add a reply

Find answers from the community

Hey I ve built a chatbot using llama