MoekaChan

Initial eval latency

Hello, I am using LlamaIndex with Ollama to build a chatbot that leverages our fine-tuned model using RAG and a custom vectorized database. I use bge_onnx for the embedding model and DuckDB for the database. Previously, the setup included a embedding model (~125MB) and a vectorized database (~1GB). In that configuration, the FaithfulnessEvaluator typically completed evaluations in about 2 seconds.

Recently, I switched to a new embedding model version of bge_onnx (~2.2GB) and re-vectorized the database using DuckDB, resulting in a new database size of 1.75GB. After these updates, I've observed that the FaithfulnessEvaluator now takes more than 25 seconds for the first evaluation. However, on subsequent evaluations (2nd, 3rd, etc.), the process takes only about 1 second.

Could you help me understand why the first evaluation is significantly slower after the updates and suggest ways to optimize the evaluation process?

1 comment

MMoekaChan

Welcome message

Happy new year. I recently fine-tune a model. I used Ollama to run it. It shows a welcome message in Ollama terminal.
In my python code, I use
llm = Ollama(model="xxxx", request_timeout=60.0)
chat_engine =index.as_chat_engine(
chat_mode="context",
llm=llm,
memory=memory
)
How to get welcome message generated by the model when it starts?

9 comments

MMoekaChan

I Asked About the Company That Created You

Hi, I have question about chat store. I save the chat store

{"store": {"chat_history": [{"role": "user", "content": "which company create you?", "additional_kwargs": {}}, {"role": "assistant", "content": "I wasn't created by a specific company, but rather I am a product of Meta AI, a subsidiary of Meta Platforms, Inc.", "additional_kwargs": {}}, {"role": "user", "content": "Repeat the question I asked you", "additional_kwargs": {}}, {"role": "assistant", "content": "You asked: "Which company created you?" \n\n\nLet me know if you have any other questions!", "additional_kwargs": {}}]}, "class_name": "SimpleChatStore"}

What is "additional_kwargs" for? I want to let the chat store contains response time, token info, and source node. How to do it? It is possible to add those data into "addtional_kwargs"?

Currently, I am using

self.memory = ChatMemoryBuffer.from_defaults(
token_limit=3000,
chat_store=self.chat_store,
)

self.chat_engine = self.index.as_chat_engine(
chat_mode="context",
llm=self.cur_lm,
memory=self.memory
)

10 comments

MMoekaChan

Using Ollama with Llama_index

Hi, I have a few questions about using Ollama with llama_index.

If I am currently chatting with llama3.2 using:
llm = Ollama(model="llama3.2:latest")
and I want to switch to phi, should I do:
llm = Ollama(model="phi")?

If I want to continue the conversation with the previous llama3.2 instance after switching to phi, should I create two separate instances—one for llama3.2 and one for phi?

If I want to start a completely new chat with llama3.2, is it necessary to create a new instance for it?

If I have 5 different conversations (possibly using the same or different models), should I create 5 separate instances to manage them?

Thanks in advance for your help!

11 comments

Find answers from the community

Initial eval latency

Welcome message

I Asked About the Company That Created You

Using Ollama with Llama_index