Find answers from the community

Updated 9 months ago

Performance debugging and optimizations

Hello everyone, I'm excited to be here and learn from you all! My program is currently taking a long time to return answers, and I suspect the issue may be related to the node parsing speed, as I'm using Llama Parse. I was wondering if there are any other ways I can optimize the code to improve performance.
v
N
L
73 comments
I don't have much on the optimization side at this point

What I would recommend is adding Langfuse. It's great at capturing the phase and their timings. You'll need visibility before you know where to spend effort optimizing.

https://docs.llamaindex.ai/en/stable/examples/callbacks/LangfuseCallbackHandler.html

I run it locally while I'm developing, they also hace a cloud / managed version
Looking closer, what I would do is break your code into 2 files.
  1. Parse & build vector db
  2. run the queries
It looks like every time you want to ask a question, you have to do step 1
You make a good point! I came across a post on Langfuse earlier that seemed really interesting.
I like it a lot
Perfect, thanks for your recommendation btw!
you're welcome
Have you tried Flowise AI before? I hope they will integrate langfuse soon
I'm pretty new to getting back into AI, so learning all this stuff around LLMs
nope, never heard of it
It allows you to build an end-to-end large language model with a simple drag-and-drop interface
no code solution
my vim motions are faster that DnD 🀣
The main problem with those class of solutions is customization, you always have to drop into real code
This is the only true nocode platform out there: https://github.com/kelseyhightower/nocode
I just got my RAG + Chat working, so now I have to deploy qdrant & langfuse (devops guys always be self-hosting 🀣 )
@Nam Tran I think this code will create the index on every user message? You probably only want to create it once right?
that's correct, I am reworking on it!
I tried to isolate the index creation but it still did not seem to work. Could you please have a look and tell me what I was doing wrong?
I would have used st.session_state to store the query engine πŸ‘€
Thanks, I will give it a try
@Logan M It worked! Thank you for your help!
@verdverm I keep getting an issue when integrating langfuse to my app. I was wondering if you have ever encountered this
ERROR:langfuse:An error occurred in _handle_LLM_events: 'NoneType' object has no attribute 'generation'

Traceback (most recent call last):

File "/home/adminuser/venv/lib/python3.10/site-packages/langfuse/utils/error_logging.py", line 14, in wrapper

return func(*args, **kwargs)

File "/home/adminuser/venv/lib/python3.10/site-packages/langfuse/llama_index/llama_index.py", line 352, in _handle_LLM_events

generation = parent.generation(
here is the code
Can you rename that file to have a .py ending? (should code highlight)
I haven't seen this exact error, but it seems like parent is None
I saw something like this in VS Code when I didn't use my direct calls to langfuse correctly
If you aren't making any calls yourself, this probably gets filed under the bug department
I tried to reach out to the langfuse team for help but I have never got any response
One thing I did do was to add callback_manager = Settings.callback_manager in a number of places, like when you create llm, there an extra keyword arg
also, I don't think it is tracing the cost correctly
I suspect this is more likely a llama-index bug, Langfuse support is lacking...
Right, so by adding in the callback_handler keyword arg in more places def improves the coverage
I'm still not seeing costs when I call VertexAI, was planning to file a bug at some point
Alright, thank you for your help! Will keep you updated if I hear back from them!
Here is a few places I put the extra kwarg, there are more
Pow! just found another from this convo, so I owe you a bit for the indirect help :]
my llm line was also missing it
does the bge_large_en perform much better than the small one?
I'm not sure, but jinaai was terrible
I saw some benchmarks that it was supposed to be SoTA with reranking, but the baseline was trash compared to the published numbers
not sure if I'm holding it wrong, I also saw some people saying they were getting different results from local vs cloud and the Jina team was working to correct that
I went with small because I'm deploying it to the cloud for the first time and...
  1. anecdotally I didn't see meaninful difference
  2. error on the small side
How long does it take for your model to return the answer? mine would take at least 15-20 seconds
I'm planning to get an evaluation pipeline in order to answer, is A better than B, for all the components
Which model? The embedding is typically very fast, Vertex / OpenAI take some time
The model and input/output size both impact timings
When I send a query to my rag model to get an answer, it will take at least 15 seconds before generating anything
I am now able to see most of it in Langfuse! So I can start to get a better answer to this
So I think there may be more than once call to the LLMs, depending on your setup (I haven't looked close enough at yours)

Getting that callback_manager in all the places will help uncover this sort of thing
not sure if you have seen this, but I am planning to apply truera on my models for comparision purpose
seems like a good one to me
I have not, but will probably check it out!

there are two sides to this, which makes it different from normal programming
  1. DevOps / LogMon type visibility, which is how I am holding Langfuse (and maybe UpTrain)
  2. Quality control, unique to LLM systems because there is fuzziness to it that SQL DBs don't experience
@Nam Tran I figured out more about langfuse, cleaned up my calling methods, got pricing working for the whole thing
Attachment
Screenshot_2024-03-21_at_3.28.43_AM.png
And a second message that skips the RAG step
Attachment
Screenshot_2024-03-21_at_3.39.45_AM.png
@verdverm awesome! I will look closer into the set up later. Glad to know that it worked!
I may pull apart the embedding / lookup phase, as I expect to hit similar issues with cost calculation (note, you have to set total_cost manually, the comment is wrong in langfuse docs)
@verdverm To follow up on the issue I mentioned earlier, this is what Langfuse relied me
"Llamaindex has issues with thwir concurrency model. If you execute multiple API requests at the same time, you run into this issue. I would advise you to add some sort of locking so that only one API request can be executed at a time"
Do you have a link to what they relayed* to you?

(*relayed, not relied, is the word I think you were after)
I sent a text message to them on their web page
are you calling .flush() yourself?
this is for the code that I showed you at beginning
Add a reply
Sign up and join the conversation on Discord