LlamaIndex

Log inLog into community

Find answers from the community

Updated 9 months ago

Performance debugging and optimizations

Performance debugging and optimizations

·

Hello everyone, I'm excited to be here and learn from you all! My program is currently taking a long time to return answers, and I suspect the issue may be related to the node parsing speed, as I'm using Llama Parse. I was wondering if there are any other ways I can optimize the code to improve performance.

v

N

L

73 comments

I don't have much on the optimization side at this point

What I would recommend is adding Langfuse. It's great at capturing the phase and their timings. You'll need visibility before you know where to spend effort optimizing.

https://docs.llamaindex.ai/en/stable/examples/callbacks/LangfuseCallbackHandler.html

I run it locally while I'm developing, they also hace a cloud / managed version

Looking closer, what I would do is break your code into 2 files.

Parse & build vector db
run the queries

It looks like every time you want to ask a question, you have to do step 1

I've got an example of the split method here: https://discord.com/channels/1059199217496772688/1218390769392418888/1218391558848778290

You make a good point! I came across a post on Langfuse earlier that seemed really interesting.

I like it a lot

Perfect, thanks for your recommendation btw!

you're welcome

Have you tried Flowise AI before? I hope they will integrate langfuse soon

I'm pretty new to getting back into AI, so learning all this stuff around LLMs

nope, never heard of it

It allows you to build an end-to-end large language model with a simple drag-and-drop interface

no code solution

and it's free

my vim motions are faster that DnD 🤣

lmao

The main problem with those class of solutions is customization, you always have to drop into real code

This is the only true nocode platform out there: https://github.com/kelseyhightower/nocode

loll

I just got my RAG + Chat working, so now I have to deploy qdrant & langfuse (devops guys always be self-hosting 🤣 )

@Nam Tran I think this code will create the index on every user message? You probably only want to create it once right?

that's correct, I am reworking on it!

I tried to isolate the index creation but it still did not seem to work. Could you please have a look and tell me what I was doing wrong?

I would have used st.session_state to store the query engine 👀

Thanks, I will give it a try

@Logan M It worked! Thank you for your help!

@verdverm I keep getting an issue when integrating langfuse to my app. I was wondering if you have ever encountered this

ERROR:langfuse:An error occurred in _handle_LLM_events: 'NoneType' object has no attribute 'generation'

Traceback (most recent call last):

File "/home/adminuser/venv/lib/python3.10/site-packages/langfuse/utils/error_logging.py", line 14, in wrapper

return func(*args, **kwargs)

File "/home/adminuser/venv/lib/python3.10/site-packages/langfuse/llama_index/llama_index.py", line 352, in _handle_LLM_events

generation = parent.generation(

here is the code

Can you rename that file to have a .py ending? (should code highlight)

I haven't seen this exact error, but it seems like parent is None

I saw something like this in VS Code when I didn't use my direct calls to langfuse correctly

If you aren't making any calls yourself, this probably gets filed under the bug department

I tried to reach out to the langfuse team for help but I have never got any response

One thing I did do was to add callback_manager = Settings.callback_manager in a number of places, like when you create llm, there an extra keyword arg

Attachment

also, I don't think it is tracing the cost correctly

I suspect this is more likely a llama-index bug, Langfuse support is lacking...

Right, so by adding in the callback_handler keyword arg in more places def improves the coverage

I'm still not seeing costs when I call VertexAI, was planning to file a bug at some point

Alright, thank you for your help! Will keep you updated if I hear back from them!

Here is a few places I put the extra kwarg, there are more

Pow! just found another from this convo, so I owe you a bit for the indirect help :]

my llm line was also missing it

does the bge_large_en perform much better than the small one?

I'm not sure, but jinaai was terrible

I saw some benchmarks that it was supposed to be SoTA with reranking, but the baseline was trash compared to the published numbers

not sure if I'm holding it wrong, I also saw some people saying they were getting different results from local vs cloud and the Jina team was working to correct that

interesting

I went with small because I'm deploying it to the cloud for the first time and...

anecdotally I didn't see meaninful difference
error on the small side

How long does it take for your model to return the answer? mine would take at least 15-20 seconds

I'm planning to get an evaluation pipeline in order to answer, is A better than B, for all the components

Which model? The embedding is typically very fast, Vertex / OpenAI take some time

The model and input/output size both impact timings

When I send a query to my rag model to get an answer, it will take at least 15 seconds before generating anything

I am now able to see most of it in Langfuse! So I can start to get a better answer to this

So I think there may be more than once call to the LLMs, depending on your setup (I haven't looked close enough at yours)

Getting that callback_manager in all the places will help uncover this sort of thing

gotcha

https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/

not sure if you have seen this, but I am planning to apply truera on my models for comparision purpose

seems like a good one to me

I have not, but will probably check it out!

there are two sides to this, which makes it different from normal programming

DevOps / LogMon type visibility, which is how I am holding Langfuse (and maybe UpTrain)
Quality control, unique to LLM systems because there is fuzziness to it that SQL DBs don't experience

@Nam Tran I figured out more about langfuse, cleaned up my calling methods, got pricing working for the whole thing

Attachment

Screenshot_2024-03-21_at_3.28.43_AM.png

And a second message that skips the RAG step

Attachment

Screenshot_2024-03-21_at_3.39.45_AM.png

@verdverm awesome! I will look closer into the set up later. Glad to know that it worked!

I may pull apart the embedding / lookup phase, as I expect to hit similar issues with cost calculation (note, you have to set total_cost manually, the comment is wrong in langfuse docs)

Gotcha!

@verdverm To follow up on the issue I mentioned earlier, this is what Langfuse relied me
"Llamaindex has issues with thwir concurrency model. If you execute multiple API requests at the same time, you run into this issue. I would advise you to add some sort of locking so that only one API request can be executed at a time"

weird, I was just told: https://discord.com/channels/1111061815649124414/1111063352593088633/1220393975211626537

Do you have a link to what they relayed* to you?

(*relayed, not relied, is the word I think you were after)

Attachment

I sent a text message to them on their web page

are you calling .flush() yourself?

this is for the code that I showed you at beginning

Add a reply

Sign up and join the conversation on Discord