LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

llama_index/llama_index/response_synthes...

llama_index/llama_index/response_synthes...

At a glance

The community members are discussing the performance of the response_mode=tree_summarize feature in the LlamaIndex library. They notice that it is fast, but wonder if it is due to parallelization. The comments suggest that the feature may only be making a single call to the language model, which could explain the speed. However, the community members find that the "compact" mode still takes a long time, likely due to the large size of the text nodes.

The community members try upgrading to newer versions of LlamaIndex, from 0.6.30 to 0.7.0 and 0.7.4, and see some improvements in performance. They also experiment with using asynchronous calls, but find that the speedup is not as significant as expected.

There is no explicitly marked answer, but the community members continue to investigate and share their findings, trying to understand the performance characteristics of the tree_summarize feature.

Useful resources

·

Hi all!, im wondering why is response_mode=tree_summarize so fast, and came to this: https://github.com/jerryjliu/llama_index/blob/99f3127012368cac3450c10b6d30e7942ae1bccd/llama_index/response_synthesizers/tree_summarize.py#L129 Is it so, because its working with each text chunk in paralell? i see that use_async is false by default, so it should not (?)

k

L

69 comments

Plain Text

Trace: query
    |_CBEventType.QUERY ->  2.771728 seconds
      |_CBEventType.RETRIEVE ->  0.004651 seconds
      |_CBEventType.SYNTHESIZE ->  2.766966 seconds
        |_CBEventType.LLM ->  2.526945 seconds

Ok, its doing only 1 call to the LLM, so its not doing the whole job .. 🤔

If all the text chunks fit into a single LLM call, then it will only take one LLM call to summarize

https://github.com/jerryjliu/llama_index/blob/99f3127012368cac3450c10b6d30e7942ae1bccd/llama_index/response_synthesizers/tree_summarize.py#L52

Hi! Yup, i noticed that. Unfortunately its not the case. i.e. the 'compact' mode takes about 6min

length of the text node's are about 4k so i think its not able to compact very much.

Its like its stopping at the first node, :shrugs:

What version of llama index do you have?

I know there was a bug for this around 0.6.2X?

llama-index 0.6.30

Hmm

Try updating to 0.6.37 maybe?

0.7.0 is out too, but has some big changes lol

oh, there is 0.7 already...

jees this moves fast. a month ago or so, i played a little with 0.5 i think, now its 0.7 🙂

Haha we move fast! Many things on our todo list ✨️

how is release management done? todo list are written as github issues?

Most of our planning is actually done internally.

We do have a public changelog for keeping track of changes in each release though

Might make our planning more public at some point, but tbh we mostly take it week by week for most things

great

Well, its taking abour 40s now, with 0.7.0. And i see about 10 queries triggered to the llm. looks better now. 🙂

That sounds like that bug I was talking about 👍

thanks for the tip!!

Before it was truncating everything into one LLM call if there was less than 10 nodes (😅).

The summaries should include more key details now too, since now it summarizes using the full node contents properly 🙏

great!

Hm.. so tree_summarize should be easy paralizable, not like refine right? how do you use async?

Plain Text

async def errr():

    list_index = ListIndex(nodes=nodes)
    qengine = list_index.as_query_engine(
        text_qa_template=prompts.get('text_qa_template'),
        refine_template=prompts.get('refine_template'),
        verbose=True,
        use_async=True,
        response_mode = 'tree_summarize' #default, compact, tree_summarize, accumulate, compact_accumulate
    )

    response = await qengine.aquery(question)
    print_response(response)

await errr()

look like, if written like that, it does req's secuentially as well.

Hmm 🤔 it should be working. I can debug quickly when I get back to my computer

Seems to be working for me 👀

Plain Text

(venv) loganmarkewich@Logans-MBP examples % python ./tree_example.py 
Seqential time:  98.3254189491272
Async time:  27.86028218269348

Plain Text

from llama_index import ListIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

index = ListIndex.from_documents(documents)

import time

seq_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=False)

async_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

start = time.time()
response = seq_query_engine.query("Summarize this text")
end = time.time()
print("Seqential time: ", end-start)

start = time.time()
response = async_query_engine.query("Summarize this text")
end = time.time()
print("Async time: ", end-start)

Oh... well thanks for checking! i'll re-test with your example as inspiration. first time i use async, im probably messing things up somewhere.

Hi, im sorry for as this newbie question..

Plain Text

async def hi():
    index = ListIndex.from_documents(documents)
    qengine = index.as_query_engine(response_mode="tree_summarize", use_async=True)
    return await qengine.query("Summarize this text")
resp = await hi()

Im trying to run that in jupiter.. and i get RuntimeError: asyncio.run() cannot be called from a running event loop.

how do i suppost to write that?

cc @Logan M

In a notebook, for async to work, run this first

Plain Text

import nest_asyncio

nest_asyncio.apply()

Thanks!

However i notices something with this:

Plain Text

from llama_index import ListIndex, SimpleDirectoryReader
import time

import nest_asyncio
nest_asyncio.apply()

index = ListIndex.from_documents(documents)
async def asddd():
    seq_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=False)
    async_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

    start = time.time()
    response = seq_query_engine.query("Summarize this text")
    end = time.time()
    print("Seqential time: ", end-start)

    start = time.time()
    response = async_query_engine.query("Summarize this text")
    end = time.time()
    print("Async time: ", end-start)

await asddd()

Seqential time: 32.56361508369446
Async time: 30.354051113128662

Plain Text

print('Docs:', len(documents), len(documents[0].text))

Docs: 1 131150

what version of llama_index are you using?

Plain Text

$ pip list |grep llama
llama-index             0.7.0

just upgraded to 0.7.4 tho. same times.

Try using response = await async_query_engine.aquery("Summarize this text") ? 🤔

I just tested this last week and the speedup was noticable

yeah i know! i could not keep the test until now (sorry i was offline). Must be something im doing wrong.

with await+aquery on the async_query_engine, i get the same time as well (well 28s instead of 30s)

hmmm.. One thing is that if all the text in your index fits into 1 LLM call, it will only make one LLM call 🤔

(and that single call can still be slow)

no idea why, its making a difference now. :shrugs:

(thanks for the patience!)

Ok, i found something interesting..

Plain Text

Without set_global_service_context:
Seqential time:  65.2854950428009
Async time:  18.973870038986206

Plain Text

from langchain.chat_models import ChatOpenAI
from llama_index import ServiceContext, LLMPredictor
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
from llama_index import set_global_service_context
set_global_service_context(service_context)

Plain Text

With set_global_service_context::
Seqential time:  17.991682052612305
Async time:  17.761921167373657

Looks like ChatOpenAI uses gpt-3.5 by default, but llama uses OpenAI(davinci3?) by default

or something like that.. the thing is, that it seems to make a different with paralell calls

I'll play around with this in a bit 🤔 very sus

could not build a test case. =(. and notices my billing got to 35 bucks today.. 😛

I got pretty consistent results for global vs. local service context here

Plain Text

Non-async Non-global:  18.473615884780884
async Non-global:  5.219250917434692
Non-async global:  17.510083198547363
async global:  7.159470796585083

Code:

Plain Text

import time
from llama_index import ListIndex, ServiceContext, SimpleDirectoryReader, set_global_service_context
from llama_index.llms import OpenAI

service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0.0))

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

index = ListIndex.from_documents(documents, service_context=service_context)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=False).query("Summarize this text")
end = time.time()
print("Non-async Non-global: ", end-start)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=True).query("Summarize this text")
end = time.time()
print("async Non-global: ", end-start)

set_global_service_context(service_context)
start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=False).query("Summarize this text")
end = time.time()
print("Non-async global: ", end-start)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=True).query("Summarize this text")
end = time.time()
print("async global: ", end-start)

hm..

Ok, i think its something about the prompting. Im using custo spanish prompt, ans notices sometimes i get 'i dont know' answer. So i think the treesummaryize, something i dont know, is somehow missing some tree traversal because of the answears

So, that could explain why sometimes secuencial running is very fast: it probably is not traversing the whole tree...

:shrugs:

Hmm, no it should be hitting spot of the tree. Although the "I don't know" response is pretty classic for gpt-3.5 😅

Add a reply

Sign up and join the conversation on Discord