Find answers from the community

Updated 3 months ago

llama_index/llama_index/response_synthes...

Hi all!, im wondering why is response_mode=tree_summarize so fast, and came to this: https://github.com/jerryjliu/llama_index/blob/99f3127012368cac3450c10b6d30e7942ae1bccd/llama_index/response_synthesizers/tree_summarize.py#L129 Is it so, because its working with each text chunk in paralell? i see that use_async is false by default, so it should not (?)
k
L
69 comments
Plain Text
Trace: query
    |_CBEventType.QUERY ->  2.771728 seconds
      |_CBEventType.RETRIEVE ->  0.004651 seconds
      |_CBEventType.SYNTHESIZE ->  2.766966 seconds
        |_CBEventType.LLM ->  2.526945 seconds
Ok, its doing only 1 call to the LLM, so its not doing the whole job .. πŸ€”
If all the text chunks fit into a single LLM call, then it will only take one LLM call to summarize
Hi! Yup, i noticed that. Unfortunately its not the case. i.e. the 'compact' mode takes about 6min
length of the text node's are about 4k so i think its not able to compact very much.
Its like its stopping at the first node, :shrugs:
What version of llama index do you have?
I know there was a bug for this around 0.6.2X?
llama-index 0.6.30
Try updating to 0.6.37 maybe?
0.7.0 is out too, but has some big changes lol
oh, there is 0.7 already...
jees this moves fast. a month ago or so, i played a little with 0.5 i think, now its 0.7 πŸ™‚
Haha we move fast! Many things on our todo list ✨️
how is release management done? todo list are written as github issues?
Most of our planning is actually done internally.

We do have a public changelog for keeping track of changes in each release though
Might make our planning more public at some point, but tbh we mostly take it week by week for most things
Well, its taking abour 40s now, with 0.7.0. And i see about 10 queries triggered to the llm. looks better now. πŸ™‚
That sounds like that bug I was talking about πŸ‘
thanks for the tip!!
Before it was truncating everything into one LLM call if there was less than 10 nodes (πŸ˜…).

The summaries should include more key details now too, since now it summarizes using the full node contents properly πŸ™
Hm.. so tree_summarize should be easy paralizable, not like refine right? how do you use async?
Plain Text
async def errr():

    list_index = ListIndex(nodes=nodes)
    qengine = list_index.as_query_engine(
        text_qa_template=prompts.get('text_qa_template'),
        refine_template=prompts.get('refine_template'),
        verbose=True,
        use_async=True,
        response_mode = 'tree_summarize' #default, compact, tree_summarize, accumulate, compact_accumulate
    )

    response = await qengine.aquery(question)
    print_response(response)

await errr()
look like, if written like that, it does req's secuentially as well.
Hmm πŸ€” it should be working. I can debug quickly when I get back to my computer
Seems to be working for me πŸ‘€

Plain Text
(venv) loganmarkewich@Logans-MBP examples % python ./tree_example.py 
Seqential time:  98.3254189491272
Async time:  27.86028218269348
Plain Text
from llama_index import ListIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

index = ListIndex.from_documents(documents)

import time

seq_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=False)

async_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

start = time.time()
response = seq_query_engine.query("Summarize this text")
end = time.time()
print("Seqential time: ", end-start)

start = time.time()
response = async_query_engine.query("Summarize this text")
end = time.time()
print("Async time: ", end-start)
Oh... well thanks for checking! i'll re-test with your example as inspiration. first time i use async, im probably messing things up somewhere.
Hi, im sorry for as this newbie question..
Plain Text
async def hi():
    index = ListIndex.from_documents(documents)
    qengine = index.as_query_engine(response_mode="tree_summarize", use_async=True)
    return await qengine.query("Summarize this text")
resp = await hi()
Im trying to run that in jupiter.. and i get RuntimeError: asyncio.run() cannot be called from a running event loop.
how do i suppost to write that?
In a notebook, for async to work, run this first

Plain Text
import nest_asyncio

nest_asyncio.apply()
However i notices something with this:
Plain Text
from llama_index import ListIndex, SimpleDirectoryReader
import time

import nest_asyncio
nest_asyncio.apply()

index = ListIndex.from_documents(documents)
async def asddd():
    seq_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=False)
    async_query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

    start = time.time()
    response = seq_query_engine.query("Summarize this text")
    end = time.time()
    print("Seqential time: ", end-start)

    start = time.time()
    response = async_query_engine.query("Summarize this text")
    end = time.time()
    print("Async time: ", end-start)

await asddd()
Seqential time: 32.56361508369446
Async time: 30.354051113128662
Plain Text
print('Docs:', len(documents), len(documents[0].text))
Docs: 1 131150
what version of llama_index are you using?
Plain Text
$ pip list |grep llama
llama-index             0.7.0
just upgraded to 0.7.4 tho. same times.
Try using response = await async_query_engine.aquery("Summarize this text") ? πŸ€”

I just tested this last week and the speedup was noticable
yeah i know! i could not keep the test until now (sorry i was offline). Must be something im doing wrong.
with await+aquery on the async_query_engine, i get the same time as well (well 28s instead of 30s)
hmmm.. One thing is that if all the text in your index fits into 1 LLM call, it will only make one LLM call πŸ€”
(and that single call can still be slow)
no idea why, its making a difference now. :shrugs:
(thanks for the patience!)
Ok, i found something interesting..
Plain Text
Without set_global_service_context:
Seqential time:  65.2854950428009
Async time:  18.973870038986206
Plain Text
from langchain.chat_models import ChatOpenAI
from llama_index import ServiceContext, LLMPredictor
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
from llama_index import set_global_service_context
set_global_service_context(service_context)
Plain Text
With set_global_service_context::
Seqential time:  17.991682052612305
Async time:  17.761921167373657
Looks like ChatOpenAI uses gpt-3.5 by default, but llama uses OpenAI(davinci3?) by default
or something like that.. the thing is, that it seems to make a different with paralell calls
I'll play around with this in a bit πŸ€” very sus
could not build a test case. =(. and notices my billing got to 35 bucks today.. πŸ˜›
I got pretty consistent results for global vs. local service context here

Plain Text
Non-async Non-global:  18.473615884780884
async Non-global:  5.219250917434692
Non-async global:  17.510083198547363
async global:  7.159470796585083
Code:

Plain Text
import time
from llama_index import ListIndex, ServiceContext, SimpleDirectoryReader, set_global_service_context
from llama_index.llms import OpenAI

service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0.0))

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

index = ListIndex.from_documents(documents, service_context=service_context)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=False).query("Summarize this text")
end = time.time()
print("Non-async Non-global: ", end-start)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=True).query("Summarize this text")
end = time.time()
print("async Non-global: ", end-start)

set_global_service_context(service_context)
start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=False).query("Summarize this text")
end = time.time()
print("Non-async global: ", end-start)

start = time.time()
response = index.as_query_engine(response_mode="tree_summarize", use_async=True).query("Summarize this text")
end = time.time()
print("async global: ", end-start)
Ok, i think its something about the prompting. Im using custo spanish prompt, ans notices sometimes i get 'i dont know' answer. So i think the treesummaryize, something i dont know, is somehow missing some tree traversal because of the answears
So, that could explain why sometimes secuencial running is very fast: it probably is not traversing the whole tree...
Hmm, no it should be hitting spot of the tree. Although the "I don't know" response is pretty classic for gpt-3.5 πŸ˜…
Add a reply
Sign up and join the conversation on Discord