Hey, I'm also using Azure OpenAI. There is asyncio support for many parts of the library but a couple other parts there are not
You will have to go through and find the parts of the library that are sync code. Namely, anything that calls run_async_tasks
hi I am using chatcompletion
I'm using the chatbot feature. If my company lets me opensource what I've done (50/50), I'll share that code. What I can say is you should wrap the sync parts of code with threading
I am sending my context and instruction using it. and I have to send lot of context.
That's where LlamaIndex is super important to reduce tokens
In terms of making things async, you should wrap the sync calls in to_thread
(py 3.9+) from asyncio. It will run it in a different thread, preventing your main loop from being blocked, speeding up the program through concurrency
So I have my own prompt But I am doing some analysis of awhole document which is quite bit so I am sending it part by part and then merge the analysis. It's very inefficient method. Can I decrese it?
How I can use LlamaIndex to reduce tokens? Can you give me a hint?
So, LlamaIndex basically works by taking your document and splitting it up into chunks, which they call nodes. You can choose between 'stores' which are different ways of organizing and using these chunks.
The popular one is vector storage. Each chunk is turned into embeddings (vector representation), and then when you actually want to analyze a document
It takes the keywords from your prompt, and calculates the distance (many dimensions) from your prompt and the chunks
But for my analysis I will have to process every node.
If you need to process every single node, LlamaIndex likely won't be the tool for you (Logan may know more). It excels specifically because it only pulls the nodes relevant to your query, not the entire context.
This is to reduce the tokens.
If you are saying you need to put every single character into the LLM, tokens are proportional to characters, so you will not be able to reduce that.
Yes, that's the thing I want to use every context. So, I have to send it in parts. then do the analysis again and it takes minutes for that.
Right right, so then LlamaIndex might not be the tool
Yes, I don't want to reduce tokens I just want to send multiple requests so, that it takes less time.
Asyncio is the tool for you
Yes, but I don't know how can I use it I am quite new to Asyncio.
If you can guide me it will be great help.
import asyncio
results = {}
chunks = ["This is a", "sentence split", "into chunks"]
async def call_to_llm(idx, chunk):
result = ... code to call LLM
results[idx] = result
loop = asyncio.get_event_loop()
for idx, chunk in enumerate(chunks):
loop.create_task(call_to_llm(idx, chunk))
Something like that is a quick pseudocode to accomplish what you are suggesting
When it comes to the asyncio part of things
You will want to grab the loop, via asyncio.get_event_loop()
You can then use the loop to create asynchronous tasks
These tasks can run in parallel, which will speed up your calls the way you are hoping for
Tasks can be launched with loop.create_task(your_async_method(your_param)))
Thanks for the help let me try.
Hi, @isaackogan I wrote the code but I am getting Timeout error when I am processing it on original text:
Kindly, can you check it thanks