Find answers from the community

Updated 4 weeks ago

Improving Inference Time for Llama Index with Parallel Query Engines

Hi,
I am noobie in llama index. I just wanted to do rag on multiple pdf docs (10 to 100 docs). Is there a way i can create queryengine in parallel and also how can i improve the inference time.

Thanks
W
_
T
7 comments
Hey!
Since you are beginning out, I would suggest for you to try it on google colab, Gives you free GPU ( helps you minimise the inference time )

Also if files are not changing , save the embeddings and load from them , this would help you save time creating embeddings again and again
Thanks, i have tried this.
My actual scenario is like i need to fire query on all the documents and have to create a map of response against each document. Right now i am iterating over the queryengine list that i have created for each document and do query in a loop. Is there a way i can make this parallel?
Fyi: i am using ollama for serving the llm
Queryengine list? Does that mean you create query engine separately for each document?
Any specific reason to do this?
Yes, i am creating seperate queryengine for each document. Am doing this to capture from which document i am getting the response from. Is there a better way to do than this?

Thanks
You can also pass that info to the metadata
Yes as @Torsten said, You can have some sort of metadata ( Maybe file name or anything you want to add ) for each document and when you get the response you can check the source nodes to check which documents were used to form the answer.
Ok will try that. Thanks🙌
Add a reply
Sign up and join the conversation on Discord