Improving Inference Time for Llama Index with Parallel Query Engines

Question

Hi,I am noobie in llama index. I just wanted to do rag on multiple pdf docs (10 to 100 docs). Is there a way i can create queryengine in parallel and also how can i improve the inference time.Thanks

WhiteFang_Jr · Answer

Hey!Since you are beginning out, I would suggest for you to try it on google colab, Gives you free GPU ( helps you minimise the inference time ) Also if files are not changing , save the embeddings and load from them , this would help you save time creating embeddings again and again

_hodophile_ · Answer

Thanks, i have tried this.My actual scenario is like i need to fire query on all the documents and have to create a map of response against each document. Right now i am iterating over the queryengine list that i have created for each document and do query in a loop. Is there a way i can make this parallel?Fyi: i am using ollama for serving the llm

WhiteFang_Jr · Answer

Queryengine list? Does that mean you create query engine separately for each document? Any specific reason to do this?

_hodophile_ · Answer

Yes, i am creating seperate queryengine for each document. Am doing this to capture from which document i am getting the response from. Is there a better way to do than this?Thanks

Torsten · Answer

You can also pass that info to the metadata

WhiteFang_Jr · Answer

Yes as @Torsten said, You can have some sort of metadata ( Maybe file name or anything you want to add ) for each document and when you get the response you can check the source nodes to check which documents were used to form the answer.

_hodophile_ · Answer

Ok will try that. Thanks🙌

Find answers from the community

Improving Inference Time for Llama Index with Parallel Query Engines