Find answers from the community

Updated 2 months ago

I am not sure what should be best way to handle number of tools when out PDFs are much mor

I am not sure what should be best way to handle number of tools when out PDFs are much more lets say 100-200 PDFs in below case:

input_dir = './data/netwrok_pdf/'
pdf_nw=SimpleDirectoryReader(input_dir=input_dir).load_data()

build index and query engine

vector_query_engine = VectorStoreIndex.from_documents(
pdf_nw,
use_async=True,
).as_query_engine()

vecto_tool = QueryEngineTool.from_defaults(
query_engine=vector_query_engine,
description=(
"Useful for retrieving specific context related to the netwrok paper."
)
)

query_egine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[vecto_tool],
verbose=True,
)

Now lets say I have 100 more PDFs then shall we create 100 seperate tool for each PDF?
If yes then it does not seems to be scalable ?
k
A
3 comments
Please stand by @Anuj as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
The LlamaIndex documentation suggests that as the number of documents scales up, the standard RAG stack (top-k retrieval + basic text splitting) doesn't perform well. In such a scenario, you may want to use structured information to help with more precise retrieval. This could be particularly useful if you have a question that's only relevant to a subset of your PDFs. Using structured information can ensure that the relevant PDFs get returned beyond raw embedding similarity with chunks [source ()].

In the case of having multiple PDFs, you could consider creating a separate tool for each PDF as shown in the example from the LlamaIndex GitHub repository. Here, a dictionary of tools is created where each tool is associated with a specific document [source ()].

However, you're correct in noting that this approach may not scale well for a large number of PDFs. One possible solution could be to group related PDFs together and create a tool for each group, rather than for each individual PDF. This would reduce the number of tools you need to manage.

Remember, the best approach will depend on your specific use case and the nature of your data. You may need to experiment with different strategies to find what works best for you.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai what will happen if I create only 1 tool ?
Add a reply
Sign up and join the conversation on Discord