Find answers from the community

Updated 4 months ago

When I build an index on 1 pdf file I

At a glance
When I build an index on 1 pdf file, I can request info from that file. "What percent of revenue did [company] spend was on Research and Development in 2018" and I get a response of "10%". However, when I create an index of 10 files including that 1 file, I get the "There is no information provided about what percent of revenue [company] spent on research and development in 2018.

Is there a parameter or class I should be looking at when working with multiple files vs 1 file?
V
O
L
31 comments
i guess it's something to due with your indexing structure
wich then depends on the usage you want to do
after, ask logan, but idk if the technique is different with multiples documents
I assume that's the case too, but I'm just using the GPTVectorStoreIndex , which from those articles should maintain articles as nodes. It would seem there must be an issue with the top_k that is returned not having the relevant info in it.
Those articles are pretty rudimentary and don't explain much beyond doc->node and general aggregation of docs by similarity score.
Logan? Can you tag them here?
try maybe a keyword index ? that would not work for other prompts but if you need smth specific
i guess so, i dont do it cuz they should have soooo much pings
you want his @
Yea. There's 3 logans.
he's one of the guys working on the project and he's helping us soooooo much
I was thinking I was just missing something silly. It sounds like that might not be the case.
These PDFs have tables with this info in them if that elucidates something
I'm kind of a beginner here too so I might not know the solution
that's what i have understood from now on
i'm gigling with a lot of code trying to understand how everything works. Since there isnt a lot doc or exemple on internet of what i'm trying to create
@Orion Pax dates are a little tricky for embedding retrieval. Are you using default options right now? You could try increasing the top k

index.as_query_engine(similarity_top_k=3)

Usually separating these documents into "groups" helps to. A group could be a single document, or a collection of documents on a topic. Then you can create an index for each group and use a router query engine or a graph to send your query to the correct documents.
Yea. Default options
Had to increase it to 4 and that worked
Although, it gave 11% instead of 10
Which is interesting. I wonder if one of the docs rounded up
Makes sense!

If you are interested as well, here's a page on the router query engine, maybe it's helpful. For example, I might include a keyword index tool for queries that mention specific dates? Idk, takes some messing around sometimes lol

https://gpt-index.readthedocs.io/en/latest/examples/query_engine/RouterQueryEngine.html#router-query-engine
Do you have an example or explanation of keyword index tools and how to incorporate them?
I'm definitely interested in that idea because of how riddled these SEC files are with financial terms. I suspect it'll make the queries MUCH better
Yea you can use a keyword index just like any other index. Even with your base setup right now you could try swapping the vector index for a keyword index

There are two keyword indexs, one that uses basically simple string parsing to find keywords (fast, but sometimes doesn't find keywords), and a smarter version that asks the LLM to indentify keywords (slower + token usage when building the index, maybe better search results)
GPTKeywordTableIndex is the smarter one, GPTSimpleKeywordTableIndex is the faster one
Add a reply
Sign up and join the conversation on Discord