When I build an index on 1 pdf file I

i guess it's something to due with your indexing structure

https://gpt-index.readthedocs.io/en/latest/guides/primer/index_guide.html

https://zilliz.com/blog/getting-started-with-llamaindex#The-Indexes-in-LlamaIndex

wich then depends on the usage you want to do

after, ask logan, but idk if the technique is different with multiples documents

I assume that's the case too, but I'm just using the GPTVectorStoreIndex , which from those articles should maintain articles as nodes. It would seem there must be an issue with the top_k that is returned not having the relevant info in it.

Those articles are pretty rudimentary and don't explain much beyond doc->node and general aggregation of docs by similarity score.

Logan? Can you tag them here?

try maybe a keyword index ? that would not work for other prompts but if you need smth specific

i guess so, i dont do it cuz they should have soooo much pings

ooooh

you want his @

@Logan M

Yea. There's 3 logans.

he's one of the guys working on the project and he's helping us soooooo much

I was thinking I was just missing something silly. It sounds like that might not be the case.

These PDFs have tables with this info in them if that elucidates something

I'm kind of a beginner here too so I might not know the solution

but

that's what i have understood from now on

i'm gigling with a lot of code trying to understand how everything works. Since there isnt a lot doc or exemple on internet of what i'm trying to create

@Orion Pax dates are a little tricky for embedding retrieval. Are you using default options right now? You could try increasing the top k

index.as_query_engine(similarity_top_k=3)

Usually separating these documents into "groups" helps to. A group could be a single document, or a collection of documents on a topic. Then you can create an index for each group and use a router query engine or a graph to send your query to the correct documents.

Yea. Default options

Had to increase it to 4 and that worked

Although, it gave 11% instead of 10

Which is interesting. I wonder if one of the docs rounded up

Makes sense!

If you are interested as well, here's a page on the router query engine, maybe it's helpful. For example, I might include a keyword index tool for queries that mention specific dates? Idk, takes some messing around sometimes lol

https://gpt-index.readthedocs.io/en/latest/examples/query_engine/RouterQueryEngine.html#router-query-engine

Do you have an example or explanation of keyword index tools and how to incorporate them?

I'm definitely interested in that idea because of how riddled these SEC files are with financial terms. I suspect it'll make the queries MUCH better

Yea you can use a keyword index just like any other index. Even with your base setup right now you could try swapping the vector index for a keyword index

There are two keyword indexs, one that uses basically simple string parsing to find keywords (fast, but sometimes doesn't find keywords), and a smarter version that asks the LLM to indentify keywords (slower + token usage when building the index, maybe better search results)