We're indexing large data sets that we

At a glance

The community member is indexing a large dataset of 1600 product reviews and is facing an issue where the summary only considers a few of the reviews. The index contains 229 document IDs, each containing around 7 reviews. The community member suspects that the LlamaIndex is stopping after finding relevant information in the first few documents. They are asking if there are settings in LlamaIndex to ensure all documents are reviewed for relevancy before responding.

In the comments, another community member suggests the community member is using VectorStoreIndex and that by default, it only returns the top-k nodes. They recommend controlling this by using index.as_query_engine(similarity_top_k=2). Another community member suggests the community member may need a decision layer to determine if the user is asking for a summary or a pointed question.

There is no explicitly marked answer in the comments.

DDan

We're indexing large data sets that we save as text files. The one in question is a set of 1600 product reviews that are collected from various product review sites. The issue seems to be that if I put them all in one index (which we store for later recall), and ask for a summary of the reviews, it only seems to consider a few of the reviews in the analysis. In this case there are 229 doc IDs in the docstore file of the index. All the docs here contain similar information of approximately 7 reviews. Is this because when I query the index and llama index sees relevant information for my query response in the first couple docs, it just stops there?

As a follow up, are there settings in LlamaIndex I should use to ensure that all docs are reviewed for relevancy prior to responding? Thoughts and help much appreciated!

9 comments

LLogan M

Are you using VectorStoreIndex and asking for a summary?

By default, this only returns the top-k nodes. You can control this like index.as_query_engine(similarity_top_k=2)

LLogan M

You probably need some decision layer on top to decide if the user is asking for a summary vs a pointed question

DDan

Yes, using VectorStoreIndex. I'll look at that. Thank you!

DDan

Where do I find the top_k and other default settings if I don't explicitly state them?

LLogan M

there are no other defaults besides that

DDan

I meant for that and all the other variables such as Temperature and Top P if they're not configured in the .settings call. Shouldn't there be a listing somewhere?

LLogan M

oh, all those depend on the LLM you are using

LLogan M

Most llms state their defaults in their pydantic models

DDan

ok, thx

Add a reply

Find answers from the community

We're indexing large data sets that we