Creating-a-summary-index-with-specific-document-retriev...

At a glance

The community member has a use case where they need to ingest over 1000+ documents and create a VectorIndex and SummaryIndex. They were able to create the VectorIndex successfully, but are facing issues with the SummaryIndex. The issue is that they need to retrieve only one particular document (identified by the metadata) and create its summary, but they found that this is not possible with the current abstractions.

The community members discussed various approaches, including using the vector_store.get_nodes() method to retrieve nodes based on metadata filters, and using a structured LLM to generate metadata filters from a user query. One community member suggested that there may be no need for a separate SummaryIndex, and that the community member could use vector_store.get_nodes(filters=filters) and then pass the retrieved nodes to a tree-summarize function.

The community member is using the AzureAIVectorStore, which does not currently implement the get_nodes() method, and they mentioned that they will try to submit a pull request to add this functionality.

Useful resources

ddhiraj

I have a usecase, in which I need to ingest over 1000+ documents and over which I need to create a VectorIndex and SummaryIndex. I am able to sucessfully create the VectorIndex by adding metadata and retrieving using VectorIndexAutoRetriever. However, I am stuck at creating the SummaryIndex as I need to retrieve only one particular document (identified by the metadata) and create its summary. How can I achieve this?

18 comments

LLogan M

We've been (slowly) implementing the method vector_store.get_nodes() for some vector stores, in which you can pass in node ids or metadata filters

What vector store are you using?

ddhiraj

There is no problem for Vector Index. I am facing issues while creating SummaryIndex.

ddhiraj

I am creating SummaryIndex over 1000+ docs

ddhiraj

now if the user asks for summary of doc A, it should retrieve the nodes corresponding to only doc A and then create the summary, However, I found that this is not possible with the current abstractions.

ddhiraj

I might be worng here, hence need your guidance

LLogan M

Yea, the summary index always retrieves all documents in it

What you probably want is some LLM call to generate metadata filters, and retrieve from your vector store using those filters

ddhiraj

I am still getting confused, because what I think you are trying to say is that, i should be storing the SummaryIndex in the VectorStore?

ddhiraj

is that possible?

LLogan M

No need for a summary index 👀

You can use vector_store.get_nodes(filters=filters), and then pass those nodes into tree-summarize (assuming you are using a vector store that supports that function)

Plain Text

from llama_index.core.response_synthesizers import TreeSummarize

synth = TreeSummarize(llm=llm)

nodes = vector_store.get_nodes(filters=filters)

response_str = synth.get_response("query", [node.text for node in nodes])

ddhiraj

Thanks for this answr logan, I am using AzureAIVectorStore, which I think doesn't implement the get_nodes() will try to do a PR for it

ddhiraj

also I wanted to know if we have some abstractions that will generate Metadata filters from the query?

LLogan M

You can use the structured_predict method to define an object, and use that to fill out the filters.

This example assumes just exact match

Plain Text

from llama_index.core.vector_stores import MetadataFilters, MetadataFilter
from pydantic import BaseModel, Field

class Filter(BaseModel):
  """A filter on metadata."""
  key: str = Field(description="The key name to filter on")
  value: str = Field(description="The value to match on.")

class Filters(BaseModel):
  """A list of metadata filters for a query."""
  filters: list[Filter]

sllm = llm.as_structured_llm(Filters)

response = sllm.complete(f"I have an index with metadata like <some examples>. Given a user query, generate some filters (if any) that can be used to help narrow down the search.\n\n{user_query}")

filters = Filters.model_validate_json(str(response))

metadata_filters = []
for filter in filters:
  metadata_filters.append(MetadataFilter(key=filter.key, value=filter.value))

nodes = vector_store.get_nodes(filters=MetadataFilters(filters=filters))

LLogan M

(I did not test that lol)

ddhiraj

Hi @Logan M - a PR for this has been raised, but still not merged, would like to know if anything is pending

ddhiraj

https://github.com/run-llama/llama_index/pull/16653

LLogan M

Ah yea it's buried in the mountain of PRs ⛰️ thanks for the bump

ddhiraj

no issues 🙂

ddhiraj

happy to help

Add a reply

Find answers from the community

Creating-a-summary-index-with-specific-document-retrieval