@Logan M

At a glance

The community member has built a Q&A system using hybrid retrieval and now wants to add document summaries to the system. They ask if this can be done using LlamaIndex and whether it can be integrated into their existing ingestion pipeline with minimal code changes. The community members discuss different approaches, such as:

- Using the TreeSummarize component from LlamaIndex to generate summaries and potentially adding a custom transform component to the pipeline to store the summaries.

- Generating and storing the summaries separately from the pipeline, and then implementing a custom retriever to fetch the summaries as needed.

The community members agree that the main requirement is to maintain a mapping between the summaries and the corresponding documents/nodes, so that the summaries can be retrieved when needed.

AAnurag Agrawal

@Logan M
Question on hybrid (embedding + keyword) based retrieval vs document summary index based retrieval:

I built a Q&A system using hybrid retrieval. Now my next task is to get summary of the same documents over which I built this Q&A system on. I will not be using this summary of retrieval. My task is to simply present users with summary of the documents.
1) Is this something I can do with LlamaIndex?
2) If yes, I want to do it with minimal code changes. I am currently using Ingestion pipeline to persist things into docstore and vector_store. Is there a way for me to include document summaries in the same pipeline?

Let me know if you need anything additional. Thanks!

6 comments

LLogan M

You can generate the summary pretty easily

Plain Text

from llama_index.core.response_synthesizers import TreeSummarize

synth = TreeSummarize(llm=llm)

response = synth.get_response("Summarize the provided text", ["text1", "text2", ...])

You could add a custom transform component to your pipeline I suppose to do this, not sure where you want to store these summaries (in the docstore I guess?)

AAnurag Agrawal

Thanks @Logan M ! I understand that part. This is how I am creating my index currently:

pg_vector_store = PGVectorStore.from_params(
database=config.pg_db_name,
host=config.pg_db_host,
password=config.pg_db_password,
port=config.pg_db_port,
user=config.pg_db_user,
table_name= db_table_name,
embed_dim=384, # bge-small-v1.5 embedding dimension
hybrid_search=True,
text_search_config="english",
perform_setup = False
)

storage_context = StorageContext.from_defaults(vector_store=pg_vector_store)

pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512, chunk_overlap=20),
embed_model,
],
docstore = RedisDocumentStore.from_host_and_port(
config.redis_host, config.redis_port, namespace=f"xxx_docstore"),
vector_store = pg_vector_store,
docstore_strategy=DocstoreStrategy.UPSERTS_AND_DELETE,
)

nodes = pipeline.run(documents=documents)

I want to add Document Summary Index like feature to this so that I don't have to generate summary at the query time. I read in one of the tutorials that DocumentSummaryIndex will save document summary for each "node" in docstore. I want to add that feature so that at query time, I can simply pick it from index

LLogan M

You just need to generate the summary and store it somewhere 🤔 This can be done a million different ways tbh haha

The document summary index works by generating summaries when you build the index. Then, the summaries embedded and are used at retrieval time to pick which documents/sets of nodes to use to answer a question

The actual documents/nodes are just in the docstore, and the summary stored has a reference to which nodes/documents are related to it

LLogan M

I hope that makes sense -- basically you just need to maintain some mapping of which summaries belong to which nodes, so that you can retrieve as needed

AAnurag Agrawal

Thanks @Logan M ! Since I have already implemented the hybrid Q&A pipeline, I was trying to see if there was a way to include summary in this pipeline and fetch it as and when needed. Seems like that isn't possible?

I understand how to generate and store summary somewhere else, i'll go with that

LLogan M

You could implement a custom retriever that grabs the summaries when needed? (Not sure how you decide "when needed" though)

Add a reply

Find answers from the community

@Logan M