Find answers from the community

Updated 8 months ago

So here is the scenario that I am trying

So here is the scenario that I am trying to accomplish, I have a pdf containing text + images + tables. I need to develop a RAG which is needs to retrieve images along with the text based on the query relevance.

I can achieve this by creating TextNodes, ImageNodes and IndexNode and then using RecursiveRetriever to retrieve the nodes along with the images.
However, this approach has a problem where in, if the there are more TextNodes with the relevant text (more then similarity_top_k) then the ImageNode wont be retrieved.
To avoid this, is it possible to do some workaround (or have a feature in the library) such that the RecursiveRetriever retrieves TextNodes and ImageNodes separately along with the scores, so that as a user I can decide weather to pass just the TextNodes or TextNode + ImageNode to the LLM in its context.

This use-case is an important one IMO, and I feel that this should be built in the library, I would love to hear some discussions on this and more than happy to contribute if the need araises.
L
d
36 comments
There is a multimodal index/retriever for this
The as_retriever() makes a mulitmodal retriever, which has a text_retrieve() and image_to_text_retrieve() functions, among others
but does it do it recursively?
as if my node list contains IndexNode does it fetch the mapped ImageNode and reterive it?
all retrievers do that by default, its baked into the BaseRetriever class
cool will try it, and reviste this thread if it doesn't work
by the way, I also what to chat over this kind of usecase, but have noticed that there is no ChatEngine specific for this?
does the ContextChatEngine work in the above case?
Hmmm yea there is no multi-modal chat engine just yet -- you'd have to make your own loop using llm calls
okay, i will try to contribute to the library by writing one, but i would be needing some guidance
I am getting the following error - MultiModalVectorIndexRetriever object has no attribute object_map
hmm.. maybe that wasn't updated to handle this recursive business
Here is the code
Plain Text
mm_vector_store_index = MultiModalVectorStoreIndex(nodes=stored_nodes, image_vector_store=None)
mm_vector_retriever = mm_vector_store_index.as_retriever(similarity_top_k=2, image_similarity_top_k=2)
mm_vector_retriever.retrieve("What are main paradigm of RAG?")
also i think that this wont be working in what I am trying to achiever
I am not storing the images separately
and i do not want to generate the embeddings for the images
i am generating the summary for the images, and linking them to the actual images using the IndexNode
I am trying to do what Lance Martin explained in this video -
https://www.youtube.com/watch?v=Rcqy92Ik6Uo
and I have already achieve that using LLama Index, whereas he has done in langchain
but there are couple of limitations to this approach, which I am trying to solve using llama-index
and hence i posted my first question
maybe you can use metadata filtering to filter out images vs text
if you attach metadata to your nodes
well filtering is after the retrieval stage
i am thinking, is there a way, to seperate out nodes and run similarity over two set of nodes
no, filtering is before actually
thats how vector dbs implement it -- apply a filter, then perform similarity search
so then i need to run two retrievals, one only for text nodes and one for text node linked with image node?
Yea, thats how you would retrieve text vs. images seperately
Plain Text
metadatafilters = MetadataFilters(filters=[MetadataFilter(key="type", value="image",FilterOperator=FilterOperator.EQ)])
vector_retriever = vector_store_index.as_retriever(similarity_top_k = 2, filters=metadatafilters)
is there a better way to write MetadataFilters, I couldn't find one
That is the way to do it
Add a reply
Sign up and join the conversation on Discord