Can I use MetadataFilters with

At a glance

The community member is trying to filter a vector store index on a set of documents using MetadataFilters and ExactMatchFilter. They initially tried using a list of document IDs, but encountered an issue where only the last document was being retrieved.

Other community members suggested that llamaindex does not natively support multiple conditions, but that Pinecone and PostgreSQL vector databases may have better support for complex metadata filtering. One community member provided an example of how to achieve this using the Mongo $in operator and the PostgreSQL vector database.

The final solution provided by a community member was to use the "in" operator in the PostgreSQL vector database to filter the index by a list of document IDs, which seems to have resolved the issue.

Useful resources

JJoshhhh

Can I use MetadataFilters with ExactMatchFilter for multiple values for the same key? I want to filter my index on a set of documents

9 comments

JJoshhhh

I want to filter on a set of documents like this:

Plain Text

def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    doc_ids = [str(doc.id) for doc in conversation_docs]
    filters = MetadataFilters(
        filters=[ExactMatchFilter(key=DB_DOC_ID_KEY, value=str(doc_id)) for doc_id in doc_ids]
    )
    kwargs = {"similarity_top_k": 3, "filters": filters}
    return index.as_query_engine(**kwargs)

JJoshhhh

This looks like it's setting up the filters correctly:

Plain Text

INFO:app.chat.engine:vector_query_engine_tools is: [
    {
        "_metadata": {
            "description": "...",
            "fn_schema": "...",
            "name": "..."
        },
        "_query_engine": {
            "_node_postprocessors": [],
            "_response_synthesizer": "...",
            "_retriever": "...'_filters': MetadataFilters(filters=[ExactMatchFilter(key='db_document_id', value='0a427515-c6d7-4fe4-9cc2-e6078eb6001b'), ExactMatchFilter(key='db_document_id', value='27b0b587-9629-41e2-a8e5-7281a0e6f300'), ExactMatchFilter(key='db_document_id', value='7a6c32ef-5bc3-44f9-ae24-0b428cc36a00'), ExactMatchFilter(key='db_document_id', value='d93366e0-9362-48ab-89bf-1f986f5f6a9a'), ExactMatchFilter(key='db_document_id', value='eb163ad6-e4aa-4a88-a643-40cc1c4352fb'), ExactMatchFilter(key='db_document_id', value='f356a86b-2365-4cd0-b6b9-d7de13622bbc')]), '_kwargs': {}}",
            "callback_manager": "..."
        }
    }
]

JJoshhhh

The issue is that it only seems to be grabbing citations/sources/documents for the last value:

Plain Text

INFO:app.schema:QuestionAnswerPair.from_sub_question_answer_pair: citations: [
    {
        "document_id": "f356a86b-2365-4cd0-b6b9-d7de13622bbc",
        "page_number": 1,
        "score": 0.8420158436317318,
        "text": "Underwood, Susan Ardmore..."
    }
]

ŁŁukasz

Yes, afaik llamaindex does not support multiple conditions natively, but you can do this with Pinecone - you just need to supply the condition in json rather than python:

Plain Text

def filter_by_document_id_and_education(document_id):
    return {
        "filter": {
            "$and": [
                {"uuid": str(document_id)},
                {"contains_education": 1},
            ]
        }
    }

ŁŁukasz

In theory some other vector stores also support complex metadata filtering, but only Pinecone has the docs for it

ŁŁukasz

https://docs.pinecone.io/docs/metadata-filtering

ŁŁukasz

So you could just the mongo $in operator to achieve this for your use case

ŁŁukasz

FYI just in case you run into the same problem as I do: https://discord.com/channels/1059199217496772688/1059200010622873741/1167774606254424134

JJoshhhh

Thanks for the rec!! I'm using pg vector db and it turns out they have undocumented support for in: https://github.com/langchain-ai/langchain/issues/9726#issuecomment-1705465285

Here's my updated function, which seems to be correctly pulling multiple documents now!

Plain Text

def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    doc_ids = [str(doc.id) for doc in conversation_docs]
    filters = {DB_DOC_ID_KEY: {"in": doc_ids}}
    kwargs = {"similarity_top_k": 3, "filter": filters}
    return index.as_query_engine(**kwargs)

Add a reply

Find answers from the community

Can I use MetadataFilters with