Find answers from the community

Updated last year

Can I use MetadataFilters with

At a glance

The community member is trying to filter a vector store index on a set of documents using MetadataFilters and ExactMatchFilter. They initially tried using a list of document IDs, but encountered an issue where only the last document was being retrieved.

Other community members suggested that llamaindex does not natively support multiple conditions, but that Pinecone and PostgreSQL vector databases may have better support for complex metadata filtering. One community member provided an example of how to achieve this using the Mongo $in operator and the PostgreSQL vector database.

The final solution provided by a community member was to use the "in" operator in the PostgreSQL vector database to filter the index by a list of document IDs, which seems to have resolved the issue.

Useful resources
Can I use MetadataFilters with ExactMatchFilter for multiple values for the same key? I want to filter my index on a set of documents
J
Ł
9 comments
I want to filter on a set of documents like this:
Plain Text
def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    doc_ids = [str(doc.id) for doc in conversation_docs]
    filters = MetadataFilters(
        filters=[ExactMatchFilter(key=DB_DOC_ID_KEY, value=str(doc_id)) for doc_id in doc_ids]
    )
    kwargs = {"similarity_top_k": 3, "filters": filters}
    return index.as_query_engine(**kwargs)
This looks like it's setting up the filters correctly:
Plain Text
INFO:app.chat.engine:vector_query_engine_tools is: [
    {
        "_metadata": {
            "description": "...",
            "fn_schema": "...",
            "name": "..."
        },
        "_query_engine": {
            "_node_postprocessors": [],
            "_response_synthesizer": "...",
            "_retriever": "...'_filters': MetadataFilters(filters=[ExactMatchFilter(key='db_document_id', value='0a427515-c6d7-4fe4-9cc2-e6078eb6001b'), ExactMatchFilter(key='db_document_id', value='27b0b587-9629-41e2-a8e5-7281a0e6f300'), ExactMatchFilter(key='db_document_id', value='7a6c32ef-5bc3-44f9-ae24-0b428cc36a00'), ExactMatchFilter(key='db_document_id', value='d93366e0-9362-48ab-89bf-1f986f5f6a9a'), ExactMatchFilter(key='db_document_id', value='eb163ad6-e4aa-4a88-a643-40cc1c4352fb'), ExactMatchFilter(key='db_document_id', value='f356a86b-2365-4cd0-b6b9-d7de13622bbc')]), '_kwargs': {}}",
            "callback_manager": "..."
        }
    }
]
The issue is that it only seems to be grabbing citations/sources/documents for the last value:
Plain Text
INFO:app.schema:QuestionAnswerPair.from_sub_question_answer_pair: citations: [
    {
        "document_id": "f356a86b-2365-4cd0-b6b9-d7de13622bbc",
        "page_number": 1,
        "score": 0.8420158436317318,
        "text": "Underwood, Susan Ardmore..."
    }
]
Yes, afaik llamaindex does not support multiple conditions natively, but you can do this with Pinecone - you just need to supply the condition in json rather than python:

Plain Text
def filter_by_document_id_and_education(document_id):
    return {
        "filter": {
            "$and": [
                {"uuid": str(document_id)},
                {"contains_education": 1},
            ]
        }
    }
In theory some other vector stores also support complex metadata filtering, but only Pinecone has the docs for it
So you could just the mongo $in operator to achieve this for your use case
Thanks for the rec!! I'm using pg vector db and it turns out they have undocumented support for in: https://github.com/langchain-ai/langchain/issues/9726#issuecomment-1705465285

Here's my updated function, which seems to be correctly pulling multiple documents now!
Plain Text
def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    doc_ids = [str(doc.id) for doc in conversation_docs]
    filters = {DB_DOC_ID_KEY: {"in": doc_ids}}
    kwargs = {"similarity_top_k": 3, "filter": filters}
    return index.as_query_engine(**kwargs)
Add a reply
Sign up and join the conversation on Discord