I m struggling with a straightforward

NNate Patterson FT25

I'm struggling with a straightforward use case. Use case: I am adding to a Pinecone VectorDB documents with specific metadata.

metadata_filters = {"document_name": document_name}
vector_store = PineconeVectorStore(
index_name=index_name,
environment=environment,
metadata_filters=metadata_filters,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
print('create storage context')
# Load the document
document = loader.load_data()

# Create index from document
document_name = VectorStoreIndex.from_documents(
document,
storage_context=storage_context,
service_context=service_context,
)
# Set summary text for document
document_name.index_struct.index_id = document_name

I want to be able to query specific documents instead of the entire index. I implemented metadata filtering, however, my response is none, even though I have checked that the metadata (with an exact match) exists. I have also checked that a response is returned when I remove all filters.

pinecone.init(api_key=PINECONE_API_KEY, environment=environment)

llm = OpenAI(temperature=0, model="gpt-3.5-turbo", max_tokens=1024)
service_context = ServiceContext.from_defaults(llm=llm)
vector_store = PineconeVectorStore(pinecone.Index("test"))
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
query_engine = index.as_query_engine(similarity_top_k=5,
service_context=service_context,
filters=MetadataFilters(
filters=[ExactMatchFilter(key='document_name', value=document_name)]
)
)
response = query_engine.query(instruction)
print(response)

What is wrong with this approach?

6 comments

LLogan M

Hmm, I see a few things

No need to set metadata_filters in the PineconeVectorStore constructor

When you load the data with the loader, double check that each document has the metadata you expect

Plain Text

documents = loader.load_data()
for doc in documents:
  print(doc.metadata)  # should print a dictionary that hopefully has the `document_name` key

Not totally sure what this line is for, you can probably remove it: document_name.index_struct.index_id = document_name

NNate Patterson FT25

Thanks Logan, i'll give it a shot troubleshooting with 2 & 3. On #1, I was following the usage pattern here: https://gpt-index.readthedocs.io/en/latest/examples/composable_indices/city_analysis/PineconeDemo-CityAnalysis.html. why would I not set metadata_filters? here's the basic pattern given: # Build city document index
from llama_index.storage.storage_context import StorageContext

city_indices = {}
for pinecone_title, wiki_title in zip(pinecone_titles, wiki_titles):
metadata_filters = {"wiki_title": wiki_title}
vector_store = PineconeVectorStore(
index_name=index_name,
environment=environment,
metadata_filters=metadata_filters,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
city_indices[wiki_title] = VectorStoreIndex.from_documents(
city_docs[wiki_title],
storage_context=storage_context,
service_context=service_context,
)
# set summary text for city
city_indices[wiki_title].index_struct.index_id = pinecone_title

LLogan M

I think that's an outdated demo, it's not used in the init in the source code

Attachment

NNate Patterson FT25

That's interesting and probably the root of my problem. I'm not seeing any errors from using metadata_filters=. I took a look at the docs, and I haven't been able to find a description of default metadata key:values. I do see metadata extractor classes, but I don't know how to use that to ensure that the metadata I want is extracted.

NNate Patterson FT25

i'll use your #2 to figure out what exactly is being added as tags

NNate Patterson FT25

Logan - you were right. In case this isn't obvious to anyone else, dropping the solve below. you can use the extractor classes to get doc metadata or create your own. Then just do this: metadata_filters = {"document_name": document_name} documents = loader.load_data()
for doc in documents:
doc.metadata = metadata_filters
print(doc.metadata)

Add a reply

Find answers from the community

I m struggling with a straightforward