Does llama index work on metadata

At a glance

The post asks if llama_index supports metadata filtering, specifically for 30 PDF files with the same format, to allow asking questions from specific files. The comments discuss the following:

- Community members suggest using the ExactMatchFilter with the correct file name format (e.g., "120.pdf") and note that the vector database used needs to support metadata filtering.

- The community member is currently using llama_index without a vector database and has other indexes like CSV and docs that are working. They ask if they should move to Pinecone for just the PDF index or if using separate indexes would be more beneficial.

- Community members suggest that if metadata filtering is required, using a vector database like Pinecone is the best option, or alternatively, a custom node postprocessor could be used, but this would only filter after retrieval.

- The community member shares their implementation of using Pinecone for the PDF index and asks how to avoid loading the data from the SimpleDirectoryReader every time the server starts, suggesting a solution similar to "load_index_from_storage".

- Community members provide a suggestion to load the existing vector index by setting up the vector store object and using "VectorStoreIndex.from_vector_store".

Useful resources

AAhsan Mirza

Does llama_index work on metadata filtering? I mean we have 30 pdf files with same format and can we ask questions from specific file?

19 comments

AAhsan Mirza

"b96104d4-2c13-4cfe-aa37-1c070ce8c2ae": {
"data": {
"id_": "b96104d4-2c13-4cfe-aa37-1c070ce8c2ae",
"embedding": null,
"metadata": {
"page_label": "2",
"file_name": "120.pdf"
},
"excluded_embed_metadata_keys": [],
"excluded_llm_metadata_keys": [],
"relationships": {
"1": {
"node_id": "2f4e8bc3-8cf8-4d95-929f-5a2e577f2bcc",
"node_type": null,
"metadata": {
"page_label": "2",
"file_name": "120.pdf"
},
"hash": "d5003a8a9955d94e34be208dbe8d5999facf147231374b189421d3f44e0ff8d7"
}
},
"hash": "d5003a8a9955d94e34be208dbe8d5999facf147231374b189421d3f44e0ff8d7",
"text": "Text skip from here intentionally",
"start_char_idx": 0,
"end_char_idx": 748,
"text_template": "{metadata_str}\n\n{content}",
"metadata_template": "{key}: {value}",
"metadata_seperator": "\n"
},
"type": "1"
},

uses this filter
filters = MetadataFilters(filters=[ExactMatchFilter(key="file_name", value="120")])
This is not working and engine respone as "NONE"

AAhsan Mirza

@Logan M

LLogan M

Notice it says ExactMatchFilter -- so you should have value="120.pdf"

LLogan M

Also, the vector db you are using needs to support metadata filtering. Thanfully most popular ones do

AAhsan Mirza

Currently i've not used any vector db, I'm using by default llama indexing.
Note that i've other indexes as well like csv and docs index. these are working file with llama defaualt indexing.
Should i move to pinecone for only pdf files inedex? rest of index same.
what you suggest @Logan M

LLogan M

If you want metadata filtering to work, that's basically the only option for vector indexes 🤔

You could write a node postprocessor though too, but this would only filter nodes AFTER retrieval, rather than before, so you would need to set the top K pretty high in order to have meaningful filtering https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/usage_pattern.html#custom-node-postprocessor

AAhsan Mirza

@Logan M if node postprocessor apply filter after retrieval of nodes, if the node that contains the specific information is not get than it says "Not enough information"

if I implement like pdf files index uses vector(pinecone) db and rest of index uses by default llama indexex. it will be benificial of not?

LLogan M

It would help with with performing proper metadata filtering yes

AAhsan Mirza

@Logan M i've implemented the pinecone for pdf indexes.
There is an issue raise here,
property_info_docs = SimpleDirectoryReader(
Path("data_folder/Property Info Sheets")).load_data()

docs = []

for row in property_info_docs:
docs.append(Document(
text=row.text,
docid=row.id,
extra_info={'file_name': row.metadata['file_name']}
))
print("Docs with 0 index: ", docs[0])
print("Length of docs: ", len(docs))

parser = SimpleNodeParser()

nodes = parser.get_nodes_from_documents(docs)

initialize connection to pinecone

pinecone.init(
api_key=os.environ['PINECONE_API_KEY'],
environment=os.environ['PINECONE_ENVIRONMENT']
)

create the index if it does not exist already

index_name = 'demo'
if index_name not in pinecone.list_indexes():
pinecone.create_index(
index_name,
dimension=1536,
metric='cosine'
)

connect to the index

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

setup our storage (vector db)

storage_context = StorageContext.from_defaults(
vector_store=vector_store
)

setup the index/query process, ie the embedding model (and completion if used)

embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = VectorStoreIndex.from_documents(
docs, storage_context=storage_context,
service_context=service_context
)

By using this code all vectors are upserted into the pinecone,
So the question is Each time i need to use read SimpleDirectoryReader whenever i start the server and then data inside docs which is used in VectorStore.
How can we load the pinecone db vectors and avoid from loading again and again.

AAhsan Mirza

Similar like load_index_from_storage

AAhsan Mirza

@Logan M

LLogan M

To load an existing vector index, setup the vector store object to point to the existing vector store then do

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

AAhsan Mirza

@Logan M i asked a question about address of 19 property it pulls the information but in answer it gives information is not provides.
Even i used similarity_top_k=10

Attachment

LLogan M

Ha wow ok 😭😅 :PepeHands:

LLogan M

I think the recent changes to prompt templates were a bit too restrictive in v0.8.0

LLogan M

About to cut a release that hopefully helps with that

AAhsan Mirza

Currently using llama-index==0.7.7.

AAhsan Mirza

which version should i use please suggest @Logan M

AAhsan Mirza

@Logan M does we upload different indexes data into single pinecone db?
i mean csv index, pdf index, docs index.
it will be better in performance or using separate indexes will me more beneficial?

Add a reply

Find answers from the community

Does llama index work on metadata

initialize connection to pinecone

create the index if it does not exist already

connect to the index

setup our storage (vector db)

setup the index/query process, ie the embedding model (and completion if used)