Find answers from the community

Updated 2 months ago

Does llama index work on metadata

Does llama_index work on metadata filtering? I mean we have 30 pdf files with same format and can we ask questions from specific file?
A
L
19 comments
"b96104d4-2c13-4cfe-aa37-1c070ce8c2ae": {
"data": {
"id_": "b96104d4-2c13-4cfe-aa37-1c070ce8c2ae",
"embedding": null,
"metadata": {
"page_label": "2",
"file_name": "120.pdf"
},
"excluded_embed_metadata_keys": [],
"excluded_llm_metadata_keys": [],
"relationships": {
"1": {
"node_id": "2f4e8bc3-8cf8-4d95-929f-5a2e577f2bcc",
"node_type": null,
"metadata": {
"page_label": "2",
"file_name": "120.pdf"
},
"hash": "d5003a8a9955d94e34be208dbe8d5999facf147231374b189421d3f44e0ff8d7"
}
},
"hash": "d5003a8a9955d94e34be208dbe8d5999facf147231374b189421d3f44e0ff8d7",
"text": "Text skip from here intentionally",
"start_char_idx": 0,
"end_char_idx": 748,
"text_template": "{metadata_str}\n\n{content}",
"metadata_template": "{key}: {value}",
"metadata_seperator": "\n"
},
"type": "1"
},

uses this filter
filters = MetadataFilters(filters=[ExactMatchFilter(key="file_name", value="120")])
This is not working and engine respone as "NONE"
Notice it says ExactMatchFilter -- so you should have value="120.pdf"
Also, the vector db you are using needs to support metadata filtering. Thanfully most popular ones do
Currently i've not used any vector db, I'm using by default llama indexing.
Note that i've other indexes as well like csv and docs index. these are working file with llama defaualt indexing.
Should i move to pinecone for only pdf files inedex? rest of index same.
what you suggest @Logan M
If you want metadata filtering to work, that's basically the only option for vector indexes πŸ€”

You could write a node postprocessor though too, but this would only filter nodes AFTER retrieval, rather than before, so you would need to set the top K pretty high in order to have meaningful filtering https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/usage_pattern.html#custom-node-postprocessor
@Logan M if node postprocessor apply filter after retrieval of nodes, if the node that contains the specific information is not get than it says "Not enough information"

if I implement like pdf files index uses vector(pinecone) db and rest of index uses by default llama indexex. it will be benificial of not?
It would help with with performing proper metadata filtering yes
@Logan M i've implemented the pinecone for pdf indexes.
There is an issue raise here,
property_info_docs = SimpleDirectoryReader(
Path("data_folder/Property Info Sheets")).load_data()

docs = []

for row in property_info_docs:
docs.append(Document(
text=row.text,
docid=row.id,
extra_info={'file_name': row.metadata['file_name']}
))
print("Docs with 0 index: ", docs[0])
print("Length of docs: ", len(docs))

parser = SimpleNodeParser()

nodes = parser.get_nodes_from_documents(docs)

initialize connection to pinecone

pinecone.init(
api_key=os.environ['PINECONE_API_KEY'],
environment=os.environ['PINECONE_ENVIRONMENT']
)

create the index if it does not exist already

index_name = 'demo'
if index_name not in pinecone.list_indexes():
pinecone.create_index(
index_name,
dimension=1536,
metric='cosine'
)

connect to the index

pinecone_index = pinecone.Index(index_name)

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

setup our storage (vector db)

storage_context = StorageContext.from_defaults(
vector_store=vector_store
)

setup the index/query process, ie the embedding model (and completion if used)

embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = VectorStoreIndex.from_documents(
docs, storage_context=storage_context,
service_context=service_context
)

By using this code all vectors are upserted into the pinecone,
So the question is Each time i need to use read SimpleDirectoryReader whenever i start the server and then data inside docs which is used in VectorStore.
How can we load the pinecone db vectors and avoid from loading again and again.
Similar like load_index_from_storage
To load an existing vector index, setup the vector store object to point to the existing vector store then do

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
@Logan M i asked a question about address of 19 property it pulls the information but in answer it gives information is not provides.
Even i used similarity_top_k=10
Attachment
image.png
Ha wow ok πŸ˜­πŸ˜… :PepeHands:
I think the recent changes to prompt templates were a bit too restrictive in v0.8.0
About to cut a release that hopefully helps with that
Currently using llama-index==0.7.7.
which version should i use please suggest @Logan M
@Logan M does we upload different indexes data into single pinecone db?
i mean csv index, pdf index, docs index.
it will be better in performance or using separate indexes will me more beneficial?
Add a reply
Sign up and join the conversation on Discord