Find answers from the community

Updated 3 months ago

Hello! I am playing around with llama-

Hello! I am playing around with llama-index and pinecone as a vector db. It seems like there's no control over the id when ingesting data to pinecone:

Plain Text
llama_doc = Document(id_="f1",text="My text")
index = VectorStoreIndex.from_documents([llama_doc], storage_context=storage_context)


Pinecone will have a different internal id. The reason why I'm asking is that there seems to be no way of deleting docs on pinecone with metadata filtering - you have to use the ID. And I don't seem to be able to get the pinecone assigned ID after ingestion either.
L
k
18 comments
You can create the nodes yourself and set the IDs

I haven't gotten around to working around their serverless limitation here -- looking at their docs, the node ids have the prefix of the original document id, and you can delete by prefix

In any case, you can create nodes like this
Plain Text
from llama_index.core.schema import TextNode
node = TextNode(text=text, id_=id_)


Or you can edit existing nodes
Plain Text
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter()
nodes = splitter(documents)
for node in nodes:
  node.id_ = "some_id"
Thank you! I'll have to look at the docs for that, but I'm a bit worried that TextNode(text=text, id_=id_)might have the same problem, as the argument id_ looks suspiciously similar πŸ™‚
When inserting directly through pinecone as described here: https://docs.llamaindex.ai/en/stable/examples/vector_stores/existing_data/pinecone_existing_data/
I managed to set the pinecone ids, but the documents on pinecone were missing a lot of other attributes that were there for documents inserted through llama-index and that scared me a bit.
Just to clarify, in the example I provided above I also explicitly set an id with id_ but this is not the one pinecone uses when deleting by prefix.

To illustrate, this is how an ingested node looks coming from llama-index. I can't delete with the id 22 that I set.
Attachment
image.png
Only with the id "49db..." in the top left I can delete
ah I see, it seems like pinecone is just auto generating the ID?
I welcome a PR to clean up the integration -- their move to severless has been relatively painful, but haven't seen enough complaints to prioritize
Yes that's what I assume but I don't know enough about the inner workings of llama-index to safely say so
Do you have a pointer which class to look at?
Also any other db you would recommend? Is there one that is particularly well supported by llama-index? I'm just getting started and chose pinecone because I heard about a few times and could get me started quickly.
chroma, qdrant, weaviate(? well, maybe), lancedb, milvus are all popular ones that I see in the community
cool I'll have a look!
happy to review a PR if you open one!
I'm already tempted to switch to chroma at this point and it was the first you mentioned! But creating a PR would also be nice and maybe help me understand how llama-index and the dbs are integrated!
Ok so looking through the code I think it's actually llama that creates the ID, and I guess it makes sense and the problem is that pinecones deletion is so limited. From my understanding the following happens:

1) VectorStoreIndex.from_documents(...) creates the nodes
2) Looking at the BaseNode it has a default factory for the node id: default_factory=lambda: str(uuid.uuid4()), description="Unique ID of the node."
3) The PineconeVectorStore seems to upsert correctly with the nodes id field as id, which pinecone then uses.

I wonder if transformers could help getting control over the node ids when using from_documents.

TLDR: what you suggested in the first place probably works!
yea, creating the nodes your self, or editing the nodes after applying a splitter, would help. You can insert nodes instead of documents using VectorStoreIndex(nodes=nodes, ...) or index.insert_nodes(nodes)
Add a reply
Sign up and join the conversation on Discord