def _delete(self, doc_id: str, **delete_kwargs: Any) -> None:
"""Delete a document."""
self._index_struct.delete(doc_id)
self._vector_store.delete(doc_id)
def delete(self, doc_id: str) -> None:
"""Delete a Node."""
if doc_id not in self.doc_id_dict:
raise ValueError("doc_id not found in doc_id_dict")
for vector_id in self.doc_id_dict[doc_id]:
del self.nodes_dict[vector_id]
del self.doc_id_dict[doc_id
I don't think an exception should be thrown when doc_id is not in self.doc_id_dict, because it's possible that my index is constructed like this:
return GPTQdrantIndex(
nodes=[],
client=qdrant_client_instance,
service_context=service_context,
collection_name=self._collection_name,
)
I only use the client to operate on the index without loading data into memory. This used to work fine in older versions.
I have two documents, one about EIP-1 and the other about EIP-2. They are stored using qdrant. When I query for EIP-1, I always get results for EIP-2 instead. It seems to be due to the embedding similarity not being clear enough. Do you have any good solutions?
You could look into using requried_keywords or exclude_keywords in your query π€
Also, maybe you can pre-split the documents into well defined sections before indexing?
Or, maybe you can try creating a composable index, like a vector index for each document and then a top-level index (but this will increase latency a bit)
I have a lot of documents with various types, and I feel it's difficult to build a reasonable composable index. In addition, the query input is not always consistent, making it hard to automatically generate key words.
I think at this point its a limitation of emeddings π€ my only further suggestion is increasing similarity_top_k (and using response_mode="compact" to keep response times reasonable), and maybe playing with chunk_size_limit