Implementing Anomaly Detection on Ingested Documents Us...

At a glance

The community member is implementing an anomaly detection system on top of documents stored in a vector store (Milvus and OpenSearch). They are trying to load the documents and their embeddings into an in-memory vector store (FAISS) to perform clustering and anomaly detection. However, they are having trouble getting the retriever to return the embeddings along with the text of the documents. The community members suggest that most vector stores don't have an option to return embeddings, mostly to save memory. They recommend using the underlying client for the vector database or submitting a pull request to add the desired functionality.

ddi5corder4701

i am implementing an anomaly detection on top of the documents that have already been ingested into a vector store (i've been using milvus and opensearch so far); i am trying to take a poor man's approach of loading the documents along with their embeddings from the vector store into an in-memory vector store (faiss) and perform some clustering and anomaly detection (lof, dbscan, faiss) which requires embeddings to be loaded from the underlying vector store (milvus, opensearch, etc); not sure if this is a good approach, so please suggest a better one - would love to hear it!

so, i've been at it for hours and still can't figure out how to get the retriever to return the embeddings along with the text of the documents already stored in the vector store; i tried it with milvus as well as opensearch vector store indexes and they both seem to be trimming "embeddings" fields somewhere before returning the nodes in the code shown below; i debugged into MilvusVectorStore code and i can see that the embeddings are returned from the milvus query but are stripped in MilvusVectorStore#_parse_from_milvus_results(..);

Plain Text

retriever = self.index.as_retriever(similarity_top_k=top_k)
        nodes = retriever.retrieve('*')  # Get all documents

4 comments

LLogan M

most vector stores don't have an option to return embeddings

LLogan M

mostly to save memory

LLogan M

I would just use the underlying client for whichever vectordb you are using

LLogan M

Or feel free to make a PR

Add a reply

Find answers from the community

Implementing Anomaly Detection on Ingested Documents Using In-Memory Vector Stores