I'm using mongodb as vector store and

SShayan

I'm using mongodb as vector store and llamaindex backend can find information when I have only a few embedded documents. But when number of embeddings increases, llamaindex backend can no longer find anything, not even the same info that it was able to retrieve before growing the vector store. Any advice would be much appreciated.

19 comments

WWhiteFang_Jr

This is more of mongo vector store problem statement 😅

SShayan

I'm relatively new to this. What can I do for next steps?

WWhiteFang_Jr

First it would be to debug the mongo vector store code on how it is working

SShayan

I use the standard llama-create python backend from https://github.com/run-llama/create-llama

SShayan

using text-embedding-3-large with 3072 dimensionality and gpt-4-turbo-preview

SShayan

I pipe concatenate the fields and embed them. MongoDB creates the embedding and metadata fields. I also have a Mongo Atlas search index defined with 3072 dimensionality and cosine matching.

SShayan

Is there any way to see why or why not llamaindex uses a particular embedding from the vector store?

SShayan

I thought it would be able to match fields based on the metadata object stored in each embedding.

SShayan

This is what I have in the index of the engine:
import logging
import os

from llama_index.core.indices import VectorStoreIndex
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

logger = logging.getLogger("uvicorn")

def get_index():
logger.info("Connecting to index from MongoDB...")
store = MongoDBAtlasVectorSearch(
db_name=os.environ["MONGODB_DATABASE"],
collection_name=os.environ["MONGODB_VECTORS"],
index_name=os.environ["MONGODB_VECTOR_INDEX"],
)
index = VectorStoreIndex.from_vector_store(store)
logger.info("Finished connecting to index from MongoDB.")
return index

SShayan

how can I change index = VectorStoreIndex.from_vector_store(store) to include metadata in embeddings? I suspect the metadata is stored in vector store during embedding but the BasicChatEngine, which uses the above engine index doesn't necessarily know how to use the metadata field

bbhrdwj

You need to add the metadata at the time of creation of the nodes i.e while ingestion.Once it's done, while querying use vector_store_kwargs or metadatfilter params to query on those metadata only.

SShayan

the first part during ingestion is done. How would I do the second part? Any references would be appreciated. The MongoDBAtlasVectorSearch initialization function has metadata_key: str = "metadata" as an input param, in which case for me is also the default name "metadata". I'm not quering directly. I have a BasicChatEngine which is initialized from the index. Would I need to change the BasicChatEngine to a query engine?

SShayan

Would I need to follow the metadata section from https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/ or setup AutoVectorRetriever similar to https://docs.llamaindex.ai/en/stable/examples/vector_stores/chroma_auto_retriever/ ?

bbhrdwj

@kapa.ai @Logan M when i use the S3Reader to get documents, the text that i get is encrypted. How to get decrypted documents?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
I'm sorry, but the provided knowledge sources do not contain information on how to decrypt documents when using the S3Reader in LlamaIndex. The S3Reader is designed to read files or directories from an S3 bucket and load them as documents. It does not inherently provide a decryption mechanism.

If your documents are encrypted, you would need to implement a decryption process either before using the S3Reader or after retrieving the documents. This would typically involve using the appropriate decryption key and algorithm that matches the encryption method used on your documents.

For more details on how to use the S3Reader, you can refer to the LlamaIndex documentation ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

bbhrdwj

@kapa.ai can i use langchain's S3 document reader to get documents and then use llamaindex to create nodes ?

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
Yes, you can use LangChain's S3 document reader to get documents and then use LlamaIndex to create nodes. Here's a general outline of the process:

Use the S3Reader to load documents from an S3 bucket. The load_s3_files_as_docs method can be used for this purpose. Here's the method signature from the LlamaIndex documentation ():

Plain Text

load_s3_files_as_docs(temp_dir=None) -> List[Document]

Once you have the documents, you can use LlamaIndex to create nodes. The VectorStoreIndex.from_documents method can be used to create an index from the documents. Here's an example from the LlamaIndex documentation ():

Plain Text

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
index = VectorStoreIndex.from_documents(documents, fs=s3)

Please note that you'll need to replace SimpleDirectoryReader with the S3Reader and adjust the parameters accordingly. Also, fs=s3 indicates that the filesystem being used is S3.

Remember to handle the documents appropriately based on their format and your specific requirements.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

I'm using mongodb as vector store and