I have a dataset of ~2000 documents

MMaxx

I have a dataset of ~2000 documents which contain information about/checklists for various sports trading card sets. I need the index retrieval time to be about half of what it currently is (currently takes about a minute). What kind of considerations should I make when deciding what type of index to use? I am currently using a Vector Store, which gives decent results but takes too long. Will I have to break up the index if I want to retrieve faster?

8 comments

ddisiok

are you using the simple vector store (i.e. in-memory)? If you have a large number of documents, we recommend using either FaissVectorStore (also in memory), or any external vectorDBs (e.g. Pinecone, Weaviate, etc)

MMaxx

Thank you. Could you specify what you mean by "in-memory"?
I am using

Plain Text

index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=index_path)

to save the index to file, then loading it with

Plain Text

index = load_index_from_storage( StorageContext.from_defaults(persist_dir="index-???"))
query_engine = index.as_query_engine()

to load it from file. Loading it also takes about 30 minutes, so if there is a way to speed that up as well that would be great. I will look into the FaissVectorStore but could you explain how an external vectorDB is different?

MMaxx

Thank you very much

ddisiok

Take a look at this (https://docs.llamaindex.ai/en/stable/core_modules/data_modules/storage/vector_stores.html) for the vector stores we support

ddisiok

external vectorDB means that your documents are being sent to a separate service, which holds the data for you

ddisiok

it'd generally be much faster to query, should be on the order of 10s or 100s of miliseconds to query

MMaxx

Thank you, this is amazing.

MMaxx

When you said external vectorDB’s was the Redis Vector Store included in that? Because I seem to be unable to get it to retrieve in under 10 seconds consistently, let alone 10s or 100s of milliseconds.

Add a reply

Find answers from the community

I have a dataset of ~2000 documents