So there are two main steps with vector indexes in llama index
- Retrieve
The embeddings of a query string are used to retrieve the top k most similar nodes. This uses an embedding model which is optimized for this task (this defaults to text-ada-002 from openai)
- Synthesize
Using those top k nodes and the query string, generate a response in natural language. In this step, the LLM reads the query and top k nodes (along with some extra instructions from a prompt template). Then, and response using all that info is generated from the LLM and returned