----------------------
Chunking data into smaller sizes can help in making the embeddings more precise. When documents are ingested into an index in LlamaIndex, they are split into chunks with a certain amount of overlap. The default chunk size is 1024, and the default chunk overlap is 20. If you change these parameters, it will change the embeddings that are calculated. A smaller chunk size means the embeddings are more precise, while a larger chunk size means that the embeddings may be more general, but can miss fine-grained details.
For example, if you halve the default chunk size, you may also want to double the
similarity_top_k
from the default of 2 to 4 to better represent the amount of data to retrieve for each query. Here is an example of how to do this:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings
documents = SimpleDirectoryReader("./data").load_data()
Settings.chunk_size = 512
Settings.chunk_overlap = 50
index = VectorStoreIndex.from_documents(
documents,
)
query_engine = index.as_query_engine(similarity_top_k=4)
In this example, the chunk size is set to 512 and the chunk overlap is set to 50. The
similarity_top_k
parameter is set to 4, which is double the default value of 2.
For more information, you can refer to the
LlamaIndex documentation on chunk sizes ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------