Find answers from the community

Updated 5 months ago

How does smaller chunk size help ? Why is it required to chunk the data into smaller size

@kapa.ai How does smaller chunk size help ? Why is it required to chunk the data into smaller size ?
k
2 comments
Please stand by @praveenks30#120294 as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
Chunking data into smaller sizes can help in making the embeddings more precise. When documents are ingested into an index in LlamaIndex, they are split into chunks with a certain amount of overlap. The default chunk size is 1024, and the default chunk overlap is 20. If you change these parameters, it will change the embeddings that are calculated. A smaller chunk size means the embeddings are more precise, while a larger chunk size means that the embeddings may be more general, but can miss fine-grained details.

For example, if you halve the default chunk size, you may also want to double the similarity_top_k from the default of 2 to 4 to better represent the amount of data to retrieve for each query. Here is an example of how to do this:

Plain Text
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core import Settings

documents = SimpleDirectoryReader("./data").load_data()

Settings.chunk_size = 512
Settings.chunk_overlap = 50

index = VectorStoreIndex.from_documents(
    documents,
)

query_engine = index.as_query_engine(similarity_top_k=4)


In this example, the chunk size is set to 512 and the chunk overlap is set to 50. The similarity_top_k parameter is set to 4, which is double the default value of 2.

For more information, you can refer to the LlamaIndex documentation on chunk sizes ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord