I have set the `Tokensplitter` with the

At a glance

I have set the Tokensplitter with the following parameters
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=20)

My service context also contains the same params chunk_size and chunk_ovrlap

Now when I create two document object using text splitter and insert them. I check the docstore and find there are three doc object for the two of them. One of them got chunked one more time.

Tokensplitter and service_context if contain the same values for chunk_size and overlap. Then why an extra doc is being created ?

2 comments

WWhiteFang_Jr

Okay I see my doc is being chunked becuase of the metadata

Plain Text

            # NOTE: Consider metadata info str that will be added
            #   to the chunk at query time. This reduces the effective
            #   chunk size that we can have
            if metadata_str is not None:
                # NOTE: extra 2 newline chars for formatting when prepending in query
                num_extra_tokens = len(self.tokenizer(f"{metadata_str}\n\n")) + 1
                effective_chunk_size = self._chunk_size - num_extra_tokens

                if effective_chunk_size <= 0:
                    raise ValueError(
                        "Effective chunk size is non positive "
                        "after considering metadata"
                    )
            else:
                effective_chunk_size = self._chunk_size

LLogan M

Yea, the metadata is considered when chunking. You can configure your documents/nodes to exclude certain metadata for the embedding or LLM steps

Add a reply

Find answers from the community

I have set the `Tokensplitter` with the