RE: this, my hypothesis was kind of correct it turns out based on current understanding.
According to this (
https://beta.openai.com/tokenizer), the general rule of thumb for characters -> embeddings is:
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
My experience with the tokensplitter is that
len(vector) > chunk_size
pretty much always and that has thrown me off before. Not sure if related to the aforementioned error however.