Find answers from the community

Updated 8 months ago

im having a very strange issue with the sentenceSplitter

At a glance

A community member is experiencing an issue with the SentenceSplitter node parser in their application. When they set a positive value for the chunk_overlap, they receive an error stating that the chunk_overlap size is greater than the node_chunk_size, even though this is not the case. The community member provided an example where a chunk_overlap of 8 is considered larger than a chunk_size of 160.

In the comments, another community member suggests that the issue may be caused by the values being converted to strings when retrieved from the environment variables using os.getenv(). They mention that they encountered a similar issue with the HierarchicalNodeParser and resolved it by removing the default values from os.getenv() and wrapping the chunk_size and chunk_overlap variables in int().

There is no explicitly marked answer, but the community members are collaborating to understand and resolve the issue.

im having a very strange issue with the SentanceSplitter node parser. When i use a node_chunk_overlap of size 0 i have no issues, but if i use a positive value i always get an error that the chunk_overlap size is greater than the node_chunk_size, when it definitely is not larger. for example, a node_chunk_overlap of size 8 is considered larger than a node_chunk_size of 160. as shown here:

2024-07-04 10:11:54 Traceback (most recent call last):
2024-07-04 10:11:54 File "/app/main.py", line 27, in <module>
2024-07-04 10:11:54 init_settings()
2024-07-04 10:11:54 File "/app/app/settings.py", line 41, in init_settings
2024-07-04 10:11:54 Settings.node_parser = SentenceSplitter(chunk_size=Settings.chunk_size, chunk_overlap=Settings.chunk_overlap)
2024-07-04 10:11:54 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-04 10:11:54 File "/usr/local/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py", line 81, in init
2024-07-04 10:11:54 raise ValueError(
2024-07-04 10:11:54 ValueError: Got a larger chunk overlap (8) than chunk size (160), should be smaller.

i dont understand how it thinks 8 is larger than 160???
R
L
6 comments
the value is being set in the default llama-create-app settings file, in the init_settings function here:

def init_settings():
node_chunk_size = os.getenv("NODE_CHUNK_SIZE", 256)
node_chunk_overlap = os.getenv("NODE_CHUNK_OVERLAP", 24)
Settings.chunk_size = node_chunk_size
Settings.chunk_overlap = node_chunk_overlap
Settings.node_parser = SentenceSplitter(chunk_size=node_chunk_size, chunk_overlap=node_chunk_overlap)
llm_configs = llm_config_from_env()
embedding_configs = embedding_config_from_env()
if embedding_configs["model"] == "BAAI/bge-small-en-v1.5":
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
else:
Settings.embed_model = OpenAIEmbedding(embedding_configs) Settings.llm = OpenAI(llm_configs)
Settings.transformations = [Settings.node_parser, Settings.embed_model]
im having a very strange issue with the sentenceSplitter
I'm not really sure either, the code is pretty straightfroward
I also don't run into the same issue

Plain Text
>>> from llama_index.core.node_parser import SentenceSplitter
>>> from llama_index.core import Settings
>>> Settings.chunk_size = 160
>>> Settings.chunk_overlap = 8
>>> Settings.node_parser = SentenceSplitter(chunk_size=160, chunk_overlap=8)
>>> 
i think somehow the value was being converted into a string when its taken in with os.getenv(). it happened the same way with the HierarchicalNodeParser, so I removed the default values from os.getenv(), wrapped all the chunk_size and chunk_overlap variables in int() and it seems to be working now.
Add a reply
Sign up and join the conversation on Discord