Find answers from the community

Updated 6 months ago

im having a very strange issue with the sentenceSplitter

im having a very strange issue with the SentanceSplitter node parser. When i use a node_chunk_overlap of size 0 i have no issues, but if i use a positive value i always get an error that the chunk_overlap size is greater than the node_chunk_size, when it definitely is not larger. for example, a node_chunk_overlap of size 8 is considered larger than a node_chunk_size of 160. as shown here:

2024-07-04 10:11:54 Traceback (most recent call last):
2024-07-04 10:11:54 File "/app/main.py", line 27, in <module>
2024-07-04 10:11:54 init_settings()
2024-07-04 10:11:54 File "/app/app/settings.py", line 41, in init_settings
2024-07-04 10:11:54 Settings.node_parser = SentenceSplitter(chunk_size=Settings.chunk_size, chunk_overlap=Settings.chunk_overlap)
2024-07-04 10:11:54 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-04 10:11:54 File "/usr/local/lib/python3.11/site-packages/llama_index/core/node_parser/text/sentence.py", line 81, in init
2024-07-04 10:11:54 raise ValueError(
2024-07-04 10:11:54 ValueError: Got a larger chunk overlap (8) than chunk size (160), should be smaller.

i dont understand how it thinks 8 is larger than 160???
R
L
6 comments
the value is being set in the default llama-create-app settings file, in the init_settings function here:

def init_settings():
node_chunk_size = os.getenv("NODE_CHUNK_SIZE", 256)
node_chunk_overlap = os.getenv("NODE_CHUNK_OVERLAP", 24)
Settings.chunk_size = node_chunk_size
Settings.chunk_overlap = node_chunk_overlap
Settings.node_parser = SentenceSplitter(chunk_size=node_chunk_size, chunk_overlap=node_chunk_overlap)
llm_configs = llm_config_from_env()
embedding_configs = embedding_config_from_env()
if embedding_configs["model"] == "BAAI/bge-small-en-v1.5":
Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
else:
Settings.embed_model = OpenAIEmbedding(embedding_configs) Settings.llm = OpenAI(llm_configs)
Settings.transformations = [Settings.node_parser, Settings.embed_model]
im having a very strange issue with the sentenceSplitter
I'm not really sure either, the code is pretty straightfroward
I also don't run into the same issue

Plain Text
>>> from llama_index.core.node_parser import SentenceSplitter
>>> from llama_index.core import Settings
>>> Settings.chunk_size = 160
>>> Settings.chunk_overlap = 8
>>> Settings.node_parser = SentenceSplitter(chunk_size=160, chunk_overlap=8)
>>> 
i think somehow the value was being converted into a string when its taken in with os.getenv(). it happened the same way with the HierarchicalNodeParser, so I removed the default values from os.getenv(), wrapped all the chunk_size and chunk_overlap variables in int() and it seems to be working now.
Add a reply
Sign up and join the conversation on Discord