RagDatasetGenerator
is respecting ServiceContext
's chunk size/overlap.File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/llama_dataset/generator.py", line 105, in from_documents nodes = run_transformations( ... File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/node_parser/text/sentence.py", line 147, in split_text_metadata_aware raise ValueError( ValueError: Metadata length (1493) is longer than chunk size (1024). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
NODE_PARSER_CHUNK_SIZE = 3000 NODE_PARSER_CHUNK_OVERLAP = 200 # set context for llm provider gpt_35_context = ServiceContext.from_defaults( llm=OpenAI(model="gpt-3.5-turbo-1106", temperature=0.1, chunk_size=NODE_PARSER_CHUNK_SIZE, chunk_overlap=NODE_PARSER_CHUNK_OVERLAP, ) ) # instantiate a DatasetGenerator dataset_generator = RagDatasetGenerator.from_documents( parent_nodes, service_context=gpt_35_context, num_questions_per_chunk=2, # set the number of questions per nodes show_progress=True, )
split_text_metadata_aware
method of this class. This error occurs when the length of the metadata is longer than the chunk size. The chunk size is defined by the chunk_size
attribute of the SentenceSplitter
class, and its default value is set by DEFAULT_CHUNK_SIZE
.NODE_PARSER_CHUNK_SIZE = 3000
and NODE_PARSER_CHUNK_OVERLAP = 200
, the error message indicates that the chunk size is still 1024, which is less than the length of your metadata (1493). This suggests that the chunk size you've set is not being used by the SentenceSplitter
.ServiceContext
you're using to create the RagDatasetGenerator
is not correctly configured with the chunk size and overlap you've set. When you create the ServiceContext
with ServiceContext.from_defaults
, make sure you're setting the chunk_size
and chunk_overlap
attributes correctly.SentenceSplitter
is not using the chunk size from the ServiceContext
, but is instead using its default value. You might need to check the implementation of the SentenceSplitter
and the run_transformations
function to make sure they're using the chunk size from the ServiceContext
.ServiceContext
, or with the documents themselves. You might need to check the content of your documents, or the configuration of your ServiceContext
and its transformations.gpt_35_context = ServiceContext.from_defaults( llm=OpenAI(model="gpt-3.5-turbo-1106", temperature=0.1, ), chunk_size=NODE_PARSER_CHUNK_SIZE, chunk_overlap=NODE_PARSER_CHUNK_OVERLAP, )