I don't think `RagDatasetGenerator` is

At a glance

The community member is encountering an issue with the RagDatasetGenerator not respecting the chunk size and overlap settings in the ServiceContext. The error message indicates that the metadata length (1493) is longer than the chunk size (1024), and the community member has explicitly set the NODE_PARSER_CHUNK_SIZE to 3000 and NODE_PARSER_CHUNK_OVERLAP to 200. The first comment suggests that the issue may be related to how the ServiceContext is configured, and the community member should ensure that the chunk_size and chunk_overlap attributes are set correctly when creating the ServiceContext. Another possibility is that the SentenceSplitter is not using the chunk size from the ServiceContext, but is instead using its default value. The second comment indicates that the community member has found the issue, which was that they had placed the chunk_size and chunk_overlap parameters inside the OpenAI constructor instead of the ServiceContext.from_defaults call.

JJoshhhh

I don't think RagDatasetGenerator is respecting ServiceContext's chunk size/overlap.

Getting this error:

Plain Text

File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/llama_dataset/generator.py", line 105, in from_documents
    nodes = run_transformations(
...
File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/node_parser/text/sentence.py", line 147, in split_text_metadata_aware
    raise ValueError(
ValueError: Metadata length (1493) is longer than chunk size (1024). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

Despite explicitly setting chunk params:

Plain Text

NODE_PARSER_CHUNK_SIZE = 3000
NODE_PARSER_CHUNK_OVERLAP = 200

# set context for llm provider
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-1106",
               temperature=0.1,
               chunk_size=NODE_PARSER_CHUNK_SIZE,
               chunk_overlap=NODE_PARSER_CHUNK_OVERLAP,
    )
)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    parent_nodes,
    service_context=gpt_35_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)

2 comments

rrahul

The error you're encountering is raised in the split_text_metadata_aware method of this class. This error occurs when the length of the metadata is longer than the chunk size. The chunk size is defined by the chunk_size attribute of the SentenceSplitter class, and its default value is set by DEFAULT_CHUNK_SIZE.

Even though you've set NODE_PARSER_CHUNK_SIZE = 3000 and NODE_PARSER_CHUNK_OVERLAP = 200, the error message indicates that the chunk size is still 1024, which is less than the length of your metadata (1493). This suggests that the chunk size you've set is not being used by the SentenceSplitter.

One possibility is that the ServiceContext you're using to create the RagDatasetGenerator is not correctly configured with the chunk size and overlap you've set. When you create the ServiceContext with ServiceContext.from_defaults, make sure you're setting the chunk_size and chunk_overlap attributes correctly.

Another possibility is that the SentenceSplitter is not using the chunk size from the ServiceContext, but is instead using its default value. You might need to check the implementation of the SentenceSplitter and the run_transformations function to make sure they're using the chunk size from the ServiceContext.

If adjusting the chunk size and metadata doesn't resolve the issue, there might be a problem with the transformations defined in the ServiceContext, or with the documents themselves. You might need to check the content of your documents, or the configuration of your ServiceContext and its transformations.

JJoshhhh

Oh duh, I put chunk_size/chunk_overlap inside of the LLM parentheses 🤦‍♂️

Plain Text

gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-1106",
            temperature=0.1,
    ),
    chunk_size=NODE_PARSER_CHUNK_SIZE,
    chunk_overlap=NODE_PARSER_CHUNK_OVERLAP,
)

Add a reply

Find answers from the community

I don't think `RagDatasetGenerator` is