Find answers from the community

Updated 3 months ago

I don't think `RagDatasetGenerator` is

I don't think RagDatasetGenerator is respecting ServiceContext's chunk size/overlap.

Getting this error:
Plain Text
File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/llama_dataset/generator.py", line 105, in from_documents
    nodes = run_transformations(
...
File "/Users/joshuasabol/Library/Caches/pypoetry/virtualenvs/llama-app-backend-CfJQzey9-py3.11/lib/python3.11/site-packages/llama_index/node_parser/text/sentence.py", line 147, in split_text_metadata_aware
    raise ValueError(
ValueError: Metadata length (1493) is longer than chunk size (1024). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.


Despite explicitly setting chunk params:
Plain Text
NODE_PARSER_CHUNK_SIZE = 3000
NODE_PARSER_CHUNK_OVERLAP = 200

# set context for llm provider
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-1106",
               temperature=0.1,
               chunk_size=NODE_PARSER_CHUNK_SIZE,
               chunk_overlap=NODE_PARSER_CHUNK_OVERLAP,
    )
)

# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
    parent_nodes,
    service_context=gpt_35_context,
    num_questions_per_chunk=2,  # set the number of questions per nodes
    show_progress=True,
)
r
J
2 comments
The error you're encountering is raised in the split_text_metadata_aware method of this class. This error occurs when the length of the metadata is longer than the chunk size. The chunk size is defined by the chunk_size attribute of the SentenceSplitter class, and its default value is set by DEFAULT_CHUNK_SIZE.

Even though you've set NODE_PARSER_CHUNK_SIZE = 3000 and NODE_PARSER_CHUNK_OVERLAP = 200, the error message indicates that the chunk size is still 1024, which is less than the length of your metadata (1493). This suggests that the chunk size you've set is not being used by the SentenceSplitter.

One possibility is that the ServiceContext you're using to create the RagDatasetGenerator is not correctly configured with the chunk size and overlap you've set. When you create the ServiceContext with ServiceContext.from_defaults, make sure you're setting the chunk_size and chunk_overlap attributes correctly.

Another possibility is that the SentenceSplitter is not using the chunk size from the ServiceContext, but is instead using its default value. You might need to check the implementation of the SentenceSplitter and the run_transformations function to make sure they're using the chunk size from the ServiceContext.

If adjusting the chunk size and metadata doesn't resolve the issue, there might be a problem with the transformations defined in the ServiceContext, or with the documents themselves. You might need to check the content of your documents, or the configuration of your ServiceContext and its transformations.
Oh duh, I put chunk_size/chunk_overlap inside of the LLM parentheses 🤦‍♂️
Plain Text
gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-1106",
            temperature=0.1,
    ),
    chunk_size=NODE_PARSER_CHUNK_SIZE,
    chunk_overlap=NODE_PARSER_CHUNK_OVERLAP,
)
Add a reply
Sign up and join the conversation on Discord