Find answers from the community

Updated 10 months ago

Chunking and breaking down metadata for ingestion

At a glance

The community member is facing an issue where their metadata is too long for the chunk size in LlamaIndex, resulting in a ValueError. They are asking for the best practices to handle this situation. The comments suggest a few options:

1. Increase the chunk size to a larger value, such as 9,999,999.

2. Reduce the amount of metadata by using fewer metadata extractors.

3. Exclude certain metadata keys from being stored, using document.excluded_llm_metadata_keys and document.excluded_embed_metadata_keys.

There is no explicitly marked answer in the comments, but the community members are discussing potential solutions to the issue.

Is there a way to chunk or breakdown metadata into smaller chunks to save in llamaindex?

I'm having an issue where my metadata is too long for the chunk size

Plain Text
ValueError: Metadata length (379349) is longer than chunk size (2048). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
B
L
3 comments
My code is as follows, while the doc giving me the error has a text of len 63618
Plain Text
title_extractor = TitleExtractor(llm=llm,num_workers=8)
qa_extractor = QuestionsAnsweredExtractor(llm=llm,questions=3,num_workers=8)
summary_extractor = SummaryExtractor(summaries=["prev","self","next"],llm=llm,num_workers=8)
keyword_extractor = KeywordExtractor(llm=llm,num_workers=8)
sentence_splitter = SentenceSplitter(chunk_size=2048,chunk_overlap=512)
huggingface_embedding = HuggingFaceEmbedding(model_name="../../huggingface_models/bge-large-en-v1.5/")

documents = []
for root,folders,files in os.walk("./cleaned_json/"):
    for file in files:
        filepath = f"{root}/{file}"
        file_doc = JSONReader(levels_back=0).load_data(input_file=filepath)
        documents.extend(file_doc)

pipeline = IngestionPipeline(
    transformations=[
        title_extractor,
        qa_extractor,
        summary_extractor,
        keyword_extractor,
        sentence_splitter,
        huggingface_embedding
    ],
    vector_store=vector_store
    )

pipeline.run(documents=documents,show_progress=True,cache_collection="./pipeline_storage")

pipeline.persist("./pipeline_storage_persist")
What're the best practices for having metadata that's too long? Is it advisable for me to simply increase the chunk size to something large e.g 9,999,999? Or perhaps i should just keep my metada shorter by using less metadata extractors? Or am i able to somehow store multiple "chunks" of metadata alongside the node ?

If not, what should i be doing?
If you have long metadata, you can (and should) be excluding it

document.excluded_llm_metadata_keys = ["key1", ...]
document.excluded_embed_metadata_keys = ["key1", ...]
Add a reply
Sign up and join the conversation on Discord