Find answers from the community

Updated 8 months ago

Chunking and breaking down metadata for ingestion

Is there a way to chunk or breakdown metadata into smaller chunks to save in llamaindex?

I'm having an issue where my metadata is too long for the chunk size

Plain Text
ValueError: Metadata length (379349) is longer than chunk size (2048). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
B
L
3 comments
My code is as follows, while the doc giving me the error has a text of len 63618
Plain Text
title_extractor = TitleExtractor(llm=llm,num_workers=8)
qa_extractor = QuestionsAnsweredExtractor(llm=llm,questions=3,num_workers=8)
summary_extractor = SummaryExtractor(summaries=["prev","self","next"],llm=llm,num_workers=8)
keyword_extractor = KeywordExtractor(llm=llm,num_workers=8)
sentence_splitter = SentenceSplitter(chunk_size=2048,chunk_overlap=512)
huggingface_embedding = HuggingFaceEmbedding(model_name="../../huggingface_models/bge-large-en-v1.5/")

documents = []
for root,folders,files in os.walk("./cleaned_json/"):
    for file in files:
        filepath = f"{root}/{file}"
        file_doc = JSONReader(levels_back=0).load_data(input_file=filepath)
        documents.extend(file_doc)

pipeline = IngestionPipeline(
    transformations=[
        title_extractor,
        qa_extractor,
        summary_extractor,
        keyword_extractor,
        sentence_splitter,
        huggingface_embedding
    ],
    vector_store=vector_store
    )

pipeline.run(documents=documents,show_progress=True,cache_collection="./pipeline_storage")

pipeline.persist("./pipeline_storage_persist")
What're the best practices for having metadata that's too long? Is it advisable for me to simply increase the chunk size to something large e.g 9,999,999? Or perhaps i should just keep my metada shorter by using less metadata extractors? Or am i able to somehow store multiple "chunks" of metadata alongside the node ?

If not, what should i be doing?
If you have long metadata, you can (and should) be excluding it

document.excluded_llm_metadata_keys = ["key1", ...]
document.excluded_embed_metadata_keys = ["key1", ...]
Add a reply
Sign up and join the conversation on Discord