This might be a bug in llama-index, or I'm not understanding how to properly use the new
IngestionPipeline
transformations. My nodes have lots of metadata for some logging and post-processing tasks, if the metadata gets included in a transformation, it hits the 3900 token limit set in the LlamaCpp configs, so I need to exclude it in transformations that rely on the LLM. I'm trying to use SummaryExtractor() which I have set to use the Mistral 7B model. But the code I try doesn't ever exclude the metadata from what goes to Mistral7B under SummaryExtractor(). My code (a bit duplicative for extra certainty) looks like this:
pipeline = IngestionPipeline(
transformations=[
CustomTransformation(),
SummaryExtractor(
llm=llm,
excluded_embed_metadata_keys=[
DEFAULT_WINDOW_METADATA_KEY,
DEFAULT_OG_TEXT_METADATA_KEY,
],
excluded_llm_metadata_keys=[
DEFAULT_WINDOW_METADATA_KEY,
DEFAULT_OG_TEXT_METADATA_KEY,
],
),
service_context.embed_model,
]
)
excluded_embed_metadata_keys = [
DEFAULT_WINDOW_METADATA_KEY,
DEFAULT_OG_TEXT_METADATA_KEY,
]
excluded_llm_metadata_keys = [
DEFAULT_WINDOW_METADATA_KEY,
DEFAULT_OG_TEXT_METADATA_KEY,
]
nodes = pipeline.run(
nodes=nodes,
excluded_embed_metadata_keys=excluded_embed_metadata_keys,
excluded_llm_metadata_keys=excluded_llm_metadata_keys,
)