Find answers from the community

Updated 4 months ago

Is there a way to assemble existing

At a glance

The community member is trying to find a way to run the PydanticProgramExtractor over only the first few nodes of each document, similar to how the TitleExtractor works. They are trying to categorize documents based on data from the first page of each document. The comments suggest that the PydanticProgramExtractor does accept a nodes parameter, and one community member provides an example of how to use a pydantic program to categorize the document based on the first few nodes.

EEric

Is there a way to assemble existing LlamaIndex components so that I can run PydanticProgramExtractor over only the first few nodes of every document (like TitleExtractor)? I'm trying to categorize the document based on data that's on the first page of every doc. What is the best approach?

6 comments

LLogan M

you could... run a pydantic program over the first few nodes of a document?

EEric

I'm wondering how to accomplish this. TitleExtractor accepts a parameter called nodes but PydanticProgramExtractor doesn't.

LLogan M

they both accept nodes, curious where you saw that they dont 👀

But in any case, I would just use a pydantic program on its own, more control anyways

Plain Text

from llama_index.core.bridge.pydantic import BaseModel
from llama_index.program.openai import OpenAIPydanticProgram

class Category(BaseModel):
  """A category for a piece of text."""
  name: str


program = OpenAIPydanticProgram.from_defaults(
    output_cls=Category,
    prompt_template_str="Given a piece of text, assign a category.\n\nText:\n{text}",
    verbose=True,
)

category = program.run(text=node.text)
node.metadata['category'] = category.name

EEric

Thanks for the response, @Logan M! I was looking at the constructor for TitleExtractor and saw nodes but did not see a way to configure PydanticProgramExtractor to evaluate similarly (at a document-level) and neither in BaseExtractor.

EEric

To clarify, the nodes parameter I was talking about from TitleExtractor refers to the number of nodes from the beginning of a document. But I'm seeing what you're saying about program accepting a node.

LLogan M

Ahhh yea I see what you mean now

Add a reply