llama_index/docs/examples/metadata

At a glance

hello everyone!

i'm working on adding a custom extractor to my vector query engine pipeline and looking at this notebook as reference

Plain Text

class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list

however, i'm getting a TypeError: Can't instantiate abstract class CustomExtractor with abstract method aextract when pasting the documentation code as is

(running llama-index v 0.9.25.post1)

does anyone have pointers on writing custom extractors?

8 comments

LLogan M

Dang, I need to update that example. Doing that now

Should be

Plain Text

class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list

LLogan M

it got updated to be async-first

eenginirmata 🐲

nice, that's running!

just getting

node.metadata["document_title"]
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'document_title'

because i've overriden all the default transformers

is there a way to add my extractor to the default list?

Plain Text

    pipe = IngestionPipeline(transformations=[
        get_default_transformers(),
        CustomExtractor()
    ])

LLogan M

hmm, not sure what you mean 🤔

what does get_default_transformers() return?

eenginirmata 🐲

i'm just diving into ingestion so forgive me

i'm using SimpleDirectoryReader and VectorStoreIndex to run some simple queries over data

i want to have that running 'as is' (defaults?) and add a custom extractor

things like Text and Code splitting, the summary, QA, all that metadata i want to keep just adding one more piece of metadata

eenginirmata 🐲

Plain Text

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    # EntityExtractor(prediction_threshold=0.5),
    # SummaryExtractor(summaries=["prev", "self"], llm=llm),
    # KeywordExtractor(keywords=10, llm=llm),
    # CustomExtractor()
]

think the docs are showing i should declare them manually if i start adding custom extractors

eenginirmata 🐲

i just hadn't configure them before looking to add new metadata

LLogan M

Hmm, there's no default extractors. At minimum you want at least a splitter/node parser (and an embedding model, if you want to put into a vector store)

You should be able to just add your custom one to the list 🤔

Add a reply

Find answers from the community

llama_index/docs/examples/metadata_extra...