Find answers from the community

Updated 2 months ago

llama_index/docs/examples/metadata_extra...

hello everyone!

i'm working on adding a custom extractor to my vector query engine pipeline and looking at this notebook as reference

Plain Text
class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list


however, i'm getting a TypeError: Can't instantiate abstract class CustomExtractor with abstract method aextract when pasting the documentation code as is

(running llama-index v 0.9.25.post1)

does anyone have pointers on writing custom extractors?
L
e
8 comments
Dang, I need to update that example. Doing that now

Should be

Plain Text
class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list
it got updated to be async-first
nice, that's running!

just getting node.metadata["document_title"] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^ KeyError: 'document_title' because i've overriden all the default transformers

is there a way to add my extractor to the default list?

Plain Text
    pipe = IngestionPipeline(transformations=[
        get_default_transformers(),
        CustomExtractor()
    ])
hmm, not sure what you mean πŸ€”

what does get_default_transformers() return?
i'm just diving into ingestion so forgive me

i'm using SimpleDirectoryReader and VectorStoreIndex to run some simple queries over data

i want to have that running 'as is' (defaults?) and add a custom extractor

things like Text and Code splitting, the summary, QA, all that metadata i want to keep just adding one more piece of metadata
Plain Text
extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    # EntityExtractor(prediction_threshold=0.5),
    # SummaryExtractor(summaries=["prev", "self"], llm=llm),
    # KeywordExtractor(keywords=10, llm=llm),
    # CustomExtractor()
]


think the docs are showing i should declare them manually if i start adding custom extractors
i just hadn't configure them before looking to add new metadata
Hmm, there's no default extractors. At minimum you want at least a splitter/node parser (and an embedding model, if you want to put into a vector store)

You should be able to just add your custom one to the list πŸ€”
Add a reply
Sign up and join the conversation on Discord