Hi, I am new to llamaindex and I'm

JJimmy Phan

Hi, I am new to llamaindex and I'm trying to extract metadata of each node at the moment. I'm following tutorial in the documents and is using CustomExtractor as instruction. However, it had an error like this. I have searched solutions on the internet, but there isn't anything helpful. Please help me to solve this.

Attachment

7 comments

WWhiteFang_Jr

Hi can you show the CustomExtractor code ?

Also are you following this tutorial: https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtractionSEC.html

LLogan M

Ah we swapped the base class to be async first

Need to update that example

Also implement the aextract function, but just have it call self.extract() if there's nothing async about your code

https://github.com/run-llama/llama_index/blob/fadef5f31ef6acd9b39b72103931b5eb62f98585/llama_index/extractors/interface.py#L74

JJimmy Phan

Yes, I'm following that tutorial step by step and find this error. CustomExtractor class is exactly as the same as in the tutorial.

JJimmy Phan

Thank you @Logan M . I will try your suggestion. Also look forward to an update in that example.

JJimmy Phan

HI @WhiteFang_Jr here is my code and its type error.

# read documents
documents = SimpleDirectoryReader(input_files=["data/impact-of-large-language-models-in-business--09:10:2023.txt"]).load_data()
# define llm
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
# define text splitter
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
    SummaryExtractor(summaries=["prev", "self"], llm=llm),
    KeywordExtractor(keywords=10, llm=llm),
    CustomExtractor()
]

transformations = [text_splitter] + extractors
pipeline = IngestionPipeline(transformations=transformations)
nodes = pipeline.run(documents=documents, show_progress=True)

Attachment

WWhiteFang_Jr

Hey!
Did you try @Logan M suggestion?

Plain Text

class CustomExtractor(BaseExtractor):
    async def aextract(self, nodes: Sequence[BaseNode]) -> List[Dict]:
        """Extracts metadata for a sequence of nodes, returning a list of
        metadata dictionaries corresponding to each node.

        Args:
            nodes (Sequence[Document]): nodes to extract metadata from

        """
        return self.extract(nodes)

    def extract(self, nodes):
        metadata_list = [
            {
                "custom": (
                    node.metadata["document_title"]
                    + "\n"
                    + node.metadata["excerpt_keywords"]
                )
            }
            for node in nodes
        ]
        return metadata_list

JJimmy Phan

yes, it worked. You're my lucky charm haha 🥰

Add a reply

Find answers from the community

Hi, I am new to llamaindex and I'm