Hi everyone, I'm trying to load

At a glance

The community member is trying to load different types of source files using different readers, but encountered an error with the HTMLTagReader. They have defined custom readers for various file types, including HTML, JSON, PowerPoint, and XML, and are using a SimpleDirectoryReader to load the documents. One of the comments suggests that the community member needs to initialize the custom reader classes in the file_extractors dictionary. The community member acknowledges this suggestion and thanks the commenter.

Useful resources

RRach

Hi everyone, I'm trying to load different types of source files using different readers. I just got an error for the HTMLTagReader

Failed to load file NAME with error: HTMLTagReader.load_data() missing 1 required positional argument: 'file'. Skipping...

and now I'm second guessing my function:

def document_loader(docs_relative_path):
    # Define custom readers
    ##Readers found in https://llamahub.ai/?tab=readers
    class MyHTMLTagReader(HTMLTagReader):
        pass

    class MyJSONReader(JSONReader):
        pass

    class MyPPTReader(PptxReader):
        pass

    class MyXMLReader(XMLReader):
        pass

    #Currently just for .pdf
    ##LlamaCloud account
    parser = LlamaParse(
        api_key="",
        result_type="text",
        verbose=True,
    )

    # Create custom file extractors dictionary
    file_extractors = {
        ".html": MyHTMLTagReader,
        ".json": MyJSONReader,
        ".pdf": parser,
        ".pptx, .ppt": MyPPTReader,
        ".xml": MyXMLReader,
    }

    # Initialize SimpleDirectoryReader with custom file extractors
    ## SimpleDirectoryReader reads any files it finds, treating them all as text. It explicity supports:.csv, .docx, .epub, .hwp, .ipynb, .jpeg, .jpg, .mbox, .md, .mp3, .mp4, .pdf, .png, .ppt, .pptm, .pptx
    reader = SimpleDirectoryReader(input_dir=docs_relative_path, file_extractor=file_extractors, filename_as_id=False)

    # Load documents
    documents = reader.load_data()
    print("Number of documents loaded:", len(documents))

    # Do further processing with loaded documents
    return documents

Any tips?

2 comments

LLogan M

Plain Text

file_extractors = {
        ".html": MyHTMLTagReader(),
        ".json": MyJSONReader(),
        ".pdf": parser,
        ".pptx, .ppt": MyPPTReader(),
        ".xml": MyXMLReader(),
    }

Need to initialize your classes here

RRach

Ahhhh - thank you!

Add a reply

Find answers from the community

Hi everyone, I'm trying to load