Find answers from the community

Updated 12 months ago

Hi everyone, I'm trying to load

At a glance

The community member is trying to load different types of source files using different readers, but encountered an error with the HTMLTagReader. They have defined custom readers for various file types, including HTML, JSON, PowerPoint, and XML, and are using a SimpleDirectoryReader to load the documents. One of the comments suggests that the community member needs to initialize the custom reader classes in the file_extractors dictionary. The community member acknowledges this suggestion and thanks the commenter.

Useful resources
Hi everyone, I'm trying to load different types of source files using different readers. I just got an error for the HTMLTagReader Failed to load file NAME with error: HTMLTagReader.load_data() missing 1 required positional argument: 'file'. Skipping... and now I'm second guessing my function:
def document_loader(docs_relative_path): # Define custom readers ##Readers found in https://llamahub.ai/?tab=readers class MyHTMLTagReader(HTMLTagReader): pass class MyJSONReader(JSONReader): pass class MyPPTReader(PptxReader): pass class MyXMLReader(XMLReader): pass #Currently just for .pdf ##LlamaCloud account parser = LlamaParse( api_key="", result_type="text", verbose=True, ) # Create custom file extractors dictionary file_extractors = { ".html": MyHTMLTagReader, ".json": MyJSONReader, ".pdf": parser, ".pptx, .ppt": MyPPTReader, ".xml": MyXMLReader, } # Initialize SimpleDirectoryReader with custom file extractors ## SimpleDirectoryReader reads any files it finds, treating them all as text. It explicity supports:.csv, .docx, .epub, .hwp, .ipynb, .jpeg, .jpg, .mbox, .md, .mp3, .mp4, .pdf, .png, .ppt, .pptm, .pptx reader = SimpleDirectoryReader(input_dir=docs_relative_path, file_extractor=file_extractors, filename_as_id=False) # Load documents documents = reader.load_data() print("Number of documents loaded:", len(documents)) # Do further processing with loaded documents return documents

Any tips?
L
R
2 comments
Plain Text
file_extractors = {
        ".html": MyHTMLTagReader(),
        ".json": MyJSONReader(),
        ".pdf": parser,
        ".pptx, .ppt": MyPPTReader(),
        ".xml": MyXMLReader(),
    }


Need to initialize your classes here
Ahhhh - thank you!
Add a reply
Sign up and join the conversation on Discord