Find answers from the community

Updated 9 months ago

Hi everyone, I'm trying to load

Hi everyone, I'm trying to load different types of source files using different readers. I just got an error for the HTMLTagReader Failed to load file NAME with error: HTMLTagReader.load_data() missing 1 required positional argument: 'file'. Skipping... and now I'm second guessing my function:
def document_loader(docs_relative_path): # Define custom readers ##Readers found in https://llamahub.ai/?tab=readers class MyHTMLTagReader(HTMLTagReader): pass class MyJSONReader(JSONReader): pass class MyPPTReader(PptxReader): pass class MyXMLReader(XMLReader): pass #Currently just for .pdf ##LlamaCloud account parser = LlamaParse( api_key="", result_type="text", verbose=True, ) # Create custom file extractors dictionary file_extractors = { ".html": MyHTMLTagReader, ".json": MyJSONReader, ".pdf": parser, ".pptx, .ppt": MyPPTReader, ".xml": MyXMLReader, } # Initialize SimpleDirectoryReader with custom file extractors ## SimpleDirectoryReader reads any files it finds, treating them all as text. It explicity supports:.csv, .docx, .epub, .hwp, .ipynb, .jpeg, .jpg, .mbox, .md, .mp3, .mp4, .pdf, .png, .ppt, .pptm, .pptx reader = SimpleDirectoryReader(input_dir=docs_relative_path, file_extractor=file_extractors, filename_as_id=False) # Load documents documents = reader.load_data() print("Number of documents loaded:", len(documents)) # Do further processing with loaded documents return documents

Any tips?
L
R
2 comments
Plain Text
file_extractors = {
        ".html": MyHTMLTagReader(),
        ".json": MyJSONReader(),
        ".pdf": parser,
        ".pptx, .ppt": MyPPTReader(),
        ".xml": MyXMLReader(),
    }


Need to initialize your classes here
Ahhhh - thank you!
Add a reply
Sign up and join the conversation on Discord