The community member is trying to load different types of source files using different readers, but encountered an error with the HTMLTagReader. They have defined custom readers for various file types, including HTML, JSON, PowerPoint, and XML, and are using a SimpleDirectoryReader to load the documents. One of the comments suggests that the community member needs to initialize the custom reader classes in the file_extractors dictionary. The community member acknowledges this suggestion and thanks the commenter.
Hi everyone, I'm trying to load different types of source files using different readers. I just got an error for the HTMLTagReader Failed to load file NAME with error: HTMLTagReader.load_data() missing 1 required positional argument: 'file'. Skipping... and now I'm second guessing my function: def document_loader(docs_relative_path):
# Define custom readers
##Readers found in https://llamahub.ai/?tab=readers
class MyHTMLTagReader(HTMLTagReader):
pass
class MyJSONReader(JSONReader):
pass
class MyPPTReader(PptxReader):
pass
class MyXMLReader(XMLReader):
pass
#Currently just for .pdf
##LlamaCloud account
parser = LlamaParse(
api_key="",
result_type="text",
verbose=True,
)
# Create custom file extractors dictionary
file_extractors = {
".html": MyHTMLTagReader,
".json": MyJSONReader,
".pdf": parser,
".pptx, .ppt": MyPPTReader,
".xml": MyXMLReader,
}
# Initialize SimpleDirectoryReader with custom file extractors
## SimpleDirectoryReader reads any files it finds, treating them all as text. It explicity supports:.csv, .docx, .epub, .hwp, .ipynb, .jpeg, .jpg, .mbox, .md, .mp3, .mp4, .pdf, .png, .ppt, .pptm, .pptx
reader = SimpleDirectoryReader(input_dir=docs_relative_path, file_extractor=file_extractors, filename_as_id=False)
# Load documents
documents = reader.load_data()
print("Number of documents loaded:", len(documents))
# Do further processing with loaded documents
return documents