----------------------
The
SimpleDirectoryReader
in LlamaIndex is designed to load data from local files into LlamaIndex. It supports a variety of file types, including
.pdf
and
.html
. However, the documentation does not explicitly mention how it handles images within these files.
By default,
SimpleDirectoryReader
will try to read any files it finds, treating them all as text. It supports the following file types, which are automatically detected based on file extension:
- .csv - comma-separated values
- .docx - Microsoft Word
- .epub - EPUB ebook format
- .hwp - Hangul Word Processor
- .ipynb - Jupyter Notebook
- .jpeg, .jpg - JPEG image
- .mbox - MBOX email archive
- .md - Markdown
- .mp3, .mp4 - audio and video
- .pdf - Portable Document Format
- .png - Portable Network Graphics
- .ppt, .pptm, .pptx - Microsoft PowerPoint
For JSON files, it is recommended to use the JSON Loader.
However, it's important to note that while
SimpleDirectoryReader
can read these file types, it does not explicitly mention the ability to extract or handle images or other non-text content within these files. If your PDF or HTML files contain images, you may need to use additional tools or methods to extract the text from these images, which is not covered in the provided knowledge sources.
Here is the
source () for your reference.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------