Find answers from the community

Updated 2 years ago

I want my app to use both pdf files and

I want my app to use both pdf files and pptx as documents. How can I combine them? I know there is this code to accept pptx:

from pathlib import Path from llama_index import download_loader
PptxReader = download_loader("PptxReader")
loader = PptxReader()
documents = loader.load_data(file=Path('./deck.pptx'))

But how do I make sure that documents uses both the normal SimpleDirectoryReader and this reader?
L
e
11 comments
You can pass in a map of file-extension->loader into the simple directory reader

Plain Text
from llama_index.readers.file.base import DEFAULT_FILE_EXTRACTOR

file_extractor = DEFAULT_FILE_EXTRACTOR
file_extractor.update(
{
    ".pptx": PptxReader()
})
docs = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()
Is this code still usable? I am getting

ModuleNotFoundError: No module named 'llama_index.readers.file.base'

and

pip install llama-index-readers-base
ERROR: Could not find a version that satisfies the requirement llama-index-readers-base (from versions: none)
ERROR: No matching distribution found for llama-index-readers-base
It would be from llama_index.core.readers.base import .. in this case
No idea why that pip install fails for you, but that's for the actual file based readers

https://pypi.org/project/llama-index-readers-file/
I just tried that and it failed too. Searching for DEFAULT_FILE_EXTRACTOR returns only one result at github codebase:
Attachment
image.png
ImportError: cannot import name 'DEFAULT_FILE_EXTRACTOR' from 'llama_index.readers.file'
ImportError: cannot import name 'DEFAULT_FILE_EXTRACTOR' from 'llama_index.core.readers.base'

And base doesn't seem to have any DEFAULT_FILE_EXTRACTOR defined
Attachment
image.png
It needs to get built on the fly
Maybe I should open a new thread as I am trying to make XMLReader() work with multiple files, and from this link you shared, it is not included in the default_file_reader_cls
indeed it is not. You just have to add it

Plain Text
file_extractor = {".xml": XMLReader()}

documents = SimpleDirectoryReader(..., file_extractor=file_extractor).load_data()
Thanks the example! I was trying to follow up the docs at https://llamahub.ai/l/readers/llama-index-readers-file?from=all and from the info there, it's not very clear which ones are already build in and which ones are not
Add a reply
Sign up and join the conversation on Discord