I want my app to use both pdf files and

At a glance

I want my app to use both pdf files and pptx as documents. How can I combine them? I know there is this code to accept pptx:

from pathlib import Path from llama_index import download_loader
PptxReader = download_loader("PptxReader")
loader = PptxReader()
documents = loader.load_data(file=Path('./deck.pptx'))

But how do I make sure that documents uses both the normal SimpleDirectoryReader and this reader?

11 comments

LLogan M

You can pass in a map of file-extension->loader into the simple directory reader

Plain Text

from llama_index.readers.file.base import DEFAULT_FILE_EXTRACTOR

file_extractor = DEFAULT_FILE_EXTRACTOR
file_extractor.update(
{
    ".pptx": PptxReader()
})
docs = SimpleDirectoryReader("./data", file_extractor=file_extractor).load_data()

eelsatch

Is this code still usable? I am getting

ModuleNotFoundError: No module named 'llama_index.readers.file.base'

and

pip install llama-index-readers-base
ERROR: Could not find a version that satisfies the requirement llama-index-readers-base (from versions: none)
ERROR: No matching distribution found for llama-index-readers-base

LLogan M

It would be from llama_index.core.readers.base import .. in this case

LLogan M

No idea why that pip install fails for you, but that's for the actual file based readers

https://pypi.org/project/llama-index-readers-file/

eelsatch

I just tried that and it failed too. Searching for DEFAULT_FILE_EXTRACTOR returns only one result at github codebase:

Attachment

eelsatch

ImportError: cannot import name 'DEFAULT_FILE_EXTRACTOR' from 'llama_index.readers.file'
ImportError: cannot import name 'DEFAULT_FILE_EXTRACTOR' from 'llama_index.core.readers.base'

And base doesn't seem to have any DEFAULT_FILE_EXTRACTOR defined

Attachment

LLogan M

ah yea that got removed, forgot
https://github.com/run-llama/llama_index/blob/0ae69d46e3735a740214c22a5f72e05d46d92635/llama-index-core/llama_index/core/readers/file/base.py#L20

LLogan M

It needs to get built on the fly

eelsatch

Maybe I should open a new thread as I am trying to make XMLReader() work with multiple files, and from this link you shared, it is not included in the default_file_reader_cls

LLogan M

indeed it is not. You just have to add it

Plain Text

file_extractor = {".xml": XMLReader()}

documents = SimpleDirectoryReader(..., file_extractor=file_extractor).load_data()

eelsatch

Thanks the example! I was trying to follow up the docs at https://llamahub.ai/l/readers/llama-index-readers-file?from=all and from the info there, it's not very clear which ones are already build in and which ones are not

Add a reply

Find answers from the community

I want my app to use both pdf files and