A reader that takes in a python file object instead of ...

At a glance

The community member is new to LlamaIndex and is looking for a Reader that can take in a Python file object instead of a file path. They have looked at the source code for SimpleDirectoryReader and think they can override some functions to create a custom reader, but they are wondering if there is a ready-made solution or if they need to implement it themselves.

The community members discuss a few options, including using the CodeSplitter, which seems to be for programming languages and specific codes. They also discuss creating a custom reader by subclassing the BaseReader and overriding the load_data function.

The community member implements a basic custom reader class called PythonFileObjectReader, but they are facing issues with the Document class expecting a string input, while they are passing in a base64-encoded byte string. They are unsure how to handle the encoding properly to make it work across all file types.

There is no explicitly marked answer, but the community members suggest that the community member will need to debug the issue and ensure that the input to the Document class is a valid string.

Useful resources

AAltairSama2

hey folks, I am pretty new to LlamaIndex in general, but I was wondering if there was a Reader that takes in a python file object instead of a file path and returns the output? I was looking into the source code for SimpleDirectoryReader and it looks clear enough how to override some functions and create it, but is there any ready made reader or even some idea of what functions I'll need to override for it to work properly? Appreciate any help on this, thanks!

16 comments

WWhiteFang_Jr

You can use code splitter which can help in splitting the code in correct format.
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#codesplitter

AAltairSama2

hey, I just looked at it and it looks like its for programming languages and specific codes? our file types are not dev codes, its just any file, I just wanted a way where I can feed it something like a BytesIO object, and it can return the parsed content to me

AAltairSama2

right now the workaround is to create temp file, pass the path explicitly and then delete it afterwards, thats a lot of unnecessary IO specially at scale, I just wanted to remove this middleman

WWhiteFang_Jr

Ah then you probably have to write your own custom reader which can handle your requirements.

WWhiteFang_Jr

Something like this:

Plain Text

from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
class CustomReader(BaseReader):
    def load_data(self, file_path: str, **kwargs) -> List[Document]:
        # Implement your custom logic to read and process the file
        ~~with open(file_path, 'r') as file: # At this Place extract the content from your BytesIO object and create document object,
            content = file.read()~~
        # Create a Document object with the processed content
        document = Document(text=content)
        return [document]

documents = CustomReader.load_data(byte_data)

Then you can directly pass it to index creation step

WWhiteFang_Jr

No need for SimpleDirectoryReader then

AAltairSama2

Hwy thanks! appreciate it, so just to confirm, I just need to override load_data right? while subclassing BaseReader

WWhiteFang_Jr

Yes

AAltairSama2

hey, I ended up implementing a basic class but I am facing some weird issues with it

Plain Text

class PythonFileObjectReader(BaseReader):
    def __init__(self):
        super().__init__()

    def load_data(self, file_content, **kwargs) -> list[Document]:
        document = Document(text=file_content) #file_content is a b64 encoded string
        return [document]

file_content is just a b64 decoded file (in bytes format, output of b64decode basically) and it works perfectly for some files, but if I pass in some files, I get this pydantic error I'm not sure of in Llama's context

Plain Text

ValidationError: 1 validation error for MediaResource
text
  Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'PK\x03\x04\x14\x00\x06\...00x\xaa\x1d\x00\x00\x00', input_type=bytes]
    For further information visit https://errors.pydantic.dev/2.9/v/string_unicode

do you have any idea of could be the issue?

AAltairSama2

I know that we can simply decode the bytes with utf-8 but not sure if this encoding is valid across all file types

WWhiteFang_Jr

This error is for MediaResource class which is not there in the above code

AAltairSama2

thats an internal llama schema class, called by Document

WWhiteFang_Jr

sorry my bad!
Yeah so the required type for text is string and the data you are trying to add is of diff type.
You can maybe have a check before insertion for instance and if not string compatible then convert it first form byte to string and then add

AAltairSama2

thats the thing, since I am passing in output of b64decode it'll always be bytes right? but it works in some cases and in some it doesnt

AAltairSama2

and the only thing I can figure is, the encoding is not being done properly

WWhiteFang_Jr

Not entirely sure for this. You'll have to debug this.
But yeah at the Docuemnt side it will always need a string type data

Add a reply

Find answers from the community

A reader that takes in a python file object instead of a file path