Find answers from the community

Updated 2 weeks ago

A reader that takes in a python file object instead of a file path

hey folks, I am pretty new to LlamaIndex in general, but I was wondering if there was a Reader that takes in a python file object instead of a file path and returns the output? I was looking into the source code for SimpleDirectoryReader and it looks clear enough how to override some functions and create it, but is there any ready made reader or even some idea of what functions I'll need to override for it to work properly? Appreciate any help on this, thanks!
W
A
16 comments
You can use code splitter which can help in splitting the code in correct format.
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#codesplitter
hey, I just looked at it and it looks like its for programming languages and specific codes? our file types are not dev codes, its just any file, I just wanted a way where I can feed it something like a BytesIO object, and it can return the parsed content to me
right now the workaround is to create temp file, pass the path explicitly and then delete it afterwards, thats a lot of unnecessary IO specially at scale, I just wanted to remove this middleman
Ah then you probably have to write your own custom reader which can handle your requirements.
Something like this:

Plain Text
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
class CustomReader(BaseReader):
    def load_data(self, file_path: str, **kwargs) -> List[Document]:
        # Implement your custom logic to read and process the file
        ~~with open(file_path, 'r') as file: # At this Place extract the content from your BytesIO object and create document object,
            content = file.read()~~
        # Create a Document object with the processed content
        document = Document(text=content)
        return [document]

documents = CustomReader.load_data(byte_data)

Then you can directly pass it to index creation step
No need for SimpleDirectoryReader then
Hwy thanks! appreciate it, so just to confirm, I just need to override load_data right? while subclassing BaseReader
hey, I ended up implementing a basic class but I am facing some weird issues with it

Plain Text
class PythonFileObjectReader(BaseReader):
    def __init__(self):
        super().__init__()

    def load_data(self, file_content, **kwargs) -> list[Document]:
        document = Document(text=file_content) #file_content is a b64 encoded string
        return [document]

file_content is just a b64 decoded file (in bytes format, output of b64decode basically) and it works perfectly for some files, but if I pass in some files, I get this pydantic error I'm not sure of in Llama's context

Plain Text
ValidationError: 1 validation error for MediaResource
text
  Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'PK\x03\x04\x14\x00\x06\...00x\xaa\x1d\x00\x00\x00', input_type=bytes]
    For further information visit https://errors.pydantic.dev/2.9/v/string_unicode

do you have any idea of could be the issue?
I know that we can simply decode the bytes with utf-8 but not sure if this encoding is valid across all file types
This error is for MediaResource class which is not there in the above code
thats an internal llama schema class, called by Document
sorry my bad!
Yeah so the required type for text is string and the data you are trying to add is of diff type.
You can maybe have a check before insertion for instance and if not string compatible then convert it first form byte to string and then add
thats the thing, since I am passing in output of b64decode it'll always be bytes right? but it works in some cases and in some it doesnt
and the only thing I can figure is, the encoding is not being done properly
Not entirely sure for this. You'll have to debug this.
But yeah at the Docuemnt side it will always need a string type data
Add a reply
Sign up and join the conversation on Discord