Reader
that takes in a python file object instead of a file path and returns the output? I was looking into the source code for SimpleDirectoryReader
and it looks clear enough how to override some functions and create it, but is there any ready made reader or even some idea of what functions I'll need to override for it to work properly? Appreciate any help on this, thanks!BytesIO
object, and it can return the parsed content to mefrom llama_index.core.readers.base import BaseReader from llama_index.core.schema import Document class CustomReader(BaseReader): def load_data(self, file_path: str, **kwargs) -> List[Document]: # Implement your custom logic to read and process the file ~~with open(file_path, 'r') as file: # At this Place extract the content from your BytesIO object and create document object, content = file.read()~~ # Create a Document object with the processed content document = Document(text=content) return [document] documents = CustomReader.load_data(byte_data)
class PythonFileObjectReader(BaseReader): def __init__(self): super().__init__() def load_data(self, file_content, **kwargs) -> list[Document]: document = Document(text=file_content) #file_content is a b64 encoded string return [document]
file_content
is just a b64 decoded file (in bytes format, output of b64decode
basically) and it works perfectly for some files, but if I pass in some files, I get this pydantic error I'm not sure of in Llama's context ValidationError: 1 validation error for MediaResource text Input should be a valid string, unable to parse raw data as a unicode string [type=string_unicode, input_value=b'PK\x03\x04\x14\x00\x06\...00x\xaa\x1d\x00\x00\x00', input_type=bytes] For further information visit https://errors.pydantic.dev/2.9/v/string_unicode
utf-8
but not sure if this encoding is valid across all file typestext
is string and the data you are trying to add is of diff type. b64decode
it'll always be bytes right? but it works in some cases and in some it doesnt