Find answers from the community

Updated 2 years ago

Is there a documented method to use

At a glance

Is there a documented method to use Azure container storage for your documents? Having trouble finding an example of reading in files into the index that way. I've been using SimpleDirectoryReader up until now, but I'm curious if there's a method to stream the files from acs, instead.

8 comments

OOrion Pax

@Logan M my current approach is to download all of the files in a list to a temp directory with unique ids, then pass their paths to the simple directory reader. Getting some weird effects though. Does the document reader rely on the file extension to determine file types and processing? A pdf without the extension produced 240 vectors instead of 13 (which previously worked in queries).

OOrion Pax

Additionally, after doing documents = SimpleDirectoryReader(input_files=file_paths).load_data() with a single element in file_paths it returns 13 documents. I don't understand why.

LLogan M

Yes it does use the extension.

PDF files get split into one document object per page

OOrion Pax

How do you update the doc_id given that design?

OOrion Pax

Since they're all the same doc.

OOrion Pax

And if you do multiple files, you wont know which array-index is which doc

LLogan M

If you are using the filename_as_id=True parameter, it just appends a part_X to each doc id, so that the ids are still deterministic

Or, you could devise your own scheme for the doc ids

OOrion Pax

cool. I did something like this:

Plain Text

documents = []
for x, file_path in enumerate(file_paths):
    docs = SimpleDirectoryReader(input_files=[file_path]).load_data()

    # Add doc_id to documents
    logging.info("Adding doc_id to documents")
    for i in range(len(docs)):
        docs[i].doc_id = f"{azure_path[x]}_part_{i}"
                
    documents.extend(docs)

Add a reply