Find answers from the community

Updated last year

Is there a documented method to use

At a glance
Is there a documented method to use Azure container storage for your documents? Having trouble finding an example of reading in files into the index that way. I've been using SimpleDirectoryReader up until now, but I'm curious if there's a method to stream the files from acs, instead.
O
L
8 comments
@Logan M my current approach is to download all of the files in a list to a temp directory with unique ids, then pass their paths to the simple directory reader. Getting some weird effects though. Does the document reader rely on the file extension to determine file types and processing? A pdf without the extension produced 240 vectors instead of 13 (which previously worked in queries).
Additionally, after doing documents = SimpleDirectoryReader(input_files=file_paths).load_data() with a single element in file_paths it returns 13 documents. I don't understand why.
Yes it does use the extension.

PDF files get split into one document object per page
How do you update the doc_id given that design?
Since they're all the same doc.
And if you do multiple files, you wont know which array-index is which doc
If you are using the filename_as_id=True parameter, it just appends a part_X to each doc id, so that the ids are still deterministic

Or, you could devise your own scheme for the doc ids
cool. I did something like this:
Plain Text
documents = []
for x, file_path in enumerate(file_paths):
    docs = SimpleDirectoryReader(input_files=[file_path]).load_data()

    # Add doc_id to documents
    logging.info("Adding doc_id to documents")
    for i in range(len(docs)):
        docs[i].doc_id = f"{azure_path[x]}_part_{i}"
                
    documents.extend(docs)
Add a reply
Sign up and join the conversation on Discord