Find answers from the community

Updated 3 months ago

I am indexing files when starting an

I am indexing files when starting an application using
Plain Text
documents = SimpleDirectoryReader(documents_dir, recursive='true', filename_as_id=True, num_files_limit=10).load_data()
This is working as expected. Docs are being split up and doc ids are being assigned to the source documents (e.g. my-doc.md_part_1, my-doc.md_part_2, etc). I'm having a problem when uploading new documents using
Plain Text
document = SimpleDirectoryReader(input_files=[doc_text]).load_data()[0]
in that case, only a small portion of the document is being indexed (e.g. my-doc.md). Am I missing a detail to properly chunk the single document and include all parts in the index?
L
e
7 comments
is doc_text a file or just the entire string?
@Logan M I am following the example from the docs and submitting the file like so:
Plain Text
...
manager.register('insert_into_index')
...

@app.route("/uploadFile", methods=["POST"])
def upload_file():
    global manager
    if 'file' not in request.files:
        return "Please send a POST request with a file", 400

    filepath = None
    try:
        uploaded_file = request.files["file"]
        filename = secure_filename(uploaded_file.filename)
        filepath = os.path.join('documents', os.path.basename(filename))
        uploaded_file.save(filepath)

        if request.form.get("filename_as_doc_id", None) is not None:
            manager.insert_into_index(filepath, doc_id=filename)
        else:
            manager.insert_into_index(filepath)
    except Exception as e:
        # cleanup temp file
        if filepath is not None and os.path.exists(filepath):
            os.remove(filepath)
        return "Error: {}".format(str(e)), 500

    # cleanup temp file
    if filepath is not None and os.path.exists(filepath):
        os.remove(filepath)

    return "File inserted!", 200
and the insert_into_index is:
Plain Text
def insert_into_index(doc_text, doc_id=None):
    global index
    document = SimpleDirectoryReader(input_files=[doc_text]).load_data()[0]
    if doc_id is not None:
        document.doc_id = doc_id

    with lock:
        index.insert(document)
        index.storage_context.persist()
ah right, just a badly named variable LOL

I think this is a just a small bug in the data loading assumption (I assumed only txt files when I wrote this), which always produces one document (hence the .load_data()[0])

Try something like this instead

Plain Text
def insert_into_index(doc_text, doc_id=None):
    global index
    documents = SimpleDirectoryReader(input_files=[doc_text], filename_as_id=True).load_data()

    with lock:
        for doc in documents
          index.insert(doc)
        index.storage_context.persist()
Trying now πŸ˜‰
@Logan M Righteous! All is working now.
Add a reply
Sign up and join the conversation on Discord