Find answers from the community

Updated 2 years ago

If I build an index over some documents

At a glance

The community member is using llama-index to manage a document index. They can use the refresh() function to update documents with the same doc_id but different text, and add new documents. However, they would also like to delete documents that are not present in the new scrape. The community member is interested in whether llama-index has a built-in management utility for this.

In the comments, another community member suggests a manual approach to delete documents not present in the new scrape. They provide sample code to find the doc_ids for deletion, delete them from the index, and then refresh the index with the newly scraped documents. Another community member thinks this approach seems feasible and suggests the community member could add a boolean option to the refresh() function in a pull request.

If I build an index over some documents, and then re-scrape the document source I can use refresh() to update documents with the same doc_id but different text, as well as add new documents... but I'd also like to delete documents that are not present in the new scrape. Does llama-index have a built in management utility for this too?
d
s
L
3 comments
Also interested in this.
I'm doing the simplest thing possible here, but maybe there is something better.

Plain Text
doc_ids_for_deletion = set(index.ref_doc_info.keys()) - set([p.doc_id for p in newly_scraped_docs])
#%%
for doc_id in doc_ids_for_deletion:
    if doc_id in index.ref_doc_info:
        index.delete_ref_doc(doc_id, delete_from_docstore=True)
        print(f"Deleted doc {doc_id}")
    else:
        print(f"Doc {doc_id} not found in index")
#%%
refreshed_docs = index.refresh_ref_docs(newly_scraped_docs, update_kwargs={"delete_kwargs": {'delete_from_docstore': True}})
I think that approach seems feasible.

If you want, you could add a boolean option to the refresh function in a PR πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord