Find answers from the community

Updated 2 years ago

If I build an index over some documents

If I build an index over some documents, and then re-scrape the document source I can use refresh() to update documents with the same doc_id but different text, as well as add new documents... but I'd also like to delete documents that are not present in the new scrape. Does llama-index have a built in management utility for this too?
d
s
L
3 comments
Also interested in this.
I'm doing the simplest thing possible here, but maybe there is something better.

Plain Text
doc_ids_for_deletion = set(index.ref_doc_info.keys()) - set([p.doc_id for p in newly_scraped_docs])
#%%
for doc_id in doc_ids_for_deletion:
    if doc_id in index.ref_doc_info:
        index.delete_ref_doc(doc_id, delete_from_docstore=True)
        print(f"Deleted doc {doc_id}")
    else:
        print(f"Doc {doc_id} not found in index")
#%%
refreshed_docs = index.refresh_ref_docs(newly_scraped_docs, update_kwargs={"delete_kwargs": {'delete_from_docstore': True}})
I think that approach seems feasible.

If you want, you could add a boolean option to the refresh function in a PR πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord