I'm trying to understand how
refresh_ref_documents
works when my vector_store, docstore, and index store are all in Postgres but the files I want to ingest are in the file system. It seems to handle changes to the existing files, but what about deleted files? Is there a way to delete the associated rows in the docstore and vector_store when I've deleted a file from the file system since the last ingestion? Or do I need to handle that manually? (If so, sample code would really help.) Here's a snippet of my indexing code. Please let me know if I'm doing anything wrong!
storage_context = StorageContext.from_defaults(
vector_store=postgres_vector_store,
index_store=postgres_index_store,
docstore=postgres_docstore,
)
# Add filename to metadata.
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(
"./sources",
recursive=True,
file_metadata=filename_fn,
filename_as_id=True,
).load_data()
try:
print("Loading index from docstore...")
index = load_index_from_storage(
storage_context=storage_context, service_context=service_context
)
except:
print("Creating initial docstore...")
index = VectorStoreIndex.from_documents(
documents=documents,
store_nodes_override=True, # Do I need to set this override?
storage_context=storage_context,
service_context=service_context,
show_progress=True,
)
print("Refreshing vector database with only new documents from the file system. TO DO: Handle deleted files.")
refreshed_docs = index.refresh_ref_docs(
documents=documents,
)