Find answers from the community

Updated 9 months ago

I'm trying to understand how `refresh_

I'm trying to understand how refresh_ref_documents works when my vector_store, docstore, and index store are all in Postgres but the files I want to ingest are in the file system. It seems to handle changes to the existing files, but what about deleted files? Is there a way to delete the associated rows in the docstore and vector_store when I've deleted a file from the file system since the last ingestion? Or do I need to handle that manually? (If so, sample code would really help.) Here's a snippet of my indexing code. Please let me know if I'm doing anything wrong!

Plain Text
storage_context = StorageContext.from_defaults(
    vector_store=postgres_vector_store,
    index_store=postgres_index_store,
    docstore=postgres_docstore,
)

# Add filename to metadata.
filename_fn = lambda filename: {"file_name": filename}

documents = SimpleDirectoryReader(
    "./sources",
    recursive=True,
    file_metadata=filename_fn,
    filename_as_id=True,
).load_data()

try:
    print("Loading index from docstore...")
    index = load_index_from_storage(
        storage_context=storage_context, service_context=service_context
    )
except:
    print("Creating initial docstore...")
    index = VectorStoreIndex.from_documents(
        documents=documents,
        store_nodes_override=True, # Do I need to set this override?
        storage_context=storage_context,
        service_context=service_context,
        show_progress=True,
    )

print("Refreshing vector database with only new documents from the file system. TO DO: Handle deleted files.")
refreshed_docs = index.refresh_ref_docs(
    documents=documents,
)
r
L
11 comments
@Logan M Adding delete_from_docstore doesn't seem to enable deleting of nodes from files no longer present in documents. Am I missing something?
theres no handling really for deleted documents with refresh_ref_docs

tbh this flow is slightly janky

I would be recommending the ingestion pipeline flow, which does have a mode for handling deleted documents (if upon rerunning, a document is missing that was present before, it will delete it. This assumes the full document set is being passed each time though)

Plain Text
from llama_index.core.ingestion import IngestionPipeline, DocstoreStrategy

pipeline = IngestionPipeline(
  transformations=[SentenceSplitter(), OpenAIEmbeddings()], 
  vector_store=postgres_vector_store,
  docstore=postgres_docstore,
  docstore_strategy=DocstoreStrategy.UPSERTS_AND_DELETE
)

pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(vector_store, ...)
Sounds like exactly what I need! Will this work with 0.9.48 by any chance?
it should πŸ™

The import is slightly different for v0.9.x

from llama_index.ingestion import IngestionPipeline, DocstoreStrategy
Thank you so much! I can't believe how far I can get with this tool and your help despite having no idea what I'm doing πŸ™‚
Not quite sure what to do after running the pipeline. Creating the index with VectorStoreIndex.from_vector_store seems to just create the docstore table in the database but the main table that's supposed to contain the nodes is empty.
after running the pipeline, it should have already inserted stuff in your vector store, and setup some tracking for whats been inserted in the docstore
So VectorStoreIndex.from_vector_store(vector_store) is just connecting to a populated vector db
Hmm. Doesn't seem to create the initial vector store rows. I'm trying to do this with a custom llm class and custom embed_model class set in service_context.
It works now! Even the deleting of documents! Looks like I just needed to update my custom embedding class to work as a transformation.
there we go! Right!
Add a reply
Sign up and join the conversation on Discord