Find answers from the community

Updated 2 months ago

Hi I m having trouble getting refresh to

Hi - I'm having trouble getting refresh() to work, it creates a new document every time. is there an error in here? thank you!
Plain Text
db_documents = db.load_data(query=query)
for document in db_documents:
   document.doc_id = VERSION_NUMBER + "_"+"string"

vector_store = PGVectorStore.from_params(
    database="postgres",
    host=HOSTNAME,
    password=PASS,
    port=5432,
    user=USER,
    table_name=TABLE,
    embed_dim=1536,
    hybrid_search=True,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

refreshed_docs = index.refresh(
    db_documents,
)
index.storage_context.persist()

(there's only one row loaded from the database currently)
L
B
t
13 comments
refresh only works with the base simple vector store :PSadge:

It relies on information stored in the docstore, but the docstore isn't used when using a vector db integration

You can override this, but then you need to handle persisting the docstore/index store
thank you!
i see, so i'm currently using PGVector as the vector_store, but i would need to also define the docstore and index_store, ya?

once i set those two to remote, would i then be able to use refresh(), or is there another approach?

i'm open to any suggestion on easier approaches vs using refresh() as well. i guess i can store my own hashes and use update_ref_doc() instead?
actually i see VectorStoreIndex doesn't support update either
Nah it works, lemme setup an example (it's a little complicated tbh)
I find this aspect of working with Llama Index and vector stores super confusing. I used refresh_ref_doc() for adding all my documents to a chromadb persistentclient and I thought it worked.
Plain Text
from llama_index import VectorStoreIndex, StorageContext, Document
from llama_index.vector_stores import ChromaVectorStore

import chromadb
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

documents = [Document(text="document 1", doc_id="doc1")]

# test and confirm single document is retrieved
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, store_nodes_override=True)
nodes = index.as_retriever(similarity_top_k=10).retrieve("document")
print("Initial: ", [(node.text, node.node_id) for node in nodes])

# save the docstore/index store
index.storage_context.persist(persist_dir="./storage")

# load the index
new_storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir="./storage")

from llama_index import load_index_from_storage
# optional service context
loaded_index = load_index_from_storage(new_storage_context) # , service_context=service_context)

# test and confirm single document is retrieved
nodes = loaded_index.as_retriever(similarity_top_k=10).retrieve("document")
print("Loaded: ", [(node.text, node.node_id) for node in nodes])

# test that refresh works
documents = [Document(text="new document 1", doc_id="doc1"), Document(text="document 2", doc_id="doc2")]
loaded_index.refresh_ref_docs(documents)

# test and confirm refreshed documents are retrieved
nodes = loaded_index.as_retriever(similarity_top_k=10).retrieve("document")
print("Refreshed: ", [(node.text, node.node_id) for node in nodes])
I used chromadb, but this will work with any vector db
At the end, the existing doc1 is updated with new text, and doc2 is also inserted
LOL jk, I messed up somewhere. It's very close, one sec
Ok it works! Here's the output

Plain Text
Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Initial:  [('document 1', 'f6d7740e-f483-4c2a-a017-eaddd3916382')]
Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Loaded:  [('document 1', 'f6d7740e-f483-4c2a-a017-eaddd3916382')]
Number of requested results 10 is greater than number of elements in index 2, updating n_results = 2
Refreshed:  [('document 2', 'a9e63361-7c7f-4a4a-925f-7956820ef8c1'), ('new document 1', '349c1488-0433-4dbc-a081-ecc5d1829496')]
amazing thank you! will try
works perfectly, thank you!!

btw not sure if the docs need to be updated
https://docs.llamaindex.ai/en/stable/core_modules/data_modules/index/document_management.html#refresh
still uses the older refresh() method and the id_ property instead of doc_id
Both work actually, but yea, could update that πŸ‘
Add a reply
Sign up and join the conversation on Discord