Hi I m having trouble getting refresh to

At a glance

Hi - I'm having trouble getting refresh() to work, it creates a new document every time. is there an error in here? thank you!

Plain Text

db_documents = db.load_data(query=query)
for document in db_documents:
   document.doc_id = VERSION_NUMBER + "_"+"string"

vector_store = PGVectorStore.from_params(
    database="postgres",
    host=HOSTNAME,
    password=PASS,
    port=5432,
    user=USER,
    table_name=TABLE,
    embed_dim=1536,
    hybrid_search=True,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

refreshed_docs = index.refresh(
    db_documents,
)
index.storage_context.persist()

(there's only one row loaded from the database currently)

13 comments

LLogan M

refresh only works with the base simple vector store :PSadge:

It relies on information stored in the docstore, but the docstore isn't used when using a vector db integration

You can override this, but then you need to handle persisting the docstore/index store

BByron

thank you!
i see, so i'm currently using PGVector as the vector_store, but i would need to also define the docstore and index_store, ya?

once i set those two to remote, would i then be able to use refresh(), or is there another approach?

i'm open to any suggestion on easier approaches vs using refresh() as well. i guess i can store my own hashes and use update_ref_doc() instead?

BByron

actually i see VectorStoreIndex doesn't support update either

LLogan M

Nah it works, lemme setup an example (it's a little complicated tbh)

ttheta

I find this aspect of working with Llama Index and vector stores super confusing. I used refresh_ref_doc() for adding all my documents to a chromadb persistentclient and I thought it worked.

LLogan M

Plain Text

from llama_index import VectorStoreIndex, StorageContext, Document
from llama_index.vector_stores import ChromaVectorStore

import chromadb
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

documents = [Document(text="document 1", doc_id="doc1")]

# test and confirm single document is retrieved
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, store_nodes_override=True)
nodes = index.as_retriever(similarity_top_k=10).retrieve("document")
print("Initial: ", [(node.text, node.node_id) for node in nodes])

# save the docstore/index store
index.storage_context.persist(persist_dir="./storage")

# load the index
new_storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir="./storage")

from llama_index import load_index_from_storage
# optional service context
loaded_index = load_index_from_storage(new_storage_context) # , service_context=service_context)

# test and confirm single document is retrieved
nodes = loaded_index.as_retriever(similarity_top_k=10).retrieve("document")
print("Loaded: ", [(node.text, node.node_id) for node in nodes])

# test that refresh works
documents = [Document(text="new document 1", doc_id="doc1"), Document(text="document 2", doc_id="doc2")]
loaded_index.refresh_ref_docs(documents)

# test and confirm refreshed documents are retrieved
nodes = loaded_index.as_retriever(similarity_top_k=10).retrieve("document")
print("Refreshed: ", [(node.text, node.node_id) for node in nodes])

LLogan M

I used chromadb, but this will work with any vector db

LLogan M

At the end, the existing doc1 is updated with new text, and doc2 is also inserted

LLogan M

LOL jk, I messed up somewhere. It's very close, one sec

LLogan M

Ok it works! Here's the output

Plain Text

Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Initial:  [('document 1', 'f6d7740e-f483-4c2a-a017-eaddd3916382')]
Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Loaded:  [('document 1', 'f6d7740e-f483-4c2a-a017-eaddd3916382')]
Number of requested results 10 is greater than number of elements in index 2, updating n_results = 2
Refreshed:  [('document 2', 'a9e63361-7c7f-4a4a-925f-7956820ef8c1'), ('new document 1', '349c1488-0433-4dbc-a081-ecc5d1829496')]

BByron

amazing thank you! will try

BByron

works perfectly, thank you!!

btw not sure if the docs need to be updated
https://docs.llamaindex.ai/en/stable/core_modules/data_modules/index/document_management.html#refresh
still uses the older refresh() method and the id_ property instead of doc_id

LLogan M

Both work actually, but yea, could update that 👍

Add a reply

Find answers from the community

Hi I m having trouble getting refresh to