Find answers from the community

Updated 3 months ago

Refresh

I am trying to insert a doc in using refresh i.e. if the doc has been inserted previously don't try again, here's my snippet of code and it isn't working. Every run with the same page and same url resutls in a fresh insertion.

index = VectorStoreIndex.from_documents([])
index.storage_context.persist(storage_location)

def add_page_to_index(page_text, page_url):
document = Document(url=page_url, text=page_text)
documents = [document]
retval = index.refresh(documents)
index.storage_context.persist(storage_location)
return
L
n
10 comments
Refresh depends on having the same doc_id every time πŸ‘€ so in this case, you should set the doc id to something consistent
thanks, now if I am trying to update only if the content has actually changed, should I do a retrieval of the data and compare the page_text contents or does the VectorStore have some built in smarts to compare say the digest computed over the old and new page_text.

Second question, what is the size limitation on the index, another approach for a doc id might be to stash the url and md5 has together as the doc id and I am wondering if that might violate certain assumptions

Finally is there a call to retrieve a document from a VectorStoreIndex based on doc id. If there is one, I can't seem to find it on this page https://gpt-index.readthedocs.io/en/latest/api_reference/indices/vector_store.html
Yea it compares based on hash, assuming that the doc_id remains consistent.

Right now, the approach only works for the simple vector store actually 🫠 been meaning to add more broad support for other vector stores though. For the simple vector store, it fetches the node from the docstore
It is essentially caching the doc_id + hash yea
thanks for the clarification. btw what is the underlying store for Simple Vector Store is that Chroma or something that Lama-index has custom built.
It's just a very simple in memory list haha and does pairwise comparison to get the top k using numpy.

It's very simple, but honestly it would pretty well
thanks again. Btw what's the operating model for Llama-index when it comes to interop with databases. Does your team implement the connector or does somebody from the database do it
Anyone can implement it and send in a PR! πŸ’ͺ
Got it - who usually does it though. Just asking because I was wondering what to expect in terms of velocity at which new database features would make it into llama-index modules.
It's largely community members who contribute updates to vector dbs. We do our best to get things reviewed/merged within a day or two. Larger PRs that change user-facing APIs may take longer though as they usually require some iteration
Add a reply
Sign up and join the conversation on Discord