Refresh

At a glance

The post is about a community member trying to insert a document into a VectorStoreIndex, but the index keeps inserting a new document every time, even if the document has been inserted previously. The community members suggest that the issue is with the doc_id not being consistent, and that the community member should set the doc_id to something consistent.

The community members also discuss other aspects of the VectorStoreIndex, such as how it compares documents based on hash, the size limitations of the index, and the possibility of using the URL and MD5 hash as the doc_id. They also discuss the interoperability of Llama-index with databases, where the community members can contribute connectors by sending pull requests.

There is no explicitly marked answer in the comments.

Useful resources

nnwdan

I am trying to insert a doc in using refresh i.e. if the doc has been inserted previously don't try again, here's my snippet of code and it isn't working. Every run with the same page and same url resutls in a fresh insertion.

index = VectorStoreIndex.from_documents([])
index.storage_context.persist(storage_location)

def add_page_to_index(page_text, page_url):
document = Document(url=page_url, text=page_text)
documents = [document]
retval = index.refresh(documents)
index.storage_context.persist(storage_location)
return

10 comments

LLogan M

Refresh depends on having the same doc_id every time 👀 so in this case, you should set the doc id to something consistent

nnwdan

thanks, now if I am trying to update only if the content has actually changed, should I do a retrieval of the data and compare the page_text contents or does the VectorStore have some built in smarts to compare say the digest computed over the old and new page_text.

Second question, what is the size limitation on the index, another approach for a doc id might be to stash the url and md5 has together as the doc id and I am wondering if that might violate certain assumptions

Finally is there a call to retrieve a document from a VectorStoreIndex based on doc id. If there is one, I can't seem to find it on this page https://gpt-index.readthedocs.io/en/latest/api_reference/indices/vector_store.html

LLogan M

Yea it compares based on hash, assuming that the doc_id remains consistent.

Right now, the approach only works for the simple vector store actually 🫠 been meaning to add more broad support for other vector stores though. For the simple vector store, it fetches the node from the docstore

LLogan M

It is essentially caching the doc_id + hash yea

nnwdan

thanks for the clarification. btw what is the underlying store for Simple Vector Store is that Chroma or something that Lama-index has custom built.

LLogan M

It's just a very simple in memory list haha and does pairwise comparison to get the top k using numpy.

It's very simple, but honestly it would pretty well

nnwdan

thanks again. Btw what's the operating model for Llama-index when it comes to interop with databases. Does your team implement the connector or does somebody from the database do it

LLogan M

Anyone can implement it and send in a PR! 💪

nnwdan

Got it - who usually does it though. Just asking because I was wondering what to expect in terms of velocity at which new database features would make it into llama-index modules.

LLogan M

It's largely community members who contribute updates to vector dbs. We do our best to get things reviewed/merged within a day or two. Larger PRs that change user-facing APIs may take longer though as they usually require some iteration

Add a reply

Find answers from the community

Refresh