Find answers from the community

Updated 2 months ago

Sql rows

I am working with llama index to convert a sql database to pinecone vector database. How do I ensure that when am adding the row data to index , I drop duplicates for rows already in the vector store? The Sql database is frequently updated.
L
J
5 comments
Is there some kind of constant identifier in the sql database for each row? You could use that as the doc id and use our update or refresh functionality πŸ€”
yes each row has a unique id and can also be created from two rows.
this is how am creating the documents currently:
documents = [] for row in job.result(): doc_str = Document( text=str(row['review_text']), doc_id=str(row['review_id']), extra_info={ 'rating': str(row['review_rating']), 'asin': str(row['asin']), 'review_date': str(row['review_date']), }, ) documents.append(doc_str)
How then can I use the update or refresh?
So assuming that the review_id / doc_id stays constant, you can call this function

index.refresh_ref_docs(documents) which will a) update any doucments with the same doc_id but different content and b) insert any documents with doc_id's that are not already present
@Logan M I have already put 60000 vectors(60000 rows) to a pinecone database from the database. how do I avoid duplicating this data in pinecone? The database has over 1000000 rows
Sadly, not really supported with pinecone. Best I can find is something like this https://community.pinecone.io/t/removing-duplicate-embeddings/1186

It's kind of a case where you need to be thinking of duplication before inserting really πŸ˜…
Add a reply
Sign up and join the conversation on Discord