Find answers from the community

Updated 3 months ago

Sql rows

I am working with llama index to convert a sql database to pinecone vector database. How do I ensure that when am adding the row data to index , I drop duplicates for rows already in the vector store? The Sql database is frequently updated.

5 comments

LLogan M

Is there some kind of constant identifier in the sql database for each row? You could use that as the doc id and use our update or refresh functionality 🤔

JJohn Esther

yes each row has a unique id and can also be created from two rows.
this is how am creating the documents currently:

documents = []
        for row in job.result():
            doc_str = Document(
                text=str(row['review_text']),
                doc_id=str(row['review_id']),
                extra_info={
                    'rating': str(row['review_rating']),
                    'asin': str(row['asin']),
                    'review_date': str(row['review_date']),
                },
            )
            documents.append(doc_str)

How then can I use the update or refresh?

LLogan M

So assuming that the review_id / doc_id stays constant, you can call this function

index.refresh_ref_docs(documents) which will a) update any doucments with the same doc_id but different content and b) insert any documents with doc_id's that are not already present

JJohn Esther

@Logan M I have already put 60000 vectors(60000 rows) to a pinecone database from the database. how do I avoid duplicating this data in pinecone? The database has over 1000000 rows

LLogan M

Sadly, not really supported with pinecone. Best I can find is something like this https://community.pinecone.io/t/removing-duplicate-embeddings/1186

It's kind of a case where you need to be thinking of duplication before inserting really 😅

Add a reply