Hi, I'm reading about

At a glance

The community member is using SimpleKeywordTableIndex with PostgresIndexStore and is encountering an issue where a new row is created in the index store table every time from_documents is called, even though the documents are the same. Another community member suggests using an IngestionPipeline with a SimpleDocumentStore to handle duplicates. The original community member then finds a solution by using pg_index_store.get_index_struct to retrieve the previously created data and feed it to the SimpleKeywordTableIndex constructor, allowing them to reuse the stored data.

Useful resources

CChicoButico

Hi, I'm reading about SimpleKeywordTableIndex and doing tests with it. I've been using it with with PostgresIndexStore.

PostgresIndexStore is storing the index_store data correctly it seems, in a JSON format with a table object. But every time I call from_documents it goes through the process of executing the transformations again and store a new line in that table, even though the documents are the same.

Isn't that using the documents hash to prevent having to reindex and store the table index every time?

3 comments

LLogan M

every time you call from_documents(), it makes no assumptions about duplicates

I would use an ingestion pipeline with a docstore attached for handling duplicates

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

Plain Text

docstore = SimpleDocumentStore()

pipeline = IngestionPipeline(
    transformations=[...], docstore=docstore
)

nodes = pipeline.run(documents=documents)

index = SimpleKeywordTableIndex(nodes=nodes, ...)

CChicoButico

docstore is working fine for me, updating data only when theres a change to a document.

index_store isn't. Every time I execute the following code, a new row is created in the indexstore pg table.

Plain Text

pg_docstore = init_pg_docstore_from_env()
pg_vector_store = init_pg_vector_store_from_env()
pg_index_store = init_pg_index_store_from_env()

storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index = SimpleKeywordTableIndex(
    nodes=[],
    storage_context=storage_context,
)

index.refresh_ref_docs(documents)

CChicoButico

Ok, I found a way of reusing the data that is stored in my postgres to rebuild the SimpleKeywordTableIndex.

I use pg_index_store.get_index_struct to get the previous created data and feed it to SimpleKeywordTableIndex constructor.

Plain Text

storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index_struct = pg_index_store.get_index_struct("any-keyword-id") or KeywordTable(
    index_id="any-keyword-id"
)

index = SimpleKeywordTableIndex(
    index_struct=index_struct,
    storage_context=storage_context,
)

print(len(index.as_retriever().retrieve("neon")))

Letting this here for anyone who's using the search in the future.

Thanks for your reply, Logan!

Add a reply

Find answers from the community

Hi, I'm reading about