Find answers from the community

Updated 4 months ago

Hi, I'm reading about

At a glance
The community member is using SimpleKeywordTableIndex with PostgresIndexStore and is encountering an issue where a new row is created in the index store table every time from_documents is called, even though the documents are the same. Another community member suggests using an IngestionPipeline with a SimpleDocumentStore to handle duplicates. The original community member then finds a solution by using pg_index_store.get_index_struct to retrieve the previously created data and feed it to the SimpleKeywordTableIndex constructor, allowing them to reuse the stored data.
Useful resources
Hi, I'm reading about SimpleKeywordTableIndex and doing tests with it. I've been using it with with PostgresIndexStore.

PostgresIndexStore is storing the index_store data correctly it seems, in a JSON format with a table object. But every time I call from_documents it goes through the process of executing the transformations again and store a new line in that table, even though the documents are the same.

Isn't that using the documents hash to prevent having to reindex and store the table index every time?
L
C
3 comments
every time you call from_documents(), it makes no assumptions about duplicates

I would use an ingestion pipeline with a docstore attached for handling duplicates

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

Plain Text
docstore = SimpleDocumentStore()

pipeline = IngestionPipeline(
    transformations=[...], docstore=docstore
)

nodes = pipeline.run(documents=documents)

index = SimpleKeywordTableIndex(nodes=nodes, ...)
docstore is working fine for me, updating data only when theres a change to a document.

index_store isn't. Every time I execute the following code, a new row is created in the indexstore pg table.

Plain Text
pg_docstore = init_pg_docstore_from_env()
pg_vector_store = init_pg_vector_store_from_env()
pg_index_store = init_pg_index_store_from_env()

storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index = SimpleKeywordTableIndex(
    nodes=[],
    storage_context=storage_context,
)

index.refresh_ref_docs(documents)
Ok, I found a way of reusing the data that is stored in my postgres to rebuild the SimpleKeywordTableIndex.

I use pg_index_store.get_index_struct to get the previous created data and feed it to SimpleKeywordTableIndex constructor.

Plain Text
storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index_struct = pg_index_store.get_index_struct("any-keyword-id") or KeywordTable(
    index_id="any-keyword-id"
)

index = SimpleKeywordTableIndex(
    index_struct=index_struct,
    storage_context=storage_context,
)

print(len(index.as_retriever().retrieve("neon")))


Letting this here for anyone who's using the search in the future.

Thanks for your reply, Logan!
Add a reply
Sign up and join the conversation on Discord