Find answers from the community

Updated 3 months ago

Hi, I'm reading about

Hi, I'm reading about SimpleKeywordTableIndex and doing tests with it. I've been using it with with PostgresIndexStore.

PostgresIndexStore is storing the index_store data correctly it seems, in a JSON format with a table object. But every time I call from_documents it goes through the process of executing the transformations again and store a new line in that table, even though the documents are the same.

Isn't that using the documents hash to prevent having to reindex and store the table index every time?
L
C
3 comments
every time you call from_documents(), it makes no assumptions about duplicates

I would use an ingestion pipeline with a docstore attached for handling duplicates

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

Plain Text
docstore = SimpleDocumentStore()

pipeline = IngestionPipeline(
    transformations=[...], docstore=docstore
)

nodes = pipeline.run(documents=documents)

index = SimpleKeywordTableIndex(nodes=nodes, ...)
docstore is working fine for me, updating data only when theres a change to a document.

index_store isn't. Every time I execute the following code, a new row is created in the indexstore pg table.

Plain Text
pg_docstore = init_pg_docstore_from_env()
pg_vector_store = init_pg_vector_store_from_env()
pg_index_store = init_pg_index_store_from_env()

storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index = SimpleKeywordTableIndex(
    nodes=[],
    storage_context=storage_context,
)

index.refresh_ref_docs(documents)
Ok, I found a way of reusing the data that is stored in my postgres to rebuild the SimpleKeywordTableIndex.

I use pg_index_store.get_index_struct to get the previous created data and feed it to SimpleKeywordTableIndex constructor.

Plain Text
storage_context = StorageContext.from_defaults(
    docstore=pg_docstore,
    vector_store=pg_vector_store,
    index_store=pg_index_store,
)

index_struct = pg_index_store.get_index_struct("any-keyword-id") or KeywordTable(
    index_id="any-keyword-id"
)

index = SimpleKeywordTableIndex(
    index_struct=index_struct,
    storage_context=storage_context,
)

print(len(index.as_retriever().retrieve("neon")))


Letting this here for anyone who's using the search in the future.

Thanks for your reply, Logan!
Add a reply
Sign up and join the conversation on Discord