Folks I have using PGVector as the

SShrikar

Folks I have using PGVector as the vector store for llama index. However when add/refresh documents a new row is create in the table ? Why is that? I was under the impression that if the content is the same, and the ids are the same new embeddings won't be created and now new rows will be created in the postgres database

15 comments

LLogan M

indeed, if the document IDs and content is the same, it wont be inserted. BUT, this depends on you saving/loading a docstore to manage that logic.

LLogan M

I think the easiest approach though is to use the ingestion pipeline

LLogan M

It makes it much clearer whats going on

LLogan M

https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

SShrikar

Isnt storing to doc store handles if we provide the storage and service context?

SShrikar

Are there any example of index creation with pipelines?

LLogan M

when using a vector db integration, the docstore is usually disabled to simplify storage (unless you were setting store_nodes_override=True in the constructor)

With a pipeline, just attach a vector db and docstore to the pipeline, and run it.

Then you can create the index with index = VectorStoreIndex.from_vector_store(vector_store)

Just dont forget to save/load your docstore somewhere, or use a remote docstore (mongodb, redis, etc.)

SShrikar

I am doing this

Plain Text

docstore = SimpleDocumentStore()
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, docstore=docstore
)
index = VectorStoreIndex.from_documents(
    documents=[],
    service_context=service_context,
    storage_context=storage_context,
    show_progress=True,
    store_nodes_override=True,
)
storage_context.persist(persist_dir=datadir)

SShrikar

Still the docstore.json in the persist directory is empty

SShrikar

Plain Text

{}

SShrikar

Here is my full code:

Plain Text

llm = LlamaOpenAI(temperature=0, model="gpt-4")
embed_model = OpenAIEmbedding()
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=256)
prompt_helper = PromptHelper(
    context_window=8192,
    num_output=256,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None,
)
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    text_splitter=text_splitter,
    prompt_helper=prompt_helper,
)

url = make_url(connection_string)
port = url.port or 5432
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=port,
    user=url.username,
    table_name="llama_index",
    embed_dim=1536,  # openai embedding dimension
)


docstore = SimpleDocumentStore()
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, docstore=docstore
)
index = VectorStoreIndex.from_documents(
    documents=[],
    service_context=service_context,
    storage_context=storage_context,
    show_progress=True,
    store_nodes_override=True,
)
storage_context.persist(persist_dir=datadir)
print(index.refresh_ref_docs(documents))

SShrikar

Let me know if I am missing something I didn't explicitly add the nodes to document store mainly because I am passing store_node_override=True

SShrikar

I tried persisting the docstore and made sure I change the node id to hash so that for the same content I get the same ID still I see duplicate entries being created when using index.insert_nodes(nodes)

Attachment

LLogan M

Just use the pipeline tbh 😅 it's going to be more janky than it needs to be with the index directly

SShrikar

Cool will give it a try

Add a reply

Find answers from the community

Folks I have using PGVector as the