Docs

At a glance

The community member is trying to get all the nodes extracted from a document using llama-index abstractions, but is facing issues. They are able to retrieve the document, but cannot get the list of all the reference documents. They also encounter a "NotImplementedError" when trying to access the ref_doc_info. The community members discuss potential workarounds, such as retrieving all documents from the docstore or using a filter on the vector store, but these options have limitations. Some community members suggest inserting the source nodes into the docstore and adding the transformed nodes as children in the relationships field, which could make it easier to track the node hierarchy. However, others note that this may result in duplication since the nodes are already serialized in the vector store. The community is still exploring solutions to this issue.

aalfredmadere

Given an existing VectorIndex and DocStore (both in postgres). How can I get all of the nodes extracted from a document using llama-index abstractions?

I'm able to get a document like this and confirm in the

Plain Text

   doc = docstore.get_document(llama_id)
    assert doc != None
    doc_content = doc.get_content()
    assert doc_content != None

but when i do this:

Plain Text

docstore_ref_doc_info = docstore.get_all_ref_doc_info()

docstore_ref_doc_info is an empty object

Also i get this error when i do index.ref_doc_info

Plain Text

NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet.

is there a workaround for this?

21 comments

LLogan M

docstore.docs will fetch everything in the docstore

LLogan M

Can be an expensive operation though

aalfredmadere

Our docstore has close to a 100k documents in it, so thats not an option.

My real question is how can i get all the nodes that were extracted from a document?

There has got to be a way to do that right?

aalfredmadere

I also noticed that the pipeline.run method stores the input_nodes before running any transformations meaning it can't add relationship information (by listing all of of the nodes that were generated by it as Relationship.CHILD)

LLogan M

hmmm

vector_store.get_nodes(filters=MetadataFilters=[MetadataFilter(key="ref_doc_id", value="...", operator="==")])

LLogan M

jk that function isn't implemented yet for postgres vector store

LLogan M

rip

LLogan M

I'm out of ideas. Ideally that method gets implemented for postgres

LLogan M

also, I don't think the docstore with ingestion pipeline is storing the nodes, only the top-level documents?

LLogan M

I might be forgetting how that works

aalfredmadere

by "that method" are you referring to .get_nodes?

LLogan M

yes 👍

LLogan M

Its in the base class, and in a few vector stores, but hasn't made its way to every vector store yet

aalfredmadere

the pipeline.run method first inserts the source_nodes deduping where necessary, then it runs the transformations on those nodes and inserts them into the vector store.

aalfredmadere

Would it work to just retrieve with some random vector and filter by those criteria?

LLogan M

Yea that would be a good workaround. Just need to set the top-k to be large enough to capture all nodes with that metadata

aalfredmadere

Seems like a pretty funky way to handle a fairly straightforward requiement. What do you think about just inserting the source_nodes into the docstore after the transformations have run and then just adding all the transformed nodes as children in the relationships field in the document. Then every document could easily tell you the id's of all its child nodes.

aalfredmadere

IMO all documents should have references to thier children by default. It would also make deleting/reingesting/updating much easier

LLogan M

Yea it could. Most people wouldn't want the duplication though (since nodes are serialized into the vector store)

LLogan M

I welcome a PR or github issue to scope this out a little more

aalfredmadere

Yeah, it seems like there should be an option to not serialize the node relationships. i've noticed a TON of duplication as I want to keep track of the order of the nodes in my documents so naturally I record previous, next and parent (points to the document) which results in a ton of bloat.

Add a reply

Find answers from the community

Docs