Find answers from the community

s
F
Y
a
P
Updated 2 months ago

Docs

Given an existing VectorIndex and DocStore (both in postgres). How can I get all of the nodes extracted from a document using llama-index abstractions?

I'm able to get a document like this and confirm in the

Plain Text
   doc = docstore.get_document(llama_id)
    assert doc != None
    doc_content = doc.get_content()
    assert doc_content != None



but when i do this:

Plain Text
docstore_ref_doc_info = docstore.get_all_ref_doc_info()


docstore_ref_doc_info is an empty object

Also i get this error when i do index.ref_doc_info
Plain Text
NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet.


is there a workaround for this?
L
a
21 comments
docstore.docs will fetch everything in the docstore
Can be an expensive operation though
Our docstore has close to a 100k documents in it, so thats not an option.

My real question is how can i get all the nodes that were extracted from a document?

There has got to be a way to do that right?
I also noticed that the pipeline.run method stores the input_nodes before running any transformations meaning it can't add relationship information (by listing all of of the nodes that were generated by it as Relationship.CHILD)
hmmm

vector_store.get_nodes(filters=MetadataFilters=[MetadataFilter(key="ref_doc_id", value="...", operator="==")])
jk that function isn't implemented yet for postgres vector store
I'm out of ideas. Ideally that method gets implemented for postgres
also, I don't think the docstore with ingestion pipeline is storing the nodes, only the top-level documents?
I might be forgetting how that works
by "that method" are you referring to .get_nodes?
Its in the base class, and in a few vector stores, but hasn't made its way to every vector store yet
the pipeline.run method first inserts the source_nodes deduping where necessary, then it runs the transformations on those nodes and inserts them into the vector store.
Would it work to just retrieve with some random vector and filter by those criteria?
Yea that would be a good workaround. Just need to set the top-k to be large enough to capture all nodes with that metadata
Seems like a pretty funky way to handle a fairly straightforward requiement. What do you think about just inserting the source_nodes into the docstore after the transformations have run and then just adding all the transformed nodes as children in the relationships field in the document. Then every document could easily tell you the id's of all its child nodes.
IMO all documents should have references to thier children by default. It would also make deleting/reingesting/updating much easier
Yea it could. Most people wouldn't want the duplication though (since nodes are serialized into the vector store)
I welcome a PR or github issue to scope this out a little more
Yeah, it seems like there should be an option to not serialize the node relationships. i've noticed a TON of duplication as I want to keep track of the order of the nodes in my documents so naturally I record previous, next and parent (points to the document) which results in a ton of bloat.
Add a reply
Sign up and join the conversation on Discord