it seems that ref_doc_id is deprecated. It refers to the source field of the textNode rather than to the actual document id
@property
def ref_doc_id(self) -> Optional[str]:
"""Deprecated: Get ref doc id."""
source_node = self.source_node
if source_node is None:
return None
return source_node.node_id
So how can I get the document id the node came from? If I am submitting a batch of documents, and get all these nodes back...how do I know which one each node belongs to?
so if you are using a hierarchical parser or sentence to window it will not refer to the doc id but to the source node
not sure ^^ I think you need to fetch sources hierarchically until you get the actual source
in my test, the highest level nodes refers a SOURCE which is a ObjectType.Document with a documentId
did not find a shortcut to this. not sure why they would not refer the docId for all the nodes
I hacked it by adding the document id to metadata which all nodes inherit anyway so I get it that way but it seems there should be a simpler way
yes if this is what you need. In my case I don't like this solution because I want to be able to delete all nodes refering to a doc
if it is in the metadata it is much slower than a direct access
for sure but that's why I am wondering why there isn't a better native way
traversing a tree seems too much
Did you find a way to specify a documentiD ?
When I create a Document, it generates automatically a UUID, and I did not find a way to specify it myself
document.doc_id when you create the document
Oh wow, not sure why the docstring says that, its definitely not deprecated
So when you create a document, it gets a randomly generated ID if you don't explicitly give it one
When you run a node parser/text splitter, node.ref_doc_id
is made to point to the parent document that the node came from
If you want to keep track of things better, give your documents actual IDs
This could be a file path
SimpleDirectoryReader(..., filename_as_id=True).load_data()
Or setting it however you want
document.id_ = "123"
@Logan M yes, this is what I am doing now. I am assigning the doc_id explicitly. This issue seems random and isn't consistent so I am trying to figure out how to reproduce it. I am wondering if it may have anything to do with cached pipelines....?
hmmm, maybe, if you are using a cache explicitly π€
I'll work on reproducing it more consistently
@Logan M actually node.ref_doc_id
seems to be referring to the source node ID, which is not always the document id itself (for instance in hierarchical parser it is the parent node not the document id). Is there a way for a child node to actually refer to the original doc_id rather than the source node
Only if you manually set it, or add it to the metadata of your nodes