Hi everyone. I have a &quot;problem&quot; with

it seems that ref_doc_id is deprecated. It refers to the source field of the textNode rather than to the actual document id

    @property
    def ref_doc_id(self) -> Optional[str]:
        """Deprecated: Get ref doc id."""
        source_node = self.source_node
        if source_node is None:
            return None
        return source_node.node_id

So how can I get the document id the node came from? If I am submitting a batch of documents, and get all these nodes back...how do I know which one each node belongs to?

so if you are using a hierarchical parser or sentence to window it will not refer to the doc id but to the source node

not sure ^^ I think you need to fetch sources hierarchically until you get the actual source

in my test, the highest level nodes refers a SOURCE which is a ObjectType.Document with a documentId

hmm, I see

did not find a shortcut to this. not sure why they would not refer the docId for all the nodes

I hacked it by adding the document id to metadata which all nodes inherit anyway so I get it that way but it seems there should be a simpler way

yes if this is what you need. In my case I don't like this solution because I want to be able to delete all nodes refering to a doc

if it is in the metadata it is much slower than a direct access

for sure but that's why I am wondering why there isn't a better native way

traversing a tree seems too much

true

Did you find a way to specify a documentiD ?

When I create a Document, it generates automatically a UUID, and I did not find a way to specify it myself

document.doc_id when you create the document

Oh wow, not sure why the docstring says that, its definitely not deprecated

So when you create a document, it gets a randomly generated ID if you don't explicitly give it one

When you run a node parser/text splitter, node.ref_doc_id is made to point to the parent document that the node came from

If you want to keep track of things better, give your documents actual IDs

This could be a file path
SimpleDirectoryReader(..., filename_as_id=True).load_data()

Or setting it however you want
document.id_ = "123"

@Logan M yes, this is what I am doing now. I am assigning the doc_id explicitly. This issue seems random and isn't consistent so I am trying to figure out how to reproduce it. I am wondering if it may have anything to do with cached pipelines....?

hmmm, maybe, if you are using a cache explicitly 🤔

I'll work on reproducing it more consistently

@Logan M actually node.ref_doc_id seems to be referring to the source node ID, which is not always the document id itself (for instance in hierarchical parser it is the parent node not the document id). Is there a way for a child node to actually refer to the original doc_id rather than the source node