Find answers from the community

Updated 9 months ago

Hi everyone. I have a "problem" with

Hi everyone. I have a "problem" with TextNode.ref_doc_id. I am not sure if this is a bug or I just don't understand how this is supposed to work?

I have situations right now where after submitting a Document to my data ingest pipeline of transformers (no storage) of let's say doc_id=123 I get TextNodes from that doc which do not all have ref_doc_id=123. I get some nodes returned with ref_doc_ids of previously ingested documents or IDs I don't recognize at all. What is the expected behavior?

I have checked I am not writing to this field anywhere in my pipeline. So, am I misunderstanding the behavior around TextNode.ref_doc_id?
A
K
L
25 comments
it seems that ref_doc_id is deprecated. It refers to the source field of the textNode rather than to the actual document id
@property def ref_doc_id(self) -> Optional[str]: """Deprecated: Get ref doc id.""" source_node = self.source_node if source_node is None: return None return source_node.node_id
So how can I get the document id the node came from? If I am submitting a batch of documents, and get all these nodes back...how do I know which one each node belongs to?
so if you are using a hierarchical parser or sentence to window it will not refer to the doc id but to the source node
not sure ^^ I think you need to fetch sources hierarchically until you get the actual source
in my test, the highest level nodes refers a SOURCE which is a ObjectType.Document with a documentId
hmm, I see
did not find a shortcut to this. not sure why they would not refer the docId for all the nodes
I hacked it by adding the document id to metadata which all nodes inherit anyway so I get it that way but it seems there should be a simpler way
yes if this is what you need. In my case I don't like this solution because I want to be able to delete all nodes refering to a doc
if it is in the metadata it is much slower than a direct access
for sure but that's why I am wondering why there isn't a better native way
traversing a tree seems too much
Did you find a way to specify a documentiD ?
When I create a Document, it generates automatically a UUID, and I did not find a way to specify it myself
document.doc_id when you create the document
Oh wow, not sure why the docstring says that, its definitely not deprecated
So when you create a document, it gets a randomly generated ID if you don't explicitly give it one

When you run a node parser/text splitter, node.ref_doc_id is made to point to the parent document that the node came from
If you want to keep track of things better, give your documents actual IDs

This could be a file path
SimpleDirectoryReader(..., filename_as_id=True).load_data()

Or setting it however you want
document.id_ = "123"
@Logan M yes, this is what I am doing now. I am assigning the doc_id explicitly. This issue seems random and isn't consistent so I am trying to figure out how to reproduce it. I am wondering if it may have anything to do with cached pipelines....?
hmmm, maybe, if you are using a cache explicitly πŸ€”
I'll work on reproducing it more consistently
@Logan M actually node.ref_doc_id seems to be referring to the source node ID, which is not always the document id itself (for instance in hierarchical parser it is the parent node not the document id). Is there a way for a child node to actually refer to the original doc_id rather than the source node
Only if you manually set it, or add it to the metadata of your nodes
Add a reply
Sign up and join the conversation on Discord