Find answers from the community

Updated 6 months ago

I make a copy of a docx file with

At a glance

I make a copy of a docx-file with windows exlorer, then I read it into documents via SimpleDirectoryReader.
I want to compare whether both files are the same.

if I compare document.get_content they are equal
but the document.hash of both are different.

When I ask Bing Chat what is part of generating the hash, it replies only the content.

What is really used for buildung the hash?
where is this documented?

I used llama_index 0.7.4
Many thanks for help

4 comments

LLogan M

The hash is based on a) the text and b) the metadata (I looked at the code lol)

Looking at the code for the docx loader, it's inserting the filename as metadata, so if the filename changed, the hash would not be equal

DDieter

That is what I assumed, that metadata will be used. Ok I can do a workaround.
Thanks a lot for your quick response.

DDieter

Hi Logan,
one additional question. What is the idea behind having the metadata included to the hash? What benefit do I have out of this?
Thanks

LLogan M

Metadata is injected into the the text of the node. So if metadata changes the hash probably should too

Add a reply