Find answers from the community

Updated 3 months ago

I make a copy of a docx file with

I make a copy of a docx-file with windows exlorer, then I read it into documents via SimpleDirectoryReader.
I want to compare whether both files are the same.
  1. if I compare document.get_content they are equal
  2. but the document.hash of both are different.
When I ask Bing Chat what is part of generating the hash, it replies only the content.
  • What is really used for buildung the hash?
  • where is this documented?
I used llama_index 0.7.4
Many thanks for help
L
D
4 comments
The hash is based on a) the text and b) the metadata (I looked at the code lol)

Looking at the code for the docx loader, it's inserting the filename as metadata, so if the filename changed, the hash would not be equal
That is what I assumed, that metadata will be used. Ok I can do a workaround.
Thanks a lot for your quick response.
Hi Logan,
one additional question. What is the idea behind having the metadata included to the hash? What benefit do I have out of this?
Thanks
Metadata is injected into the the text of the node. So if metadata changes the hash probably should too
Add a reply
Sign up and join the conversation on Discord