Are you querying a graph? Or just a single vector index?
It's still returning a relevant document from the query
What's the structure of the graph look like?
very simple, very Hamfisted collection of Simple Vector indexes
the query is vanilla, nothing fancy configured on it
response = graph.query(query_string_from_input_widget)
Hmmmm the only thing I can think of off the top of my head is that node is actually the summary of one of the indexes?
That's odd. He is one of 6 people in that index. Strange that he would be the result of the summary. Even so, don't summaries get an embedding?
Yea they don't get an embedding, which is why it doesn't have a score? 🤔
Very strange I agree. Just trying to guess how the score could be none lol
do you mean an interstitial representation of the document?
(done live, not while indexing)
How many source nodes did your query return, out of curiosity?
And by summary, I mean the index_summaries that you passed in when creating the index
OK, one per sub index, at least that makes sense lol
Idk man, this is pretty spooky. Maybe some of the embeddings failed to create? 🤔
What would be a good way to trace a document-id back to the source document. Does the index keep track of where elements of it came from?
This might be likely. maybe it bugged out on doc/docx processing or something?
just saw this in the debugger I built
You can trace back to the ref doc, but this is usually only useful if you assigned the documents useful doc_ids
response.source_nodes[0].node.ref_doc_id
should map it back I think
it gives hash like this 7118a538-ebe6-4aa0-87a9-9d053bdee95f
, but my files might look like bio_sketches/LoganM.pdf
. I assume the reference to the path is thrown away past the data-loading step?
Yeaaa when you load the documents, the doc ID is randomly generated, or you can set it manually to a unique value
Alternatively, you can also set the extra_info dict of each document object with other info that should get passed to the nodes
ah ok, so this would change, perhaps, how the index is constructed. I would, instead of pointing a SimpleDirectoryReader
to the directory , probably be loading in documents one at a time and appending the path data to it by over-riding the document or passing in the extra info. 🤔
The node score can be none when the node is selected based on relationship with another node.
ohh so if a document is chunked
So for example a node with score can have its next node also in the source nodes but with score none.
This next node might not have the embedding score but it’s being queries because it’s related to the node that has the score
that is a good explanation
I hope that is what’s happening in your case
You can confirm by looking at the next and previous relationship of the nodes that have source nodes
the None
nodes have None
for their ref doc
@BioHacker to no avail. But thank you for the suggestion. I never knew about prev_node_id
and next_node_id
useful to be able to reference that
Something potentially wrong with those nodes. Easiest way to deal with this is to set similar_cutoff att in the query call. It can also be done by calling the node post processor to eliminate these none nodes before query takes place.
@BioHacker yeah, I was considering that too. I was considering re-indexing everything as a first step. Just to get the document source name into the index nodes. And to try a few other things. Including doing bespoke doc/docx processing.