get_formatted_sources()
truncates the string. In that example, the full source is actually pretty relevant.
If you are getting good responses, the sources must be helping π€
One thing I did notice about OpenAI especially, the node similarities are always around 0.75, unless they are super super relevant haha
Hmm... For me even when manually adjusting and printing out a very long get_formatted_sources it will still be irrelevant to the actual response it provides
Hm, spooky lol
Not sure what to tell ya π
two things I think
I guess the questions you are asking are general enough the the LLM figures it out without the context?
And also, embeddings seem to not be capturing what you are trying to query
Yeah not really sure, definitely looks like it's using the embeddings correctly but it just prints out overly verbose formatted sources that maybe has a slight mention of the actual query context. Maybe it just makes more sense to use links to the original documents instead of printing out the formatted sources. My original idea was to get the exact text passage showing where the response was generated from but that doesn't seem to be working...
Are you setting chunk size limit? I know the emebddings seem to work best for ~1024 chunk sizes π
I am, how are you setting yours?
Yea usually at 1024, top k of 2-3 π€
How about for the rest? Like max chunk overlap, num outputs etc...
Usually just the defaults for those 256 and 20
Hmm weird. That didn't help π€ . Actually when I set the chunk_size_limit to 1024, it stopped even finding the context. With maxed out chunk size limit it finds the context but not with lower values
But yeah I wish the formatted sources could somehow display only the relevant text they used for the response, even when it displays the answer in there, it's overly verbose and hard to find the spots where the actual query response is included in the original text
Can you think of a workaround? My goal is to have every LLM response have a link/display the original text that was used for creating the response
With PDFs I've experimented with including the links but the issue is it doesn't display the actual parts which were used for the response, just the whole document link
The response.source_nodes
will show the exact start/end string position of each node, which might help?
At the end of the day, whatever is in response.source_nodes is what the model read to create the answer π€
Yeah printing out the response.source_nodes seems better? The formatted sources tend to be truncated and not really relevant. Is the idea of formatted sources to parse the text from the response source nodes that was used for the answer?