Find answers from the community

Updated 2 months ago

going from refine to llm lingua +

going from refine to llm lingua + summarize basically turned a 4.5 minute subquestion chain into <1 minute with the same quality on my pipeline
L
.
35 comments
curious how you are evaluting quality of final answers?
not very scientifically πŸ˜…

but I have a set of questions that I'm asking the bot that I know the answers to already and I'm looking for specific information

I asked it what the synergistic mechanism of actions would be between three different medications and looking to see that it:

  1. Identifies the medications correctly
  2. Identifies their mechanisms of action correctly
  3. Explains how they would compliment each other (i.e. which cellular pathways would be activated and how that would be synergistic)
I found that using compact and/or summarize would often miss a lot of the finer details, or would get them mixed up between the medications

However, using refine would keep all the details and assign them appropriately in the final response

The LLM Lingua compression before a simple summarize step also retains those details and assigns them appropriately
if you have a better way, I'm totally all ears
or maybe we could implement something like this: https://github.com/EleutherAI/lm-evaluation-harness
well that's definitely something to look into
yea, I'm actually pretty curious how well this is working for you πŸ˜…
Anecdotally? Very, very well
Here's a sample response... Check out all the sources that it compressed to synthesize an answer from
Attachments
Screenshot_20240115_160044_Discord.jpg
Screenshot_20240115_160107_Discord.jpg
Screenshot_20240115_160141_Discord.jpg
Screenshot_20240115_160159_Discord.jpg
Screenshot_20240115_160126_Discord.jpg
It did that all under a minute on a 3060 12gb
which LLM are you using? πŸ‘€
Mistral 7b 4bit quantized via vllm and then tiny llama chat v1.0 4bit quantized for Llm lingua
ooo interesting
Yeah, I'm shocked how well it works, lol
oh, do I need GPT-4 to run these? they're not standalone?
You need an LLM to evaluate the responses (gpt-4 in this case)
yeah.... makes sense after I looked through the code, lol
so how do you interpret these results?

Plain Text
mean_correctness_score         2.931818
mean_relevancy_score           0.272727
mean_faithfulness_score        0.977273
mean_context_similarity_score  0.808573
Hmm... seems to be a bug. Looking at _evaluations.json I see a bunch of records for "relevancy" that look like this:

Plain Text
{
      "query": "Paul Graham mentions his experience of leaving YC and no longer working with Jessica. How does he describe this experience and what does it reveal about his personal and professional relationship with Jessica?",
      "contexts": null,
      "response": " Paul Graham describes his experience of leaving YC and no longer working with Jessica as a decision made due to the need for a change in leadership and the desire to let Sam Altman take over. He mentions that they had been discussing the possibility of bringing Sam on board since the previous year and that he had been helping with YC since then. Paul and Jessica had agreed that if Sam accepted the offer to become the president of YC, they would step back and become ordinary partners. Paul also mentions that he had been running YC more and more, partly because he was focused on his mother's cancer treatment. The conversation about bringing Sam on board took place at a corner caf\u00e9 in March 2014, and they both agreed to implement the change. The text reveals that Paul and Jessica had a professional relationship as co-founders of YC, and Paul had a high regard for her abilities and contributions to the organization. However, he felt that the time had come for a change in leadership and for him to step back and let someone else take the reins.",
      "passing": false,
      "feedback": "NO",
      "score": 0.0,
      "pairwise_source": null,
      "invalid_result": false,
      "invalid_reason": null
    }
contexts being null doesn't seem right
actually, all of the contexts are null
Those scores were not bad-ish. Correctness is 0-5, rest are 0-1.

You can see an explanation here
https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html#response-evaluation

Not sure what's going on with the context though
gotcha, yeah, running it again I can see that it's evaluating properly, but just not providing the contexts for some reason -shrug-
it's interesting that my setup works great on my dataset, but falls over on Paul's essays
dang, GPT-4 is a harsh critic, lol
it'd be interesting to see some "high water marks" so to speak for LLMs of various sizes
like, what should we expect out of a 7B model, etc?
that being said, I'm finding that the model doesn't seem to matter as much as the splitting and retrieval
yea, getting the accurate chunks for generating a response is a pretty large factor
been tinkering over the last couple days, it's actually really difficult to improve on that score, wow
also, wow GPT-4 is expensive, lol
Yaaaa I try not use it lol it really does add up
has anyone tried using BAAI/JudgeLM-7B-v1.0 yet? they claim to get 84% consistency w/ GPT-4 on their 7B model and 92% consistency w/ a 33B model
I have not tried it yet πŸ‘€
Add a reply
Sign up and join the conversation on Discord