going from refine to llm lingua +

At a glance

..assets.

going from refine to llm lingua + summarize basically turned a 4.5 minute subquestion chain into <1 minute with the same quality on my pipeline

35 comments

LLogan M

curious how you are evaluting quality of final answers?

..assets.

not very scientifically 😅

but I have a set of questions that I'm asking the bot that I know the answers to already and I'm looking for specific information

I asked it what the synergistic mechanism of actions would be between three different medications and looking to see that it:

Identifies the medications correctly
Identifies their mechanisms of action correctly
Explains how they would compliment each other (i.e. which cellular pathways would be activated and how that would be synergistic)

I found that using compact and/or summarize would often miss a lot of the finer details, or would get them mixed up between the medications

However, using refine would keep all the details and assign them appropriately in the final response

The LLM Lingua compression before a simple summarize step also retains those details and assigns them appropriately

..assets.

if you have a better way, I'm totally all ears

..assets.

or maybe we could implement something like this: https://github.com/EleutherAI/lm-evaluation-harness

LLogan M

We do have a few RAG datasets
https://llamahub.ai/?tab=llama_datasets

..assets.

oh, sweet

..assets.

well that's definitely something to look into

LLogan M

yea, I'm actually pretty curious how well this is working for you 😅

..assets.

Anecdotally? Very, very well

..assets.

Here's a sample response... Check out all the sources that it compressed to synthesize an answer from

Attachments

..assets.

It did that all under a minute on a 3060 12gb

LLogan M

which LLM are you using? 👀

..assets.

Mistral 7b 4bit quantized via vllm and then tiny llama chat v1.0 4bit quantized for Llm lingua

LLogan M

ooo interesting

..assets.

Yeah, I'm shocked how well it works, lol

..assets.

oh, do I need GPT-4 to run these? they're not standalone?

LLogan M

You need an LLM to evaluate the responses (gpt-4 in this case)

..assets.

yeah.... makes sense after I looked through the code, lol

..assets.

so how do you interpret these results?

Plain Text

mean_correctness_score         2.931818
mean_relevancy_score           0.272727
mean_faithfulness_score        0.977273
mean_context_similarity_score  0.808573

..assets.

Hmm... seems to be a bug. Looking at _evaluations.json I see a bunch of records for "relevancy" that look like this:

Plain Text

{
      "query": "Paul Graham mentions his experience of leaving YC and no longer working with Jessica. How does he describe this experience and what does it reveal about his personal and professional relationship with Jessica?",
      "contexts": null,
      "response": " Paul Graham describes his experience of leaving YC and no longer working with Jessica as a decision made due to the need for a change in leadership and the desire to let Sam Altman take over. He mentions that they had been discussing the possibility of bringing Sam on board since the previous year and that he had been helping with YC since then. Paul and Jessica had agreed that if Sam accepted the offer to become the president of YC, they would step back and become ordinary partners. Paul also mentions that he had been running YC more and more, partly because he was focused on his mother's cancer treatment. The conversation about bringing Sam on board took place at a corner caf\u00e9 in March 2014, and they both agreed to implement the change. The text reveals that Paul and Jessica had a professional relationship as co-founders of YC, and Paul had a high regard for her abilities and contributions to the organization. However, he felt that the time had come for a change in leadership and for him to step back and let someone else take the reins.",
      "passing": false,
      "feedback": "NO",
      "score": 0.0,
      "pairwise_source": null,
      "invalid_result": false,
      "invalid_reason": null
    }

..assets.

contexts being null doesn't seem right

..assets.

actually, all of the contexts are null

LLogan M

Those scores were not bad-ish. Correctness is 0-5, rest are 0-1.

You can see an explanation here
https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html#response-evaluation

Not sure what's going on with the context though

..assets.

gotcha, yeah, running it again I can see that it's evaluating properly, but just not providing the contexts for some reason -shrug-

..assets.

it's interesting that my setup works great on my dataset, but falls over on Paul's essays

..assets.

dang, GPT-4 is a harsh critic, lol

..assets.

it'd be interesting to see some "high water marks" so to speak for LLMs of various sizes

..assets.

like, what should we expect out of a 7B model, etc?

..assets.

that being said, I'm finding that the model doesn't seem to matter as much as the splitting and retrieval

LLogan M

yea, getting the accurate chunks for generating a response is a pretty large factor

..assets.

been tinkering over the last couple days, it's actually really difficult to improve on that score, wow

..assets.

also, wow GPT-4 is expensive, lol

LLogan M

Yaaaa I try not use it lol it really does add up

..assets.

has anyone tried using BAAI/JudgeLM-7B-v1.0 yet? they claim to get 84% consistency w/ GPT-4 on their 7B model and 92% consistency w/ a 33B model

LLogan M

I have not tried it yet 👀

Add a reply

Find answers from the community

going from refine to llm lingua +