Isn’t FLAN-T5 already a fine-tuned model

At a glance

The post discusses the FLAN-T5 model, which the community member understands was created by fine-tuning the T5 model on a large dataset. However, the community member is confused because the image shows FLAN-T5 as the base model and mentions "full fine tuning" on the right. The comments discuss the differences between fine-tuning and FLAN, with some community members explaining that FLAN-T5 is already fine-tuned on a large dataset, making it easier to adapt to specific tasks without much additional data. There is also a discussion about the interpretability of soft prompts and whether this is a problem or not.

권권씨 😮💨

Isn’t FLAN-T5 already a fine-tuned model?

When I studied FLAN-T5 in the lecture, I understand that FLAN-T5 was created by fine-tuning the T5 model with 50 to 100,000 multi-data sets. However, look at the image below, FLAN-T5 is listed as the base model, and on the right it says full fine tuning. I don't understand

Attachment

권

10 comments

LLogan M

Yes FLANT-T5 is technically fine-tuned on a ton of instruction datasets

LLogan M

because of this, its generally pretty good at follow instructions. And if you wanted to fine-tune further, it doesn't require a huge dataset

권권씨 😮💨

Is there a difference between full fine tuning and FLAN??

LLogan M

I mean, more fine tuning will just make it more aligned to what you train it on (maybe you have a very specific task)

권권씨 😮💨

Oh, please let me know if I understand correctly.
Full fine tuning updates all weights, so catastrophic forgetting may occur.

However, catastrophic forgetting does not occur in FLAN-T5 because the dataset has been trained so that it can be used in multi-tasking.

Did I understand it well?

LLogan M

Fine tuning doesn't always mean catastrophic forgetting, especially if you use a lower learning rate

Because of FLANT-T5s training, it makes it really easy to quickly adapt to specific datasets/use cases without too much data

But out of the box, I would expect it to work fairly well

권권씨 😮💨

thank you for telling me. I'm a front-end developer, but I've been studying llm recently. Please support me so I can change to an llm engineer hahaha

권권씨 😮💨

Currently, I am studying prompt tuning - soft prompting among the PEFT techniques.

The description of this photo is as follows:

Plain Text

"One potential issue to consider is the interpretability of learned virtual tokens. Remember, because the soft prompt tokens can take any value within the continuous embedding vector space. The trained tokens don't correspond to any known token, word, or phrase in the vocabulary of the LLM. However, an analysis of the nearest neighbor tokens to the soft prompt location shows that they form tight semantic clusters. In other words, the words closest to the soft prompt tokens have similar meanings. The words identified usually have some meaning related to the task, suggesting that the prompts are learning word like representations."

But doesn’t this improve the performance of the prompt? Why is this a problem?

Attachment

LLogan M

Yea thats a weird paragraph. I think its saying the soft prompt tokens have the possibility to take on any meaning, but usually ends up being something that makes sense

권권씨 😮💨

I don't quite understand. Could you explain a little?

Add a reply

Find answers from the community

Isn’t FLAN-T5 already a fine-tuned model