Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)

Рет қаралды 23,315

Күн бұрын

Пікірлер: 74

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 2:00 - Frozen Pretrained Transformers 4:50 - Evaluated Tasks 10:05 - The Importance of Training LayerNorm 17:10 - Modality Transfer 25:10 - Network Architecture Ablation 26:10 - Evaluation of the Attention Mask 27:20 - Are FPTs Overfitting or Underfitting? 28:20 - Model Size Ablation 28:50 - Is Initialization All You Need? 31:40 - Full Model Training Overfits 32:15 - Again the Importance of Training LayerNorm 33:10 - Conclusions & Comments

@PotatoKaboom 3 жыл бұрын

As always, very well done! Very clear explanation and great thoughts on the paper. Blows my mind that you seem to do these videos in a single take.

@normalchannel4747 3 жыл бұрын

This is a hell of a paper. Thank you Yannic for sharing this work

@twmicrosheep 3 жыл бұрын

I think the paper "K for the Price of 1: Parameter-efficient Multi-task and Transfer Learning" from ICLR 2019 already demonstrated that it is possible to transfer/fine-tune a model by using only the parameters of the normalization layers, scale and biases. It also shows how pre-trained models can have better results compared to random init models. Quoting from the Abstract: "The basic approach is to learn a model patch - a small set of parameters - that will specialize to each task, instead of finetuning the last layer or the entire network. For instance, we show that learning a set of scales and biases is sufficient to convert a pretrained network to perform well on qualitatively different problems (e.g. converting a Single Shot MultiBox Detection (SSD) model into a 1000-class image classification model while reusing 98% of parameters of the SSD feature extractor)."

@pensiveintrovert4318 3 жыл бұрын

Transformers simply build multi-level look up tables. It is basically an algo for that game "Is it a plant? Is it an animal? Does it have fur? etc..." By combining information about a token and its context transformers disambiguate the meaning of every token within a unique context.

@valthorhalldorsson9300 3 жыл бұрын

Good video! I think they’ve put their finger on an interesting empirical finding that is worth investigating further. I’m not sure whether it’s worth getting excited about it yet though considering previous papers showing you can get competitive performance on some architectures by only training the batch norm layers.

@hannesstark5024 3 жыл бұрын

~"Because we are in the low data domain we are not better than the fully trained transformer" But we also have the same performance for ListOps where we do have a lot of training data.

@christophecerisara8741 3 жыл бұрын

Great video, thanks ! My guess is that they talk about zero-shot because they place themselves from the perspective of the pretrained transformer: all they're doing is, in a way, train the minimum required to adapt to a new task: inputs encoding, output classes, and adapt the range of values taken by the hidden representations for the new task, but the pretrained modules stay fixed. But I agree with your point, of course... Thanks !

@ce6535 3 жыл бұрын

One of the things that I thinks supports their hypothesis that language is special in an indirect way is that essentially every novel item in an image corresponds to some new noun in the language model. So if you had an image classifier with the same number of classes as some other language model, the language model that 'corresponds' to that image classifier might be larger again.

@DamianReloaded 3 жыл бұрын

This could open the possibility of having a pre-trained multi-purpose core model to which you can just append input layers and output layers and it could be able to process whatever data you throw at it. Imagine if a company provided this core-model as a service and "clients" only had to provide the input layers, the output layers and the data...

@Daniel-ih4zh 3 жыл бұрын

Yeah, it's basically plug in priors.

@RobotProctor 3 жыл бұрын

Isn't that what Google's BERT is?

@DamianReloaded 3 жыл бұрын

@@RobotProctor I think Google BERT is intended to be used only for NLP. This paper talks about NNs trained for NLP being used with any kind of data, say images, without re-training them with image datasets.

@RobotProctor 3 жыл бұрын

@@DamianReloaded got it

@andrewminhnguyen9446 3 жыл бұрын

Amazing. You wanna do it? MaaS (model-as-a-service).

@burhanrashidhussein6037 3 жыл бұрын

22:10 Your insights could be important here, they should try on more complex task. For example fine-grained classification. MNIST and CIFAR10 have been repeatly use to investigate some model behaviours and making big conclusions while there are more complex tasks when it comes to practice implications. Thanks for the videos we are really. benefiting

@ahmedmagdy2932 3 жыл бұрын

i like your point about the number of classes, also honestly, this randomly initialized model is surprisingly good, I can understand the point about the layer norm effect but still, you know it is random initialized lol, I saw those couple of times where people try the random initialization and get fairly good results, I have a theory I feel it will be proven one day, transformers idea is good but these large ones are not necessarily needed, I don't think leaving all of this weights behaving blindly is the solution.

@f14-werto 3 жыл бұрын

Now that I've heard of this "computational utility", I wonder if a more "artificial" hard task like SAT solving can encode better "utilities" than a natural language task.

@norik1616 3 жыл бұрын

I like it! Sadly I think it will require more time to converge than GPT2 and that would (for now) require a lot of compute out of a mere mortal reach.

@f14-werto 3 жыл бұрын

@@norik1616 Guess I'll keep wondering then

@conduit242 3 жыл бұрын

Embeddings are all you need 🤷🏻‍♂️

@BBorn223 3 жыл бұрын

Wow, thanks for sharing. I have fine tuned a hugging face model for nlp task. Watching this i think that even if i want to classify other languages, i could just use GPT-2 model and just fine-tune those layer norms. I think this is worth trying

@norik1616 3 жыл бұрын

I wonder, why are there never confidence intervals? From my limited experience even +-1 % can be in 2 × std just based on the seed.

@howardkong8927 2 жыл бұрын

Yeah. These improvements in the paper seem pretty marginal, I wonder if an error could just make it disappear.

@freemind.d2714 3 жыл бұрын

Your second guess about why Pretrained Transformers even work just as same as mine

@techma82 3 жыл бұрын

Loved this one!

@dimonenka 3 жыл бұрын

19:36 It does not seem to me that the vision transformer lags behind. It performs considerably worse on Homology than FPT, but it also underperforms to Random on Homology. I don't know what this task is, but I'm guessing vision transformer just has bad inductive biases or learns bad patterns or whatever for that task, and that it is more similar to learning natural language. Bottom line, it really matters what set of problems one chooses to test transformers, and based on the choice in this paper the results are inconclusive.

@liammcdevitt7594 3 жыл бұрын

I noticed that you made a video at the end of 2017 called, "Attention Is All You Need". There is a new paper called, "Is Attention Better Than Matrix Decomposition?" and it is proposing a new method called Hamburger that uses Matrix Decomposition claiming that this 20-year-old technique is better than Attention. I was wondering if you could read this paper and make a video helping me understand this new method?

@liammcdevitt7594 3 жыл бұрын

@@pshyilocibinoaut9433 Maybe, we'll have to see if Yannic does a video on it ;)

@piotr780 3 жыл бұрын

pretrained networks are exactly some kind of decomposition of original data, so mayby pretreining is discovering some unknown (or known) decomposition method or aproximating existing one ? I tried once series of NMF on data but I have problem we architecture itself like : split the data into two in second layer and apply two NMF ? or add noise after first layer etc.

@rtluo1546 3 жыл бұрын

Hi, Can you provide a paper reference for the adapter between transformer layers? Thank you

@Dougystyle11 3 жыл бұрын

Look up adapterhub

@piotr780 3 жыл бұрын

ok, but what is your conclusion ? your paper summary is really good, but what is your opinion about source of effectiveness of pretrained transformers ? (I guess that language has the highest entropy so it is the the most challenging modality so pretraining on it gives better results)

@aleph0540 3 жыл бұрын

Does the number of degrees of freedom have something to do with the complexity of the tasks here? Or rather the success of these tasks?

@RandyArdywibowo 3 жыл бұрын

Nice overview! Indeed, the layer norm training somewhat makes the "Universal Computation Engine" claim weaker. Also, what about ablations against simpler feature extractors like Fourier Transforms, DCT, or wavelets replacing the transformer architecture? If the results for these simple feature extractors are only slightly worse, then it's a bit silly to pretrain on a huge language dataset to eek out a few percentage points of performance don't you think?

@conduit242 3 жыл бұрын

Bingo, enormous value is being hidden in the embedding training and positional encoding. Universal computation shouldn’t require that.

@sheggle 3 жыл бұрын

I loved the paper, interested to see what you think

@user-rh8hi4ph4b 3 жыл бұрын

The comparison to a randomly initialized frozen transformer, while still showing a significant difference, is close enough to FPT to really throw me off. Same thing with random CNN of which only the batchnorms are trained. The idea that many supposedly difficult tasks can be solved reasonably well with random transformations where only the scale and mean of some layers/scalars/etc is tweaked tells me we're really missing something. It's almost like machine intelligence is a matter of trying as many random things as you can and scaling the things that don't work into oblivion. Are there any published experiments that compare a fully trained network to its initialized state and look at how much the parameters have actually changed, and which ones?

@codeWithBatta 3 жыл бұрын

Hey Yannic, how many books you read per week or atleast give me some Idea (of which genre), I am getting a bit competitive with you :). Thanks for replying btw you know someone doing the similar stuff as you on youtube ?

@krzysztofwos1856 3 жыл бұрын

These transformers develop a universal basis for understanding the structure of the world. Language is the intellect's way of making sense of sensory perceptions, so the structure of all our sensory perceptions is encoded in the language. This is why we can imagine what certain words, e.g. elated, "feel like". It's a path through a space that activates memories of related sensory perceptions. Words move you from one state to another. So the language model develops fundamental understanding of the reality as perceived by the users of the language.

@0102030405Jacky 3 жыл бұрын

I think that model pretrained on language outperforms ViT or Bit is due to the fact that language has more complicated syntax such as causality or recursion. I guess that a transformer pretrained on logic reasoning tasks may performed equally well as FPT

@dianchen9083 3 жыл бұрын

Is there a paper showing "training only the batch norm layers of a randomly initialized CNN gives non-trivial results" that Yannic mentioned at around 15:45? Can someone please tell me where I could find reference of that conclusion? Thanks!

@nathancooper1001 3 жыл бұрын

I totally agree with some of your intuition as to what these models learning. However, I'm betting there is a huge overlap between "natural signals" and "computational premitives" especially if you are on the team of the universe itself being computable since then the natural signal may just be a higher level computation

@hailking5588 2 жыл бұрын

why does it make sense to train the input layer when training new modalties?

@andrewminhnguyen9446 3 жыл бұрын

Re: fine-tuning self-attention and feedforward layers resulting in performance degradation, the authors state that they don't "[change] the optimization or learning rate scheme." I'm not sure it's a foregone conclusion that the network is overfitting. I would be curious to see if it continues to degrade even if they make judicious choice of learning rate as they gradually unfreeze the layers from the top. Thanks, Yannic.

@slavamoiseev803 3 жыл бұрын

It triggers some speculation about a new type of hardware that provides some kind of general inference for downstream tasks with minimal tuning.

@williechen1876 3 жыл бұрын

Waiting for their pretrained model

@adamtran5747 2 жыл бұрын

I love you Yannic.

@louis3195 3 жыл бұрын

Correct me if I am wrong: these researchers are trying to create software that reproduces the computation that some strange apes learned over million years fighting mammoths? So can this communication tool “language” contains all the information needed to understand how to survive in a world surrounded by mammoths (and other apes)? Did they try the other way around, from vision to language? Because I think vision is largely the most useful (so bringing the most important knowledge to survive among apes and mammoths) input for human beings?

@MrMIB983 3 жыл бұрын

Amazing

@alonsomartinez9588 3 жыл бұрын

Yannic, could you do a video on "Predicting Video with VQVAE"??

@herp_derpingson 3 жыл бұрын

14:55 Maybe its not zero shot. It is epsilon shot ;) . Also, I think you should add the #transformer tag to all videos related to transformers because #pretrainedtransformers != #transformer and will probably not show up in the list/recommendations.

@silberlinie 3 жыл бұрын

Ich denke Yannic, es ist so ähnlich wie das, was wir in der Erziehung das Lernen zum Lernen bezeichnen.

@Dendus90 3 жыл бұрын

The next exercise that comes to my mind is a time series prediction. For instance, we have a timestamp represented as an n-dim vector. It would be great to see whether pre-trained LM models can overperform LSTM and trained from scratch transformers in such tasks. What are you guys thinking?

@conduit242 3 жыл бұрын

Absolutely, let’s see it predict hierarchical step functions or simple “linearly growing volatility” non-stationary time series

@matthieulin335 3 жыл бұрын

Is it possible that the model overfitted because it is smaller (Double gradient descent phenomenon)

@andres_pq 3 жыл бұрын

Thinking a lot about those computational primitives, perhaps those are the ones that make the information routing altogether. It has been somewhat proved that the self-atenttion by itself is not useful, it is needed along skip connections and MLPs to make a Transformer a superior model. Perhaps there can be another architecture that englobes said primitive computations.

@jabowery 3 жыл бұрын

Turing not found

@carlosxaviersoto5743 3 жыл бұрын

If you don't know what an attention layer is, I'm sure you'll find some video on youtube that explains it... 😏

@jeremykothe2847 3 жыл бұрын

But where???

@conduit242 3 жыл бұрын

Pairwise distance. Done

@jeremykothe2847 3 жыл бұрын

@@conduit242 If you already know what it means, that works :P I'm hoping everyone watching this probably does by now...

@conduit242 3 жыл бұрын

@@jeremykothe2847 I probably should have said weighted distance 😌

@larrybird3729 3 жыл бұрын

if an object has all of its components replaced does it remain fundamentally the same?🤯

@swordwaker7749 3 жыл бұрын

Hard pre-training task? What about programming language execution? Make a simple instruction set and let the model predict the result.

@mwdcodeninja 3 жыл бұрын

Finance immediately comes to mind. Anyone have an order flow stream they'd like to share? For Science!

@adamrak7560 3 жыл бұрын

Their claim is truly massive. The 19 page article seems very short for supporting it. I think somebody will have to try it on much bigger datasets with much bigger networks too. This can be truly significant if it is true, but the article mostly demonstrates it on "toy" examples, so I am not fully convinced.

@conduit242 3 жыл бұрын

Transformers are approximate Turing machines bc natural language is Turing complete, so I’m not sure what this paper is proving besides saying that looking at various points in history in combination is valuable to Turing machines. This point is already well understood, there are only so many ways to approximate look back and fractional windows 🤷🏻‍♂️

@adamrak7560 3 жыл бұрын

@@conduit242 They imply that the trained transformers with most of the weight frozen are Universal Turing Machines. That is very different from knowing that the architecture is Turing complete.

@conduit242 3 жыл бұрын

@@adamrak7560 How? ;)