OUTLINE: 0:00 - Intro & Overview 2:00 - Frozen Pretrained Transformers 4:50 - Evaluated Tasks 10:05 - The Importance of Training LayerNorm 17:10 - Modality Transfer 25:10 - Network Architecture Ablation 26:10 - Evaluation of the Attention Mask 27:20 - Are FPTs Overfitting or Underfitting? 28:20 - Model Size Ablation 28:50 - Is Initialization All You Need? 31:40 - Full Model Training Overfits 32:15 - Again the Importance of Training LayerNorm 33:10 - Conclusions & Comments
@PotatoKaboom3 жыл бұрын
As always, very well done! Very clear explanation and great thoughts on the paper. Blows my mind that you seem to do these videos in a single take.
@normalchannel47473 жыл бұрын
This is a hell of a paper. Thank you Yannic for sharing this work
@twmicrosheep3 жыл бұрын
I think the paper "K for the Price of 1: Parameter-efficient Multi-task and Transfer Learning" from ICLR 2019 already demonstrated that it is possible to transfer/fine-tune a model by using only the parameters of the normalization layers, scale and biases. It also shows how pre-trained models can have better results compared to random init models. Quoting from the Abstract: "The basic approach is to learn a model patch - a small set of parameters - that will specialize to each task, instead of finetuning the last layer or the entire network. For instance, we show that learning a set of scales and biases is sufficient to convert a pretrained network to perform well on qualitatively different problems (e.g. converting a Single Shot MultiBox Detection (SSD) model into a 1000-class image classification model while reusing 98% of parameters of the SSD feature extractor)."
@pensiveintrovert43183 жыл бұрын
Transformers simply build multi-level look up tables. It is basically an algo for that game "Is it a plant? Is it an animal? Does it have fur? etc..." By combining information about a token and its context transformers disambiguate the meaning of every token within a unique context.
@valthorhalldorsson93003 жыл бұрын
Good video! I think they’ve put their finger on an interesting empirical finding that is worth investigating further. I’m not sure whether it’s worth getting excited about it yet though considering previous papers showing you can get competitive performance on some architectures by only training the batch norm layers.
@hannesstark50243 жыл бұрын
~"Because we are in the low data domain we are not better than the fully trained transformer" But we also have the same performance for ListOps where we do have a lot of training data.
@christophecerisara87413 жыл бұрын
Great video, thanks ! My guess is that they talk about zero-shot because they place themselves from the perspective of the pretrained transformer: all they're doing is, in a way, train the minimum required to adapt to a new task: inputs encoding, output classes, and adapt the range of values taken by the hidden representations for the new task, but the pretrained modules stay fixed. But I agree with your point, of course... Thanks !
@ce65353 жыл бұрын
One of the things that I thinks supports their hypothesis that language is special in an indirect way is that essentially every novel item in an image corresponds to some new noun in the language model. So if you had an image classifier with the same number of classes as some other language model, the language model that 'corresponds' to that image classifier might be larger again.
@DamianReloaded3 жыл бұрын
This could open the possibility of having a pre-trained multi-purpose core model to which you can just append input layers and output layers and it could be able to process whatever data you throw at it. Imagine if a company provided this core-model as a service and "clients" only had to provide the input layers, the output layers and the data...
@Daniel-ih4zh3 жыл бұрын
Yeah, it's basically plug in priors.
@RobotProctor3 жыл бұрын
Isn't that what Google's BERT is?
@DamianReloaded3 жыл бұрын
@@RobotProctor I think Google BERT is intended to be used only for NLP. This paper talks about NNs trained for NLP being used with any kind of data, say images, without re-training them with image datasets.
@RobotProctor3 жыл бұрын
@@DamianReloaded got it
@andrewminhnguyen94463 жыл бұрын
Amazing. You wanna do it? MaaS (model-as-a-service).
@burhanrashidhussein60373 жыл бұрын
22:10 Your insights could be important here, they should try on more complex task. For example fine-grained classification. MNIST and CIFAR10 have been repeatly use to investigate some model behaviours and making big conclusions while there are more complex tasks when it comes to practice implications. Thanks for the videos we are really. benefiting
@ahmedmagdy29323 жыл бұрын
i like your point about the number of classes, also honestly, this randomly initialized model is surprisingly good, I can understand the point about the layer norm effect but still, you know it is random initialized lol, I saw those couple of times where people try the random initialization and get fairly good results, I have a theory I feel it will be proven one day, transformers idea is good but these large ones are not necessarily needed, I don't think leaving all of this weights behaving blindly is the solution.
@f14-werto3 жыл бұрын
Now that I've heard of this "computational utility", I wonder if a more "artificial" hard task like SAT solving can encode better "utilities" than a natural language task.
@norik16163 жыл бұрын
I like it! Sadly I think it will require more time to converge than GPT2 and that would (for now) require a lot of compute out of a mere mortal reach.
@f14-werto3 жыл бұрын
@@norik1616 Guess I'll keep wondering then
@conduit2423 жыл бұрын
Embeddings are all you need 🤷🏻♂️
@BBorn2233 жыл бұрын
Wow, thanks for sharing. I have fine tuned a hugging face model for nlp task. Watching this i think that even if i want to classify other languages, i could just use GPT-2 model and just fine-tune those layer norms. I think this is worth trying
@norik16163 жыл бұрын
I wonder, why are there never confidence intervals? From my limited experience even +-1 % can be in 2 × std just based on the seed.
@howardkong89272 жыл бұрын
Yeah. These improvements in the paper seem pretty marginal, I wonder if an error could just make it disappear.
@freemind.d27143 жыл бұрын
Your second guess about why Pretrained Transformers even work just as same as mine
@techma823 жыл бұрын
Loved this one!
@dimonenka3 жыл бұрын
19:36 It does not seem to me that the vision transformer lags behind. It performs considerably worse on Homology than FPT, but it also underperforms to Random on Homology. I don't know what this task is, but I'm guessing vision transformer just has bad inductive biases or learns bad patterns or whatever for that task, and that it is more similar to learning natural language. Bottom line, it really matters what set of problems one chooses to test transformers, and based on the choice in this paper the results are inconclusive.
@liammcdevitt75943 жыл бұрын
I noticed that you made a video at the end of 2017 called, "Attention Is All You Need". There is a new paper called, "Is Attention Better Than Matrix Decomposition?" and it is proposing a new method called Hamburger that uses Matrix Decomposition claiming that this 20-year-old technique is better than Attention. I was wondering if you could read this paper and make a video helping me understand this new method?
@liammcdevitt75943 жыл бұрын
@@pshyilocibinoaut9433 Maybe, we'll have to see if Yannic does a video on it ;)
@piotr7803 жыл бұрын
pretrained networks are exactly some kind of decomposition of original data, so mayby pretreining is discovering some unknown (or known) decomposition method or aproximating existing one ? I tried once series of NMF on data but I have problem we architecture itself like : split the data into two in second layer and apply two NMF ? or add noise after first layer etc.
@rtluo15463 жыл бұрын
Hi, Can you provide a paper reference for the adapter between transformer layers? Thank you
@Dougystyle113 жыл бұрын
Look up adapterhub
@piotr7803 жыл бұрын
ok, but what is your conclusion ? your paper summary is really good, but what is your opinion about source of effectiveness of pretrained transformers ? (I guess that language has the highest entropy so it is the the most challenging modality so pretraining on it gives better results)
@aleph05403 жыл бұрын
Does the number of degrees of freedom have something to do with the complexity of the tasks here? Or rather the success of these tasks?
@RandyArdywibowo3 жыл бұрын
Nice overview! Indeed, the layer norm training somewhat makes the "Universal Computation Engine" claim weaker. Also, what about ablations against simpler feature extractors like Fourier Transforms, DCT, or wavelets replacing the transformer architecture? If the results for these simple feature extractors are only slightly worse, then it's a bit silly to pretrain on a huge language dataset to eek out a few percentage points of performance don't you think?
@conduit2423 жыл бұрын
Bingo, enormous value is being hidden in the embedding training and positional encoding. Universal computation shouldn’t require that.
@sheggle3 жыл бұрын
I loved the paper, interested to see what you think
@user-rh8hi4ph4b3 жыл бұрын
The comparison to a randomly initialized frozen transformer, while still showing a significant difference, is close enough to FPT to really throw me off. Same thing with random CNN of which only the batchnorms are trained. The idea that many supposedly difficult tasks can be solved reasonably well with random transformations where only the scale and mean of some layers/scalars/etc is tweaked tells me we're really missing something. It's almost like machine intelligence is a matter of trying as many random things as you can and scaling the things that don't work into oblivion. Are there any published experiments that compare a fully trained network to its initialized state and look at how much the parameters have actually changed, and which ones?
@codeWithBatta3 жыл бұрын
Hey Yannic, how many books you read per week or atleast give me some Idea (of which genre), I am getting a bit competitive with you :). Thanks for replying btw you know someone doing the similar stuff as you on youtube ?
@krzysztofwos18563 жыл бұрын
These transformers develop a universal basis for understanding the structure of the world. Language is the intellect's way of making sense of sensory perceptions, so the structure of all our sensory perceptions is encoded in the language. This is why we can imagine what certain words, e.g. elated, "feel like". It's a path through a space that activates memories of related sensory perceptions. Words move you from one state to another. So the language model develops fundamental understanding of the reality as perceived by the users of the language.
@0102030405Jacky3 жыл бұрын
I think that model pretrained on language outperforms ViT or Bit is due to the fact that language has more complicated syntax such as causality or recursion. I guess that a transformer pretrained on logic reasoning tasks may performed equally well as FPT
@dianchen90833 жыл бұрын
Is there a paper showing "training only the batch norm layers of a randomly initialized CNN gives non-trivial results" that Yannic mentioned at around 15:45? Can someone please tell me where I could find reference of that conclusion? Thanks!
@nathancooper10013 жыл бұрын
I totally agree with some of your intuition as to what these models learning. However, I'm betting there is a huge overlap between "natural signals" and "computational premitives" especially if you are on the team of the universe itself being computable since then the natural signal may just be a higher level computation
@hailking55882 жыл бұрын
why does it make sense to train the input layer when training new modalties?
@andrewminhnguyen94463 жыл бұрын
Re: fine-tuning self-attention and feedforward layers resulting in performance degradation, the authors state that they don't "[change] the optimization or learning rate scheme." I'm not sure it's a foregone conclusion that the network is overfitting. I would be curious to see if it continues to degrade even if they make judicious choice of learning rate as they gradually unfreeze the layers from the top. Thanks, Yannic.
@slavamoiseev8033 жыл бұрын
It triggers some speculation about a new type of hardware that provides some kind of general inference for downstream tasks with minimal tuning.
@williechen18763 жыл бұрын
Waiting for their pretrained model
@adamtran57472 жыл бұрын
I love you Yannic.
@louis31953 жыл бұрын
Correct me if I am wrong: these researchers are trying to create software that reproduces the computation that some strange apes learned over million years fighting mammoths? So can this communication tool “language” contains all the information needed to understand how to survive in a world surrounded by mammoths (and other apes)? Did they try the other way around, from vision to language? Because I think vision is largely the most useful (so bringing the most important knowledge to survive among apes and mammoths) input for human beings?
@MrMIB9833 жыл бұрын
Amazing
@alonsomartinez95883 жыл бұрын
Yannic, could you do a video on "Predicting Video with VQVAE"??
@herp_derpingson3 жыл бұрын
14:55 Maybe its not zero shot. It is epsilon shot ;) . Also, I think you should add the #transformer tag to all videos related to transformers because #pretrainedtransformers != #transformer and will probably not show up in the list/recommendations.
@silberlinie3 жыл бұрын
Ich denke Yannic, es ist so ähnlich wie das, was wir in der Erziehung das Lernen zum Lernen bezeichnen.
@Dendus903 жыл бұрын
The next exercise that comes to my mind is a time series prediction. For instance, we have a timestamp represented as an n-dim vector. It would be great to see whether pre-trained LM models can overperform LSTM and trained from scratch transformers in such tasks. What are you guys thinking?
@conduit2423 жыл бұрын
Absolutely, let’s see it predict hierarchical step functions or simple “linearly growing volatility” non-stationary time series
@matthieulin3353 жыл бұрын
Is it possible that the model overfitted because it is smaller (Double gradient descent phenomenon)
@andres_pq3 жыл бұрын
Thinking a lot about those computational primitives, perhaps those are the ones that make the information routing altogether. It has been somewhat proved that the self-atenttion by itself is not useful, it is needed along skip connections and MLPs to make a Transformer a superior model. Perhaps there can be another architecture that englobes said primitive computations.
@jabowery3 жыл бұрын
Turing not found
@carlosxaviersoto57433 жыл бұрын
If you don't know what an attention layer is, I'm sure you'll find some video on youtube that explains it... 😏
@jeremykothe28473 жыл бұрын
But where???
@conduit2423 жыл бұрын
Pairwise distance. Done
@jeremykothe28473 жыл бұрын
@@conduit242 If you already know what it means, that works :P I'm hoping everyone watching this probably does by now...
@conduit2423 жыл бұрын
@@jeremykothe2847 I probably should have said weighted distance 😌
@larrybird37293 жыл бұрын
if an object has all of its components replaced does it remain fundamentally the same?🤯
@swordwaker77493 жыл бұрын
Hard pre-training task? What about programming language execution? Make a simple instruction set and let the model predict the result.
@mwdcodeninja3 жыл бұрын
Finance immediately comes to mind. Anyone have an order flow stream they'd like to share? For Science!
@adamrak75603 жыл бұрын
Their claim is truly massive. The 19 page article seems very short for supporting it. I think somebody will have to try it on much bigger datasets with much bigger networks too. This can be truly significant if it is true, but the article mostly demonstrates it on "toy" examples, so I am not fully convinced.
@conduit2423 жыл бұрын
Transformers are approximate Turing machines bc natural language is Turing complete, so I’m not sure what this paper is proving besides saying that looking at various points in history in combination is valuable to Turing machines. This point is already well understood, there are only so many ways to approximate look back and fractional windows 🤷🏻♂️
@adamrak75603 жыл бұрын
@@conduit242 They imply that the trained transformers with most of the weight frozen are Universal Turing Machines. That is very different from knowing that the architecture is Turing complete.