I was crossing my fingers for this video. Thank you.
@ChocolateMilkCultLeader2 жыл бұрын
Crazy how Yannic has a video on every topic I want to research
@tianyangchen1339 Жыл бұрын
Thank you for your explanation! You always make complex things easy to understand, so great!
@alivecoding4995 Жыл бұрын
Thanks so much, Yannic!
@adrianpetrescu85833 жыл бұрын
man, you have a style of tell the story that I find very easy to understand. it is easy for me to learn from you :) thanks
@hoaxuan70743 жыл бұрын
Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products. Locality sensitive hash = Random projection followed by binarization. Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality. 3 Elements: Dot product, switching, predicate for switch state (EG. x
@hoaxuan70743 жыл бұрын
3 Element theory gives you Fast Transform fixed filter bank neural nets.
@beattoedtli10403 жыл бұрын
At 6:20, check out the y axis! It definitely is flattening out...
@veedrac3 жыл бұрын
While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.
@spicychiley3 жыл бұрын
Re: "model parallelism has high communication costs." Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution. From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, arxiv.org/abs/2003.11666) "Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively." In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf Cerebras has also talked about the benefits of pipelining models: www.cerebras.net/data-model-pipeline-parallel-training-neural-networks
@hoaxuan70743 жыл бұрын
Very wrong models. Use Fast Transform neural nets with a O(nlog(n)) compute cost per layer and 2n parameters. You can either go wow, 1 trillion parameters, or you can slap your palm to your forehead. It depends if you like watching Monster Trucks or reading a chemistry book.
@etopowertwon Жыл бұрын
"We are not going to gave trillion parameters anytime soon". It took just 2 years to reach that soon.
@EnricoRos3 жыл бұрын
I think the main takeaway for Switch-C is that it outperforms T5-XXL using 1/10th of the FLOPS (although blowing over 1T params). While the smaller Switch model gets the best performance but matching T5's compute. They haven't tried with both Equal Compute and 1T params.
@conchylicultor3 жыл бұрын
Thank you for the summary, this was very informative. I was just wondering how did they manage to train the router weights if they are only sending exaamples to a single expert ?
@lucashadi51953 жыл бұрын
Maybe thats where the high expert dropout comes into play.
@WilliamFedus13 жыл бұрын
There is still a gradient through the selected expert. Therefore, the router can effectively up- or down-weight that expert relative to the others (perhaps akin to a policy gradient).
@adrienforbu51653 жыл бұрын
impressively well explained ! Thank you Yannic !
@yaxiongzhao66403 жыл бұрын
So HN has a comment: news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission) The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters. To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency). The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote: "researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things." Any comments? I actually cannot comment on TensorFlow's capability at all...
@herp_derpingson3 жыл бұрын
16:04 I ran into a similar problem while implementing a similar thing for one of my projects. Here, how will the router know that it should have routed to FFN1 instead of FFN2? If we do hard routing, there is no "push" to change the class from any gradients. 31:00 I would recommend the tensorflow mixed precision video from the official tensorflow youtube channel. Its pretty good.
@YannicKilcher3 жыл бұрын
Good question, I think the fact that they have a uniform distribution regularizer in there means that every now and then a token is routed to a different expert, from which it might get better gradients. A bit like RL
@Rhannmah3 жыл бұрын
Uh oh, this is getting out of hand! Transformers are crazy and I can't imagine what they can do with that many params... This is also amazing because it potentially gives common folk like me some hope to actually be able to run a reasonably sized transformer on local hardware.
@paulcurry83833 жыл бұрын
11:17 a feed forward layer still relates each token to the other tokens. It’s just not computed on the based on the sample like self attention. I guess each expert must have capacity # >= 2 to make sense. Unless they are FFing on the token vector only.
@sourabmangrulkar91053 жыл бұрын
Thanks for great explanation 😄
@TechyBen3 жыл бұрын
GTP 5 Password re-rememberer: [Complete the following text] "I forgot my password, it is..."
@shaxosYT3 жыл бұрын
Yannic: "Clearly memorized by scraping hacked databases"
@TechyBen3 жыл бұрын
@@shaxosYT Oooh. Good one. Unintended successful consequences of AI.
@lars0me3 жыл бұрын
... very hard to remember.
@bernardoramos94093 жыл бұрын
... forgotten
@pastrop20033 жыл бұрын
So, FF layer is essentially a one-dimensional convolution. In this case, what is a kernel size of it? 1? Still don't quite understand that part. Also, when you say "token", you mean 768-dimensional vector whatever the embedding dimensionality is, right?
@YannicKilcher3 жыл бұрын
yes true, a token is represented as its vector. and a FF is a 1d convolution with k=1
@ammarkov3 жыл бұрын
>tfw in my MocapNET work I use a a classifier that decides on using an ensemble trained on a subset of the problem (basically poor man's routing) and it was one of the reviewer complains.. This is a fundamentally good idea, divide and conquer!
@thomasmuller70013 жыл бұрын
whats next, routing is all you need?
@mathematicalninja27563 жыл бұрын
Lol, don't give em ideas
@muhammadsaadmansoor77773 жыл бұрын
At what point can novel start being able to make sense ie start reasoning. How to give model reasoning power
@pensiveintrovert43183 жыл бұрын
When every slightly unique concept gets its own distinct vector. Everything a person could say with the current state of the language, culture, reality is encoded in a state.
@bryanye44903 жыл бұрын
I think the feedforward is a fully connected linear layer, not a disjoint linear layer
@ratanrohith101311 ай бұрын
Here after Mixtral 8x7B release!
@NoFunkGiven3 жыл бұрын
Thank you for the high qualities of your videos :)
@jahcane37113 жыл бұрын
Thanks Yannic!
@andreaswallin88623 жыл бұрын
Is Yannic mainly concerned with NLP application?
@YannicKilcher3 жыл бұрын
no I'm just interested in whatever I think is cool :)
@russellcox36993 жыл бұрын
Doesn't the hard routing screw with differentiation?
@anshul52433 жыл бұрын
For automatic differentiation I believe it is analogous to a max pool operation
@Sogartar3 жыл бұрын
@@anshul5243 I think it is some sort of hard attention, but on the parameters not the input. It must use argmax, which has a derivative of 0 almost everywhere. Except where 2 or more arguments have the same maximum value, then it is undefined. It is not useful for gradient descent. Maybe they are doing something with random sampling to estimate the optimization step to take. I have not read the paper.
@florianhonicke54483 жыл бұрын
Thanks for another awesome video!!!
@hypegt68853 жыл бұрын
sparsity is all you need
@simonstrandgaard55033 жыл бұрын
Great explanation.
@fast_harmonic_psychedelic3 жыл бұрын
this is just like how the brain works. There are different parts of the brain that specialize in different layers of information processing. They should be able to have some of the FFNs the ability to handle visual data, audio data etc so it has more than just one form of perception. The road to consciousness is through the combination of multiple forms of perception of the world in the same network. Until now its all done on separate networks. But that's just like separate people, not like the brain which is processing on multiple dimensions of input, and its that multi-dimensional (sight, sound, touch, social communication, etc) processing which combine to form a concept (as opposed to a concept that is composed of strictly natural language. You can be told what a chair is your whole life but until you can touch a chair, see a chair, sit in a chair, make a chair, etc you don't really know a chair - just the word chair and how it relates to other words. Consciousness is knowing the word chair AND seeing it AND having other forms of measurement -and then the combined concept of the chair and the concept OF the concept OF the concept is sent through a feedback loop for self reflection - and only then do the conditions from which consciousness emerges. And thats all that consciousness is - its nothing more than multi-FFN-feedback loops.
@kicckicc3 жыл бұрын
The usual way to do huge transformer model parallelism isn't layer-by-layer, but vertically (e.g. split a trainable variable into different machines). The layer-by-layer approach leads to high TPU idle time. Another framework gpipe (arxiv.org/abs/1811.06965) described it and proposed a way to alleviate it. Of course, the performance of the vertical model split relies on the fast communicate between TPUs.
@cmucjpeter3 жыл бұрын
how do they back-prop the error if using argmax to switch expert?
@YannicKilcher3 жыл бұрын
a bit like random exploration in RL
@jamiecampbell76733 жыл бұрын
What I thought when I watched this was "ok, so it's a CapsNet except instead of feeding the revisions to a deeper expert each expert has the final say". Is that accurate?
@adarob3 жыл бұрын
Thanks for the video! Unfortunately your explanation of model parallelism is inaccurate. The way you explained it requires sequential execution of the layers (unless pipelining is used). Switch Transformer, T5, etc. splits each layer into N shards and processes them in parallel.
@beagle9895 ай бұрын
you have such a lovely voice
@snippletrap3 жыл бұрын
Welp, I'm still happy with my "lonely one Colab in the corner", LOL.
@herp_derpingson3 жыл бұрын
* plays a small lonely violin in the corner *
@yongquanhuang71113 жыл бұрын
Sounds like GShard with top-1 expert routing. What's the novelty?
@christospapadopoulos78943 жыл бұрын
Google brain published it, there's your novelty
@danielalorbi3 жыл бұрын
@@christospapadopoulos7894 Say no more, you had me at Google
@TechyBen3 жыл бұрын
Specific useful application, or (in the first few mins of the video) mentioned it being "stable"?
@mathematicalninja27563 жыл бұрын
Novelty is in the domain
@arajuiojuo3 жыл бұрын
no examples/use cases?
@granttao75043 жыл бұрын
This is a model parallizing paper.
@GuillermoValleCosmos3 жыл бұрын
maybe the expert dropout works because each expert is trained on fewer data, effectively, so regularizing it more helps
@MrSchweppes3 жыл бұрын
4:00 "We are not going to have trillion parameters models around anytime soon just yet" Don't you think OpenAI will release GPT-4 with more than 1 trillion parameters in 6-9 months? I think they will.
@mathematicalninja27563 жыл бұрын
Closed AI
@NextFuckingLevel3 жыл бұрын
They will, but my prediction is near the end of 2021
@DamianReloaded3 жыл бұрын
As I watched I kinda felt like I was witnessing the first strokes of AGI. I suspect when we learn to make networks of these models collaborating with each other to solve general problems we would have got AGI.
@DamianReloaded3 жыл бұрын
yas
@mathematicalninja27563 жыл бұрын
You mean like the neurona
@DamianReloaded3 жыл бұрын
More like a society of mind
@DamianReloaded3 жыл бұрын
Yeah. We still have a way to go, but it feels like we are a lot closer than 6 years ago. The refinement of the techniques and how AI researchers think about neural processes as second nature, imo, are accelerating everything. With these huge models there is a legitimate interest in creating powerful hardware to run them. I can imagine a transformer being used as a "memory tracker" that given a few data points can "predict" (remember) what happened before or in between.
@preethamgali30233 жыл бұрын
DeBERTa explaination please.
@simleek3 жыл бұрын
Trillion parameters... okay, that's approaching the number of synapses in the human brain. ~800 trillion though, so it will probably still take a bit of time. Also, still needs better design, which is out there, but not all put together. Edit: I think a further improvement would be to have multiple of these switch-transformers switch between running on some dataset, or different types of data, and I think having it combined at the end with info of which transformer ran would help too.
@JamesAwokeKnowing3 жыл бұрын
yep, first goal is to get to the 'scale' of data processing that the brain in doing, mostly to rule out magic "if it were scaled" claims/hopes/thoughts, then goal can focus on architecture for learning/acting. I'd say 20 years till a brain scale computer could run on a robot like boston dynamics', though maybe in a Tesla in about 15 years (hi KIT). Current tech can scale to brain scale in data center, but new tech will distribute the computation at the layer level in a more neuromorphic (1 billion tiny processors physically connected in a pattern 'learned' in data center training) style.
@thomasmuller70013 жыл бұрын
finally an architecture that feels more biological plausible..
@Daniel-ih4zh3 жыл бұрын
Why do you think?
@thomasmuller70013 жыл бұрын
@@Daniel-ih4zh information is routed in the brain carefully: you don't need your complete brain to make sense of this sentence. in normal transformers thats not the case and every information goes through all the computation. with this routing mechanism it finally does only makes use of compute when it precalculated its necessity. at least a bit ;) at least this is my intuition.
@JTMoustache3 жыл бұрын
Reminds me of lambda layers
@pratik2453 жыл бұрын
Sparsity concept is like neural network able to figure out by itself which two/many parameters can form a concept which it can npt while training and after training you can't really figure out which parameter learned what because they are numbers. Only way they can be possible is match values from a lower concept to higher concept. Whole part theory Mr Hinton was talking about. There is no precise why of doing it because energy manifestations in our world are probabilistic and not absolute. But this idea is worthwhile to explore.. At least for solid objects but simply impossible for thought processes.
@pratik2453 жыл бұрын
How is token forwarded to each different FF layers. Deep fakes.
@beattoedtli10403 жыл бұрын
Experts or exberts?
@YannicKilcher3 жыл бұрын
nice one
@ksy85853 жыл бұрын
The paper is somewhat not friendly
@DasGrosseFressen3 жыл бұрын
It starting to get ridiculous, no?
@brandonwickstead91593 жыл бұрын
the rate of advancement in AI is astounding, it wasnt long ago people thought GO would never be solved by computers and now thats ez work
@brandonwickstead91593 жыл бұрын
i believe i will see atleast the path to agi in my lifetime
@DasGrosseFressen3 жыл бұрын
@@brandonwickstead9159 meh, not sure what is hype and iterations of more powerand parameters, and what is really ground-breaking work.
@johnpope14733 жыл бұрын
Pytorch model for those playing at home - github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.py