Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Рет қаралды 32,596

Yannic Kilcher

Күн бұрын

Пікірлер: 92

@menzithesonofhopehlope7201 3 жыл бұрын

I was crossing my fingers for this video. Thank you.

@ChocolateMilkCultLeader 2 жыл бұрын

Crazy how Yannic has a video on every topic I want to research

@tianyangchen1339 Жыл бұрын

Thank you for your explanation! You always make complex things easy to understand, so great!

@alivecoding4995 Жыл бұрын

Thanks so much, Yannic!

@adrianpetrescu8583 3 жыл бұрын

man, you have a style of tell the story that I find very easy to understand. it is easy for me to learn from you :) thanks

@hoaxuan7074 3 жыл бұрын

Max pooling, locality sensitive hashing parameter switching, ReLU (f(x)=x connect, f(x)=0 disconnect) are all switching. Convolution, weighted sums, fast transforms (FFT Hadamard) are all dot products. Locality sensitive hash = Random projection followed by binarization. Random projection = fixed pattern of randomly chosen sign flips followed by Hadamard transform. Repeat for better quality. 3 Elements: Dot product, switching, predicate for switch state (EG. x

@hoaxuan7074 3 жыл бұрын

3 Element theory gives you Fast Transform fixed filter bank neural nets.

@beattoedtli1040 3 жыл бұрын

At 6:20, check out the y axis! It definitely is flattening out...

@veedrac 3 жыл бұрын

While I can see how one can rationalize the results otherwise, it seems to me that the scaling differences between dense and Switch (or other MoE) models on downstream tasks, relative to their scaling on perplexity, is further evidence against the idea that these are just memorize-interpolators. One would, I think, expect that such memorization and interpolation would be more robust on average to MoE-style partitioning than if they were also learning more general reasoning. Yet while we see Switch-Base outperform T5-Large on perplexity, it underperforms on every downstream task except CB Trivia QA. As in, this seems like what you get if your parameter scaling was giving benefits predominantly through better memorization, and it seems of a distinctly different character.

@spicychiley 3 жыл бұрын

Re: "model parallelism has high communication costs." Yes and no. Standard data-parallelism (aka layer sequential execution) incurs the overhead of synchronizing all accelerators, reducing all gradients, doing the weight update and distributing the updated parameters again. Model parallel (aka layer parallel aka layer pipelined) execution incurs the overhead of moving the hidden activations, but the weights are not moved. If moving weights is more expensive than moving activations then you probably want to run using model parallel execution. There are many cases where pipelining a model incurs the penalty of moving weights, but avoids a lot of overheads present in layer sequential execution. From Pipelined Backpropagation at Scale: Training Large Models without Batches (Kosson et al 2020, arxiv.org/abs/2003.11666) "Zhang et al. (2019c) find that fine-grained pipelining can enable speedups of up to 3.5x in their setting. Li & Pedram (2017) and Chen et al. (2016) both report energy savings of up to 3x. Fine-grained pipelining can also enable efficient sparse processing which Chen et al. (2019) show can result in up to a 42.5x and an 11.3x improvement in throughput and energy efficiency, respectively." In a recent white paper Sambanova shows how they plan to pipeline model. See Figure 4 here: sambanova.ai/wp-content/uploads/2020/12/RDA-Whitepaper.pdf Cerebras has also talked about the benefits of pipelining models: www.cerebras.net/data-model-pipeline-parallel-training-neural-networks

@hoaxuan7074 3 жыл бұрын

Very wrong models. Use Fast Transform neural nets with a O(nlog(n)) compute cost per layer and 2n parameters. You can either go wow, 1 trillion parameters, or you can slap your palm to your forehead. It depends if you like watching Monster Trucks or reading a chemistry book.

@etopowertwon Жыл бұрын

"We are not going to gave trillion parameters anytime soon". It took just 2 years to reach that soon.

@EnricoRos 3 жыл бұрын

I think the main takeaway for Switch-C is that it outperforms T5-XXL using 1/10th of the FLOPS (although blowing over 1T params). While the smaller Switch model gets the best performance but matching T5's compute. They haven't tried with both Equal Compute and 1T params.

@conchylicultor 3 жыл бұрын

Thank you for the summary, this was very informative. I was just wondering how did they manage to train the router weights if they are only sending exaamples to a single expert ?

@lucashadi5195 3 жыл бұрын

Maybe thats where the high expert dropout comes into play.

@WilliamFedus1 3 жыл бұрын

There is still a gradient through the selected expert. Therefore, the router can effectively up- or down-weight that expert relative to the others (perhaps akin to a policy gradient).

@adrienforbu5165 3 жыл бұрын

impressively well explained ! Thank you Yannic !

@yaxiongzhao6640 3 жыл бұрын

So HN has a comment: news.ycombinator.com/item?id=26174038 (sorry if @thesz saw this, I did not ask for permission) The context is that one comment suggested that Switch Transformer is parameter-inefficient, i.e., it uses too much parameters to achieve the performance that some other architecture would achieve with much less parameters. To that comment, someone asked what's the basis for this conclusion. This comment provides the reasoning (actually from different user from the original comment of inefficiency). The gist is that TensorFlow actually does not provide the APIs for experimenting with different algorithm, quote: "researchers at Google cannot do IRLS (search provides IRLS only for logistic regression in Tensorflow), they cannot do Hessian-free optimization ([4], closed due lack of activity - notice the "we can't support RNN due to the WHILE loop" bonanza), etc. All due to the fact they have to use Tensorflow - it just does not support these things." Any comments? I actually cannot comment on TensorFlow's capability at all...

@herp_derpingson 3 жыл бұрын

16:04 I ran into a similar problem while implementing a similar thing for one of my projects. Here, how will the router know that it should have routed to FFN1 instead of FFN2? If we do hard routing, there is no "push" to change the class from any gradients. 31:00 I would recommend the tensorflow mixed precision video from the official tensorflow youtube channel. Its pretty good.

@YannicKilcher 3 жыл бұрын

Good question, I think the fact that they have a uniform distribution regularizer in there means that every now and then a token is routed to a different expert, from which it might get better gradients. A bit like RL

@Rhannmah 3 жыл бұрын

Uh oh, this is getting out of hand! Transformers are crazy and I can't imagine what they can do with that many params... This is also amazing because it potentially gives common folk like me some hope to actually be able to run a reasonably sized transformer on local hardware.

@paulcurry8383 3 жыл бұрын

11:17 a feed forward layer still relates each token to the other tokens. It’s just not computed on the based on the sample like self attention. I guess each expert must have capacity # >= 2 to make sense. Unless they are FFing on the token vector only.

@sourabmangrulkar9105 3 жыл бұрын

Thanks for great explanation 😄

@TechyBen 3 жыл бұрын

GTP 5 Password re-rememberer: [Complete the following text] "I forgot my password, it is..."

@shaxosYT 3 жыл бұрын

Yannic: "Clearly memorized by scraping hacked databases"

@TechyBen 3 жыл бұрын

@@shaxosYT Oooh. Good one. Unintended successful consequences of AI.

@lars0me 3 жыл бұрын

... very hard to remember.

@bernardoramos9409 3 жыл бұрын

... forgotten

@pastrop2003 3 жыл бұрын

So, FF layer is essentially a one-dimensional convolution. In this case, what is a kernel size of it? 1? Still don't quite understand that part. Also, when you say "token", you mean 768-dimensional vector whatever the embedding dimensionality is, right?

@YannicKilcher 3 жыл бұрын

yes true, a token is represented as its vector. and a FF is a 1d convolution with k=1

@ammarkov 3 жыл бұрын

>tfw in my MocapNET work I use a a classifier that decides on using an ensemble trained on a subset of the problem (basically poor man's routing) and it was one of the reviewer complains.. This is a fundamentally good idea, divide and conquer!

@thomasmuller7001 3 жыл бұрын

whats next, routing is all you need?

@mathematicalninja2756 3 жыл бұрын

Lol, don't give em ideas

@muhammadsaadmansoor7777 3 жыл бұрын

At what point can novel start being able to make sense ie start reasoning. How to give model reasoning power

@pensiveintrovert4318 3 жыл бұрын

When every slightly unique concept gets its own distinct vector. Everything a person could say with the current state of the language, culture, reality is encoded in a state.

@bryanye4490 3 жыл бұрын

I think the feedforward is a fully connected linear layer, not a disjoint linear layer

@ratanrohith1013 11 ай бұрын

Here after Mixtral 8x7B release!

@NoFunkGiven 3 жыл бұрын

Thank you for the high qualities of your videos :)

@jahcane3711 3 жыл бұрын

Thanks Yannic!

@andreaswallin8862 3 жыл бұрын

Is Yannic mainly concerned with NLP application?

@YannicKilcher 3 жыл бұрын

no I'm just interested in whatever I think is cool :)

@russellcox3699 3 жыл бұрын

Doesn't the hard routing screw with differentiation?

@anshul5243 3 жыл бұрын

For automatic differentiation I believe it is analogous to a max pool operation

@Sogartar 3 жыл бұрын

@@anshul5243 I think it is some sort of hard attention, but on the parameters not the input. It must use argmax, which has a derivative of 0 almost everywhere. Except where 2 or more arguments have the same maximum value, then it is undefined. It is not useful for gradient descent. Maybe they are doing something with random sampling to estimate the optimization step to take. I have not read the paper.

@florianhonicke5448 3 жыл бұрын

Thanks for another awesome video!!!

@hypegt6885 3 жыл бұрын

sparsity is all you need

@simonstrandgaard5503 3 жыл бұрын

Great explanation.

@fast_harmonic_psychedelic 3 жыл бұрын

this is just like how the brain works. There are different parts of the brain that specialize in different layers of information processing. They should be able to have some of the FFNs the ability to handle visual data, audio data etc so it has more than just one form of perception. The road to consciousness is through the combination of multiple forms of perception of the world in the same network. Until now its all done on separate networks. But that's just like separate people, not like the brain which is processing on multiple dimensions of input, and its that multi-dimensional (sight, sound, touch, social communication, etc) processing which combine to form a concept (as opposed to a concept that is composed of strictly natural language. You can be told what a chair is your whole life but until you can touch a chair, see a chair, sit in a chair, make a chair, etc you don't really know a chair - just the word chair and how it relates to other words. Consciousness is knowing the word chair AND seeing it AND having other forms of measurement -and then the combined concept of the chair and the concept OF the concept OF the concept is sent through a feedback loop for self reflection - and only then do the conditions from which consciousness emerges. And thats all that consciousness is - its nothing more than multi-FFN-feedback loops.

@kicckicc 3 жыл бұрын

The usual way to do huge transformer model parallelism isn't layer-by-layer, but vertically (e.g. split a trainable variable into different machines). The layer-by-layer approach leads to high TPU idle time. Another framework gpipe (arxiv.org/abs/1811.06965) described it and proposed a way to alleviate it. Of course, the performance of the vertical model split relies on the fast communicate between TPUs.

@cmucjpeter 3 жыл бұрын

how do they back-prop the error if using argmax to switch expert?

@YannicKilcher 3 жыл бұрын

a bit like random exploration in RL

@jamiecampbell7673 3 жыл бұрын

What I thought when I watched this was "ok, so it's a CapsNet except instead of feeding the revisions to a deeper expert each expert has the final say". Is that accurate?

@adarob 3 жыл бұрын

Thanks for the video! Unfortunately your explanation of model parallelism is inaccurate. The way you explained it requires sequential execution of the layers (unless pipelining is used). Switch Transformer, T5, etc. splits each layer into N shards and processes them in parallel.

@beagle989 5 ай бұрын

you have such a lovely voice

@snippletrap 3 жыл бұрын

Welp, I'm still happy with my "lonely one Colab in the corner", LOL.

@herp_derpingson 3 жыл бұрын

* plays a small lonely violin in the corner *

@yongquanhuang7111 3 жыл бұрын

Sounds like GShard with top-1 expert routing. What's the novelty?

@christospapadopoulos7894 3 жыл бұрын

Google brain published it, there's your novelty

@danielalorbi 3 жыл бұрын

@@christospapadopoulos7894 Say no more, you had me at Google

@TechyBen 3 жыл бұрын

Specific useful application, or (in the first few mins of the video) mentioned it being "stable"?

@mathematicalninja2756 3 жыл бұрын

Novelty is in the domain

@arajuiojuo 3 жыл бұрын

no examples/use cases?

@granttao7504 3 жыл бұрын

This is a model parallizing paper.

@GuillermoValleCosmos 3 жыл бұрын

maybe the expert dropout works because each expert is trained on fewer data, effectively, so regularizing it more helps

@MrSchweppes 3 жыл бұрын

4:00 "We are not going to have trillion parameters models around anytime soon just yet" Don't you think OpenAI will release GPT-4 with more than 1 trillion parameters in 6-9 months? I think they will.

@mathematicalninja2756 3 жыл бұрын

Closed AI

@NextFuckingLevel 3 жыл бұрын

They will, but my prediction is near the end of 2021

@DamianReloaded 3 жыл бұрын

As I watched I kinda felt like I was witnessing the first strokes of AGI. I suspect when we learn to make networks of these models collaborating with each other to solve general problems we would have got AGI.

@DamianReloaded 3 жыл бұрын

yas

@mathematicalninja2756 3 жыл бұрын

You mean like the neurona

@DamianReloaded 3 жыл бұрын

More like a society of mind

@DamianReloaded 3 жыл бұрын

Yeah. We still have a way to go, but it feels like we are a lot closer than 6 years ago. The refinement of the techniques and how AI researchers think about neural processes as second nature, imo, are accelerating everything. With these huge models there is a legitimate interest in creating powerful hardware to run them. I can imagine a transformer being used as a "memory tracker" that given a few data points can "predict" (remember) what happened before or in between.

@preethamgali3023 3 жыл бұрын

DeBERTa explaination please.

@simleek 3 жыл бұрын

Trillion parameters... okay, that's approaching the number of synapses in the human brain. ~800 trillion though, so it will probably still take a bit of time. Also, still needs better design, which is out there, but not all put together. Edit: I think a further improvement would be to have multiple of these switch-transformers switch between running on some dataset, or different types of data, and I think having it combined at the end with info of which transformer ran would help too.

@JamesAwokeKnowing 3 жыл бұрын

yep, first goal is to get to the 'scale' of data processing that the brain in doing, mostly to rule out magic "if it were scaled" claims/hopes/thoughts, then goal can focus on architecture for learning/acting. I'd say 20 years till a brain scale computer could run on a robot like boston dynamics', though maybe in a Tesla in about 15 years (hi KIT). Current tech can scale to brain scale in data center, but new tech will distribute the computation at the layer level in a more neuromorphic (1 billion tiny processors physically connected in a pattern 'learned' in data center training) style.

@thomasmuller7001 3 жыл бұрын

finally an architecture that feels more biological plausible..

@Daniel-ih4zh 3 жыл бұрын

Why do you think?

@thomasmuller7001 3 жыл бұрын

@@Daniel-ih4zh information is routed in the brain carefully: you don't need your complete brain to make sense of this sentence. in normal transformers thats not the case and every information goes through all the computation. with this routing mechanism it finally does only makes use of compute when it precalculated its necessity. at least a bit ;) at least this is my intuition.

@JTMoustache 3 жыл бұрын

Reminds me of lambda layers

@pratik245 3 жыл бұрын

Sparsity concept is like neural network able to figure out by itself which two/many parameters can form a concept which it can npt while training and after training you can't really figure out which parameter learned what because they are numbers. Only way they can be possible is match values from a lower concept to higher concept. Whole part theory Mr Hinton was talking about. There is no precise why of doing it because energy manifestations in our world are probabilistic and not absolute. But this idea is worthwhile to explore.. At least for solid objects but simply impossible for thought processes.

@pratik245 3 жыл бұрын

How is token forwarded to each different FF layers. Deep fakes.

@beattoedtli1040 3 жыл бұрын

Experts or exberts?

@YannicKilcher 3 жыл бұрын

nice one

@ksy8585 3 жыл бұрын

The paper is somewhat not friendly

@DasGrosseFressen 3 жыл бұрын

It starting to get ridiculous, no?

@brandonwickstead9159 3 жыл бұрын

the rate of advancement in AI is astounding, it wasnt long ago people thought GO would never be solved by computers and now thats ez work

@brandonwickstead9159 3 жыл бұрын

i believe i will see atleast the path to agi in my lifetime

@DasGrosseFressen 3 жыл бұрын

@@brandonwickstead9159 meh, not sure what is hype and iterations of more powerand parameters, and what is really ground-breaking work.

@johnpope1473 3 жыл бұрын

Pytorch model for those playing at home - github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/switch/experiment.py