Sparse Expert Models (Switch Transformers, GLAM, and more... w/ the Authors)

Рет қаралды 18,620

Күн бұрын

#nlp #sparsity #transformers
This video is an interview with Barret Zoph and William Fedus of Google Brain about Sparse Expert Models.
Sparse Expert models have been hugely successful at distributing parts of models, mostly Transformers, across large array of machines and use a routing function to effectively route signals between them. This means that even though these models have a huge number of parameters, the computational load for a given signal does not increase because the model is only sparsely activated. Sparse expert models, such as Switch Transformers and GLAM can scale up to trillions of parameters and bring a number of desirable properties. We discuss everything from the fundamentals, history, strengths and weaknesses, up to the current state of the art of these models.
OUTLINE:
0:00 - Intro
0:30 - What are sparse expert models?
4:25 - Start of Interview
5:55 - What do you mean by sparse experts?
8:10 - How does routing work in these models?
12:10 - What is the history of sparse experts?
14:45 - What does an individual expert learn?
19:25 - When are these models appropriate?
22:30 - How comparable are sparse to dense models?
26:30 - How does the pathways system connect to this?
28:45 - What improvements did GLAM make?
31:30 - The "designing sparse experts" paper
37:45 - Can experts be frozen during training?
41:20 - Can the routing function be improved?
47:15 - Can experts be distributed beyond data centers?
50:20 - Are there sparse experts for other domains than NLP?
52:15 - Are sparse and dense models in competition?
53:35 - Where do we go from here?
56:30 - How can people get started with this?
Papers:
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (arxiv.org/abs/2101.03961)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (arxiv.org/abs/2112.06905)
Designing Effective Sparse Expert Models (arxiv.org/abs/2202.08906)
Links:
Merch: store.ykilcher.com
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
BitChute: www.bitchute.com/channel/yann...
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/2017636191
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 27

@YannicKilcher 2 жыл бұрын

OUTLINE: 0:00 - Intro 0:30 - What are sparse expert models? 4:25 - Start of Interview 5:55 - What do you mean by sparse experts? 8:10 - How does routing work in these models? 12:10 - What is the history of sparse experts? 14:45 - What does an individual expert learn? 19:25 - When are these models appropriate? 22:30 - How comparable are sparse to dense models? 26:30 - How does the pathways system connect to this? 28:45 - What improvements did GLAM make? 31:30 - The "designing sparse experts" paper 37:45 - Can experts be frozen during training? 41:20 - Can the routing function be improved? 47:15 - Can experts be distributed beyond data centers? 50:20 - Are there sparse experts for other domains than NLP? 52:15 - Are sparse and dense models in competition? 53:35 - Where do we go from here? 56:30 - How can people get started with this? Papers: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (arxiv.org/abs/2101.03961) GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (arxiv.org/abs/2112.06905) Designing Effective Sparse Expert Models (arxiv.org/abs/2202.08906)

@MsFearco 2 жыл бұрын

we are not going to talk about the attention layer today. all you have to know is that attention is all you need. was kinda expecting this :D

@alan2here 2 жыл бұрын

I've been hearing about a fractal/hierarchy of groups/clusters of neurones approach for years, it's nice to see it actually happen

@oncedidactic 2 жыл бұрын

Glad to hear the guests comment that the primary motivation is engineering considerations. It really does seem like the whole concept of experts (in the light discussed) is to better route latent representations to compute. But they also make a great point that this must be the smart way forward in the longer term, instead of always shoving things through dense networks. I wonder if there is more opportunity than acknowledged with the experts’ token tendencies regarding interpretability. In fact I’m surprised they expected other than basic switching behavior, which would be the seemingly obvious first optima to train towards. (No?) Overall seems like a really ripe area for research and maybe architecture innovations. Dense networks must be doing something similar but why not make it explicit with experts and leverage that. Anyway great interview as always thanks Yannic and guests!

@shawnlin2718 2 жыл бұрын

I've been waiting for this video for about a year!

@CharlesVanNoland 2 жыл бұрын

Expert model experts, eh?

@user-hf1lo4ts5g 2 жыл бұрын

They probably have expertise about expert model...Experts of expert model with expertise about expert model...! EXPERTCEPTION

@oncedidactic 2 жыл бұрын

Yannic the only dude who can throw out an “ergo” like playing frisbee at the park. And the catch is effortless

@robmarks6800 Жыл бұрын

Great to see OpenAI’s secret sauce explained so clearly

@Timotheeee1 8 ай бұрын

this was invented by google

@alan2here 2 жыл бұрын

Is anyone else reminded of the cores from Portal 2 (2011)? "wanna go to space, did you know that mars is really big? space, SPACE"

@brll5733 2 жыл бұрын

Did anyone else feel they kinda talked around the questions? Barret especially?

@jabowery 2 жыл бұрын

There's something confused about the use of the word "sparse" in the context of expanded numbers of parameters for the same quantity of data. Overfitting is an obvious consequence: A sequential memory can be viewed as a bunch of "experts", each of which knows only about one bit in the data. Routing is memory address decoding. No compression of anything, no generalization/prediction hence no validation. When I think of "sparse models" it is relative to the quantity of observations up to the present point in time, hence it has a natural correspondence to Kolmogorov Complexity of the data -- the polar opposite of overfitting. Maybe one way of approaching this is to explicitly represent the lack of connections as 0-weight connections -- so it is a degenerate case of a stupendously huge dense network model -- and then ask how one would train such a stupendously huge dense network model to zero out those connections to a merely huge sparse network model.

@oncedidactic 2 жыл бұрын

I feel this, experts is like a proxy major shortcut for backprop to 0 all those weights, or seen another way it pins all the gradient on the previous layer in tuning the latent representation to conform to the arbitrary expert partition of downstream parameter buckets. There’s something about it that feels right in conjunction with lottery ticket ideas.

@ZastrowOtto 2 жыл бұрын

what about group convolutions? the have been used to spread groups of filters (e.g. 4) across different machines. They did not have a router function, otherwise they seem related?

@intisarchy7059 2 жыл бұрын

What do you think of mixture of expert models for vision task?

@csababotos7004 2 жыл бұрын

how can it be that there's literally no one talking about the KAT? on the other hand there was a DeepMind paper mentioned wrt formalizing Sparse-vs-Dense comparisons, is there any pointer to that by any chance?

@alan2here 2 жыл бұрын

This is very Greg Egan, for example "Diaspora". I've said that before, but it's even more so now.

@Data_scientist_t3rmi Жыл бұрын

I'm quite confused, too much evolution for small amount of time, however when i see Reformers, they were not used like Transformer for NLP tasks, when i search for pre-trained models of reformer i cannot find a useful pre-trained models, and now for Switch transformers will have the same destiny?

@amarnamarpan Жыл бұрын

Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.

@alan2here 2 жыл бұрын

an expert in knowing everyones name