MoE LLMs with Dense Training for Better Performance

  Рет қаралды 1,364

Tunadorable

Tunadorable

Күн бұрын

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
arxiv.org/abs/...
Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!
/ tunadorable
account.venmo....
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tuna...

Пікірлер: 7
@marinepower
@marinepower 2 ай бұрын
I think you might be able to apply this method to a VQ-VAE, which are typically weird to train since you use a straight-through estimator (which gives wrong / incomplete grads).
@BooleanDisorder
@BooleanDisorder 2 ай бұрын
Btw, I was thinking about MoE multimodal models... I think that there still isn't a true multimodal model yet despite what OpenAI claims regarding 4o. I think they still process different modalities in different experts and therefore they don't truly get the richness that multimodality promises.
@npip99
@npip99 2 ай бұрын
MoE isn't split by modality,
@BooleanDisorder
@BooleanDisorder 2 ай бұрын
@@npip99 why wouldn't it be?
@Tunadorable
@Tunadorable 2 ай бұрын
so the experts in a given layer are chosen based on each individual token’s state right after it leaves the attention mechanism, meaning that each token can have a different combination of experts chosen. modalities are sometimes done differently in different models but usually you’ve got tokens relating to the image, different tokens for the text, etc. The experts are encouraged to activate on each token with a roughly equal probability. it’d be an interesting question to test whether certain experts tend to group more towards the tokens of one modality or another, and whether a given expert is performing similar/related tasks when it’s used on a text token as when it’s used on a vision token. would be an interesting mechinterp study to read but for the latter idea i’m not sure how you’d go about classifying the task similarity
@yorailevi6747
@yorailevi6747 2 ай бұрын
I am a little confused, you have the lines marked but you're acting as if it's a live read and you haven't ever marked those lines to begin with... are you using some automated tool to highlight key sentences?
@zacharykosove9048
@zacharykosove9048 Ай бұрын
he’s probably rethinking about what the paper means and explaining his thought process out loud
Autoregressive decoding of sentence vectors as opposed to tokens
27:33
Embarrassingly Parallel Training of MoE LLMs
23:18
Tunadorable
Рет қаралды 385
Cleaning🤣 #shorts #トイキッズ
00:18
Toy Kids★トイキッズ
Рет қаралды 10 МЛН
大家都拉出了什么#小丑 #shorts
00:35
好人小丑
Рет қаралды 84 МЛН
GTA 5 vs GTA San Andreas Doctors🥼🚑
00:57
Xzit Thamer
Рет қаралды 27 МЛН
Understanding Case in SQL
12:39
Analytics Extra
Рет қаралды 10
From Sparse to Soft Mixtures of Experts Explained
43:59
Gabriel Mongaras
Рет қаралды 2,1 М.
Some training tokens are more valuable than others
8:49
Tunadorable
Рет қаралды 1,3 М.
Making some embedding vectors functions of each other
21:48
Tunadorable
Рет қаралды 2,1 М.
I’m bad at coding…. (my software engineering journey)
9:58
Pooja Dutt
Рет қаралды 1,7 МЛН
Understanding Mixture of Experts
28:01
Trelis Research
Рет қаралды 9 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 307 М.
Soft Mixture of Experts - An Efficient Sparse Transformer
7:31
AI Papers Academy
Рет қаралды 4,7 М.