MoE LLMs with Dense Training for Better Performance

Рет қаралды 1,364

Күн бұрын

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
arxiv.org/abs/...
Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!
/ tunadorable
account.venmo....
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tuna...

Пікірлер: 7

@marinepower 2 ай бұрын

I think you might be able to apply this method to a VQ-VAE, which are typically weird to train since you use a straight-through estimator (which gives wrong / incomplete grads).

@BooleanDisorder 2 ай бұрын

Btw, I was thinking about MoE multimodal models... I think that there still isn't a true multimodal model yet despite what OpenAI claims regarding 4o. I think they still process different modalities in different experts and therefore they don't truly get the richness that multimodality promises.

@npip99 2 ай бұрын

MoE isn't split by modality,

@BooleanDisorder 2 ай бұрын

@@npip99 why wouldn't it be?

@Tunadorable 2 ай бұрын

so the experts in a given layer are chosen based on each individual token’s state right after it leaves the attention mechanism, meaning that each token can have a different combination of experts chosen. modalities are sometimes done differently in different models but usually you’ve got tokens relating to the image, different tokens for the text, etc. The experts are encouraged to activate on each token with a roughly equal probability. it’d be an interesting question to test whether certain experts tend to group more towards the tokens of one modality or another, and whether a given expert is performing similar/related tasks when it’s used on a text token as when it’s used on a vision token. would be an interesting mechinterp study to read but for the latter idea i’m not sure how you’d go about classifying the task similarity

@yorailevi6747 2 ай бұрын

I am a little confused, you have the lines marked but you're acting as if it's a live read and you haven't ever marked those lines to begin with... are you using some automated tool to highlight key sentences?

@zacharykosove9048 Ай бұрын

he’s probably rethinking about what the paper means and explaining his thought process out loud