The problem with MoE isn't necessarily RAM, but memory bandwidth. You can think of the problem in terms of arithmetic intensity (FLOPS/Byte). In the case of batched inference, you have a higher arithmetic intensity using dense models since the weights are shared across each token / batch. However, with MoE, you can't guarantee this behavior where in worst case each token needs a different set of weights (i.e. minimal arithmetic intensity). And with modern hardware, the bottleneck is memory bandwidth not compute (i.e. why flash attention is effective). Merging segments like this is only going to exacerbate the bandwidth issue, especially if each token in the batch uses a different weighted combination of experts. One solution to this problem was proposed by the switch transformer (and what GPT4 is likely doing), where the experts live on different GPUs and are routed over the NVLink rather than data dependent memory reads, but that also won't work for merging the FFN. This work is still interesting though, and has stronger applicability to summarization rather than generation (e.g. if you trained the experts without a fixed fallback dropout, maybe you could use a common FFN for generation, but use MoE for building the KV cache?).
@metaprotium3 ай бұрын
I like your idea at 9:00, that would be useful for training large numbers of experts very sparsely
@Alice_Fumo3 ай бұрын
This is interesting. I find the segment length of 256 also surprising. It does seem very long. To me it didn't seem like having a fixed segment length is required by the approach, so to me it would make sense to have a maximum segment length and additionally split segments in text for example with each paragraph or sentence end or new line with a minimum segment length of something like 16 tokens or so. The way it is done in this paper, the segment boundaries do not care if there was a semantic shift in the middle of the segment, which I would assume is not good for expert specialization.
@alfinal57873 ай бұрын
Loving the more assertive tone.
@Tunadorable3 ай бұрын
i wish i knew what you meant but im pretty tone deaf
@alfinal57872 ай бұрын
@@Tunadorable lol clever comeback
@farrael0043 ай бұрын
Where's my "oi" in the beginning of the video? 😢
@Tunadorable3 ай бұрын
haha sometimes i forget
@SinanAkkoyun3 ай бұрын
"lory" ~= "ooi"
@jeremywalsh51772 ай бұрын
Please cover mixture of million experts
@Tunadorable2 ай бұрын
way ahead of u
@BooleanDisorder2 ай бұрын
MoE will also always end up with one modality per expert too even if you try, unlike now, to make the embeddings themselves truly multimodal. That's why you need a coherent single network without experts to get the most out of multimodality (true multimodal embeddings). RAG is the way forward I think for agentic behavior, where it also learns to process certain info more or less. As a thought process I guess. A diffusion type of effect where it constructs the generation one part at a time and then gives a final generation when it's finished.
@Tunadorable2 ай бұрын
One of us must be confused, could you clarify? Experts are chosen using a simple linear layer router at each token without any separation by modality, and with an extra loss term that encourages roughly even activation. Meaning that a given expert can and likely will be called some % of the time even on tokens with different modalities since they have no architectural reason to separate by modality; they separate by tokens. Are you referring to some research I've not heard of showing that experts naturally do tend to specialize towards tokens different modalities? If so I'd love to see it. But even then I'm not sure how what you're saying would apply to a "multi-modal token", which I assume would be referring to a single vector that simultaneously references data from different modalities.
@BooleanDisorder2 ай бұрын
@@Tunadorable Sorry. I will try to explain better. When I refer to “truly multimodal embeddings,” I'm think of a system where the representation of information from different modalities is deeply integrated, rather than simply concatenated or processed in parallel. While you're correct that experts are chosen without explicit separation by modality, in practice, they often tend to specialize. This specialization can occur due to the inherent differences in the statistical properties of different modalities, even if not explicitly architected that way. This is not something I know a paper about, but something that I have come to learn from somewhere. So no, I can’t prove it but it seems to have stuck in my mind. Make of that as you will. The router may indeed work at the token level, but the challenge lies in creating higher-level representations that genuinely blend information across modalities. Individual tokens, even if they contain information from multiple modalities, may not capture the complex interactions between modalities that we're after. So, you’d want high level representations to be “routed” rather than tokens. My point about a “coherent single network” is that it might be better suited to develop these truly integrated multimodal representations. Without the separation into experts, the network might be forced to learn more generalized, cross-modal features. I believe RAG could be a promising approach for more flexible, context-aware processing. It could allow the system to dynamically adjust its focus on different types of information, similar to how humans shift attention between sensory inputs and would give it a general way to learn new abilities in context, I think. The idea of diffusion like process is that the generation process could be more iterative, refining the output by considering multiple modalities over several steps, rather than making a single pass through separated experts. Maybe even make each iterate done by separate but similar networks that have been trained for each step rather than a perfect score at once. One could stop at certain network when a satisfactory answer has been reached. Maybe the network itself can be taught to judge that with self-supervising autoregressed embeddings each iteration. Like a grade for each iteration where it stops when it has reached “good enough?” Note that my idea of multiple networks is more akin to layers than experts, but where each "layer" here is it's own network, albeit with somewhat different depths and so on. edit: My inspiration behind the several networks in parallell is the Neocortical Columns in the brain.
@Tunadorable2 ай бұрын
ah that makes more sense very cool
@mrpocock3 ай бұрын
Can we think of MoE models as being like dense networks trained with an extremely strong sparse/dropout process, but where this network ablation has very strong correlations rather than being sampled from white noise?
@Tunadorable3 ай бұрын
interesting hmmm. so the dropout process is dependent upon the input data (the correlations you mentioned) and because of that i don’t think it’d be possible to make any actual rigorous strictly mathematically definable connection between the two concepts. that being said the sparsity part of the connection is there. at some point (mid-late august?) i’ll be releasing a video on a paper called something like “a million experts” that i think you’d be interested in as it’s a bit closer to your description than regular MoE setups are
@mrpocock3 ай бұрын
@@Tunadorable Your observation about the dropout being data-dependent is valid. So we could probably formalise this as a function from some choosing layer L_c to a covariance matrix over per-weight dropout probabilities. That lets us get rid of the additive/sigmoid layer that re-integrates the individual agents entirely. There are probably tricks that can be played in how that covariance matrix is calculated from L_c that allow us to incrementally add new agents using "dead" weights from an otherwise dense layer or stack of layers. Sounds like a job for a first year PhD student to test out...