Sparsely-Gated Mixture-of-Experts Paper Review

Sparsely-Gated Mixture-of-Experts Paper Review - 18 March, 2022

Рет қаралды 2,429

Күн бұрын

Пікірлер

@yarrowflower 2 жыл бұрын

Thank you so much for uploading this! I am champing at the bit to see some form of sparsity implemented in large neural nets running on GPUs or NPUs. (Especially in commercial settings!)

@hyunsunggo855 2 жыл бұрын

19:08 The order matters a bit. Doing kWTA afterwards would result in the probabilities not summing up to 1.0.

@stardustsong1680 2 жыл бұрын

It's quite a long time that I haven't seen an update of this channel.

@MattGorbet 2 жыл бұрын

at 58:20 to about 58:39 I was surprised everyone just moved on from this point analogizing coritcal columns to experts. Maybe I misunderstood something but couldn't the operation of the gating function, which is applied to huge batches of unknown, raw generic data (think of it as all possible data in the world, from the point of view of the system) and deciding which experts should handle it, be analogous to the actual physical connections in the brain that deterimine which 'experts' - i.e. cortical colums - deal with or ignore specific inputs? So when you say "different cortical columns will process different parts of the visual field", (and similarly I imagine input via touch is processed by different 'expert' cortical columns than input via sight)... is this biological 'filtering' not analogous in some way to the purpose of the gating function the authors are proposing? Every part of our brain does not process every single possible 'bit' of information from the world, and in the brain's case it's done via biology & specific routing, rather than an alogrithm in software. But I think the parallel that was pretty much dismissed is still valid. Perhaps I've misunderstood - I'm still quite new to all this. Thanks so much for making these discussions public, it is really interesting.

@subutaiahmad8208 2 жыл бұрын

It's a good question, and I can see why people might think we moved on too quickly there. The main reason we moved on quickly is that cortical columns embody a far more intricate and complex structure than the layer structure of the sparse-mixture paper. Each cortical column does process a different subset of the sensory input. Two visual cortical columns will process different subsets of the incoming image. More importantly, cortical columns have a diverse motif of recurrent connections, receive lots of feedback from other areas, and incorporate movement in complex ways. To get a sense of some aspects of this, you can look at our paper [1] below. The mapping to biology section in [1] details some of this with references to the neuroscience literature. Perhaps there are some really really high level connections between the two ideas, but to me it's so high-level that it quickly become meaningless. [1] A Theory of How Columns in the Neocortex Enable Learning the Structure of the World. www.frontiersin.org/articles/10.3389/fncir.2017.00081/full

@MattGorbet 2 жыл бұрын

@@subutaiahmad8208 thanks for the reply. I will read this paper. My comment was indeed making a high-level, maybe even conceptual link between filtering input for experts & the way various input we perceive are 'pre-filtered' by our diverse senses. I can see this being meaningless in the context of optimizing the sparse-mix algorithm... For me it was more the aha of realizing that not all input is created equal, and unlike AI systems we have multiple pre-filtered data streams automatically going to the 'right' experts via our varied sense organs. Thanks!