Show notes and transcript: www.dropbox.com/scl/fi/3lufge4upq5gy0ug75j4a/RANDALLSHOW.pdf?rlkey=nbemgpa0jhawt1e86rx7372e4&dl=0
@SapienSpaceКүн бұрын
Fascinating talk, thank you for sharing! I can be completely wrong, though these ideas look so much like an adaptive state classifier for an adaptive control system. The splines appear similar to linear Fuzzy (merges statistical math with semantic languages) membership functions with overlapping probabilistic states. The centers of these membership functions can be adaptively adjusted to experience as discussed in this video using K-means clustering. Much like TRPO, PPO, GRPO, and the attention heads (from the "Transformers: Attention is all you need paper") are all like forms of adaptive state classification ("Grokking") for an adaptive control system of the "reasoning (and maybe even 'consciousness' process". Combine state classification with RL likely produces a robust capable system, especially as indicated in this video "...after very, very long training.". I need to explore this further. Another fascination I have is the usage of the Greek letter, Theta, throughout the deep learning literature. As always, I can be completely wrong (and crazy to think this), but I highly suspect the origin of using Theta is the angle of a pendulum in the 1983 IEEE "Cart Pole" paper by Barto, Sutton, and Anderson where they used Reinforcement Learning as an adaptive control system. There is an even simpler experiment just using a pole in a 1997 master thesis with no cart, it is by an American master student with a Chinese advisor, with the first few words of the title "Reinforcement Learning: Experiments with State Classifiers...". This simple pole (a "useless machine") is described as the "simplest robot" described in a fascinating lecture on KZbin by Scott Kuindersma, he was one of Barto's students that went on to Boston Dynamics. Although the speaker in this MLST talk seemed to steer people away from looking at the older history of all this, I perceive it is vital for not only understanding how this works, but is also vital for reconstructing it, especially in catastrophe as history is a regular collapse and resurrection. It makes me understand how the Antikythera mechanism is so out of place in the archaeological record and was so challenging to reconstruct and understand it's celestial purpose. Slowly, or maybe too quickly, we are beginning to understand the "magic" hat 🎩 of ☃️"Frosty", though I hope how this "magic hat" is made he not forever lost in history, so vital to robustly identify it's history and core design! We live in as amazing times, again, fascinating talk and I really appreciate you sharing it, thank you!
@oscbitКүн бұрын
I'm a simple man. I structure my interest around MLST uploads.
@CheapDeath9623 сағат бұрын
I'm a simple man. I don't need a microwave!
@luke.perkin.onlineКүн бұрын
Still watching, great video! Another popular paper from January was "Grokking at the Edge of Numerical Stability"... similar ideas, orthogonal gradient updates, and also a fix for softmax.
@Pingu_astrocat21Күн бұрын
Was reading this paper and then this got uploaded🔥 thank you
@arowindahouseКүн бұрын
We need more Randalls
@makhalid199922 сағат бұрын
His previous appearance, along with Yann LeCun, was one of the first episodes of MLST I watched :)
@EmileAIКүн бұрын
I've watched 3 episodes of mlst today haha Its good to be a patreon, the content there is great as well
@aitheignisКүн бұрын
I read the first paper in ref after I saw the video. It is such a good & eye opening read especially the visualization of tesellation arise from overlapping of hyperplane partitioning input space. Thank you for the great video & the interesting guest as usual. I love the reference list that you added in video description. It is like a well curated list of interesting papers that I can just go to and find something interesting to read in the sea of boring papers that flood the field right now due to its popularity.
@sravandanda23028 сағат бұрын
Please upload more technical talks like these
@JudyMitchell-f6bКүн бұрын
What we are seeing here seems to be transferable to understanding high functioning autism; and also seems related to the drastic neural pruning infants go through; and in general shines a light of pedagogy and logic. I suppose this data should be of paramount importance to cognitive psychology and neurology across the board.
@redazzoКүн бұрын
Given the rate at which pedagogy moves, I'd expect this to be picked up 15 years from now, and applied in the next century ...😅
@RahulSam6 сағат бұрын
French people talking about deep learning is my ASMR.
@yurona515520 сағат бұрын
Aww, seems like we're slowly being deprived of the YOLO run thrill...;/ Great stuff, thank you both!
@jabowery10 сағат бұрын
Science generally progresses from statistics to dynamics. Sometimes it takes centuries for this to get through to some fields and finally "grok" their laws, but at least it got through to Newton after ... what? How long did that take from the time of the kinematics of the ancient Greeks thinking you had to keep pushing on something to make it move? I really think its important not to get confused on this point when speaking of grokking. Yes, there is reason to believe some of what is going on here is just more parsimonious curve fitting but that's one reason I often complain about people getting hung up on image processing even though I cut my teeth on it back in 1991 with the first hardware convolution Xilinx image processor to do image segmentation. Bear in mind that even with Transformers there is something akin a "state space" that _may_ also be giving us something along the lines of a piecewise approximation of ODEs with feedback. Here's a thought experiment: What would have happened with Tesla's accident with confusing that truck with an overpass if people had given half as much attention early on to video as to image processing?
@CYI3ERPUNKКүн бұрын
excellent analogy/metaphor
@franszdyb4507Күн бұрын
10:11 I don't follow this claim. Clearly the partitions do change even far away from the data - when SGD adjusts some partitions close to the data, all the intersections with those partitions, no matter how far away, are affected. But why would they be affected in ways that extrapolate? What the NN looks like far away from the data is irrelevant to the training loss. There is no gradient information far from the data - all the changes that happen there are purely side-effects of minimizing training loss, i.e. getting a better fit close to the data.
@oncedidacticКүн бұрын
I think all he’s trying to say is that in comparison to K means, there can be coupling from distant decision regions during the learning process, whereas in K means you can rearrange cluster centroids all you want on the north side and it won’t do anything to south pole, because that algo isn’t nested like deep networks are (I don’t think he’s talking about off the data manifold, he’s just taking about distant regions)
@oncedidacticКүн бұрын
Oh sorry, I was mistaken above responding to the immediately preceding statements. But related idea to nested high D space. If your spline adjustments (affine) are tuned the data manifold, the geometric extension of those will be like kaleidoscope replicas. Assuming the test data is somewhat similarly structured, it should largely conform, ie nested splines generalize well under the assumption of “natural data exhibit tendencies”. Whereas for reason I state above, K means just can’t do this, by comparison. Think of the infinite polygon boundaries at the edge of a voronoi decomposition. But yes this is a claim about the behavior of NN and maybe not a perfect intuition.
@franszdyb450719 сағат бұрын
@@oncedidactic what do you mean by "kaleidoscope replicas" and "natural data exhibit tendencies"?
@oncedidactic14 сағат бұрын
@@franszdyb4507 Kaleidoscope- the image of earlier layer spline is split and projected into subsequent layers. Very much like light splitting and bending in a prism. And then this happens many times. The preceding decision boundaries are preserved and replicated, but also mutated. Thus any structure in the natural data manifold, which is embedded in the spline approximation, is extended to regions off the manifold. What I mean by “exhibits tendencies” is a very glib way to say that usually natural data have structure that is repeated and recognizable - ie can be largely generalized.
@franszdyb450713 сағат бұрын
@@oncedidactic Say the NN is fitting a periodic function. To do so, regions are placed closely together wherever there's data, and the function is nonlinear - here the function is nonlinear over the whole domain. Right outside the outermost region with a data point, this region differs only by a single neuron. Further out, the region next to that one differs by two neurons. Eventually, no matter how big the NN is, the number of neurons or "folds" of the input domain have gone down to zero, and the NN is linear, for all input sufficiently far from the data. But the function we're fitting is nonlinear everywhere. So I can see how we can have nonlinear behavior slightly outside the data. I can see how, if test data always lies close to a manifold that we've managed to fit, even slightly outside the manifold we have a good fit. But I don't see how we can extrapolate more than a short distance away from the data.
@TBOBrightonandHoveКүн бұрын
Top stuff! Immediate take away, more computation can take you a long way for improving resiliency of learnings from a given dataset. Hmnnn Nvidia...
@oncedidacticКүн бұрын
Elastic origami…. in high dimensions
@munchingorange234Күн бұрын
Lol
@tommys4809Күн бұрын
Linear is the new nonlinear
@oncedidactic14 сағат бұрын
😆
@azharalibhutto1209Күн бұрын
Great 👍
@burnytechКүн бұрын
❤
@Interstellar00.00Күн бұрын
Ask AGI ai to ai report Violation
@drhxaКүн бұрын
First
@ElieSanhDucos0Күн бұрын
Second but french too
@isajoha9962Күн бұрын
Feels like the regions need relations (Nexus?). To understand connections for reasoning over patterns/symbols/translations. Clouseau vibe. 🔎 Cool deep video. Reminds me of the XO drum VST plugin. part 2) Might be very promising for overall consistent movie sequence generation. 🤔 Ability to overlay a reasoning mechanism on top of a loop where the index = Context'First to Context'Last. Is kind of a challenge.