Deep Networks Are Kernel Machines (Paper Explained)

  Рет қаралды 58,931

Yannic Kilcher

Yannic Kilcher

Күн бұрын

#deeplearning #kernels #neuralnetworks
Full Title: Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
Deep Neural Networks are often said to discover useful representations of the data. However, this paper challenges this prevailing view and suggest that rather than representing the data, deep neural networks store superpositions of the training data in their weights and act as kernel machines at inference time. This is a theoretical paper with a main theorem and an understandable proof and the result leads to many interesting implications for the field.
OUTLINE:
0:00 - Intro & Outline
4:50 - What is a Kernel Machine?
10:25 - Kernel Machines vs Gradient Descent
12:40 - Tangent Kernels
22:45 - Path Kernels
25:00 - Main Theorem
28:50 - Proof of the Main Theorem
39:10 - Implications & My Comments
Paper: arxiv.org/abs/2012.00152
Street Talk about Kernels: • Kernels!
ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they're not mutually exclusive. And it's also not true that you "don't need a model" in kernel machines, as it usually still contains learned parameters.
Abstract:
Deep learning's successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.
Authors: Pedro Domingos
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 171
@YannicKilcher
@YannicKilcher 3 жыл бұрын
ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they're not mutually exclusive. And it's also not true that you "don't need a model" in kernel machines, as it usually still contains learned parameters.
@chriszhou4283
@chriszhou4283 3 жыл бұрын
Another arXiv forever.
@23kl104
@23kl104 3 жыл бұрын
yes, would fully appreciate more theoretical papers. Keep up the videos man, they are gold
@Mws-uu6kc
@Mws-uu6kc 3 жыл бұрын
I love how simple you explained such a complicated paper. Thanks
@ish9862
@ish9862 3 жыл бұрын
Your way of explaining these difficult concepts in a simple manner is amazing. Thank you so much for your content.
@tpflowspecialist
@tpflowspecialist 3 жыл бұрын
Amazing generalization of the concept of a kernel in learning algorithms to neural networks! Thanks for breaking it down for us.
@florianhonicke5448
@florianhonicke5448 3 жыл бұрын
Thank you so much for all of your videos. I just found some time to finish the last one and here is the next video in the pipeline. The impact you have on the ai community is immense! Just think about how many people were starting in this field just because of your videos. Not even talking about multiplication effects by educating theire friends.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Thanks a lot :)
@dalia.rodriguez
@dalia.rodriguez 3 жыл бұрын
"A different way of looking at a problem can give rise to new and better algorithms because we understand the problem better" ❤
@andreassyren329
@andreassyren329 3 жыл бұрын
I think this paper is wonderful in terms of explaining the Tangent Kernel, and I'm delighted to see them showing that there is a kernel _for the complete model_, such that the model can be interpreted as a kernel machine with some kernel (the path kernel). It ties the whole Neural Tangent Kernel stuff together rather neatly. I particularly liked your explanation of training in relationship to the Tangent Kernel, Yannic. Nice 👍. I do think their conclusion, that this suggests that ANNs don't learn by feature discovery, is not supported enough. What I'm seeing here is that, while the path kernel _given_ the trajectory can describe the full model as a kernel machine, the trajectory it took to get it _depends on the evolution_ of the Tangent Kernel. So the Tangent Kernel changing along the trajectory essentially captures the idea of ANNs learning features, that they then use to train in future steps. The outcome of K_t+1 depends on K_t, which represents some similarity between data points. But the outcomes of the similarities given by K_t were informed by K_t-1. To me that looks a lot like learning features that drive future learning. With a kind of _prior_ imposed by the architecture, through the initial Tangent Kernel K_0. In short. Feature discovery may not be necessary to _represent_ a trained neural network. But it might very well be needed to _find_ that representation (or find the trajectory that got you there). In line with the fact that representability != learnability.
@nasrulislam1968
@nasrulislam1968 3 жыл бұрын
Oh man! You did a great service to all of us! Thank you! Hope to see more coming!
@minhlong1920
@minhlong1920 3 жыл бұрын
I'm working on NTK and I came across this video. Truly amazing explaination, it really clears things up for me!
@andrewm4894
@andrewm4894 3 жыл бұрын
Love this, Yannic does all the heavy lifting for me, but I still learn stuff. Great channel.
@al8-.W
@al8-.W 3 жыл бұрын
Thanks for the great content, delivered on a regular basis. I am just getting started as a machine learning engineer and I usually find your curated papers interesting. I am indeed interested in the more theoretical papers like this one so you are welcome to share. It would be a shame if the greatest mysteries of deep learning remained concealed just because the fundamental papers are not shared enough !
@dermitdembrot3091
@dermitdembrot3091 3 жыл бұрын
Agree, good that Yannic isn't too shy to look into theory
@NethraSambamoorthi
@NethraSambamoorthi 3 жыл бұрын
@yannic kilcher - Your brilliance and simplicity is amazing.
@user-xs9ey2rd5h
@user-xs9ey2rd5h 3 жыл бұрын
I really liked this paper, puts neural networks in a completely different perspective.
@111dimka111
@111dimka111 3 жыл бұрын
Thanks Yannic again for very interesting review. I'll give here also my 5 cents on this paper. Will start with some critization. The specific point of the paper's proof is to divide and multiply by path kernel (almost end of the proof). This makes coefficients a_i to be a function of input, a_i(x), which as noted by remark 1 is very different from a typical kernel formulation. This difference is not something minor and I'll explain why. When you say that some model is a kernel machine and that it belongs to some corresponding RKHS defined via kernel k(x_1, x_2), we can start explore that RKHS and see what are its properties (mainly its eigen-decomposition) and from them to deduce various model behaviours (its expressiveness and tendency for overfitting). Yet, the above step of division/multiplication allows us to express NN as kernel machine of any kernel. Take some other irrelevant kernel (not the path kernel) and use it similarly - you will obtain the result that now NN is a kernel machine of this irrelevant kernel. Hence, if we allow a_i to be x-dependent then we can tell that any sum of train terms is a kernel machine of arbitrary kernel. Not a very strong statement, in my opinion. Now with good parts - the paper's idea is very clear and simple, propagating overall research domain more into the right direction of understanding theory behind DL. Also, the form that a_i obtained (derivative weighted by the kernel and then normalized by the same kernel) may provide some benefits in future works (not sure). But mainly, as someone that worked on these ideas a lot during my PhD I think papers like this one, that explain DL via tangent/path kernels and their evolution during the learning process, will eventually give us the entire picture of why and how NNs perform so well. Please review more papers like this :)
@wojciechkulma7748
@wojciechkulma7748 3 жыл бұрын
Great overview, many thanks!
@IoannisNousias
@IoannisNousias 3 жыл бұрын
Thank you for your service sir!
@michelealessandrobucci614
@michelealessandrobucci614 3 жыл бұрын
Check this paper: Input similarity from the neural network perspective. It's exactly the same idea (but older)
@shawkielamir9935
@shawkielamir9935 2 жыл бұрын
Thanks a lot, this is a great explanation and I find it very useful. Great job !
@Sal-imm
@Sal-imm 3 жыл бұрын
Very good, pretty much straight forward linear.deduction.
@scottmiller2591
@scottmiller2591 3 жыл бұрын
I think what the paper is saying is "neural networks are equivalent to kernel machines, if you confine yourself to using the path kernel." No connection to Mercer or RKHS, so even the theoretical applicability is only to the path kernel - no other kernels allowed, unless they prove that path kernels are universal kernels, which sounds complicated. I'm also not sanguine about their statement about breaking the bottleneck of kernel machines - I'd like to see a big O for their inference method and compare it to modern low O kernel machines. Big picture, however, I think this agrees with what most kernel carpenters have always felt intuitively.
@damienhenaux8359
@damienhenaux8359 3 жыл бұрын
I like very much this kind of videos on mathematical papers. I would like very much a video like this one on Stéphane Mallat paper : Group invariant scattering (2012). And thank you very much for everything
@master1588
@master1588 3 жыл бұрын
This follows the author's hypothesis in "The Master Algorithm" that all machine learning algorithms (e.g. NN, Bayes, SVM, rule-based, genetic, ...) approximate a deeper, hidden algo. A Grand Unified Algorithm for Machine Learning.
@master1588
@master1588 3 жыл бұрын
For example: lamp.cse.fau.edu/~lkoester2015/Master-Algorithm/
@herp_derpingson
@herp_derpingson 3 жыл бұрын
@@master1588 The plot thickens :O
@adamrak7560
@adamrak7560 3 жыл бұрын
Or they are similar in a way because they are all universal. Similar to the Universal Turing Machines. They can each simulate each other. The the underlying algorithms may be the original proof that NNs are universal approximators.
@paulcarra8275
@paulcarra8275 3 жыл бұрын
About your comment in the video about the fact that hte theorem applies only for the full GD case, in fact it can be extended to SGD aswell, you only need to add an indicator (in the sum of graditents over the trianing data) at each step to spot the points that are sampled at this step (this is explained by the author in the video below). Regards
@arthdh5222
@arthdh5222 3 жыл бұрын
Hey Great video, what do you use for annotating on the pdf, also which software do you use for it? Thanks!
@YannicKilcher
@YannicKilcher 3 жыл бұрын
OneNote
@morkovija
@morkovija 3 жыл бұрын
"I hope you can see the connection.." - bold of you to hope for that
@LouisChiaki
@LouisChiaki 3 жыл бұрын
Very excited to see some real math and machine learning theory here!
@kaikyscot6968
@kaikyscot6968 3 жыл бұрын
Thank you so much for your efforts
@YannicKilcher
@YannicKilcher 3 жыл бұрын
It's my pleasure
@kimchi_taco
@kimchi_taco 3 жыл бұрын
* NeuralNet with gradient descent is special version of kernel machine, which is sum_i(). * It means NeuralNet works well like SVM works well. NeuralNet is even better because it doesn't need to compute kernel (O(data*data)) explicitly. * is similarity score between new prediction y of x and training prediction yi of xi. * The math is cool. I feel this derivation is useful later.
@mrpocock
@mrpocock 3 жыл бұрын
I can't help thinking of attention mechanisms as neural networks that rate timepoints as support vectors, with enforced sparsity through the unity constraint.
@syedhasany1809
@syedhasany1809 3 жыл бұрын
One day I will understand this comment and what a glorious day it will be!
@OmanshuThapliyal
@OmanshuThapliyal 3 жыл бұрын
Very well explained. The paper itself is written very well that I could read as a researcher outside of CS.
@pranitak
@pranitak 3 жыл бұрын
Hello. 👋😂
@OmanshuThapliyal
@OmanshuThapliyal 3 жыл бұрын
@@pranitak 😅
@mlearnxyz
@mlearnxyz 3 жыл бұрын
Great news. We are back to learning Gram matrices.
@kazz811
@kazz811 3 жыл бұрын
Unlikely that this perspective is used for anything.
@sinaasadiyan
@sinaasadiyan 9 ай бұрын
Great Explanation👍
@hoaxuan7074
@hoaxuan7074 3 жыл бұрын
The dot product is an associative memory if you meet certain mathematical requirements it has especially relating to the variance equation for linear combinations of random variable. The more things it learns the greater the angles between the input vectors and the weight vector.If it only learns 1 association the angle should actually be zero and the dot product will provide strong error correction.
@abdessamad31649
@abdessamad31649 3 жыл бұрын
i love your content from Morroco !!!!!! keep it going
@JTMoustache
@JTMoustache 3 жыл бұрын
Ooh baby ! That was a good one ☝🏼
@marvlnurban5102
@marvlnurban5102 2 жыл бұрын
The paper reminds me of a paper by Maria Schuld comparing quantum ML models with kernels. Instead of dubbing quantum ML models as quantum neural networks, she demonstrates that quantum models are mathematically closer to kernels. Her argument is that the dot product of the hilbert space in which you embed your (quantum) data implies the construction of a kernel method. As far as I understand the method you use to encode your classical bits into your qbits is effectively your kernel function. Now it seems like kernels connect deep neural networks to "quantum models" by encoding the superposition of the training data points..? - 2021 Schuld Quantum machine learning models are kernel methods - 2020 Schuld Quantum embedding for Machine learning
@MrDREAMSTRING
@MrDREAMSTRING 3 жыл бұрын
So basically an NN trained with gradient descent is equivalent to a function that computes the kernel operations across all the training data (and across the entire training path!), and obviously NN runs so much more efficiently. That's pretty good; and very interesting insight!
@pranavsreedhar1402
@pranavsreedhar1402 3 жыл бұрын
Thank you!
@rnoro
@rnoro 2 жыл бұрын
I think this paper is a translation of NN formalism to a functional analysis formalism. Simply speaking, a gradient-descent on a loss function framework is equivalent to a Linear matching problem on a Hilbert space determined by the NN structure. The linearization process is characterized by the gradient scheme. In other words, the loss function on the sample space becomes a linear functional on the Hilbert space. This is all it's about the paper, but nothing more.
@twobob
@twobob Жыл бұрын
Thanks. That was helpful
@Sal-imm
@Sal-imm 3 жыл бұрын
Now changing the weights in hypothetical sense means impaction reaction.
@herp_derpingson
@herp_derpingson 3 жыл бұрын
24:30 I wonder if this path tracing thingy works not only for neural networks but also for t-SNE. Imagine a bunch of points in the beginning of t-SNE. We have labels for all points except for one. During the t-SNE optimization, all points move. The class of the unknown point is equal to the class of the point to which its average distance was the least during the optimization process. . 41:57 I think it means we can retire the Kernel Machines because Deep Networks are already doing that. . No broader impact statement? ;) . Regardless, perhaps one day we can have a proof like. "Since kernel machines cannot solve this problem and neural networks are kernel machines, it implies that there cannot exist any deep neural network that can solve this problem". Which might be useful.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
very nice ideas. Yes, I believe the statement is valid for anything trained with gradient descent, and probably with a bit of modifications you could even extend that to any sort of ODE-driven optimization algorithm.
@ashishvishwakarma5362
@ashishvishwakarma5362 3 жыл бұрын
Thanks for the explanation. Can you please , also attach the annotated paper link in the description of every video, it would be great help ?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
The paper itself is already linked. If you want the annotations that I draw, you'll have to become a supporter on Patreon or SubscribeStar :)
@jinlaizhang312
@jinlaizhang312 3 жыл бұрын
Can you explain the AAAI best paper 'Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting' ?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
thanks for the reference
@fmdj
@fmdj 2 жыл бұрын
Damn that was inspiring. I almost got the full demonstration :)
@moudar981
@moudar981 3 жыл бұрын
thank you for the very nice explanation. What I did not get is the dL/dy. So, L could be 0.5 (y_i - y_i*)^2. Its derivative is (y_i - y_i*). Does that mean that if a training sample x_i is correctly classified (the aforementioned term is zero), then it has no contribution to the formula? Isn't that counter intuitive? Thank you so much.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
a bit. but keep in mind that you integrate out these dL/dy over training. so even if it's correct at the end, it will pull the new sample (if it's similar) into the direction of the same label.
@LouisChiaki
@LouisChiaki 3 жыл бұрын
Why the gradient vector of y w.r.t. w is not the tangent vector of the training history of w in the plot? Shouldn't the update of the w by gradient descent always proportional to the gradient vector?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
true, but the two data points predicted are not the only ones. the curve follows the average gradient
@dhanushka5
@dhanushka5 7 ай бұрын
Thanks
@andres_pq
@andres_pq 3 жыл бұрын
Is there any code demonstration?
@socratic-programmer
@socratic-programmer 3 жыл бұрын
Makes you wonder the effect of something like skip connections or different network topologies has on this interpretation, or even something like the Transformer with the attention layer. Maybe that attention allows the network to more easily match against similar things and rapidly delegate tokens to functions that have already learnt to 'solve' that type of token?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
entirely possible
@kyuucampanello8446
@kyuucampanello8446 Жыл бұрын
Dot similarity with softmax is kind of 'equivalent' with distance. So I guess it's kind of similar like when we calculate the velocity gradient of a particle with sph method by using kernel function onto the distances between neighbouring particles and multiplying with their velocities to build a velocity-function corresponding to the distances. In attention machinism, it might be a function of the tokens' values corresponding to similarities
@joshuaclancy8093
@joshuaclancy8093 3 жыл бұрын
So representations are constructed via aggregating input data... gasp!* but still its an interesting way of getting there. Correct me if I am wrong, but here is my overly simplified summary: Inputs that have similar output change with respect to weight change are grouped together as training progresses and so that means we can approximate neural networks with kernel machines.
@anubhabghosh2028
@anubhabghosh2028 2 жыл бұрын
At around 19 min mark of the video, the way you describe the similarity between kernels and gradient descent leads me to believe that what this paper is claiming is that neural networks don't really "generalize" on the test data but rather compares similarities with samples it has already seen in the training data. This can be perhaps a very bold claim by the authors or probably I am misunderstanding the problem. What do you think? P.S. Really appreciate this kind of slightly theoretical paper review on deep learning in addition to your other content as well.
@YannicKilcher
@YannicKilcher 2 жыл бұрын
Isn't that exactly what generalization is? You compare a new sample to things you've seen during training?
@anubhabghosh2028
@anubhabghosh2028 2 жыл бұрын
@@YannicKilcher Yes intuitively I think so too. Like in case of neural networks, we train a network, learn some weights and then use the trained weights to get predictions for unseen data. I think my confusion is because about the explicit way they define this in case of these kernels, where they compare the sensitivity of the new data vector with that of every single data point in the training set 😅.
@faidon-stelioskoutsourelak464
@faidon-stelioskoutsourelak464 2 жыл бұрын
1) Despite the title, the paper never makes use in any derivation of the particulars of an NN and its functional form. Hence the result is not just applicable to NNs but to any differentiable model e.g. a linear regression. 2) What is most puzzling to me is the path dependence. I.e. if you run your loss-gradient-descent twice from two different starting points which nevertheless converge to the same optimum, the path kernels i.e. the impact of each training data-point on the predictions, would be (in general) different. The integrand though in the expression of path kernels, involves dot products of gradients. I suspect that in the initial phases of training, i.e. when the model has not yet fit to the data, these gradients would change quite rapidly (and quite randomly) and most probably these dot products would be zero or cancel out. Probably only close to the optimum will these gradients and the dot-products stabilize and contribute the most to the path integral. This behavior should be even more pronounced in stochastic gradient descent (intuitively). The higher the dimension of the unknwon parameters, the more probable it'd be that these dot products are zero, even close to the optimum unless there some actual underlying structure that it is discoverable by the model.
@dawidlaszuk
@dawidlaszuk 3 жыл бұрын
It isn't surprising that all function approximators, including NN and kernal methods, are equivalent since they all... can approximate functions. However, nice thing here is showing explicitly the connection between kernel methods and NN which allows easier knowledge transfer between methods' domains.
@joelwillis2043
@joelwillis2043 3 жыл бұрын
All numbers are equivalent since they all... are numbers.
@Guztav1337
@Guztav1337 3 жыл бұрын
@@joelwillis2043 No. 4 ⋦ 5
@joelwillis2043
@joelwillis2043 3 жыл бұрын
@@Guztav1337 Now if only we had a concept of an equivalence relation we could formally use it on other objects instead of saying "equivalent" without much thought.
@WellPotential
@WellPotential 3 жыл бұрын
Great explanation of this paper! Any thoughts why they didn't include dropout? Seems like if they added a dropout term in the dynamical equation that the result wouldn't reduce down to a kernel machine anymore.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
maybe to keep it really simple without stochasticity
@woowooNeedsFaith
@woowooNeedsFaith 3 жыл бұрын
3:10 - How about giving a link to this conversation in the description box?
@Pheenoh
@Pheenoh 3 жыл бұрын
kzbin.info/www/bejne/r5C1m6Z6fdFoj5Y
@lirothen
@lirothen 3 жыл бұрын
I work with the Linux kernel, so I too was VERY lost when he's referring to a kernel function in this different context. I was just about to ask too.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
I've added it to the description. Thanks for the suggestion.
@DamianReloaded
@DamianReloaded 3 жыл бұрын
Learning the most general features and learning to generalize well ought to be the same thing.
@Sal-imm
@Sal-imm 3 жыл бұрын
Mathematically limit definition of a function (for e.g.) and comes out of a new conclusion, that might be heuristic.
@willwombell3045
@willwombell3045 3 жыл бұрын
Wow someone found a way to rewrite "Neural Networks are Universal Function Approximators" again.
@neoli8289
@neoli8289 3 жыл бұрын
Exactly!!!
@gamerx1133
@gamerx1133 3 жыл бұрын
@chris k Yes
@Lee-vs5ez
@Lee-vs5ez 3 жыл бұрын
😂
@olivierphilip1612
@olivierphilip1612 3 жыл бұрын
@chris k Any continuous function to be precise
@drdca8263
@drdca8263 3 жыл бұрын
@chris k tl;dr: If by "all functions" you mean "all L^p functions" (or "all locally L^p functions"?) for some p in [1,infinity), then yes. (But technically, this isn't *all* functions from (the domain) to the real numbers, for which the question seems not fully defined, because in that case, what do we mean by "approximated by"?) needlessly long version: I was going to say "no, because what about the indicator function for a set which isn't measurable", thinking we would use the supremum norm for that (actual supremum, not supremum-except-neglecting-measure-zero-zets), in order to make talking about the convergence well-defined, but then I realized/remembered that under that criteria, you can't even approximate a simple step function using continuous functions (the space of bounded continuous functions is complete under the supremum norm), and therefore using the supremum norm can't be what you meant. Therefore, the actual issue here is that "all functions" is either too vague, or, if you really mean *all* functions from the domain to the real numbers, then this isn't compatible with the norm we are presumably using when talking about the approximation. If we mean "all L^p functions" for some p in [1,infinity) , then yes, because the continuous functions (or the continuous functions with compact support) are dense in L^p (at least assuming some conditions on the domain of these functions which will basically always be satisfied in the context we are talking about) . Original version of this comment: deleted without ever sending because I realized it had things wrong about it and was even longer than the "needlessly long version", which is a rewrite of it, which takes into account from the beginning things I realized only at the end of the original. I'm slightly trying to get better about taking the time to make my comments shorter, rather than leaving all the broken thought process on the way to the conclusion in the comment.
@veloenoir1507
@veloenoir1507 3 жыл бұрын
If you can make a connection with the Kernel Path and a resource-efficient, general architecture this could be quite meaningful, no?
@hecao634
@hecao634 3 жыл бұрын
Hey Yannic, could you plz gently zoom in or zoom out in the following videos? I really felt dizzy sometimes especially when you derive formulas
@YannicKilcher
@YannicKilcher 3 жыл бұрын
sorry, will try
@JI77469
@JI77469 3 жыл бұрын
At 39:30 "... The ai's and b depend on x..." So how do we know that our y actually lives in the RKHS, since it's not a linear combination of kernel functions!? If we don't, then don't you lose the entire theory of RKHS?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
true, I guess that's for the theoreticians to figure out :D
@drdca8263
@drdca8263 3 жыл бұрын
I didn't know what RKHS stood for. For any other readers of this comment section who also don't : It stands for Reproducing kernel Hilbert space. Also, thanks, I didn't know about these and it seems interesting.
@G12GilbertProduction
@G12GilbertProduction 3 жыл бұрын
12:57 Hypersurface with a extensive tensor line? That's so looks like Fresnelian.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
sorry that's too high for me :D
@fast_harmonic_psychedelic
@fast_harmonic_psychedelic 3 жыл бұрын
The same thing can be said about the brain itself, neurons just store a superposition of the training data (the input from the senses when we were infants, weighed against the evolutionary "weights" stored in the DNA, and in every day experience whenever we see any object or motion, our neurons immediately compare that input to the weights that it stores in a complex language of dozens of neutrotransmitters and calcium ion and tries to find out what it best matches up with.. The brain is a kernel machine. That doesn't depreciate its power, neither the brain or the neural networks. they memorize.. that doesn't mean they're not intelligence. Intelligence is not magic, it IS essentially just memorizing input.
@albertwang5974
@albertwang5974 3 жыл бұрын
A paper: one plus one is a merge of one and one!
@hoaxuan7074
@hoaxuan7074 3 жыл бұрын
With really small nets you can hope to more or less fully explore the solution space say using evoutionary algorithms. There are many examples on YT. In small animals with a few hundred neuron you do see many specialized neurons with specific functions. In larger nets I don't think there is any training algorithm that can actually search the space of solutions to find any good solution. Just not possible ----- except there in a small sub-space of statistical solutions where each neuron responds to the general statistics of the neurons in the prior layer. Each neuron being a filter of sorts. I'm not sure why I feel that sub-space is easier to search through? An advantage would be good generalization and avoiding many brittle over-fitted solutions that presumably exist in the full solution space. A disadvantage would be the failure to find short compact logical solutions that generalize well, should they exist.
@THEMithrandir09
@THEMithrandir09 3 жыл бұрын
I get that the new formalism is nice for future work, but isn't it intuitive that 'trainedmodel = initialmodel + gradients x learningrates'?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
true, but making the formal connection is sometimes pretty hard
@THEMithrandir09
@THEMithrandir09 3 жыл бұрын
@@YannicKilcher oh yes sure, this work is awesome, it reminds me of the REINFORCE paper. I just wondered why that intuition wasn't brought up explicitly. Maybe I missed it though, I didn't fully read the paper yet. Great video btw!
@eliasallegaert2582
@eliasallegaert2582 3 жыл бұрын
This paper is from the author of "The master algorithm" where the big picture is explored of multiple machine learning techniques. Very interesting! Thanks Yannic for the great explanation!
@paulcurry8383
@paulcurry8383 3 жыл бұрын
I don’t understand the claim that models don’t learn “new representations”. Do they mean that the model must use features of the training data (which I think is trivially true), or that the models store the training data without using any “features” and just the result of the points on the loss over GD? In the latter it seems that models can be seen as doing this, but it’s not well understood how they actually store a potentially infinitely sized Gram Matrix. I’m also tangentially interested in how SGD fits into this.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
yea I also don't agree with the paper's claim in this point. I think it's just a dual view. i.e. extracting useful representations is the same as storing an appropriate superposition of the training data.
@michaelwangCH
@michaelwangCH 3 жыл бұрын
Deep learning is further step of ML evolution. The Kernel methods are known since early 60's. No surprise at all.
@michaelwangCH
@michaelwangCH 3 жыл бұрын
Hi, Yannic. Thanks for the explanation, looking forward for your next video.
@mathematicalninja2756
@mathematicalninja2756 3 жыл бұрын
Next paper: Extracting MNIST data from its trained model
@Daniel-ih4zh
@Daniel-ih4zh 3 жыл бұрын
Link?
@salimmiloudi4472
@salimmiloudi4472 3 жыл бұрын
Isn't it what visualizing activation maps does
@vaseline.555
@vaseline.555 3 жыл бұрын
Possibly Inversion attack? Deep leakage from gradients?
@kazz811
@kazz811 3 жыл бұрын
Nice review. Probably not a useful perspective though. SGD is critical obviously (ignoring variants like momentum, Adam which incorporate path history) but you could potentially extend this using the path integral formulation (popularized in Quantum mechanics though applies in many other places) by constructing it as an ensemble over paths for each mini-batch procedure, the loss function replacing the Lagrangian in Physics. The math won't be easy and it likely needs someone with higher level of skill than Pedro to figure that out.
@diegofcm6201
@diegofcm6201 3 жыл бұрын
Thought the exact same thing. Even more like it when it states about “superposition of train data weighted by kernel path”. Reminds me a lot about wave functions
@diegofcm6201
@diegofcm6201 3 жыл бұрын
It also looks like something from calculus of variations: 2 points (w0 and wf) connected by a curve that’s trying to optimize something
@guillaumewenzek4210
@guillaumewenzek4210 3 жыл бұрын
I'm not found of the conclusion. The NN at inference doesn't have access to all the historical weights, and runs very differently from their kernel. For me 'NN is a Kernel' would implies that K only depends on the final weights. OTOH I've no issue if a_i is computed from all historical weights.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
You're correct, of course. This is not practical, but merely a theoretical connection.
@guillaumewenzek4210
@guillaumewenzek4210 3 жыл бұрын
Stupid metaphor: "Oil is a dinosaur". Yes there is a process that converts dinosaur into oil, yet they have very different properties. Can you transfer the properties/intuitions of a Gaussian kernel to this path kernel?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Sure, both are a way to measure distances between data points
@vertonical
@vertonical 3 жыл бұрын
Yannic Kilcher is a kernel machine.
@veedrac
@veedrac 3 жыл бұрын
Are you aware of, or going to cover, Feature Learning in Infinite-Width Neural Networks?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
I'm aware of it, but I'm not an expert. Let's see if there's anything interesting there.
@frankd1156
@frankd1156 3 жыл бұрын
wow...my head get hot a little bit lol
@drdca8263
@drdca8263 3 жыл бұрын
The derivation of this seems nice, but, maybe this is just because I don't have any intuition for kernel machines, but I don't get the interpretation of this? I should emphasize that I haven't studied machine learning stuff in any actual depth, have only taken 1 class on it, so I don't know what I'm talking about If the point of kernel machines is to have a sum over i of (something that doesn't depend on x, only on i) * (something that depends on x and x_i) , then why should a sum over i of (something that depends on x and i) * (something that depends on x and x_i) be considered, all that similar? The way it is phrased in remark 2 seems to fit it better, and I don't know why they didn't just give that as the main way of expressing it? Maybe I'm being too literal. edit: Ok, upon thinking about it more, and reading more, I think I see more of the connection, maybe. the "kernel trick" involves mapping some set to some vector space, and then doing inner products there, and for that reason, taking the inner product that naturally appears here, and wanting to relate it to kernel stuff, seems reasonable. And so defining the K^p_{f,c}(x,x_i) , seems also probably reasonable. (Uh, does this still behave line an inner product? I think it does? Yeah, uh, if we send x to the function that takes in t and returns the gradient of f(x,w(t)) with respect to w, (where f is the thing where y = f(x,w) ), that is sending x to a vector space, specifically, vector valued functions on the interval of numbers t comes from, and if we define the inner product of two such vectors as being the integral over t of the inner product of the value at t, This will be bilinear, and symmetric, and anything with itself should have positive values whenever the vector isn't 0, so ok yes it should be an inner product. So, it does seem (to me, who, remember, I don't know what I'm talking about when it comes to machine learning) like this gives a sensible thing to potentially use as a kernel. How unfortunate then, that it doesn't satisfy the definition they gave at the start! Maybe there's some condition in which the K^g_{f,w(t)}(x,x_i) and the L'(y^*_i, y_i) can be shown to be approximately uncorrelated over time, so that the first term in this would approximately not depend on x, and so it would approximately conform to the definition they gave?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
your thoughts are very good, maybe you want to check out some of the literature of neural tangent kernel, because that's pretty much into this direction!
@jeremydy3340
@jeremydy3340 3 жыл бұрын
Don't Kernel methods usually take the form of a weighted average of examples? sum( a_i * y_i * K ) . The method given here is quite different, and depends on the labels largely implicitly via how they change the path c(t) through weight space. It isn't clear to me that y( x' ) is at all similar to y_i, even if K(x', x_i) is large. And the implicit dependence through c(t) on all examples means (x_i, y_i) may be extremely important even if K(x', x_i) is small.
@YannicKilcher
@YannicKilcher 3 жыл бұрын
it depends on the output, yes, but not on the actual labels. and yes, the kernel is complicated by construction, because it needs to connect gradient descent, so the learning path and the gradient are necessarily in there
@NeoKailthas
@NeoKailthas 3 жыл бұрын
Now someone prove that humans are kernel machines
@nellatl
@nellatl 3 жыл бұрын
When you're broke with no money, the last thing you want to hear is theoretically;. Especially theoretically and paper in the same sentence.
@sergiomanuel2206
@sergiomanuel2206 3 жыл бұрын
What happens if we do a training of just one step. It could be started from the last step of an existing training (this theorem doesn't require to have random weight at the start). In this case we don't need to store all the trainig path 😎
@YannicKilcher
@YannicKilcher 3 жыл бұрын
yes, but the kernel must still be constructed from all the gradient descent path, that includes the path to obtain the initialization in your case
@sergiomanuel2206
@sergiomanuel2206 3 жыл бұрын
@@YannicKilcher first: wonderful videos, thank you!!! Second: Correct me if I am wrong. The theorem doesn't tell us anything about the initialization weights. I am thinking about one-step-training with w0 obtained from a previews trainig. If we do one step of gradient descent using all the dataset, there is just one optimal path in the direction dw=- lr*dL/dw, this training will lead us to w1. Using w0 and w1 we can build the kernel and evaluate it. I think it is correct because all information about the trainin is in the last step (using the nn we can make predictions using just the last weigths, w1).
@YannicKilcher
@YannicKilcher 3 жыл бұрын
@@sergiomanuel2206 sounds reasonable. W0 also appears in the Kernel
@MachineLearningStreetTalk
@MachineLearningStreetTalk 3 жыл бұрын
Does this mean we need to cancel DNNs?
@herp_derpingson
@herp_derpingson 3 жыл бұрын
...or Kernel Machines. Same thing.
@diegofcm6201
@diegofcm6201 3 жыл бұрын
*are approximately
@YannicKilcher
@YannicKilcher 3 жыл бұрын
Next up: Transformers are actually just linear regression.
@ooodragon94
@ooodragon94 3 жыл бұрын
1:31 you made me worse by letting me into these kind of topics... your explanation is brilliant..... plz keep me enlightened until I die....
@dontthinkastronaut
@dontthinkastronaut 3 жыл бұрын
hi there
@mrjamess5659
@mrjamess5659 3 жыл бұрын
Im having multiple enligthments while watching this video.
@raunaquepatra3966
@raunaquepatra3966 3 жыл бұрын
Kernel is all you need 🤨
@shuangbiaogou437
@shuangbiaogou437 3 жыл бұрын
I knew this 8 years ago and I mathmatically proved a perceptron is just a dual form of linear kernel machine. A MLP is just a linear kernel machine with its input being transformed.
@az8134
@az8134 3 жыл бұрын
i thought we all know that ...
@paulcarra8275
@paulcarra8275 3 жыл бұрын
Hi first the author make a presentation here "kzbin.info/www/bejne/o2TFYaR7hq2fi9U&lc=UgwjZHYH9cRyuGmD6e14AaABAg" Second (repeating a comment made on the presentation above) I was wondering why should we go through the whole learning procedure and not instead start at the penultimate step with the corresponding b and w's, wouldn't it save almost computational time ? I mean if the goal is not learn the DNN but to get an additive representation of it (ignoring the non linear transform "g" of the Kernel Machine) Regards
@erfantaghvaei3952
@erfantaghvaei3952 3 жыл бұрын
Pedro is cool guy, sad to see the hate on him for opposing the distorts surrounding datasets
@albertwang5974
@albertwang5974 3 жыл бұрын
I cannot understand why such kind of topic can be a paper.
@willkrummeck
@willkrummeck 3 жыл бұрын
why do you have parler?
@YannicKilcher
@YannicKilcher 3 жыл бұрын
yea it's kinda pointless now isn't it :D
@marouanemaachou7875
@marouanemaachou7875 3 жыл бұрын
First
@IoannisNousias
@IoannisNousias 3 жыл бұрын
What an unfortunate choice of variable names. Every time I heard “ai is the average...” it threw me off. Too meta.
@fast_harmonic_psychedelic
@fast_harmonic_psychedelic 3 жыл бұрын
im just afraid this will lead us BACK into kernel machines and programming everything by hand, resulting in much more robotic, calculator-esque models, not the AI models that we have. ITs better to keep it in the black box. If you look inside you'll jinx it and the magic will die, and we'll just have dumb calculator robots again
@conduit242
@conduit242 3 жыл бұрын
“You just feed in the training data” blah blah, the great lie of deep learning. The reality is ‘encodings’ hide a great deal of sophistication, just like compression ensemble models. Let’s see a transformer take a raw binary sequence and match zpaq-5 at least on the Hutter Prize 🤷🏻‍♂️ choosing periodic encodings, stride models, etc are all the same. All these methods, including compressors, are compromised theoretically
@getowtofheyah3161
@getowtofheyah3161 3 жыл бұрын
So boring who freakin’ cares
@SerBallister
@SerBallister 3 жыл бұрын
Are you proud of your ignorance?
@getowtofheyah3161
@getowtofheyah3161 3 жыл бұрын
@@SerBallister yes.
@sphereron
@sphereron 3 жыл бұрын
Dude, if you're reviewing Pedro Dominguez's paper despite his reputation as a racist and sexist, why not use your platform to give awareness to Timnit Gebru's work on "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?". Otherwise your bias is very obvious here.
Pray For Palestine 😢🇵🇸|
00:23
Ak Ultra
Рет қаралды 34 МЛН
Super sport🤯
00:15
Lexa_Merin
Рет қаралды 20 МЛН
On the Connection between Neural Networks and Kernels: a Modern Perspective -Simon Du
30:47
Rethinking Attention with Performers (Paper Explained)
54:39
Yannic Kilcher
Рет қаралды 55 М.
Top Optimizers for Neural Networks
29:00
Machine Learning Studio
Рет қаралды 6 М.
Optimizers - EXPLAINED!
7:23
CodeEmporium
Рет қаралды 110 М.
What is backpropagation really doing? | Chapter 3, Deep learning
12:47
3Blue1Brown
Рет қаралды 4,3 МЛН
How are memories stored in neural networks? | The Hopfield Network #SoME2
15:14
Топ-3 суперкрутых ПК из CompShop
1:00
CompShop Shorts
Рет қаралды 415 М.
Карточка Зарядка 📱 ( @ArshSoni )
0:23
EpicShortsRussia
Рет қаралды 462 М.
😱НОУТБУК СОСЕДКИ😱
0:30
OMG DEN
Рет қаралды 2,9 МЛН