This Physics Technology Trains Machine Learning 10X Faster

  Рет қаралды 2,641

CompuFlair

CompuFlair

Күн бұрын

Пікірлер: 42
@CompuFlair
@CompuFlair 2 күн бұрын
Hello friends. Here is the link to the Jupyter notebook presented in this video (on google colab): colab.research.google.com/drive/1jHJ-LevD34f1Rgf5YsuYaGD8SzD8hwNJ?usp=sharing
@vidal9747
@vidal9747 2 күн бұрын
I am convinced that if computer scientists and physicists talked to each other, we would've both achieve incredible things.
@CompuFlair
@CompuFlair Күн бұрын
This is encouraging. Thanks for that!
@llbrunollllbll9347
@llbrunollllbll9347 Күн бұрын
Could you check about Liquid Neural Networks? They also implement analytic solutions to speed up the training. There might be complementary ideas in the air that could make your work even more impactful. Very interesting! Keep pushing forward!
@CompuFlair
@CompuFlair Күн бұрын
Thanks for the comment. I'll take a look.
@turun_ambartanen
@turun_ambartanen 3 күн бұрын
stochastic gradient descent is used to train neural networks, because it scales surprisingly well with the number of parameters/weights. For example, the higher the number of parameters, the fewer local minima and flat spots will be present in the loss landscape. This was surprising for me, but it just pops out of the math for higher dimensions. Also, neural networks are trained with activation functions from which the derivative can be calculated easily at any point. So getting the gradient, which for some mathematical functions can be quite costly, is actually dirt cheap for neural networks. The time complexity of the computation is the most important thing nowadays. PS: I could not reproduce the results with the code shown on screen. If you have a repo with the python code that would be neat. Pro tip: seed the numpy rng to make the code deterministic.
@CompuFlair
@CompuFlair 3 күн бұрын
Yes, for sure, gradient descent is a great method that we can see in action today in LLMs. And no doubt they scale well. However, perturbation theory also scales very well. Its main advantage over gradient descent I believe, is it comes pre-computed. The whole point of numeric computation is to replace the impossible analytic calculation. Perturbation theory can reduce computation by providing an analytic solution with only a few parts unknown. Regarding the seed, I intentionally didn't set it so that the speed advantage is randomly tested and make sure its not accidental.
@2299momo
@2299momo 2 күн бұрын
@@CompuFlair Any idea why he couldn't reproduce your results? I don't think your response should've glossed over that talking point. Share a set seed to see if the OP can get an identical result.
@CompuFlair
@CompuFlair 2 күн бұрын
@@2299momo Thanks for the reminder. I just pinned my comment with the link to my notebook on google colab. Please check that out.
@mircorichter1375
@mircorichter1375 2 күн бұрын
I think the most convincing argument for another training method is not speed, but the possibility to overcome local non optimal minima, sattle points and such. Any method that is better to converge to global optima is interesting. However there is more. The optimas that backpropagation finds are often "robust" (the term is not strictly defined yet but it somewhat means "stable under small deformations of weights") ... This is important so any other method must have that property too.
@CompuFlair
@CompuFlair Күн бұрын
Thanks for the comment. Totally agree that "any other method must have that property too."
@hjups
@hjups 3 күн бұрын
I think you are making a false assumption that this method can practically scale to arbitrary distributions. It seems feasible for simple linear regression with a Gaussian distribution, but what about non Gaussian distributions like in classification problems? If I understand correctly, your recipe depends on 1) computing the partition function through perturbation approximations, and 2) solving the system of equations for the unknown parameters. How do you intend to compute the partition function for a transformer with 8 billion free parameters? Or even in the simpler classification case, think about a ResNet model with 50 million free parameters? Sure you can in /theory/ do it, just like how a Fourier series can theoretically approximate any compact or repeating function, but it would practically be equivalent to solving for an infinite series. And then in regards to solving the system.... this will either become far too expensive (where traditional SGD is faster), or you will need to approximate the solution using something like SGD. My guess is that the issue comes down to the dataset itself, where you considered a very small problem. Even consider something as simple of MNIST - each data point is 784 dimensional, and you have 60k of them. That is not a computationally feasible system to solve directly. That said, if you have some thoughts on how to deal with these more general cases, perhaps you should focus on the relatively simple classification problems of MNIST and then CIFAR10. Both can be solved with simpler MLPs, but you would have to show that you're able to produce equivalent or better classification accuracy, while also training faster than with SGD.
@CompuFlair
@CompuFlair 3 күн бұрын
Thanks for the comment and the suggestions. They are mostly on my to-do list. Classifications would be much easier to handle. For example, see this video where I derive the partition function of logistic regression kzbin.info/www/bejne/fl7clYWkiL-Vr9k Also, I have already shown the mathematical equivalence of neural nets and Ising model in physics where the non-Gaussian interactions are fully investigated. See the video here: kzbin.info/www/bejne/imecp5WDoNGSq6s In general, I am not worried about the large dimensions or large datasets of neural nets because perturbation theory is developed to handle infinite dimensions and where it is practically used, 1 billion events (spreadsheet rows) are created every second and the experiment goes on for months. So one can imagine how large the dataset it handles is. Check this link out for example: home.cern/science/computing/storage#:~:text=Up%20to%20about%201%20billion,out%20all%20of%20these%20events.
@ChaseFreedomMusician
@ChaseFreedomMusician 3 күн бұрын
This seems fine for basic predictions but what about very large set of thousands of columns and millions or billions of rows? Or for models that use state or sequences, like with LLMs or time series prediction? How would we apply this insight?
@CompuFlair
@CompuFlair 3 күн бұрын
To know for sure, we have to give it a try. However, larger models and more complicated datasets have complicated parameter landscapes which challenges the performance of minimization methods even more. So, my guess is this method outperforms the minimization approach even more.
@ChaseFreedomMusician
@ChaseFreedomMusician 3 күн бұрын
@@CompuFlair I get what you're saying, and sure, trying things out is always a good approach. But I think there’s something fundamental here that we shouldn’t overlook: non-linear relationships. Covariance matrices are great for capturing linear dependencies, but real-world data is rarely that straightforward, especially with large datasets or sequences. Think about time series data or models like LLMs. The relationships between inputs and outputs in those cases aren’t just simple, "if X goes up, Y goes up" kind of patterns. There’s context, feedback loops, and interactions across time or space that linear approaches just can’t capture. Take text prediction as an example. A covariance-based method might pick up that "The" is often followed by "cat," but it’ll completely miss that the verb in "The cat that chased the mouse" has to agree with "cat," not "mouse." That’s a hierarchical dependency, and it’s non-linear by nature. DNNs, especially with things like attention mechanisms, shine here because they don’t just find pairwise relationships. They learn layers of abstraction: simple patterns in the first layer, combinations of patterns in the next, and so on. That’s what lets them handle things like long-range dependencies in text or complex non-linear trends in time series. I’m not saying covariance matrices can’t be useful-they’re fast, interpretable, and great for simpler problems. But when the data is messy, interactive, and non-linear, you’re going to hit a wall where the relationships you need to model just don’t fit into a linear framework. That’s where DNNs pull ahead, even if they’re computationally heavier. So yeah, I’m all for trying it out. But I think the performance gap you’re suggesting would actually get worse, not better, as the dataset gets more complex.
@CompuFlair
@CompuFlair 3 күн бұрын
@@ChaseFreedomMusician Thanks for the comment. Non-linear interactions is the whole point of this video. Perturbation theory with 300 years of background is developed only because of these non-linear interactions that you greatly (and correctly) emphasized. I think performance gets much better in the presence of non-linear interactions because we can still find analytic approximations using perturbation theory and analytic solution usually means significant reduction in computations.
@ChaseFreedomMusician
@ChaseFreedomMusician 3 күн бұрын
​@@CompuFlair Thanks for explaining. Perturbation theory is definitely powerful, and I can see how it might help approximate non-linear interactions. But I’m struggling to see how it connects to the example you gave. The method you described, using: cov = np.cov(data.T) cov_inv = np.linalg.inv(cov) beta = - cov_inv[:-1,1] / cov_inv[:-1,-1] seems to focus purely on linear relationships. The covariance matrix captures how variables linearly co-vary, and the inverse isolates direct linear dependencies. Without explicitly adding higher-order terms or interaction features (like x^2 or x * y), it doesn’t seem like this approach would naturally handle non-linear relationships. Am I missing a step where those non-linear interactions are introduced? Also, on the computational side, inverting a large covariance matrix becomes challenging as the dataset scales. For thousands of columns or millions of rows, this step could become a bottleneck. In contrast, while DNNs are computationally heavy, they scale well across distributed systems and handle non-linearity without requiring feature engineering. How would you incorporate perturbation theory into the method you showed to model non-linear relationships? Or are you suggesting a way to add those higher-order terms directly into the covariance matrix? I’m curious how this would look in practice.
@CompuFlair
@CompuFlair 3 күн бұрын
@@ChaseFreedomMusician Yes, the example I presented is linear but that is the essence of perturbation theory to solve the linear part first then add corrections one by one. And yes most of the time the probability (F in the exp) explicitly contains x^3 and higher terms. These are the corrections perturbation theory invented to handle approximately. So, in their presence, we have to calculate corrections to my code and add them. So, yes, the covariance matrix alone won't be there. But, the form of corrections depends on the model. For linear regression, they are zero. Regarding high dimensions and the inverse of a matrix in such large spaces, I'm not that worried because, in field theory, where we use perturbation theory, dimensions are not high; they are infinite, and we have techniques to find the inverse in infinite dimensions. Regarding how we add corrections, we first ignore them and solve the linear version. That would be what I did. Then we add the largest correction, x^3 terms, and update the previous answer. Then we add the next largest correction, x^4, and update the answer of the previous step and this loop goes on as far as we are satisfied with the accuracy of the model. This whole loop has a systematic mathematical machinery that I am going to cover in future videos.
@volpir4672
@volpir4672 3 күн бұрын
you rock, this is great!!!!
@CompuFlair
@CompuFlair 3 күн бұрын
Thanks! Glad you liked it.
@TheRayhaller
@TheRayhaller 3 күн бұрын
Great video, I really appreciate the python notebook example!
@CompuFlair
@CompuFlair 3 күн бұрын
You're very welcome!
@drdca8263
@drdca8263 3 күн бұрын
3:40 : [removed thing saying essentially “I’ve heard that the problem is more often the gradient being flat, rather than gradient being zero with second derivative being positive”] 3:51 : oh nvm you addressed it being flat as well 3:54 : computing the gradient is costly? This surprises me. Well, I suppose computing the loss for the entire dataset is computationally costly, but for the loss for a single datapoint, my impression was that it was roughly as expensive as 2x the forward pass? You run the forward pass, computing the gradient of each neuron with respect to its inputs and its parameters, evaluated at its current inputs and parameters, storing these for each layer as a pair of matrices, right? And like, for sigmoid or tanh, the nonlinearity’s contribution to the gradient is readily computed from the activation, I thought? 23:25 : hmm… One thing I’m a bit unclear on, is how we determine in general how J variables should fit into the F variable. Should it always be a J_i w_i term for each parameter w_n of the model? Or, J_i x_i for each input x_i of the model? I saw you wrote something about combining the J with x in the derivation you showed…
@CompuFlair
@CompuFlair 3 күн бұрын
"computing the gradient is costly?" Well we are comparing. Costly with respect to what? We need to find 1st and 2nd derivatives to find a minimum and that is extra work. "One thing I’m a bit unclear on, is how we determine in general how J variables should fit into the F variable" It must be the variable we are summing over so that when we take derivative wrt J, that variables fall down from the argument of the exp and returns expectation value
@drdca8263
@drdca8263 3 күн бұрын
@ Ah, so whatever variables we want to take moments of. Makes sense, thanks! (Err… where by “moments” I also mean things like the expectation of e.g. (x_1^2 times x_3) , even if that might not technically be a “moment”(?)) I will need to think more about how that all works in the case of hidden layers.
@minecraftermad
@minecraftermad 2 күн бұрын
how's the memory usage? afaik that's one of the main bottlenecks in any larger ML problems.
@CompuFlair
@CompuFlair Күн бұрын
That is a great question. Honestly, I haven't checked, but will keep an eye on it next time.
@NLPprompter
@NLPprompter 3 күн бұрын
Hello sir I'm wondering How does the computational complexity of this perturbation theory approach scale with increasing dataset size and model complexity (e.g., deeper neural networks)? Are there limitations where gradient descent becomes more efficient?
@CompuFlair
@CompuFlair 3 күн бұрын
This is a great question to be explored. I don't have the answer though. Just know that perturbation theory in field theory handles infinite dimensional systems. And the dataset it is being applied to is usually very large. But that is in physics. In ML we just need to explore it
@NLPprompter
@NLPprompter 3 күн бұрын
@CompuFlair ah I see... thank you for your kind reply, such wonders... this always keep me wanted to learn more. again, thanks.
@CompuFlair
@CompuFlair 3 күн бұрын
@@NLPprompter you are welcome
@bojanbernard180
@bojanbernard180 2 күн бұрын
you hint to Feynman diagrams as an example of perturbation theory - that works because you expand in power series of fine structure constant (1/137), works worst for strong force, etc.. what kind of real world problems could be formulated in a way to assure fast convergence, cause otherwise there is little or no gain compared to SGD ?
@CompuFlair
@CompuFlair 2 күн бұрын
This is a great comment and you have mentioned a critical point. We have this max entropy principle in information theory (that works for ML) and the 2nd law of thermodynamics (that works for physics systems) that both guarantee the existence of an equilibrium state. ML works only after the system has reached this state (or probability will evolve with time after we collect data and that data can't predict future events). In this equilibrium state, perturbations are small by definition. I have covered this in 2 of the earlier videos in this playlist.
@eloitorrents2439
@eloitorrents2439 3 күн бұрын
Is this approach written somewhere?
@CompuFlair
@CompuFlair 2 күн бұрын
In physics, yes. Any book on statistical field theory or quantum field theory. But, applying to ML, I guess not. The trick is to convert ML model to P = e^-F/Z which is mathematically what those books start with. Then just copy what those books have prescribed.
@volpir4672
@volpir4672 3 күн бұрын
are you on twitter or discord? your work here is really good, it would be nice to discuss
@CompuFlair
@CompuFlair 3 күн бұрын
Not active there but can be reached here or on LinkedIn.
From Physics God Equation to Linear Regression in Machine Learning
13:58
Way Bigger Than Graham's Number (Goodstein Sequence) - Numberphile
16:39
Миллионер | 3 - серия
36:09
Million Show
Рет қаралды 2,1 МЛН
Players vs Pitch 🤯
00:26
LE FOOT EN VIDÉO
Рет қаралды 138 МЛН
А я думаю что за звук такой знакомый? 😂😂😂
00:15
Денис Кукояка
Рет қаралды 4,7 МЛН
I never understood why speed of light is a constant (c)... until now!
21:52
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Could TIME Really Be an Illusion?
15:36
Arvin Ash
Рет қаралды 108 М.
The Strange Physics Principle That Shapes Reality
32:44
Veritasium
Рет қаралды 6 МЛН
This G15 is the Oldest Running Digital Computer in America!
33:52
Usagi Electric
Рет қаралды 109 М.
PirateSoftware getting Trolled by TTS
21:04
FunnyFerret
Рет қаралды 626 М.
Programming Is Cooked
9:30
ThePrimeTime
Рет қаралды 281 М.
Миллионер | 3 - серия
36:09
Million Show
Рет қаралды 2,1 МЛН