This is a game changer! (AlphaTensor by DeepMind explained)

  Рет қаралды 182,827

Yannic Kilcher

Yannic Kilcher

Күн бұрын

Пікірлер: 260
@YannicKilcher
@YannicKilcher 2 жыл бұрын
OUTLINE: 0:00 - Intro 1:50 - Sponsor: Assembly AI (link in description) 3:25 - What even is Matrix Multiplication? 6:10 - A very astounding fact 8:45 - Trading multiplications for additions 12:35 - Matrix Multiplication as a Tensor 17:30 - Tensor Decompositions 20:30 - A formal way of finding multiplication algorithms 31:00 - How to formulate this as a game? 39:30 - A brief primer on AlphaZero / MCTS 45:40 - The Results 48:15 - Optimizing for different hardware 52:40 - Expanding fundamental math 53:45 - Summary & Final Comments
@CosmiaNebula
@CosmiaNebula 2 жыл бұрын
TLDR: ## The problem Given two matrices A, B of size mxn, nxl, find a way to compute AB, such that we do as few scalar multiplications as possible (add and subtract are free). Reformulate the problem to make it amenable to AI: given the 3D "matrix multiplication tensor" T of size (mn)x(nl)x(ml), find a decomposition into T = ∑i ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. If this decomposition has r terms, then we have a corresponding way to compute AB using only r scalar multiplications. ### TensorGame Now create the 1-player TensorGame: given a 3D tensor T, ask it to make moves of the form ui⊗vi⊗wi, where each ui, vi, wi is a vector with entries from {-1, 0, 1}. The state of the game after the move is (T - ui⊗vi⊗wi). The game ends either when the game state is 0, or if the player has played K rounds and thus "ran out of time" (where K is an upper bound to the rank of T). The reward at each step is (-1), to encourage the agent to play as fast as possible. If the agent runs out of time, then the reward for the last step is (-r), where r is an upper bound to the game state in the end. ### Agent The agent is a RL agent "AlphaTensor". It, like AlphaZero, uses a deep neural network to guide a Monte Carlo tree search (MCTS) planning procedure. It is trained on randomly generated 3D tensors constructed by adding randomly generated ∑i ui⊗vi⊗wi. TensorGames are easy to make like that. ## Results After some training, the resulting agent is able to play the TensorGame very well. When applied to matrix multiplication tensors, it found some ways to do better than previous human-found techniques. For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method. By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}). Consider what they achieved: It took humans 10 years to go from O(n^{2.8074}) in Strassen (1969) to O(n^{2.780}) in Bini, Capovani, Romani (1979). and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well. I scoured their papers and couldn't find out how much time it took to train the agent, other than somewhere around 1 million steps?. By Google standards, I guess that means it took them a month. So, 1 month vs 10 years. BUT WAIT THERE"S MORE They also found the optimal algorithm for multiplying a nxn skew-symmetric matrix and a vector. Previous SOTA (2018), it was known to take at most n^2 multiplications, (see "Fast structured matrix computations: tensor rank and Cohn-Umans method"). They found a method taking only 0.5(n-1)(n+2) multiplications. This is obviously the asymptotically optimal number (since a nxn skew-symmetric matrix has ~ 0.5 n^2 different entries). BUT WAIT THERE'S MORE They ran the same algorithm but this time scoring its strategy based not on how many scalar multiplications it takes, but how much time it takes to run on a particular hardware (GPU or TPU). It found improvements as well, and these improvements are specific to GPU or TPU (using the optimized algorithm for GPU on TPU would yield a much lower improvement, and vice versa). > AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for.
@tlf4354
@tlf4354 2 жыл бұрын
@@CosmiaNebula Thanks for this TL;DR~
@tlf4354
@tlf4354 2 жыл бұрын
@@CosmiaNebula TL;DR 🖥 🧠 🔢 ⏫ 😱
@pouet4608
@pouet4608 2 жыл бұрын
@@CosmiaNebula but are the found algorithms proven afterwards? Because finding an algorithm is half the work. Proving it works, not only by some results but conceptually is the rest. I Don t understand how they can use discovered algorithms if they cannot proof it is true.
@CosmiaNebula
@CosmiaNebula 2 жыл бұрын
@@pouet4608 verifying the algorithms is trivial algebra
@ztangent3188
@ztangent3188 2 жыл бұрын
Actually when the fast matrix multiplication algorithm is applied recursively (to big matrix), not just the multiplication number reduces, so is the addition. This is because each sub-matrix multiplication requires super-linear O(m^(2+eps)) operations, while sub-matrix addition still takes O(m^2). So we are not trading multiplications for faster additions - both operations decrease.
@Neptutron
@Neptutron 2 жыл бұрын
Yes - this!! Please upvote the OP's comment! It should be emphasized that these improvements are not just constant factors - they improve the asymptotic complexity of matrix multiplication for certain matrix sizes in a practical way!
@nonamehere9658
@nonamehere9658 2 жыл бұрын
+1, Yannic should've at least mentioned Strassen's recursive matrix multiplications algorithm.
@dr.mikeybee
@dr.mikeybee 2 жыл бұрын
@@nonamehere9658 Yannick shows it in the paper.
@CosmiaNebula
@CosmiaNebula 2 жыл бұрын
For example, when A, B, are both 2x2, the previous human-found technique is Strassen algorithm, taking 7 multiplications. The agent replicated it. When A, B are both 4x4, the previous best is 49 multiplications (recursive Strassen algorithm). The agent found a 47 multiplication method. By recursively applying the Strassen algorithm, one can compute (nxn) times (nxn) matrix multiplication in O(n^{log_2(7)}) = O(n^{2.8074}) time. I went through their table of results, and found that the best exponent they have found is the "47 scalar multiplications for 4x4 matrix multiplication". This one achieves asymptotic complexity of O(n^{log_4(47)}) = O(n^{2.777}). Consider what they achieved: It took humans 10 years to go from O(n^{2.8074}) in Strassen (1969) to O(n^{2.780}) in Bini, Capovani, Romani (1979). and they managed to do it by an entirely general RL search. I have no doubt that if they can find a way to cast the Coppersmith-Winograd algorithm as a game, the RL will out-perform it as well.
@blackpepper2610
@blackpepper2610 2 жыл бұрын
no
@DRealHammer
@DRealHammer 2 жыл бұрын
-2 and 2 make sense as you can use a bitshift for the multiplication, so it is basically free and gives a bigger space of possible solutions.
@DRealHammer
@DRealHammer 2 жыл бұрын
@@h..h can you reformulate the question?
@AlexanderShamov
@AlexanderShamov 2 жыл бұрын
​@@h..h Why not? Just add 1 to the exponent, it's still much faster than general floating point multiplication.
@vladimirtchuiev2218
@vladimirtchuiev2218 2 жыл бұрын
I guess so is -4, 4, -8, 8 or any n with 2^n and -(2^n). But I guess the capacity to search in this space is more limited.
@DRealHammer
@DRealHammer 2 жыл бұрын
@@vladimirtchuiev2218 I think so too! However, my intuition also is, that the benefits of these high factors are not as great as adding the factors 2/-2.
@antopolskiy
@antopolskiy 2 жыл бұрын
thank you, Yannic. I love these paper explaining videos. it is really hard for me to focus on reading papers. You explain the main points very clearly in a fraction of time it would take me to read it.
@LarryPanozzo
@LarryPanozzo 2 жыл бұрын
We literally don’t have time to read every great paper these days. Summaries like these do save the day.
@apothum
@apothum 2 жыл бұрын
Production quality on this video was fantastic. For instance, the simple things like the blue background and circle are nice touches.
@cmilkau
@cmilkau 2 жыл бұрын
The reason that addition is "cheap" is not so much CPU's, these can do number multiplication in a single cycle, too. It's because the entries are usually matrices themselves (instead of numbers) and adding matrices is just adding their entries. Full story: Its when you use these results to multiply huge matrices. You do this using "block matrix multiplication". For instance, to multiply two 1000x1000 matrices, you can reinterprete these as 2x2 matrices, but each of the eight entries are 500x500 matrices ("blocks") themselves. Obviously, if you use a 2x2 multiplication formula, each multiplication in that formula actually corresponds to multiplying two 500x500 matrices (which you can do recursively applying your formula). The naive formula takes n×n×n multiplications to multiply two n×n matrices, but only n×n additions to add two such matrices. As you can see, this is a huge difference for large n. This difference stays huge even for "clever" versions of multiplication.
@cmilkau
@cmilkau 2 жыл бұрын
@@leeroyjenkins0 Ok, maybe, but I don't like emphasis put on that because it really isn't as relevant. The matrices that are actually relevant in state-of-the-art ANN have n in the thousands to tens of thousands. A direct Tensor decomposition would be in the order of millions to billions actions. So, the block multiplication case is the real relevant case. Of course, if you want to do actual graphics, 4x4 is your most common size and directly improving that can vastly improve the characteristics of your graphics chip or driver.
@cmilkau
@cmilkau 2 жыл бұрын
"A matrix is a square box of numbers" - AlphaTensor can also multiply rectangular matrices, and the paper gives an example. Of course 2x2 and maybe 4x4 matrices are of specific interest because they can be easily generalized and implemented in algorithms for large matrices.
@adityay525125
@adityay525125 2 жыл бұрын
Thank you Yannic, I missed these older style paper explainers
@BjarkeEbert
@BjarkeEbert 2 жыл бұрын
One thing I have wondered: we use multiplications because it's mathematically convenient. But what if we use some cheaper operations on the layer inputs, and "just learn" to parametrize them correctly. Who cares if it's exact multiplication - what matters is that the NN gets the job done. The learning could maybe work around the "incorrect but fast" multiplication
@JurekOK
@JurekOK 2 жыл бұрын
"Analogue computers" do that -- they exploit the non-linearity of silicon diode in the 0..1V region of conduction. As long as you have some nonlinear operation, you can have a neural network. Multiplication happens to be a convenient one for **digital** computers.
@Fordance100
@Fordance100 2 жыл бұрын
No, you have to go through nn to estimate it which would is composed a bunch of matrix multiplications, end up doing much more work than just doing the matrix multiplication straight.
@appletree6741
@appletree6741 2 жыл бұрын
You would need a “incorrect but fast” backprop as well, it’s not clear to me that backprop still works if you break multiplication
@mujtabaalam5907
@mujtabaalam5907 2 жыл бұрын
27:40 I think there's a mistake: the modification comes from the 5th column, not the 7th, because we're looking at the tensor where the top row of u is non-zero
@mgostIH
@mgostIH 2 жыл бұрын
I think it's also important besides the domain they applied it on (matrix multiplication algorithms), since it shows that reinforcement learning via deep learning is strong enough to explore extremely large discrete spaces to optimize. I can expect that in the future most NP-Hard problems we'll want to tackle will be attacked by these approaches rather than heuristics.
@Jiftoo
@Jiftoo 2 жыл бұрын
real
@AufBerghofNAM
@AufBerghofNAM 2 жыл бұрын
nphard has incomplete m-sets, but solutions are provable in post
@rxphi5382
@rxphi5382 2 жыл бұрын
As a new CS student this kind of content is very interesting, even tough I don't understand everything. It motivates me to keep learning. Thank you Yannic!
@w花b
@w花b 2 жыл бұрын
@Greg LeJacques w-what..?
@Anders01
@Anders01 2 жыл бұрын
That's a great result. I have been thinking of machine learning training other machine learning algorithms, but to improve other algorithms is also an interesting approach which now is shown to be practically possible.
@andrewdunbar828
@andrewdunbar828 2 жыл бұрын
Hey thanks for explaining this in a way I can almost understand. English tip for you in appreciation: "long-winded" is not about winding it's about the wind blowing. Dunno why.
@minimath5882
@minimath5882 2 жыл бұрын
Karatsuba's algorithm trades additions with multiplication with numbers back in 1963 and now DeepMind did with it matrices with AlphaTensors in 2022. Amazing
@NoNameAtAll2
@NoNameAtAll2 2 жыл бұрын
analog of karatsuba's algo for matrixes existed for decades already deep mind found even better one
@mildpass
@mildpass 2 жыл бұрын
I would have liked to hear your comments on the structured matrix multiplication part of the paper. This is potentially the biggest speedup to linear algebra presented in the paper. Their skew symmetric matrix algorithm beats the state of the art by a factor of 2. Not 2 multiply operations, half as many as the previous best! Unfortunately the structured matrix section of the paper read a bit vague to me. Particularly the "extrapolation" part of the skew symmetric algorithm to larger matrices looked like it was done manually by a human. Would be really cool to see an ML algorithm be able to automate extrapolation of those algorithms. To me that would be the holy grail of AI optimized linear algebra.
@iestynne
@iestynne 2 жыл бұрын
This kind of fine-grained explanation is such a great service - thank you!!
@GandalfTheBrown117
@GandalfTheBrown117 2 жыл бұрын
Thank you for your videos. You're the Agadmator of ML/AI.
@khoda81
@khoda81 2 жыл бұрын
I can already see articles titled "DeepMind made an ai that can make itself faster!".
@stevey7997
@stevey7997 2 жыл бұрын
Really great video, thanks! Just a small mistake at 27:48. You look at the wrong entry in U (but I see how that easily happens of course).
@robrobbins
@robrobbins 2 жыл бұрын
I tried to write a Python program to implement their algorithm for multiplying 4x4 matrices in 47 steps but I suspect they left out the subtractions in their algorithm. There were nothing but additions and multiplications which is unusual. The numbers I got were way too high. You can find two of the algorithms they found in the PDF accompanying the article.
@PrzemyslawDolata
@PrzemyslawDolata 2 жыл бұрын
The 47-step algorithm is for *modular arithmetic*. In this system all the numbers are "modulo p" for some number p, i.e. the numbers "wrap around". E.g. in modulo 5 arithmetic, 4+4 = 3 (it would be 8, but 8 mod 5 is 3). That's why subtractions aren't necessary - numbers will stay confined to the (0;p-1) range by definition. And that's why numbers blew up when you were doing normal arithmetic.
@NoNameAtAll2
@NoNameAtAll2 2 жыл бұрын
@@PrzemyslawDolata can't one just take really big p?
@MagicGonads
@MagicGonads 2 жыл бұрын
@@NoNameAtAll2 I take a really big p sometimes if I drink too much water
@StoianAtanasov
@StoianAtanasov 2 жыл бұрын
Does this mean that the small speedups in matrix multiplications compound to huge speedups when running a Deep Network model with hundreds of matrix multiplications?
@PieGuyX1000
@PieGuyX1000 2 жыл бұрын
Oh shiiii…
@vanderkarl3927
@vanderkarl3927 2 жыл бұрын
This almost sounds too good to be true. Improving the speed of matrix multiplication... that basically speeds up EVERYTHING nowadays.
@cmilkau
@cmilkau 2 жыл бұрын
The relationship between matrix multiplication and the Tensor T can be written as C = AB = A:T:B. Here, we use double contractions because the "axis" of T corresponding to A (the one indexing its entries a1,a2,a3,a4) is actually two axis (a1=a11, a2=a12, a3=a21, a4=a22) indexing the rows and columns of A, respectively. Same for B. The result has two remaining axis, indexing the rows and columns of the product C. Decomposition means A:T:B = A:(u1 ⊗ v1 ⊗ w1 + u2 ⊗ v2 ⊗ w2 + ...):B = A:(u1 ⊗ v1 ⊗ w1):B + A:(u2 ⊗ v2 ⊗ w2):B + ... Here u,v,w are actually matrices the same size as A,B,C, respectively. Now A:(u ⊗ v ⊗ w):B can actually be computed using a single multiplication. First, compute A:u, which adds/subtracts entries of A (u only has entries 1,-1 and 0). Second, compute v:B, acting in the same way. Multiply the two numbers to obtain the number m = (A:u)(v:B) = A:(u ⊗ v):B. Now you only have to multiply m and w to get m ⊗ w = A:(u ⊗ v ⊗ w):B. Because w also only has entries +1,-1 and 0, m ⊗ w essentially means replacing the +1's in w by m and the -1's by -m.
@videowatching9576
@videowatching9576 2 жыл бұрын
I appreciate the content, so when I feel that I am not having time well spent, then watching and learning about AI on this channel is super useful and interesting.
@AlericResident
@AlericResident 2 жыл бұрын
Error at 27:45 ... its the top row of U, not the bottom row, so column 5 causes the cancel, not column 7.
@hellokingston-z7g
@hellokingston-z7g 2 жыл бұрын
The progress is really far smaller than the hype. Those in the theory community are not quite excited because the improvement happens not in real numbers, but in a field called F2, which only contains 0 and 1 and is not used by most people. Putting this restriction aside, the modest improvement is with respect to a result more than 50 years ago..... So why cares. From a practical perspective, the value is even smaller. If you really want to do matrix multiplication in real life, you have to consider numerical stability, memory allocation, and a bunch of other issues, and this is why Strassen algorithm is not used in the first place. So the experiment is really unconvincing to me and those who have experience in high-performance computing. It is definitely a cool result, but really not as big as most people think.
@janasandeep
@janasandeep 2 жыл бұрын
"These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware" as per deepmind blog.
@johanlarsson9805
@johanlarsson9805 2 жыл бұрын
16:00 Glad that you caught your misstake.. I was thinking you are in the second "slice" so it should be a2
@Obe_Qwaet
@Obe_Qwaet 2 жыл бұрын
I'm actually a little bit mad right now because I proposed doing this during high school and college and was told that's not the correct way to multiply matricies. Seeing an academic paper on it... I'm not sure how to feel atm.
@HUEHUEUHEPony
@HUEHUEUHEPony 2 жыл бұрын
You had low IQ instructors
@AlexanderBukh
@AlexanderBukh 2 жыл бұрын
Some faster ways were found decades back, now they only automated finding new ones.
@AlexanderBukh
@AlexanderBukh 2 жыл бұрын
Also, isn't it always "the wrong way to do it" until someone comes forward and _proves_ them(sayers) wrong?
@kikc
@kikc 2 жыл бұрын
If I were you I would show this to every human that ever wronged you and then kill them brutally for costing your future
@rmajdodin
@rmajdodin 2 жыл бұрын
Next time you have a great idea, just implement it. People love to see sth working.
@nadavmihov
@nadavmihov 2 жыл бұрын
In 28:00 isn't it supposed to be the other collum because you check the first row in U?
@mattiasfagerlund
@mattiasfagerlund 2 жыл бұрын
I thought so too, but by pure luck it worked out.
@Wobbothe3rd
@Wobbothe3rd 2 жыл бұрын
Thanks for this video! Had just heard of this development but didn't understand it, this explains it well.
@i4m3rr0r
@i4m3rr0r 2 жыл бұрын
I was wondering a lot why multiplication is actually so important as naturally in hardware multiplying numbers isn't that much worse than adding them (in comparison to for example division). The magic seems to be when we start thinking about big matrices as blocks of smaller matrices, like a 20x20 matrix can be though of as a 4x4 matrix where each entry is a 5x5 matrix. And you can then apply your matrix multiplication algorithms in the same way, but now every multiplication which was a scalar multiplication will become a MATRIX multiplication, whereas an addition will just be matrix addition. And as matrix addition is so much cheaper than multiplication, it is really important to have as few of them as possible. So these savings are relevant in the asymptotic case where we get to multiply two very large matrices and instead of using the n^3 algorithm we split the matrices smartly into block to apply the more optimized algorithms.
@ScottSummerill
@ScottSummerill 2 жыл бұрын
Received October 2, 2021?! Published a year later?
@cmilkau
@cmilkau 2 жыл бұрын
We're using a 3-dimensional tensor here but it's actually 6 dimensions because each axis corresponds to an entry in one of the 3 matrices (factor A, factor B, product C), and an entry is identified by a row and a column.
@Zed_Oud
@Zed_Oud 2 жыл бұрын
Received October 2021 Published October 2022 A year and 3 days.
@carlosrivadulla8903
@carlosrivadulla8903 2 жыл бұрын
I asked about this in 2 minutes parpers yesterday but no one answered me. Thnx u are making a video. Such a change on fundamentals should have a cascade impact in all fields of tech and science.
@AlericResident
@AlericResident 2 жыл бұрын
Maybe they should use the diffuse algorithm to get the tensor decompositions instead of finding it column by column; that gave a tremendous improvement for making drawings too.
@siquod
@siquod 2 жыл бұрын
How much energy would have been saved if these multiplication algorithms had been discovered and universally implemented 3 years ago?
@Derrekito
@Derrekito 2 жыл бұрын
This is a fantastic coincidence! I was trying to figure out similar generalized forms you mentioned in the beginning, but from the perspective of a computer architecture data dependency issue several months ago. I shelved the idea because after a few weeks I found it difficult to justify the time to spend on it. This solution might be a much better fit if it really works! I'm so excited to read this paper and start work towards implementing it.
@maloxi1472
@maloxi1472 2 жыл бұрын
Are you a descendant of the great Rolf Landauer ?!
@Derrekito
@Derrekito 2 жыл бұрын
@@maloxi1472 I sometimes fantasize that I have some connection to the Landauer limit. Sadly, I'm unaware of any familial connection. 😅
@RamRachum
@RamRachum 2 жыл бұрын
Bro had so much fun multiplying matrices, he almost forgot to talk about the research.
@GantryG
@GantryG 2 жыл бұрын
Awesome, thanks for breaking that down for us! 😀
@Ou8y2k2
@Ou8y2k2 2 жыл бұрын
It's been so many decades that I don't even remember what a matrix is. Is it that movie with John Wick?
@oflasch
@oflasch 2 жыл бұрын
Nice to see another excellent paper explanation video. Thank you! 😊
@hichamalaoui8024
@hichamalaoui8024 2 жыл бұрын
Instead of getting multiplication of two numbers by converting to binary digits and doing long addition operation , we can define a table for simple numbers multiplication from 1x1 to 9x9
@blanamaxima
@blanamaxima 2 жыл бұрын
This tricks are typically marginally affecting performance. There are other factors in data-flow that have a huge impact. The 8b multiplier is not the one that consumes energy or space and this techniques make the hw less uniform and kill utilization.
@TylerThomas
@TylerThomas 2 жыл бұрын
"That was long winded" about explained how well I understand it now lol
@gaggix7095
@gaggix7095 2 жыл бұрын
Multiplication is usually not more expensive than addition in modern hardware. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, for example, the 2×2 matrix multiplication can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Also numerical stability can be a issue with these faster algorithm. So at the end of the day the standard O(N^3) will remain the way matrix multiplication is implementated, at least for a long time.
@nullptrRL
@nullptrRL 2 жыл бұрын
Really cool info, and I think the paper does a good job of addressing that reducing multiplication is only one possible training goal. Another possible training goal is matrix multiplication compute time, so that you can optimize for what the hardware can do fastest.
@drdca8263
@drdca8263 2 жыл бұрын
For large integers, I would think that multiplication would have to be slower? Like, adding two n digit numbers can be done in O(n) time, and, if you allow two different input tapes for the two inputs, or interleave the digits of the two numbers, can even be done by a finite state machine, but surely this can’t be done for multiplication, right? Maybe for the numbers which we practically will compute with on the hardware we have, they might take the same time? Would that be true even on eg GPU?
@gaggix7095
@gaggix7095 2 жыл бұрын
@@drdca8263 computers use a fixed number of bits to represent numbers because the ALUs are built to perform addition, multiplication etc in costant time, if they accept n bits to add and multiply, then yes, if the numbers increase, more times will be needed to perform these operations.
@JensDoll
@JensDoll 2 жыл бұрын
To be precise, it depends on the definition of "expensive". You can speedup a multiplication for a fixed format like IEEE754. Simply by unrolling the algorithm for each bit directly into a circuit. But the complexity of the circuit and therefore the energy consumption may be higher.
@zyansheep
@zyansheep 2 жыл бұрын
@@gaggix7095 doesn't multiplication still take more clock cycles than addition? And there are definitely more transistors... Would a custom silicon implementation of matrix multiplication (like you find in tensor chips) really not benefit from this research?
@alefratat4018
@alefratat4018 2 жыл бұрын
That's great, but I would still bet that clever product quantization algorithms for approximated matmul are the way to go for faster deep learning.
@airazure2050
@airazure2050 2 жыл бұрын
Will new factorizations found in this paper actually make their own ways into existing optimized BLAS (Basic Linear Algebra Subprograms) libraries?
@mareko8940
@mareko8940 2 жыл бұрын
I don't think so. Matrix multiplication is mainly limited due to data loading as far as I'm aware, but please correct me if I'm wrong
@himanshupalve3454
@himanshupalve3454 2 жыл бұрын
I don't think so since BLAS libraries don't use strassen algorithm for Matrix multiplication
@nohrocks
@nohrocks 2 жыл бұрын
Thank you for making this easy to understand
@-E42-
@-E42- 2 жыл бұрын
amazing how they find out about this now, 1000 trillion matrix multiplications and thousands of math lib implementations later :D
@mailoisback
@mailoisback 2 жыл бұрын
So we trade multiplications with additions? But why are additions faster than multiplications? I think on modern CPUs this is no longer true.
@mohaofa1544
@mohaofa1544 2 жыл бұрын
try to multiple 2 numbers in binary system and add two numbers in binary system which one will take more operations to give desire result ? remember computers work in Binary system
@derasor
@derasor 2 жыл бұрын
Could an even more efficient method be discovered by a next iteration of AlphaTensor? That would be truly mindblowing, but if not (meaning there is a hard limit on the most basic operation of machine learning), then there are IMO 2 takes from this achievement. 1- There definitely won't be a fast take off when AgI is implemented, and all the fear-mongering on AI being an eXisTEnTiAL thREat was just the overblown conclusion to a very defficient thought experiment. 2- Hardware engineers have a golden opportunity to design machines that implement this algorithm in the lowest levels.
@muskyoxes
@muskyoxes 2 жыл бұрын
Let's train on 3SAT just in case P=NP (although, if it does, seems like we'd have to train a super long time)
@hola-kx1gn
@hola-kx1gn 2 жыл бұрын
Love these videos, thank you so much.
@miniminerx
@miniminerx 2 жыл бұрын
And now alpha tensor can run faster to find more improvements. That part is crazy
@sndengale
@sndengale 2 жыл бұрын
Was waiting for this!
@MenkoDany
@MenkoDany 2 жыл бұрын
What a time to be alive!
@AlexanderBukh
@AlexanderBukh 2 жыл бұрын
Hold on to your GPUs!
@MenkoDany
@MenkoDany 2 жыл бұрын
@@AlexanderBukh Funny you say that, yesterday I sold my 3080 to get a 4090
@MrJaggy123
@MrJaggy123 2 жыл бұрын
I love the home shopping network tunes in the sponsor part ❤️
@NoNameAtAll2
@NoNameAtAll2 2 жыл бұрын
so you got sponsored by voice-to-text company... and your video STILL has no subtitles???
@90benj
@90benj 2 жыл бұрын
I am a bit confused whene you talk about rank 1 tensor, because I thought rank 1 tensors are just vectors. Time to brush up my linear algebra 101 30:27 The tensor shown on the left is not a 3D tensor, it is a 4D rank 3 tensor, as far I as understand it. imagine one row there is one vector, so a rank 1 tensor. If a vector has 2 numbers, it's a 2D vector, if it has 3 it's a 3D vector, so a 3D rank 1 tensor. Just because visually it is represented as a 3D object doesn't make it mathematical 3D I think.
@Handelsbilanzdefizit
@Handelsbilanzdefizit 2 жыл бұрын
I think, with the dawn of quantumcomputers, such problems will be blown away. Because matrix multiplication consists of many scalarproducts (or dot-products). row * column = With the help of superposition and quantumparallelism, it will deliver results almost instantly.
@sator666666
@sator666666 2 жыл бұрын
Great explanation. Thank you.
@isaacandrewdixon
@isaacandrewdixon 2 жыл бұрын
AI accelerating AI. Every day we are closer to the singularity.
@michaelnurse9089
@michaelnurse9089 2 жыл бұрын
Strictly speaking we get closer but the whole idea of singularity is that every step gets easier on the path. Rather the real situation is that we have made 27 steps on a 100 step path and each step gets harder with costs ballooning exponentially in some way. In other words, there will be no singularity with evolution of neural nets as after 32 steps any further steps costs more than nation states can afford.
@isaacandrewdixon
@isaacandrewdixon 2 жыл бұрын
@@michaelnurse9089 We're so close. The rate of progress is rapidly increasing.
@drewduncan5774
@drewduncan5774 2 жыл бұрын
Is there any mention of numerical stability of the resulting algorithms in the paper? I've only skimmed it, but I couldn't find anything.
@TheSinTube
@TheSinTube 2 жыл бұрын
As far as i understand the resulting algorithm is a simple adition and multiplication of the elements in the matices and therefor there shouldn't be any stability issues.
@drewduncan5774
@drewduncan5774 2 жыл бұрын
@@TheSinTube There definitely can be. Strassen's two-step algorithm is less stable than the naive algorithm, which is why it is often not used in practice despite being asymptotically faster.
@MagicGonads
@MagicGonads 2 жыл бұрын
he mentioned only using coefficients in the set {-2,-1,0,1,2} for numerical stability
@drewduncan5774
@drewduncan5774 2 жыл бұрын
@@MagicGonads Strassen's algorithm in fact only uses {-1, 0 ,1}.
@TheJysN
@TheJysN 2 жыл бұрын
Amazing thanks for the explanation.
@cmilkau
@cmilkau 2 жыл бұрын
16:00 dude, it's c1 = a1b1 + a2b3, it's written literally under the picture...
@eafindme
@eafindme 2 жыл бұрын
Sometimes, what we need is to just get back to the basic and rediscover.
@zepho17
@zepho17 2 жыл бұрын
Does anyone know the pdf reader Yannic is using? I never have enough space on my margins
@kaeptnbaloo
@kaeptnbaloo 2 жыл бұрын
Getting teached from Niko Bellic about Alpha Tensor is quite exciting.
@kinngrimm
@kinngrimm 2 жыл бұрын
I would have been a bit more interested in actual usecases later on in the real world. (or speculations about them as these may not have been established yet) Things that came to mind formost are prediction models. Flight path calculations to safe fuel including temperature and density of current fligth hight environment, maybe with weather forcast data. Then shape, weight and engines used within a plane/copter and many other factors which may play a role, but i would suppose could overall be improved by this approach of alpha tensors. Then there is space flight with orbital dynamics. Maybe tests of tensile strength objects with different materials or combination of materials? Pressure and dynamic tests within fluids. ... next to endless seem reallife applications as tensors are usable to adress so many things.
@russelldicken9930
@russelldicken9930 2 жыл бұрын
Er . . . you lost me partway through, but I think I got the gist of it. I'm getting too old for this. The implications are huge. I just can't wait to see the compiler implementations!
@kylepena8908
@kylepena8908 2 жыл бұрын
Do you think we can use this to discover O(N^2) algos for matrix multiplies? I sure hope so.
@drdca8263
@drdca8263 2 жыл бұрын
I suspect the minimal possible exponent there is greater than 2.
@swordwaker7749
@swordwaker7749 2 жыл бұрын
There is an n^2*log(n) lower bound for n. This assumes you use an arithmetic circuit... OFC, which only supports multiplication and addition. This still couldn't rule out weird stuffs like logarithms and so on.
@Dron008
@Dron008 2 жыл бұрын
I guess it can be O(1) if you pre-calculate results of multiplications of all possible matrices and store them somewhere. Concatenation of 2 matrices will be an index in this huge array.
@kylepena8908
@kylepena8908 2 жыл бұрын
@@Dron008 well erm I guess. But what's your space complexity
@ruroruro
@ruroruro 2 жыл бұрын
@@Dron008 what you just described is still at least O(N^2) in the best case. You still have to read both N^2 sized matrices to find your precomputed result.
@RobotProctor
@RobotProctor 2 жыл бұрын
27:46 row 1 of U is what we need to look at.
@Username56291
@Username56291 2 жыл бұрын
can it be possible an example calculating manually this method? this is a gamechanger also for students in linear algebra, +is the code published?
@serta5727
@serta5727 2 жыл бұрын
This adds up 😉
@simdimdim
@simdimdim 2 жыл бұрын
Looking forward to compilers development in 2023
@drewduncan5774
@drewduncan5774 2 жыл бұрын
0:11 "I know that sounds a bit boring." This really made me cringe. I understand this is youtube, but at least some of the audience is intelligent enough to understand the importance of this.
@paxdriver
@paxdriver 2 жыл бұрын
Why the hell didn't those geniuses at numpy figure this out years ago?! 😭
@AlexanderBukh
@AlexanderBukh 2 жыл бұрын
They did, see the table towards the end of the video, the smaller matrices were optimized by people decades ago. This improves larger ones, and by only a few steps. 52:10
@paxdriver
@paxdriver 2 жыл бұрын
@@AlexanderBukh they did but they didn't find the most optimal ones 😜 47 steps vs 48 is a huge difference when working with hundreds of thousands of parameters across many layers and repeating. I'm surprised they didn't use numpy itself to find more optimizations for numpy. There are only what, like 7 or 11 of them or something, right?
@drdca8263
@drdca8263 2 жыл бұрын
The action space is larger than for go? Ok, so if each number entry is in {-2,-1,0,1,2} and each vector has n^2 entries, ah, ok, I was initially thinking only n entries, so I was thinking 4 because I was thinking 4x4 matrices. But the vectors representing coefficients of each matrix entry would have n^2 entries. Ok. So, for 2x2 matrices, then there would be (5^(2^2))^3 = (5^4)^3 = 5^12 possible moves, which... is more than 2^24 , which, ok, yeah that’s many more possible moves than there are in Go.
@MagicGonads
@MagicGonads 2 жыл бұрын
the general action space is of size 5^(3((nmp)^2)) where we are taking an n x m matrix by an m x p matrix
@bloomp7999
@bloomp7999 2 жыл бұрын
It's great, i thought mathematicians had already figured general rules of the limit of matrices, but it seems they dont
@farrael004
@farrael004 2 жыл бұрын
I guess this is more in the realm of computer science than just math (even though both are closely related). There are many tricks computer programs do to get the same mathematical results while being much more computationally efficient.
@bloomp7999
@bloomp7999 2 жыл бұрын
@@farrael004 ​ yes but in math there are no way around things have to be mathematically right, so that's cool ai can do math things right. because in comparison, ai generated art is the opposite of anything mathematically right, the outputs are always wobbely and whafky, and we can't find mathematical truth in it
@farrael004
@farrael004 2 жыл бұрын
@@bloomp7999 Something can be mathematically right in many different ways. Some ways are just more efficient. One of the goals of computer science is finding these efficient ways to make calculations.
@sergiomanuel2206
@sergiomanuel2206 2 жыл бұрын
At 27:52 you made a mistake, you should point to the first row of U not the last. The result is the same.
@u_luana.j
@u_luana.j 2 жыл бұрын
Not being offensive, but shouldn't it be the first of U, first of V and last of W, since he was looking at the corner (a1,b1,c4)?
@MrKrtek00
@MrKrtek00 2 жыл бұрын
Very good and simple demonstration of what is happening. so algebra is just a game...
@thcoura
@thcoura 2 жыл бұрын
Interesting approach. I didn't read the article yet . So I making my point based on the face value of your video. Here we go: I have concern about the total number of operations (+,*) dealing with float values and the residual error produced
@peterszilvasi752
@peterszilvasi752 2 жыл бұрын
They did not use float values for the operations. They constrained the entries to "discrete set of coefficients F (for example, F = {−2, −1, 0, 1, 2})". See the DRL for algorithm discovery section on page 48.
@nadahlberg
@nadahlberg 2 жыл бұрын
Can I just say that 3d image during the tensor decomposition bit would have been SO much less confusing if they made the individual slices the C values and used each slice to show the A, B pairs you need to multiply and add to get that C
@yonikremer4135
@yonikremer4135 2 жыл бұрын
Did they try finding the algorithem for large metrices?
@bloomp7999
@bloomp7999 2 жыл бұрын
What if we build matrices with each terms being the resultant formulas found, and run the ai on these new matrices
@carlosquinones7560
@carlosquinones7560 2 жыл бұрын
Man Google really loves tensors.
@jellyboy00
@jellyboy00 2 жыл бұрын
I wonder does that only work on sparse tensor
@yilei1051
@yilei1051 2 жыл бұрын
You wanted to explain a1*b1*c4 would be cancelled out, and started from a1*b4*c1, then later changed to a4*b4*c1... and it still worked out, good for you XD
@martinhsl68hw
@martinhsl68hw 2 жыл бұрын
Thank you for doing that for us
@JamesAwokeKnowing
@JamesAwokeKnowing 2 жыл бұрын
This is dangerous and you should be ashamed for enabling people to multiply so efficiently. Now any clown can multiply some really harmful things.
@sn0wbr33z3
@sn0wbr33z3 2 жыл бұрын
The end is near
@yashashgaurav
@yashashgaurav 2 жыл бұрын
Do we have to create these decompositions for all kinds of matrix sizes? Or is there a piece of math I'm missing?
@JurekOK
@JurekOK 2 жыл бұрын
Yes, decomposition for a matrix of given size only deals with matrix of that size. Consider the structure of the "u" tensor visible at 5:30. Its contents do not say anything about the source matrix of size 5. Moreover, there is more than one decomposition for each source matrix size.
@nathandfox
@nathandfox 2 жыл бұрын
Unfortunately, this kind of accelerated matrix multiplications that partition across both rows and columns will have poor numerical stability, so they are not that useful in practice.
@cedricvillani8502
@cedricvillani8502 2 жыл бұрын
The first solution for partial differential equations was a Matrix function that was developed by the founder of MathWorks, in fact the logo of math works is that very matrix 😮😮 better looking 👀 now but go back to the original logo.. and ya fun Snapple Fact 🎉 whatever
@IAMschizoaffective
@IAMschizoaffective 2 жыл бұрын
Nice mohawk I'm rocking one too bro turok for life🧬
@Frankthegravelrider
@Frankthegravelrider 2 жыл бұрын
Did you ever explain what you use to make your videos?
@LeftBoot
@LeftBoot 2 жыл бұрын
Could you please explain the relevance of '7' aka 'Rank-7'? You mention the fact around the 8m30s mark. Is it related to 'Complexity Theory'?
@NoNameAtAll2
@NoNameAtAll2 2 жыл бұрын
iirc, rank is another name for number of dimentions MxNxKxLxPxOxT sized tensor is of rank 7
@avantan1980
@avantan1980 2 жыл бұрын
It looks like Fast Fourier Transform (FFT) to me. Anyone can confirm this?
ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress)
31:55
Yannic Kilcher
Рет қаралды 437 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Hilarious FAKE TONGUE Prank by WEDNESDAY😏🖤
0:39
La La Life Shorts
Рет қаралды 44 МЛН
Jaidarman TOP / Жоғары лига-2023 / Жекпе-жек 1-ТУР / 1-топ
1:30:54
The Most Important Algorithm Of All Time
26:33
Veritasium
Рет қаралды 9 МЛН
How AI Discovered a Faster Matrix Multiplication Algorithm
13:00
Quanta Magazine
Рет қаралды 1,5 МЛН
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН
Fast Inverse Square Root - A Quake III Algorithm
20:08
Nemean
Рет қаралды 5 МЛН
When Optimisations Work, But for the Wrong Reasons
22:19
SimonDev
Рет қаралды 1,2 МЛН
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,2 МЛН
Deepmind AlphaZero - Mastering Games Without Human Knowledge
42:29
The Artificial Intelligence Channel
Рет қаралды 195 М.
MIT Introduction to Deep Learning | 6.S191
1:09:58
Alexander Amini
Рет қаралды 825 М.
The Full History of ChatGPT
26:55
Art of the Problem
Рет қаралды 1,1 МЛН