120x Faster Algorithm By Nested Loops

Рет қаралды 387,810

Күн бұрын

Пікірлер: 354

@MichaelAbramo Жыл бұрын

It takes a true legend to make a 15 minute demo on optimization that he probably spent a month making and then at the end say "but instead you can just use this library that's 8x faster"

@khhnator Жыл бұрын

but only if your problem is algebra... the thing is that you will be always having cache performance issues. is really a mine of free performance gains if you can optimize the hotspots of your code

@chrism3790 Жыл бұрын

Well, those libraries were created by someone who went through the optimization process in a much more thorough way than presented here, and that is interesting in and of itself.

@lesscommonsense1804 Жыл бұрын

True legends give the sauces.

@pithlyx Жыл бұрын

@@chrism3790 yea but optimizing for your specific use cases will be more performant than a catch all solution in most situations. additionally i think the video is also telling a story that can be applied to many sectors of optimization.

@chrism3790 Жыл бұрын

@@pithlyx I disagree with that. Most of the time, people overestimate their ability to hard core optimize things. It takes deep knowledge to truly extract every ounce of performance from your hardware. If you're working on well known problems, trying to do these things yourself is a waste of time that will produce suboptimal results, almost guaranteed. Ask yourself, would you be able to optimize matrix multiplication yourself, to the degree that it was done in this video? And remember, even the author admits there is more under the hood of the referenced implementation.

@DexterMorgan Жыл бұрын

I'm a prewatch viewer. I watch it once, write down notes, then watch it again and pretend I know the answers while crossing my arms and nodding my head with a smug look on my face. Classic.

@ThePrimeTimeagen Жыл бұрын

That is the way to do it

@YDV669 11 ай бұрын

A fellow after my heart. I do that with PVR. Once the thing watches the movie I told it to record, I feel I no longer have to and can get on with my day without too much waste of time.

@christophervsilvas Жыл бұрын

I’m doing full stack coming from a game development background. As a game developer I had to deal with this stuff all the time and so when I got my first full stack job I wrote everything like a high performance C++ algorithm even though it was just a typescript nodejs server. At first I thought I was always just a bad programmer and didn’t even know it but now I am realizing good code means different things for different contexts. In game dev it means minimal operations to achieve best result. In web it means minimal complexity and maximum maintainability by the broadest set of programmers. It’s almost like you have to program for the “worst” programmer in the “room”.

@boastfultoast Жыл бұрын

If you don't mind me asking - how did you swap from gave dev to full stack? What steps did you take to learn/fill in gaps for full stack

@christophervsilvas Жыл бұрын

@@boastfultoast Hey, the full story is a little complicated (I didn't exactly "swap", it's more like I got a good job from someone who didn't care what I knew, they just knew I could make what they need happen). Anyway, to answer your question I would really just say have a problem that you are interested in and solve it! Build something! I didn't really have the luxury of learning fullstack in the abstract. I started by creating a Shopify app with a Typescript NestJS backend, and NextJS admin ui front end. Redis for caching. MongoDB for db. You have to be okay with writing everything baddly for a while so you see the problems and go look for a solution. I can't tell you how many times I have suffered through ugly programming experiences only to find a feature I didn't know about that makes everything so much easier. After two years of development I am currently on our third major refactor. Jump in with a simple project and be okay with sucking and not knowing how much you suck. Side note: if you come from a typed language I would really recommend learning Typescript. It was a huge help for me to get started. (If you don't come from a typed language I would still recommend Typescript :D). I just recently "relearned" how important it can be to have a project to work on in order to learn. I saw the Primagen talking about Rust and decided to give it a try but couldn't really make heads or tail of it for a whole week... Luckily my boss told me to make a simple service for querying every zipcode in the USA against the AT&T and Verizon coverage maps. I started dissecting both AT&T and Verizon's APIs using Python as a quick mockup. Then I decided to try to create it in Rust. At first it was really really slow... but then it just started to fall into place and after 3 days I know Rust pretty well. That all may or may not be helpfull. If you have any more specfic questions, feel free to ask!

@doc8527 Жыл бұрын

I think you still got some wrong ideas from the video (not everything but how you view web and game in terms of difference). The video is the opposite of what you are doing. For a given "specific size of dataset" or "specific data structure", sometimes simply dumping the nested array might run faster (node.js in your example, V8 is doing some cache magic behind the scene), which is what a lot of devs are doing in web. Even though they don't realize that, and many in-place optimizations might lead to slow performance because the requirement change constantly, your old in-place optimization only works for specific size of data, then the stupid slow nested solution out-perform yours. The web is quite dynamic, so premature optimization is pretty bad. If you are just working on some local offline game, not the game server that has a drastically change of traffic all the time. Then your in-place optimization will be best because you know exactly what you will get. For web server development, a lot of times is hard to predict the incoming traffic or outgoing traffic (as db size grows). the stupidest slow solution can have the maximum reliability (regardless the size and structure ) and flexibility later if you need to make the change. The video is about optimization wouldn't work uniformly, size wise, structure wise, some works the best for the given range or given structure. I have the opposite experience from another side, I actually want to try out game development from fullstack (but traditionally just software development). I constant see game devs are doing over optimization without knowing the reason, and lead to unreadable and slow code, but they tend to so pretentious that they are the smartest, like those leetcode obsessives. Many are lack of computer science knowledge, but hey, game devs are really passionate (better than traditional software developers by a margin), I really really like them doing some creative thing. At the end is more case by case, person by person, I wouldn't generalize my experience to the entire field, whether it's game development or web development, whether who is smart or not, it's up to the specific area/problem you are working on.

@BeefIngot Жыл бұрын

I want to know this too. Not because I think its impossible at all, just curious. I would imagine for a motivated individual it just takes building your own project 0 user service to a reasonable enough quality to feel confident you know what is going on, front to back @@boastfultoast

@martinvuyk5326 Жыл бұрын

Yeah I also started on a more C performance track then fell on Web dev. It's absolutely about maintainability. But there are corners here and there where I still get a kick on optimization, especially now that I'm doing rather Data engineering on DB. I had to reimplements something with numpy the other day and it was not really as readable as before, but I'll take 22 seconds over an hour every day lol

@CattoFace Жыл бұрын

16:10 "transposing the walk" works without hurting performance only when it doesnt cause cache misses, earlier in the video, he did not account for cache size yet, so it caused misses. but now the block is in cache so the transposition doesnt cause extra misses.

@drooplug Жыл бұрын

Matt parker wrote a python script that he admitted wasn't very good, but it got thebjob done. It took something like 30 days to complete on his laptop. He shared it with his viewers and a number people rewrote it to run faster. I think the first rewrite reduced the rub time ti a few hours in python. Eventually, someone submitted code in another language that gor it down to something like 2 ms.

@GabrielSoldani Жыл бұрын

That’s pretty different, though. Parker used an inefficient algorithm. The community first found more efficient algorithms and then optimized it for cache hits and parallel execution. This video uses the same algorithm from start to finish.

@Z3rgatul 11 ай бұрын

@@GabrielSoldaniare you sure library doesn't use Strassen's algorithm?

@Hamstray 4 ай бұрын

@@GabrielSoldani yes, and he used an inefficient language.

@martijn3151 Жыл бұрын

The single most important take here: always profile your code. Don’t assume, measure.

@unbreakablefootage Жыл бұрын

Ye and look at the instructions generated

@DFPercush Жыл бұрын

-O3 or -Ofast , and -march=native are good for squeezing out the most performance out of your compiler on your test rig. If you control the server, or in source distros, you can leave it that way, but for public binary distributions it's good to have a well defined set of features to target, like plain x86, -mAVX, etc.

@martijn3151 Жыл бұрын

@DFPercush those compiler flags do come with a risk. I’ve seen games suddenly behaving differently when using Ofast for instance, which is definitely not something you want when being in optimization phase. There is a reason as to why optimization flags can be switched on and off; for some applications the shortcuts generated by the compiler are acceptable and the speed increase is worth it. For other applications, the speed increase isn’t all that much, but the risks are. So, be careful with using them would be my advise

@blindshellvideos 11 ай бұрын

what does the -march flag do anyway?@@DFPercush

@DFPercush 11 ай бұрын

@@blindshellvideos -m is used for a bunch of different options. I think it means "machine" or something like that. One of them is "arch" = "architecture". In other words what kind of CPU to compile for. This is how you can cross compile for other types of machines. But "native" means whatever processor the compiler is currently running on. This will allow it to take advantage of any advanced features you have on your machine like AVX, encryption acceleration hardware, etc. Otherwise, you can specify those features with flags like -mavx2. man gcc | grep -C 20 -- '-m' | less and look for x86 options.

@notapplicable7292 Жыл бұрын

Cache optimization has been really eye opening for me over the last year.

@petrkiyashko4248 Жыл бұрын

Same here, and also branch prediction and cpu pipelining lately

@alexhiatt3374 Жыл бұрын

same! I think people focus too much on "heap vs. stack" instead of "cached vs. uncached". the stack is way faster because it's basically always in cache, but heap usage doesn't necessarily need to be slow.

@astronemir Жыл бұрын

It’s also all wasted effort whenever cache layout size cpu instructions change.

@yegorzakharov8514 Жыл бұрын

Cache me daddy outside

@TheSnowMan2023 Жыл бұрын

@@Jim-gyswry i think it depends on the problem you are trying to solve and the framework/languages and packages/libraries you are using.

@sinom Жыл бұрын

Relating GB/s to Gflops is something that's often done in high performance multithreaded programming. If you look at a CPU that can handle 64Gflops. Every one of the floating operations will be using 2 8byte numbers and then outputting 1 8byte number. That would mean if every operation would require its own new data to be loaded the required throughout would be 128GB/s. But your actual bandwidth is probably, gonna be more like 8GB/s. So you somehow need to try fully utilizing all the flops with much less data. In matrix algorithms this is often done by basically loading tiles one after another. And then iterating over those. This will introduce more nested loops but it will also significantly reduce the amount of data loading required per floating point instructions, allowing for much higher utilization of the CPU. This is called tiling, and by now a lot of compilers do a simplified version of it on their own, but they don't know your exact flops and bandwidth specs, so if you do know them you can get out even more performance than your compiler can. Basically the exact same thing is also done when doing matrix maths on the GPU, since there there's also a huge bandwidth limit most of the time. Just in the GPU case you usually just use a library that already implemented it for you like Thrust or rocThrust

@BeefIngot Жыл бұрын

God, what a name those libraries have. You really just thrust that upon us.

@gauravnegi6857 Жыл бұрын

@@BeefIngotas a computer science noob I can't understand a lot, but that's really crazy amazing.

@IvandelMuntSoler 4 ай бұрын

@@gauravnegi6857 Its a way to say your crappy dual channel PC sucks no matter how fast your i9 or R9 is! We've been bottlenecked by dual channel since X58.

@bariole 4 ай бұрын

@@IvandelMuntSoler It is more latency issue though. His operations are fine grined, as he is reading 16 and than storing 8 bytes. We have been lantecy bottlenecked since 486 days. For reference ddr5 ram has latency about 14 ns, and edo ram in 1990 was about 60 ns.. Parallel chaneling and/or increase in word/page sizes is not helpful. The only thing which works is not using memory. And this video was about that.

@PythonPlusPlus Жыл бұрын

21 GB/s equates to 5.25 GFLOPS because floats are 4 bytes, and every operation has to load a float from memory. So 21/4 = 5.25

@markkalsbeek5883 7 ай бұрын

But it's multiplication, so you need to fetch 2 floats right?

@thewhitefalcon8539 4 ай бұрын

@@markkalsbeek5883 depends what you're doing. Sum up a big array? the running sum is in a register.

@DWal32 4 ай бұрын

you could multiply the same number, no?

@m4rt_ Жыл бұрын

16:40 if you are doing game programing and know what size the matrices will be, for example 4x4, you can just hardcode the multiplication. Since the size is set you don't need General Matrix Multiplication (GEMM), which is the algorithm he is optimizing. With GEMM you don't optimize for the specific size, you make something that can handle all sizes, but then may not have the best performance. When you know that the size will always be 4x4 you can make it do it the most optimal way.

@heavymetalmixer91 Жыл бұрын

Asking from a novice's perspective: Can you put an example in C++ here of how you would hardcode a 4x4 optimized multiplication?

@theairaccumulator7144 Жыл бұрын

@@heavymetalmixer91use a library like glm

@Ulchie Жыл бұрын

@@heavymetalmixer91 My assumption is that you just manually unroll any loops and have every multiplication laid out, probably using SIMD intrinsics to really optimize it.

@m4rt_ Жыл бұрын

@@heavymetalmixer91 I don't have a C++ version handy, but I have a Jai version. It should be relatively readable. (btw --- means that it is uninitialized). multiply :: (m: Matrix4, n: Matrix4) -> Matrix4 { result: Matrix4 = ---; result._11 = m._11*n._11 + m._12*n._21 + m._13*n._31 + m._14*n._41; result._21 = m._21*n._11 + m._22*n._21 + m._23*n._31 + m._24*n._41; result._31 = m._31*n._11 + m._32*n._21 + m._33*n._31 + m._34*n._41; result._41 = m._41*n._11 + m._42*n._21 + m._43*n._31 + m._44*n._41; result._12 = m._11*n._12 + m._12*n._22 + m._13*n._32 + m._14*n._42; result._22 = m._21*n._12 + m._22*n._22 + m._23*n._32 + m._24*n._42; result._32 = m._31*n._12 + m._32*n._22 + m._33*n._32 + m._34*n._42; result._42 = m._41*n._12 + m._42*n._22 + m._43*n._32 + m._44*n._42; result._13 = m._11*n._13 + m._12*n._23 + m._13*n._33 + m._14*n._43; result._23 = m._21*n._13 + m._22*n._23 + m._23*n._33 + m._24*n._43; result._33 = m._31*n._13 + m._32*n._23 + m._33*n._33 + m._34*n._43; result._43 = m._41*n._13 + m._42*n._23 + m._43*n._33 + m._44*n._43; result._14 = m._11*n._14 + m._12*n._24 + m._13*n._34 + m._14*n._44; result._24 = m._21*n._14 + m._22*n._24 + m._23*n._34 + m._24*n._44; result._34 = m._31*n._14 + m._32*n._24 + m._33*n._34 + m._34*n._44; result._44 = m._41*n._14 + m._42*n._24 + m._43*n._34 + m._44*n._44; return result; } and here is a simplified version of the struct Matrix4 Matrix4 :: struct { _11, _12, _13, _14 : float; _21, _22, _23, _24 : float; _31, _32, _33, _34 : float; _41, _42, _43, _44 : float; }

@oblivion_2852 Жыл бұрын

I have a strong suspicion that the simd registers being wider than the data... With the added latency would actually make it slower than just raw assembly on the memory addresses.

@emanuelfarauanu1760 Жыл бұрын

One of my tasks during an internship way back in 2018 was to do such optimisations in C++ on a Raspberry PI, I learnt so much from that. Really cool algs.

@tordjarv3802 Жыл бұрын

I think this level of optimization is more common in scientific computing than in game development since SC uses way more data (think peta to exa bytes) than in game development.

@0dsteel Жыл бұрын

Those probably use specialized hardware (for massively parallel computations), aka not the cpu. Cuda magic and kernels go brrrr

@tordjarv3802 Жыл бұрын

@@0dsteel I'm in that field, and yes accelerators are sometimes used such as GPU, but you will be surprised how much research code still runs on CPU only.

@0dsteel Жыл бұрын

@@tordjarv3802 can imagine how 10^n vs 10^(n+2) ticks of simulation per second makes a difference, games are pretty shy on things happening pre tick in comparison

@OREYG Жыл бұрын

Depends how you define scientific computing. In gamedev you usually have to crunch pretty considerate amount of data in 16ms, so such optimizations are mandatory. In scientific computing you are not bound by constraints of having to output something in real time. Then there are some numerically intensive applications that cannot use fastest available option (for example they require a certain accuracy). Then there are libraries, that are available to scientists, that already employ some of those techniques. What I'm trying to say, that in gamedev it's done as a necessity, while in scientific computing as an afterthought.

@0dsteel Жыл бұрын

@@OREYG Depends how you define constraints. A game engine doesn't have to produce 60+ frames a second, but it helps with sales. Take an engine that runs simulations. 1 second vs 2 minutes to simulate the same model for 1 year delta time is huge. Little Tommy experimenting with python scripts for his PhD thesis is not really what comes to my mind at least. --- Edit --- ;)

@sinom Жыл бұрын

No simply walking differently would mean you'd also have to walk differently on the other matrix so that would just shift the problem from one matrix to the other. The reason it works later on is because of that temporary copy being made.

@Impatient_Ape Жыл бұрын

The FORTRAN spaghetti code I used for research calculations back in the day used the LAPACK library, which, at the time, was the most optimized code for linear algebra in existence.

@abdullahahmad1642 Жыл бұрын

It still is in most cases, though there's a Rust library called Faer being developed that looks like it could give it a run for its money...

@IvandelMuntSoler 4 ай бұрын

Its still used.

@viniciuspjardim Жыл бұрын

In my understanding he only stopped transposing the matrix in the end because of the buffers. Without the buffers, a transposed copy of the matrix will lead to more cache hits. With the buffer we already have a copy. So he fuse two steps in one, transposing it into the buffer, so it's still transposed copy when accessing the buffer memory.

@gregorymorse8423 11 ай бұрын

Transpose load from cache vs transpose load from RAM. I thought the same.

@PhilfreezeCH Жыл бұрын

Last year I helped optimize the math library for my universities RISCV multi-core microprocessors (you can also buy commercial versions called GAP8 and GAP9). It was entirely this stuff, just with Fourier transforms and so on. We started with C code and by the end almost all inner loops were assembly, it was pretty fun honestly.

@manofacertainrage856 Жыл бұрын

Wow. I haven't seen this stuff for a while ... or felt the terror of implementing the first few optimizations for a grade. That's after two parallel programming classes in my master's that included OpenMP and many matrix multiplication algorithms. Took a CUDA course a few years afterward and some of the access/transpose problems are still a problem, but the CUDA libs are the answer there too. Awesome video!

@PiratePawsLive Жыл бұрын

This was awesome, I knew quite a bit about how memory and CPU's work. But the optimization was a bit of beauty even if the code looks like an aesthetic war crime xD. Love it. Now I'm interested in learning more about optimizing again :).

@aerocodes Жыл бұрын

the code aesthetics are really not good. I'm downvoting it

@chaselewis6534 Жыл бұрын

On the question for games does this scale for N=4 or just larger matrices. N=4 fits into 4 SIMD registers so you can do the entire set of operations actually in register. In fact you generally load both into register do a 4x4 transpose on one in register then do the multiply since transposing a 4x4 in register is only a couple instructions and lets you do all the multiply adds in parallel afterwards. So their is another level optimization at that point, but that is just because the matrices fit entirely in register. Once you get outside fitting everything in registers this is how GEMM works. There are more optimal algorithms though that get into fancier math that lets the matrix calculation scale below N^3 and more like N^2.4 or so.

@woolfel Жыл бұрын

it's a good video. Lots of people have talked about how bad cache hit miss affects performance, but the video explained it in a simple way that more people can understand. In the java world, lots of talks about mechanical sympathy have addressed this topic.

@tomaszrusnak5684 Жыл бұрын

Jako inżynier budownictwa pracuję na progtramach wykorzystujących obliczenia wytrzymałościowe właśnmie na macierzach, i im więcej danych (drobniej podzielona konstrukcja na elementy skończone) tym dokładniejsze wyniki otrzymujemy. Od ponad 10 lat obliczenia nie przyspieszają, i obliczanie macierzy o wymarach 200 000 x 200 000 elementów sąjuż wyzwaniem, nie mówiąc o tym, że wiele solverów jest napisachych bez wykorzystania wielu procesorów. Postanawiam na poważnie podjąć naukę programowania (nie tlko hobbistycznie jak do tej pory) aby rozpocząć w branży budowlanej optymalizację solverów. Świetny film.

@theoballet1519 Жыл бұрын

another optimization not mentioned here is using the super scalar nature of our processors by creaing a temp vector that stores the product of elements from the two lines, the multiplication will be performed in parrallel and the sums will be performed when the prior mult have been done. (this results in more code and more loops, but more perf too, especially for big lines since you wont have too much miss in branch prediction) this is an optimisation used by BLAS. another thing could be using INT representation since modern CPUs have more int functionnal units that floats.

@Kotfluegel Жыл бұрын

For small size matrices up to four dimensions, you usually employ specific types for each dimension, because then you can literally unroll all the loops. So, no need to use block matrices for graphic applications.

@m.sierra5258 Жыл бұрын

I optimized this very problem in a university competition, and won the competition. One thing they missed in this video: line stride. Not every memory can be put into ever cache line. In most cases, there is something like modulo 256 applied to the address to decide which cache bucket something gets put into. Sadly, a matrix of 1024 means that every item of a column gets put into the same cache bucket. So for us it increased throughout substantially to apply a padding of one cache entry to every matrix row. This has the effect that columns can be cached efficiently.

@ArkDEngal Жыл бұрын

loop ordering (locality, locality, locality) was so important that I had at least 3 classes in classes that covered it in college.

@heck-r 4 ай бұрын

Step 1: Spend a lot of time to do a crazy and very effective optimization Step 2: Forget to document it properly Step 3: Random normal colleague comes across it a month later: "Wow, this code is unreadable, matrix multiplication does not have to be this hard" * refactors it to the original slow version *

@enjoyart6194 4 ай бұрын

that is the a reality of the industry today, no joke.

@Hebruwu 4 ай бұрын

I did something like that during HPC class. It was fun. If you really want to get into HPC, matrix multiplication is just such an amazing example to learn on.

@B_dev 4 ай бұрын

i fukin love that he goes through all that just to end up with something thats 8x slower than just using a library lol

@filiformis Жыл бұрын

I encountered this stuff in my Computer Architecture class in college, but I haven't had the opportunity to use this knowledge it yet.

@leschopinesns100 Жыл бұрын

I've literally implemented vanilla GEMM in parallelism course at uni, multiplying big ass matrices.

@suede__ Жыл бұрын

That's a lot of computer science.

@zkreso Жыл бұрын

Can't wait to implement this optimization in the ERP software I write at work

@enjoyart6194 4 ай бұрын

That problem, is a classical question in an exam for Design & Analysis of Alghoritms students, in an Enginnering Science for IT Schools on Chile. Question 1: Write the brute force alghoritm for resolution of matrix multiplication; question 2: Then, write a performance optimization, considering the memory allocation limitations and the maxima, divide to reign; question 3: and calculate their computation cost and order of the algorihtm. 2 points by question for an evaluation over 1-7, with 7 the maximum... Aspiring to a 5 is for genius. If you obtain a 4, feel great, you pass.... if you just obtain a 3 or less, you fail. The 7 is only for God.... 1,2,3... 2 hours. Go Code.

@ward7576 Жыл бұрын

2:58 that's about the type of optimizations that a company, doing WordPress plugins, would ask you to do in your technical interview

@richardpowell1425 Жыл бұрын

The applications for large matrix multiplication include engineering and science. Important to get fast code.

@GRHmedia Жыл бұрын

This used to be the standard of what was taught in college when it came to programming. It is why they used to require you to learn ASM before you could take C and C++. A project I worked on 6 years ago was to create a large maze with many rooms attached about 40,000. The first version would take days to finish. So rewrote it and it was taking hours. The rewrote it again got it down to 30minutes. Another rewrite and I was down 8min, then 2.5 min, then 30seconds. Finally version 0.9seconds. That project was purely on a single thread at 3ghz a8-3850. I've seen much larger changes in performance stuff that would run weeks and months reduced to seconds. Most of those also ended up having a change in language used as well.

@alfred.clement Жыл бұрын

I was mind blown when I performed byte shuffling using x64 SIMD with the SSSE3 instruction set, specifically the pshufb instruction. I've eventually optimized reordering/transformation of complex data with vector instructions, resulting in a significant impact on real performance. It was about x88 faster... It matters A LOT, especially when you're dealing with low-latency operations.

@Blubb3rbub Жыл бұрын

Matrix Multiplication becomes even more interesting if you reach the "I can't reasonably fit my matrix into memory"-size of matrices and need to employ MPI and distributed memory.

@BleachWizz Жыл бұрын

5:13 - I think the idea is really to transpose the matrix in memory, we want the columns to be close together so when we read the first block in cache from the first multiplication all subsequent numbers possible for the next multiplications would be brought into cache. otherwise indices would be in intervals on the dimension of the matrix and you would read a lot less numbers at a time up to rereading memory every multiplication. 16:17 - the problem with walking is that you have to check if you CAN do that, in this case, localA and local B are both being completely brought into the cache, because now you're calculating different parts of the matrix at a time so the result needs to be completely fit in the cache as well or THIS PART will start causing problems.

@kitastro Жыл бұрын

9:14 I like that music that comes in

@zataritamods7499 Жыл бұрын

Transposing the matrix reminds me a lot of the concept of Swizzling, also doing the blocks reminds me a lot of the tiling method used by the ps2. It feels like this is generalizing the concept for matrix multiplication. Also really helps me conceptualize why someone would do something like this.

@7th_CAV_Trooper 11 ай бұрын

two guys having an argument over which compiled language is faster and this guy walks up, transposes some matrices, lights the computer on fire and then tells them it could be faster if he just walked the array backwards and walks out of the room.

@josephlabs Жыл бұрын

That’s crazy 😂. I aspire to be doing this one day soon.

@selasco 4 ай бұрын

this video remember me one time when i refactored a matrix that was using to show data to the user but was slow AF, my idea was to use 2 vectors (one as X and other as Y) and, every time the user change the data on the matrix, i just change the vectors (the original data, in the matrix, still the same forever). this save 50% performance and 50% memory use btw

@Peregringlk 11 ай бұрын

7:24 If you don't care about the order of elements of the array, you can remove an element at any position by swapping it with the last value and then pop. That way you don't need to "move" things around.

@alexcristache8830 Жыл бұрын

This is really cool. I had to do a very similar thing for one of my university courses called High Performance Computing. Basically simulate fluids using Lattice Boltzmann on a cluster of the university's supercomputer. Was done in C and had to pay attention to all the optimizations he talks about there. Memory access, cache, using OMP and so on. It was really cool, but extremely challenging. I guess this is used more in academia for different simulations like the one I mentioned.

@rogierderuijter7009 Жыл бұрын

Never seen prime this quite. Amazing video

@MrHaggyy Жыл бұрын

I´m doing automotive stuff. For small matrices, reorganizing doesn`t work that well. You load or create the matrix in the right order. What we do is "sliding" across the matrices by rotating pointers over the memory we are given. This allows you to fill your SIMD operation with the demand from multiple mathematical operations. Also, there are many hardware solvers for matrix multiplication, that can run in parallel to the CDU.

@josefpharma4714 Жыл бұрын

Transposing the memory layout of a matrix can improve cache hits a lot.

@thapr0digy Жыл бұрын

8:43 Prime realizes that this guys maths is blowing his mind.

@satibel 4 ай бұрын

both the most optimized (and non optimized*) code I wrote is a benchmark to see if what I wanted to do is even possible, I'm doing collision detection over thousands of entities the position is stored in 3 different ways, one normal way, and in the left side and right side of a simd less or equal comparison, so comparing collision is 1 instruction. * quad tree space partitioning is way more efficient as you don't scale in pyramid numbees but the same algorithm can be used on top of qsp (and it's the same in the worst case scenario where you have all collisions inside the smallest cell of your qsp.), so it doesn't affect the result, and you can also do the calculation on the gpu directly.

@TQuantP Жыл бұрын

I wanted to watch this video for a while now, just sitting there on a tab so old it wad beggining to gather dust. Thankfully now I can! Thanks Prime 😂

@besknighter Жыл бұрын

In games, we rarely use huge matrices, if ever. However, depending on the game, the amount of matrices multiplications itself is so so soooo big that it might be useful to optimize it this much.

@oliverer3 Жыл бұрын

This is the stuff that makes me question if I can even call myself a programmer. Any optimization I've ever had to do has been very basic stuff. I can't even begin to wrap my head around this. Then again I've never really done any compute-heavy programming.

@ArkhKGB Жыл бұрын

Blast from the past: had to do the same kind of things in engineering school. Try to optimize matrix multiplication, be happy with the results (same thing about finding cache sizes but no SIMD at the time). Then you whip out BLAS and it out performs your work one or two order of magnitude. And yeah, those big matrix multiplications are useful in most simulations for things like weather forecasting or fluid simulation.

@dominikmuller4477 6 ай бұрын

there have been recent improvements to the matrix multiplication algorithm. Just like multiplication of scalars has some convoluted improvements, you can multiply some sums of sub-matrices and get away with vastly fewer multiplications.

@resresres1 Жыл бұрын

modern programming at it's finest: Spend forever optimizing code, then find out there's a library for that and you wasted your time.

@mort44444 5 ай бұрын

Going thru the pain & learning is never a waste of time

@enjoyart6194 4 ай бұрын

Just when that library exist, if don't, then you really will know the art, math.

@perguto 8 күн бұрын

It's crazy that you can get that much performance without even touching the mathematical optimization side. While the naive matrix multiplication is O(N^3), you can get it down to something like O(N^2.3) with clever intermediate products

@Felix-ve9hs Жыл бұрын

I barely know anything about algorithms, programming, math, or optimization, but it's pretty impressive how knowledge about hardware can improve performance :)

@GRHmedia Жыл бұрын

The stated amount a CPU can do is based on if the program runs entirely inside cache. However if the program has to reach out to memory for new variables then it is limited by memory bandwidth.

@zayniddindev Жыл бұрын

In the first minute, I figured out that this video is not for me (yet) xD

@russelllapua4904 Жыл бұрын

Didn't even get that far 😂

@llothar68 Жыл бұрын

You are never to young to learn about cache locality. The matrix multiplication is just a use case, no need to understand that.

@shahbazkhan-ek7hp Жыл бұрын

28seconds😁

@nexovec Жыл бұрын

It's probably true that you don't have to transpose the matrix if it's float32 4x4, because the entire thing is 64 bytes, which is one cache line, so in that extremely common case it can actually be made even faster, you just load it into the SIMD registers directly from the definition, done.

@vladlu6362 Жыл бұрын

Except that, if you are running m/n matrices and not n/n matrices, it's not in a single cache line. It's only in the same cache line if you can guarantee that they are allocated all at once, and that's not always the casem

@nexovec Жыл бұрын

@@vladlu6362 I only meant 4x4 times 4x4, which is a pretty common case... What do you mean by allocated all at once. Never would I imagine anything else than having them contiguously in memory.

@kazedcat Жыл бұрын

@@nexoveccache alignment problem for example you have non vector data like the determinant that you need for your calculation over and over so you want them stored next to your 4x4 but the determinant value will cause your 4x4 to straddle two cacheline.

@nexovec Жыл бұрын

@@kazedcat Sure, always #align 64 or whatever the thing was.

@DFPercush Жыл бұрын

@@kazedcat That's the main advantage of a struct-of-arrays approach. The matrices can live in an alignas(64) array, and the det's or whatever can live in an alignas(4) array. You just have to write container.matrix[i] and container.det[i] instead of container[i].matrix and container[i].det .

@radialorbits 6 ай бұрын

5 nested loops is rookie numbers. I went 8 deep once for a crude brute force suspension geometry solver, plus another 6 nested if loops to test solution. Worked a charm😅

@ismaelvc3728 4 ай бұрын

The transposition can return a view, transposed indexes of the same data

@shapelessed Жыл бұрын

3:20 - Technically, if you have a 32-bit float, which is 4 bytes, you can squeeze about "5 TFLOPs" through 20GB/s memory bus, I think that's the analogy here...

@forinda Жыл бұрын

I have never seen @primeagen have this agree to almost everything. This guy is fluent in math

@omdevs Жыл бұрын

This is amazing!

@HyperFocusMarshmallow Жыл бұрын

Watched this video a while back. It’s really good.

@enviidakid Жыл бұрын

lmfaooo this video got me in tears right now

@aerocodes Жыл бұрын

This video made me feel dumb and I want to go and study linear algebra back again

@swannie1503 Жыл бұрын

I’ve had to do quite a bit of cache optimization. Rust is very good for it actually. I enjoy the problems. Similarity matrices with 100s of thousands of members features. It makes a huge difference.

@Shenepoy 8 ай бұрын

I did once optimize python code from 50min to 5min with some C binding libraries and multithreading + multiprocessing using 12cores/24threads cpu felt great, the best feeling I ever had after that hot garbage worked and later threw it at colleague to finish it

@ivanjelenic5627 Жыл бұрын

Also, if you know something about the matrices you can decompose them and do stuff more efficiently for some operations.

@HyperFocusMarshmallow Жыл бұрын

8:35 This is the point where you see in Primagen’s eyes how his brains got fried by maths.

@russelllapua4904 Жыл бұрын

This video has solidified that maths just isn't for me.

@KillianTwew 4 күн бұрын

1:08 Nah that was the Bible. GOT EM

@Hr1s7i Жыл бұрын

3:40 It's the same relation as the one between moments and velocity in mechanics. The CPU can perform that many billions of floating point operations in a single second. Meanwhile the RAM can "process" or store/cycle, that many in a single second. RAM is a capacitive hard wired reactive circuit, which means you have to wait for the rather long transient processes to do their thing. Couple that and the fact that you expend part of the bandwidth for the sake of communication (CPU's memory controller is the equivalent of a train station tossing trains at other stations, be it memory, gpu or in some cases SSDs) and you might end up with seemingly low performance under certain conditions. A good example of how this weakness can surface is when you operate on enormous active data sets. In the meantime, your PC can look like an absolute monster when playing a video game, where once you load the data necessary for the simulation to run, very little of it changes on the logic side of things. As a matter of fact, most game behaviours are de facto hard coded specifically to prevent overhead. In the meantime, driving a PID sim in Matlab can easily employ 4-5 cores when done in a non cheaty way.

@hannessteffenhagen61 Жыл бұрын

For small matrices the main thing is that parallelising the work for multiplying two matrices makes no sense anymore, the overhead of distributing the work would far exceed the actual work that needs to be done. Some of the other things mentioned in the video may still make sense though.

@rotors_taker_0h Жыл бұрын

Then you find yourself counting execution port's limits, micro-op cache, branchless logic, whatever. The very next level of optimization (for those curious enough) is to separate "logic" of transforming data (matrix multiplication or whatever) and "schedule" (correct blocking, SIMD, multithreading, etc) and let a compiler or special optimizer do it for you. And they can optimize _several_ operations in the best way. Google for stuff like Halide, Taichi, TVM, Tiramisu compiler, etc. This stuff may optimize for different hardware too (DSP, GPUs, etc). They make camera pipelines with that, neural networks, rendering, all kind of stuff. Computers are crazy fast and crazy difficult to program.

@sanderbos4243 Жыл бұрын

7:03 If you have a Set of 10 items, rather than transforming it into an array and doing O(n) removal by shifting everything over, you can just make it O(1) using the "swap-remove" `arr[i] = arr[--len];` that swaps the last element into the position of the removed element. Note this presumes it's a Set, so the order doesn't matter, but you said it was a Set. :)

@YumekuiNeru Жыл бұрын

was not the point that the O-notation does not take into account how the data structure is stored in memory so an array will probably be faster even if it naively says O(n) or whatever

@Marius426 Жыл бұрын

... unless you want to keep it sorted for binary search access, then you do have to move the elements. But again, for small enough sets it doesn't matter there either.

@sanderbos4243 Жыл бұрын

@@YumekuiNeru Sure, but my single-line suggested approach of removing items from the array is an improvement of the approach prime suggested where you're moving way more items. Why shift over say 5 items when you can just swap. It does depend on it being a Set, but that was prime's hypothetical.

@sanderbos4243 Жыл бұрын

@@Marius426 If your elements are integers and you want to find an element in it, you can do it in O(1) as well rather than binary search's O(log N). You just need to maintain a second array that is the same length, where you index it with the item you're looking for and it tells you what index it's at. No hashing required. This of course doesn't really matter for performance when dealing with 10 integers, but with more integers it can. I hear you say "but my elements often aren't integers", but you can let those integers be indices into an array of the structs or whatever you're looking for. Think of them as entity IDs. This costs you another level of indirection, but on the other hand swapping integers is faster than swapping large structs. But this is admittedly going quite far, just wanted to bring up the possibility. "We can solve any problem by introducing an extra level of indirection." "…except for the problem of too many levels of indirection,"

@deldrinov Жыл бұрын

I'm a novice in programming at best but I've encountered something similar once. I wanted to add attributes to each member of an array which themselves were in an array; pretty much extending 2D array if my lingo is correct. I did that in 2 nested for loops, and the initial code ran like crap. I swapped the for statements without any other changes and speed increased 40x.

@DFPercush Жыл бұрын

There's a similar thing in image processing. It's usually better to loop for (y=...) on the outer loop and for (x=...) in the inner loop. That way you're accessing memory contiguously in a horizontal manner and not skipping several cache lines on every pixel.

@johanngambolputty5351 Жыл бұрын

Did I see a Manjaro terminal, hell yeah! (oh and something like astronvim too, respect)

@markuszeller_official Жыл бұрын

Uncle Bob is going to die if sees these nested loops.

@BrainySmurf77 Жыл бұрын

I spent a large part of my PhD doing optimizations like this to my CFD codes so I could graduate faster.

@notspm9157 4 ай бұрын

Larger matrices it's not optimal to actually do all the multiplications like that. It's no longer an O(n^2) operation, instead there are tricks to reduce the amount of multiplications down to something like O(nlognlogn) can't remember exactly which is where I think the other one likely benefits from. It reduces the overall steps needed

@lameshithead 10 ай бұрын

11:47 this is called a native array, since it is fixed in memory and you cant push

@TutoPanographic Жыл бұрын

The optimization concept in every mind is to try to make is faster, smaller, more efficient, etc. while the real problem is to choose first the right algorithm. Thanks Donald. Also the video doesn't take care to tell that even if this low level optimizations work, they probably compile differently on different machines. And, every optimization is contextual, because every problem /need is different. And then there's reality. When the client code is not new, you know you'll witness schizophrenia...

@mubin-ansari Жыл бұрын

Prime without hoodie? This is going to be a great article

@StarLord7544 4 ай бұрын

Do pro coders do not use using namespace std?? I am asking cause I only used c++ for DSA and never used it for dev purpose. Although I some experience in dev with JS

@josephvictory9536 3 ай бұрын

The real takeaway for me is that i need a way to test my code's performance with more comprehensive and objective metrics than the tests i use now. And use those tests to validate my code and build some good healthy *measured* intuitions.

@gingeral253 Жыл бұрын

This cache optimization has went against the programming principles I had learned.

@lameshithead 10 ай бұрын

i just dont get it if you can avoid redundancy you use a adjastancy matrix nested loop, if not you either use matrix multiplication with native arrays, or you use tensors

@michaelt126 Жыл бұрын

Im pretty sure modern compilers should already use a combination of prefetch and blocking of large arrays without having to include additional libraries,, maybe this is not a standard for all languages, but its worth knowing prior to adding a bunch of redundant code.

@cblair1353 Жыл бұрын

Linear algebra is fucking magic.

@hkravch Жыл бұрын

Bro is using that star trek “inser technobabble” thing

@spamviking8591 Жыл бұрын

Some of the best devs I know suffer from premature optimization.

@ZM-dm3jg 7 ай бұрын

He missed an easy optimization, for loops are much faster in reverse. Cmon bruh, even a junior developer knows that you run for loops in reverse for maximum peeformance

@QayLikeKay 8 ай бұрын

Even small cache line sizes are 32 bytes long, so a 4x4 float matrix would be 16 bytes, 2 of them would be 32. So both matrices could fit in one cache line and the result another (in the worst case scenario, in a 64byte cache line system everything *could* potentially fit in one cache line). People in chat saying your wrong forget that pretty much everything in computing is a trade off and many optimizations come at the cost of some fixed amount of setup time (or a setup time that scales slower than the normal case). As one of my professors once said "Bubble sort beats small children" (english was not his first language), but he meant that bubble sort beats algorithms like quicksort for small child data sets as it can just go, it doesn't have to allocate extra buffers, or do anything else. That said "bubble sort beats small children" sticks in your brain way better.