Loop Tiling - HTTP 203

Рет қаралды 28,078

Күн бұрын

Пікірлер: 114

@lartme 5 жыл бұрын

The smallest "unit" that can be loaded or evicted in an Intel cache is 64-bytes, or a 16 pixel span of a 32-bit RGBA image row. When you write from the top-left to bottom-left of the destination image, you are accessing only one pixel of this 16-pixel chunk of the row. The typical L1 cache can only contain at the very most 512 of these 16-pixel chunks, best case (32Kb data cache). The performance increase comes from stopping before you get to the 512'th row and instead return to the top to fill out more of the 16-pixels in each cache line before they get evicted (stopping at some point before reaching 512 would be better because the cache replacement policy won't be perfect, and accessing the source image is also causing cache evictions). The above explains tile sizes above the 256-512 range perform badly in your graph, and why you would expect a ~10 speed up (it's worst case 15 more cache fetches without tiling). You might even get the same/better results from processing columns in the source image instead of tiles.

@dassurma 5 жыл бұрын

Thanks for that detailed write-up! I learned more on the topic after we recorded that video, but not as detailed as you wrote it down :D

@DenisTRUFFAUT 5 жыл бұрын

So bottleneck is not the number of CPU, rather the knowledge of amount of L1 cache, allowing developer to create properly sized jobs. Is there any JS function that gives the amount of L1 cache ?

@lartme 5 жыл бұрын

@@DenisTRUFFAUT Not that I'm aware of, but for desktops the L1 cache specifications don't vary too much. There are also techniques that exploit caches with different specifications without knowing the details of the cache ("cache oblivious" algorithms).

@victornpb 5 жыл бұрын

Isn’t tiling supposed to leverage SIMD instructions? Or even a tile is too big to be able to be transformed in a single instruction?

@youknowhu 5 жыл бұрын

Damn thanks for this explanation!

@CyberAcidPlanet 5 жыл бұрын

That browser must've been on the cutting edge of technology.

@sahilsatishkumar 5 жыл бұрын

It must be cutting the edges, but not cutting edge.

@JimCullen 5 жыл бұрын

I dunno, they said it's usually a very good browser.

@omg_look_behind_you 5 жыл бұрын

DO YOU GUYS MEAN INTERNET EXPLORER? HUH GUYS? CAN I PLAY WITH YOU?

@barcicle 5 жыл бұрын

come on, they just hit an edge case with that browser. i love how he says 'they hit corner case' instead

@pardal_bs 5 жыл бұрын

I see what you did there...

@hyyanabofakher6229 5 жыл бұрын

The most detailed video I have watched in here so far , The topic is super interesting, Great work , keep it up guys

@sahilsatishkumar 5 жыл бұрын

I have never seen Jake this bamboozled.

@iuliuvisovan7307 5 жыл бұрын

This is by far your best video so far. Good job. Keep them coming.

@marshal7591 5 жыл бұрын

I love watching this channel, even though I don't fully understand everything they're talking about the majority of the time. Keep up the good work! I'm learning more and more as I go :)

@Abhishek-dp5tc 3 жыл бұрын

That's the reason you love it because you don't fully understand

@JoeWong81 5 жыл бұрын

love the technicality of these explanations by Surma.

@abdullahtayel5405 5 жыл бұрын

It is called cache blocking, I felt like you first time I was introduced to it. Here is a really good article on this technique software.intel.com/en-us/articles/cache-blocking-techniques and to be honest I never thought that any web developer would find himself in a situation where he needs to implement this technique in his work :D

@stepans2167 3 жыл бұрын

Thanks for the link, it's a good read

@AmxCsifier 5 жыл бұрын

5:17 corner case... you mean Edge case?

@dassurma 5 жыл бұрын

Both work.

@AmxCsifier 5 жыл бұрын

@@dassurma it was a pun, i.e a joke LOL

@lkri7951 5 жыл бұрын

Thats amazing thing you guys shared, Even though this is Engineering (logical reasoning) it sometimes works more like intuition 😊

@sinaseirafi9445 5 жыл бұрын

This is the first video that I've seen from you guys, Really interesting. Thank you.

@lucacasonato 5 жыл бұрын

It doesn't run on a Mac... Safari runs on a Mac. Chrome runs on a Mac. Firefox runs on a Mac. Opera runs on a Mac. Internet Explorer and Edge don't. IE doesn't support WASM. IT MUST BE EDGE! Who would have thought?! :-P

@khai96x 5 жыл бұрын

Do not reveal the other browsers' identity. Well yes, but actually no.

@iuliuvisovan7307 5 жыл бұрын

Is he using a Mac? I thought maybe if he was using Windows, it would've been Safari.

@iuliuvisovan7307 5 жыл бұрын

Oh, I got the point in the video where he specifically said Mac. GG Edge

@dan-garden 5 жыл бұрын

Khải Hoàng We are allowed because we don’t represent Google, whether as they do. It’s Edge.

@youknowhu 5 жыл бұрын

As soon as they diplomatically said "But it's a very good browser", we knew it was Edge ;) Especially after Microsoft said it was forced to move to Chromium because Google kept intentionally breaking KZbin performance on Edge.

@lukemiles6813 5 жыл бұрын

Not sure if I'm missing something, but couldn't this be explained by the rotation? Reading the pixels a row at a time is much faster, but this means you're writing a column at a time for the result (so there are large gaps between the locations you're writing to). You want both fairly sequential reads and writes, so using tiling means that the write cache can cache a whole tile before writing it, rather than having to write single values spread out through memory. (edit): After re-listening through the bit about the two explanations, this might have been what you were saying actually, just in a fairly roundabout way.

@meldafert7619 5 жыл бұрын

yes, this is how i understand it too

@lukejagodzinski 5 жыл бұрын

I agree and I think it's a better answer to the problem than the one presented in the video. I think, what they're talking about is a little bit different. I remember when I was learning C++, it was emphasised that accessing array by index is very expensive, so that's why it's better to use pointers. Every time you do array[index] it has to go the beginning of the array in memory and move an array pointer to the given index. It's very expensive if you're jumping between rows. On the other hand, if you use pointer you're just responsible for moving it and just by having 2 pointers you can copy memory efficiently. However, it might also be a case that V8 does some optimisations when accessing array by index, so that problem would not apply here. It's just my intuition but I might be wrong about that.

@dassurma 5 жыл бұрын

@@lukejagodzinski Accessing an array by index is not more or less expensive than doing the pointer arithmetic manually. You might be thinking about vectors from the stdlib, which overload the index operator and do a lot more under the hood. Luke’s got it right. And that is what I meant, but wasn’t as coherent as I could have been. What I sadly only learned only after the video was recorded is that writes to memory go through the cache as well and can also cause the cache to get evicted (I was under the impression that only reads do that). So yes, keeping both reads and writes limited to fairly small memory area (for example by using tiling) reduces the chance of cache eviction which means a speedup.

@lukejagodzinski 5 жыл бұрын

@@dassurma so you say that V8 doesn't do extra jumps between memory cells when executing? array[100]; array[1000]; array[50]; I would assume that it would do something like: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Go to memory point where array starts 4. Move array pointer by 1000 5. Go to memory point where array starts 6. Move array pointer by 50. With pointers it would just be: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Move array pointer by 900 4. Move array pointer by -850 Also, as I think about that, tiling would actually not help with array access... So probably I'm wrong that it was the source of the problem. But still I think pointers would improve speed here

@dassurma 5 жыл бұрын

@@lukejagodzinski I was referring to C++ (because you said that array indexing is expensive). For JavaScript, it’s hard to tell what kind of assembler V8 would generate for that code as it depends on the types and what behavior the optimizer observers. But the optimization you outline is definitely within the capabilities of V8.

@sampathrg1 4 жыл бұрын

9:52 - Is that bug in the first line of the inner loop? says [newX, newY] = [newWidth - 1 - y, newY]. Shouldn't it be [newX, newY] = [newWidth - 1 - y, x]?

@frutiboy1 3 жыл бұрын

Btw, In your Progressively Loading Images episode we can see that images are loaded tile by tile in some really progressive formats...

@marcopfeiffer3032 3 жыл бұрын

Can I try an explanation? The reading is nice sequential but the writing isn't. Let's take a 4x4 pixel image which is just a single dimensional 16 item array/block of memory. When you rotate clockwise, then you copy 1 to 4, then you copy 2 to 8, then 3 to 12 and then 4 to 16. If you tile the 4x4 image into 2x2 images, then you copy 1 to 4, then 2 to 8 _but then_ 5 to 3 and 6 to 7. So the first 4 pixels in the tiled rotate used the read-buffer 1-6 and write buffer 4-8 (6+4=10 pixels in cache total). The first 4 pixels in the full scan used read-buffer 1-4 but then write buffer 4-16 (4+12=16 pixels in cache total, basically the entire image).

@k1ngjulien_ 5 жыл бұрын

Oh my god I loved this. JavaScript and low level cpu functionality are two of my favorite topics :D Thanks for making this. PS: Please tell your colleagues that work on GMail to finally make it a PWA so I don't have to use another program anymore :P

@jurgentreep 5 жыл бұрын

If you open your google chrome apps you can right click gmail and set it to open in a new window. This is almost the same as a pwa :) You can even create a shortcut for your desktop or start menu.

@mpoisot 2 жыл бұрын

Would a Hilbert curve be useful here for a more general solution, perhaps to avoid picking an explicit tile size?

@NuncNuncNuncNunc 4 жыл бұрын

I was on the edge of my seat hoping you would name the browser.

@dum.briyani 4 жыл бұрын

Sur-matram. Optimization levelled up.

@ben_lacy 5 жыл бұрын

Favorite episode to date! Really good stuff

@niklashigi 5 жыл бұрын

16:40, you forgot to link the podcast in the description. Very interesting video, btw!

@dassurma 5 жыл бұрын

I fixed it now! Thanks for letting me now

@martixy2 5 жыл бұрын

Cache locality? Cache locality! It's actually a pretty popular concept in performance critical code (e.g. game engines). I keep seeing it mentioned in a ton of GDC talks on architecture/optimization/related.

@YoussefELHOUTI 5 жыл бұрын

without tiling each write has to evict because you are rotating 90 degrees (x become y) when you do tiling, you don't need to evict foreach write and this why it's much faster

@hypersonic12 5 жыл бұрын

Fascinating video, guys! This was a great education, and entertaining at the same time. :)

@akinwunmioluwaseun3772 5 жыл бұрын

Actually drew my breathe in with Jake at 21:10

@OlleHellman 5 жыл бұрын

Very interesting! Thank you for doing your best explaining it.

@victornpb 5 жыл бұрын

Who doesn’t like a good browser war! The 90s is back guys

@kosnk 3 жыл бұрын

A great magical trick. Sadly, I still don't quite get it: some say in comments it's because the way the reading caching happens (better pattern prediction/cache flow through levels/cache eviction), others say that the write cache is also involved. I'll have to learn more on the topic, but at least knowing that such thing exists should help. Thanks! :)

@ZacharyBerger 5 жыл бұрын

Could you share the link to the Hacker News discussion?

@ZacharyBerger 5 жыл бұрын

Actually, found it myself: news.ycombinator.com/item?id=19166233

@LeviBuzolic 5 жыл бұрын

Super interesting video, great work! Entirely uncharted territory for me.

@aidanbrumsickle 5 жыл бұрын

If you wanted to write a parallel algorithm for rotation would you want to tile the image in the same way? that would seem to be useful for per-core caches, but then once separate cores are reading and writing from different locations it would seem to make shared cache performance worse. That's pure speculation on my part though. Perhaps a new side learning project is in order.

@emilemil1 5 жыл бұрын

I disagree that this is a "tools not rules" case. That this would be a performance issue is obvious from looking at the code, knowing the rules would let you catch it immediately. You're accessing a 2D array column by column. This *always* causes cache issues for large arrays, and images are very large. These are also obviously the instructions that will be run the most, by a significant margin, so it's worth thinking about from the start. In fact in C++ there is a whole paradigm dedicated to designing with the cache in mind. Look up data-oriented design.

@jakearchibald 5 жыл бұрын

Without tools, we wouldn't be able to confirm the result.

@emilemil1 5 жыл бұрын

Confirming isn't needed in obvious cases, it's just time wasted on unnecessary profiling. I mean do you profile before using binary search over linear search on an array of 2 million values? This is the same situation, just more obscure information.

@dassurma 5 жыл бұрын

@@emilemil1 If you had shown me the code before I learned about loop tiling, I’d have bet my ass that the new code was slower, even if maybe only marginally so. At some point you’ll have enough expertise in a field to be able to make these kind of calls without profiling and testing. For example: By now I know when using `will-change` will help and when it won’t. I don’t need to test, because _to me_ it’s obvious. But that doesn’t mean it’s obvious to everyone else and - more importantly - that others should just follow the advice “use will-change”. Even if you “know the rules” around `will-change`, there are a lot of scenarios where it will make things worse. So whenever I give advice around `will-change` I’d add “but measure and see if it actually helps”. It’s great if this case is obvious to you, but it wasn’t for me. We all have different backgrounds with different levels of knowledge. “Tools not rules” is a phrase that help us DevRels give advice in a way that everyone can use it while reducing the adverse effects of just listing rules.

@MrZyclops 5 жыл бұрын

*head explodes* great episode

@victornpb 5 жыл бұрын

Don’t tell me you had to reshoot this just to remove the browser name

@pflasterstrips7254 2 жыл бұрын

I expected this to be about code being parallelized by the compiler, and the outer tiling loops are better for that.

@JustJohnny 5 жыл бұрын

I've guided some clients that don't have photo optimization software to Squoosh and even I find myself using it for time to time because it does a better job than Photoshop. Any chance we'll see batch processing?

@dassurma 5 жыл бұрын

We are working on batch processing! (Currently implementing some lower-level building blocks that will pave the way for batch processing to land)

@Morgan_Davis 5 жыл бұрын

Can the optimal tile size be dynamically determined at runtime (with a small internal test of generated data) to avoid having to profile CPUs at all?

@dassurma 5 жыл бұрын

You could run a small benchmark on your target device, but I think that’s not worth the effort. 16 seems like a pretty safe number :)

@MacoveiVlad 5 жыл бұрын

Let's hope the speed gain is not based on speculative execution. If it is it might go away with a OS patch and leave everyone wandering what happened. Or worse, someone might find a exploit. This performance gains look really close to the metal. One would expect many layers of abstraction for something that runs in a browser.

@lakandoor1007 5 жыл бұрын

nice video about wasm :-)

@iuliuvisovan7307 5 жыл бұрын

10:15 Hey Jake, it's called an index :D

@lastdecember2496 5 жыл бұрын

Happy to hear that even google engineer not fully understand why their code working 😅

@dmitrisheley1998 5 жыл бұрын

now I'm intrigued... WHAT BROWSER WAS IT??

@areebjamaliam 5 жыл бұрын

Edge

@b3rakesh11 5 жыл бұрын

Thank you Google

@cristianberroeta9866 5 жыл бұрын

Great video, very interesting stuff! I would have thought that tiling improves performance because of some sort of concurrency (many tiles being processed at the same time), but that's not the case, right?

@justintaddei 5 жыл бұрын

The JavaScript code in the video is single-threaded, so no.

@anonymouse7074 5 жыл бұрын

Podcast not linked in the description 😂

@dassurma 5 жыл бұрын

I am not a smart man.

@ColinRichardson 5 жыл бұрын

Best video EVER

@NanobyteOnline 5 жыл бұрын

Conclusion: It works! Great.

@DenisTRUFFAUT 5 жыл бұрын

Is there a general rule for tiling ? Or a JS function that gives the number of cores ?

@dassurma 5 жыл бұрын

There is navigator.hardwareConcurrency

@ZacharyBerger 5 жыл бұрын

@@dassurma but number of cores doesn't matter here, its the cache size

@FlowertwigDotOrg 5 жыл бұрын

Actually useful info this time, thanks :)

@dassurma 5 жыл бұрын

Do I sense shade?

@apo34apo34 5 жыл бұрын

simply woah ^^

@furetosan 2 жыл бұрын

16:03 inaudible = AssemblyScript perhaps?

@b3rakesh11 5 жыл бұрын

Thank you

@GifCoDigital 5 жыл бұрын

My head hurts now. lol :)

@Marseleon 5 жыл бұрын

what was the second theory?

@b3rakesh11 5 жыл бұрын

Awesome

@nicolas.stucki 3 жыл бұрын

Tiling improves performance because it helps keep the destination matrix in the cache.

@arkanciscan 5 жыл бұрын

Does it rhyme with "kitchenette restorer"?

@recursiv 5 жыл бұрын

That one doesn't run wasm so no.

@GottZ 5 жыл бұрын

nice how they just described microsoft edge

@wmhilton-old 5 жыл бұрын

Any sufficiently advanced technology...

@franklemanschik_de 4 жыл бұрын

What happens here is simple you did the most basic optimization possible chunk your operations :)

@anonymouse7074 5 жыл бұрын

Name and shame!

@Benimation 5 жыл бұрын

Nobody knows you're talking about IE..

@hobbyturystaSEO 4 жыл бұрын

https 203 0:30

@pharmokan 4 жыл бұрын

what a horrible direction web development is going in

@panstromek 5 жыл бұрын

DOD comes to web ;)

@zaffarmughal5478 5 жыл бұрын

could be better with nice and decent. no need to be clown.

@dassurma 5 жыл бұрын

Where’d be the fun in that?