The smallest "unit" that can be loaded or evicted in an Intel cache is 64-bytes, or a 16 pixel span of a 32-bit RGBA image row. When you write from the top-left to bottom-left of the destination image, you are accessing only one pixel of this 16-pixel chunk of the row. The typical L1 cache can only contain at the very most 512 of these 16-pixel chunks, best case (32Kb data cache). The performance increase comes from stopping before you get to the 512'th row and instead return to the top to fill out more of the 16-pixels in each cache line before they get evicted (stopping at some point before reaching 512 would be better because the cache replacement policy won't be perfect, and accessing the source image is also causing cache evictions). The above explains tile sizes above the 256-512 range perform badly in your graph, and why you would expect a ~10 speed up (it's worst case 15 more cache fetches without tiling). You might even get the same/better results from processing columns in the source image instead of tiles.
@dassurma5 жыл бұрын
Thanks for that detailed write-up! I learned more on the topic after we recorded that video, but not as detailed as you wrote it down :D
@DenisTRUFFAUT5 жыл бұрын
So bottleneck is not the number of CPU, rather the knowledge of amount of L1 cache, allowing developer to create properly sized jobs. Is there any JS function that gives the amount of L1 cache ?
@lartme5 жыл бұрын
@@DenisTRUFFAUT Not that I'm aware of, but for desktops the L1 cache specifications don't vary too much. There are also techniques that exploit caches with different specifications without knowing the details of the cache ("cache oblivious" algorithms).
@victornpb5 жыл бұрын
Isn’t tiling supposed to leverage SIMD instructions? Or even a tile is too big to be able to be transformed in a single instruction?
@youknowhu5 жыл бұрын
Damn thanks for this explanation!
@CyberAcidPlanet5 жыл бұрын
That browser must've been on the cutting edge of technology.
@sahilsatishkumar5 жыл бұрын
It must be cutting the edges, but not cutting edge.
@JimCullen5 жыл бұрын
I dunno, they said it's usually a very good browser.
@omg_look_behind_you5 жыл бұрын
DO YOU GUYS MEAN INTERNET EXPLORER? HUH GUYS? CAN I PLAY WITH YOU?
@barcicle5 жыл бұрын
come on, they just hit an edge case with that browser. i love how he says 'they hit corner case' instead
@pardal_bs5 жыл бұрын
I see what you did there...
@hyyanabofakher62295 жыл бұрын
The most detailed video I have watched in here so far , The topic is super interesting, Great work , keep it up guys
@sahilsatishkumar5 жыл бұрын
I have never seen Jake this bamboozled.
@iuliuvisovan73075 жыл бұрын
This is by far your best video so far. Good job. Keep them coming.
@marshal75915 жыл бұрын
I love watching this channel, even though I don't fully understand everything they're talking about the majority of the time. Keep up the good work! I'm learning more and more as I go :)
@Abhishek-dp5tc3 жыл бұрын
That's the reason you love it because you don't fully understand
@JoeWong815 жыл бұрын
love the technicality of these explanations by Surma.
@abdullahtayel54055 жыл бұрын
It is called cache blocking, I felt like you first time I was introduced to it. Here is a really good article on this technique software.intel.com/en-us/articles/cache-blocking-techniques and to be honest I never thought that any web developer would find himself in a situation where he needs to implement this technique in his work :D
@stepans21673 жыл бұрын
Thanks for the link, it's a good read
@AmxCsifier5 жыл бұрын
5:17 corner case... you mean Edge case?
@dassurma5 жыл бұрын
Both work.
@AmxCsifier5 жыл бұрын
@@dassurma it was a pun, i.e a joke LOL
@lkri79515 жыл бұрын
Thats amazing thing you guys shared, Even though this is Engineering (logical reasoning) it sometimes works more like intuition 😊
@sinaseirafi94455 жыл бұрын
This is the first video that I've seen from you guys, Really interesting. Thank you.
@lucacasonato5 жыл бұрын
It doesn't run on a Mac... Safari runs on a Mac. Chrome runs on a Mac. Firefox runs on a Mac. Opera runs on a Mac. Internet Explorer and Edge don't. IE doesn't support WASM. IT MUST BE EDGE! Who would have thought?! :-P
@khai96x5 жыл бұрын
Do not reveal the other browsers' identity. Well yes, but actually no.
@iuliuvisovan73075 жыл бұрын
Is he using a Mac? I thought maybe if he was using Windows, it would've been Safari.
@iuliuvisovan73075 жыл бұрын
Oh, I got the point in the video where he specifically said Mac. GG Edge
@dan-garden5 жыл бұрын
Khải Hoàng We are allowed because we don’t represent Google, whether as they do. It’s Edge.
@youknowhu5 жыл бұрын
As soon as they diplomatically said "But it's a very good browser", we knew it was Edge ;) Especially after Microsoft said it was forced to move to Chromium because Google kept intentionally breaking KZbin performance on Edge.
@lukemiles68135 жыл бұрын
Not sure if I'm missing something, but couldn't this be explained by the rotation? Reading the pixels a row at a time is much faster, but this means you're writing a column at a time for the result (so there are large gaps between the locations you're writing to). You want both fairly sequential reads and writes, so using tiling means that the write cache can cache a whole tile before writing it, rather than having to write single values spread out through memory. (edit): After re-listening through the bit about the two explanations, this might have been what you were saying actually, just in a fairly roundabout way.
@meldafert76195 жыл бұрын
yes, this is how i understand it too
@lukejagodzinski5 жыл бұрын
I agree and I think it's a better answer to the problem than the one presented in the video. I think, what they're talking about is a little bit different. I remember when I was learning C++, it was emphasised that accessing array by index is very expensive, so that's why it's better to use pointers. Every time you do array[index] it has to go the beginning of the array in memory and move an array pointer to the given index. It's very expensive if you're jumping between rows. On the other hand, if you use pointer you're just responsible for moving it and just by having 2 pointers you can copy memory efficiently. However, it might also be a case that V8 does some optimisations when accessing array by index, so that problem would not apply here. It's just my intuition but I might be wrong about that.
@dassurma5 жыл бұрын
@@lukejagodzinski Accessing an array by index is not more or less expensive than doing the pointer arithmetic manually. You might be thinking about vectors from the stdlib, which overload the index operator and do a lot more under the hood. Luke’s got it right. And that is what I meant, but wasn’t as coherent as I could have been. What I sadly only learned only after the video was recorded is that writes to memory go through the cache as well and can also cause the cache to get evicted (I was under the impression that only reads do that). So yes, keeping both reads and writes limited to fairly small memory area (for example by using tiling) reduces the chance of cache eviction which means a speedup.
@lukejagodzinski5 жыл бұрын
@@dassurma so you say that V8 doesn't do extra jumps between memory cells when executing? array[100]; array[1000]; array[50]; I would assume that it would do something like: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Go to memory point where array starts 4. Move array pointer by 1000 5. Go to memory point where array starts 6. Move array pointer by 50. With pointers it would just be: 1. Go to memory point where array starts 2. Move array pointer by 100 3. Move array pointer by 900 4. Move array pointer by -850 Also, as I think about that, tiling would actually not help with array access... So probably I'm wrong that it was the source of the problem. But still I think pointers would improve speed here
@dassurma5 жыл бұрын
@@lukejagodzinski I was referring to C++ (because you said that array indexing is expensive). For JavaScript, it’s hard to tell what kind of assembler V8 would generate for that code as it depends on the types and what behavior the optimizer observers. But the optimization you outline is definitely within the capabilities of V8.
@sampathrg14 жыл бұрын
9:52 - Is that bug in the first line of the inner loop? says [newX, newY] = [newWidth - 1 - y, newY]. Shouldn't it be [newX, newY] = [newWidth - 1 - y, x]?
@frutiboy13 жыл бұрын
Btw, In your Progressively Loading Images episode we can see that images are loaded tile by tile in some really progressive formats...
@marcopfeiffer30323 жыл бұрын
Can I try an explanation? The reading is nice sequential but the writing isn't. Let's take a 4x4 pixel image which is just a single dimensional 16 item array/block of memory. When you rotate clockwise, then you copy 1 to 4, then you copy 2 to 8, then 3 to 12 and then 4 to 16. If you tile the 4x4 image into 2x2 images, then you copy 1 to 4, then 2 to 8 _but then_ 5 to 3 and 6 to 7. So the first 4 pixels in the tiled rotate used the read-buffer 1-6 and write buffer 4-8 (6+4=10 pixels in cache total). The first 4 pixels in the full scan used read-buffer 1-4 but then write buffer 4-16 (4+12=16 pixels in cache total, basically the entire image).
@k1ngjulien_5 жыл бұрын
Oh my god I loved this. JavaScript and low level cpu functionality are two of my favorite topics :D Thanks for making this. PS: Please tell your colleagues that work on GMail to finally make it a PWA so I don't have to use another program anymore :P
@jurgentreep5 жыл бұрын
If you open your google chrome apps you can right click gmail and set it to open in a new window. This is almost the same as a pwa :) You can even create a shortcut for your desktop or start menu.
@mpoisot2 жыл бұрын
Would a Hilbert curve be useful here for a more general solution, perhaps to avoid picking an explicit tile size?
@NuncNuncNuncNunc4 жыл бұрын
I was on the edge of my seat hoping you would name the browser.
@dum.briyani4 жыл бұрын
Sur-matram. Optimization levelled up.
@ben_lacy5 жыл бұрын
Favorite episode to date! Really good stuff
@niklashigi5 жыл бұрын
16:40, you forgot to link the podcast in the description. Very interesting video, btw!
@dassurma5 жыл бұрын
I fixed it now! Thanks for letting me now
@martixy25 жыл бұрын
Cache locality? Cache locality! It's actually a pretty popular concept in performance critical code (e.g. game engines). I keep seeing it mentioned in a ton of GDC talks on architecture/optimization/related.
@YoussefELHOUTI5 жыл бұрын
without tiling each write has to evict because you are rotating 90 degrees (x become y) when you do tiling, you don't need to evict foreach write and this why it's much faster
@hypersonic125 жыл бұрын
Fascinating video, guys! This was a great education, and entertaining at the same time. :)
@akinwunmioluwaseun37725 жыл бұрын
Actually drew my breathe in with Jake at 21:10
@OlleHellman5 жыл бұрын
Very interesting! Thank you for doing your best explaining it.
@victornpb5 жыл бұрын
Who doesn’t like a good browser war! The 90s is back guys
@kosnk3 жыл бұрын
A great magical trick. Sadly, I still don't quite get it: some say in comments it's because the way the reading caching happens (better pattern prediction/cache flow through levels/cache eviction), others say that the write cache is also involved. I'll have to learn more on the topic, but at least knowing that such thing exists should help. Thanks! :)
@ZacharyBerger5 жыл бұрын
Could you share the link to the Hacker News discussion?
@ZacharyBerger5 жыл бұрын
Actually, found it myself: news.ycombinator.com/item?id=19166233
@LeviBuzolic5 жыл бұрын
Super interesting video, great work! Entirely uncharted territory for me.
@aidanbrumsickle5 жыл бұрын
If you wanted to write a parallel algorithm for rotation would you want to tile the image in the same way? that would seem to be useful for per-core caches, but then once separate cores are reading and writing from different locations it would seem to make shared cache performance worse. That's pure speculation on my part though. Perhaps a new side learning project is in order.
@emilemil15 жыл бұрын
I disagree that this is a "tools not rules" case. That this would be a performance issue is obvious from looking at the code, knowing the rules would let you catch it immediately. You're accessing a 2D array column by column. This *always* causes cache issues for large arrays, and images are very large. These are also obviously the instructions that will be run the most, by a significant margin, so it's worth thinking about from the start. In fact in C++ there is a whole paradigm dedicated to designing with the cache in mind. Look up data-oriented design.
@jakearchibald5 жыл бұрын
Without tools, we wouldn't be able to confirm the result.
@emilemil15 жыл бұрын
Confirming isn't needed in obvious cases, it's just time wasted on unnecessary profiling. I mean do you profile before using binary search over linear search on an array of 2 million values? This is the same situation, just more obscure information.
@dassurma5 жыл бұрын
@@emilemil1 If you had shown me the code before I learned about loop tiling, I’d have bet my ass that the new code was slower, even if maybe only marginally so. At some point you’ll have enough expertise in a field to be able to make these kind of calls without profiling and testing. For example: By now I know when using `will-change` will help and when it won’t. I don’t need to test, because _to me_ it’s obvious. But that doesn’t mean it’s obvious to everyone else and - more importantly - that others should just follow the advice “use will-change”. Even if you “know the rules” around `will-change`, there are a lot of scenarios where it will make things worse. So whenever I give advice around `will-change` I’d add “but measure and see if it actually helps”. It’s great if this case is obvious to you, but it wasn’t for me. We all have different backgrounds with different levels of knowledge. “Tools not rules” is a phrase that help us DevRels give advice in a way that everyone can use it while reducing the adverse effects of just listing rules.
@MrZyclops5 жыл бұрын
*head explodes* great episode
@victornpb5 жыл бұрын
Don’t tell me you had to reshoot this just to remove the browser name
@pflasterstrips72542 жыл бұрын
I expected this to be about code being parallelized by the compiler, and the outer tiling loops are better for that.
@JustJohnny5 жыл бұрын
I've guided some clients that don't have photo optimization software to Squoosh and even I find myself using it for time to time because it does a better job than Photoshop. Any chance we'll see batch processing?
@dassurma5 жыл бұрын
We are working on batch processing! (Currently implementing some lower-level building blocks that will pave the way for batch processing to land)
@Morgan_Davis5 жыл бұрын
Can the optimal tile size be dynamically determined at runtime (with a small internal test of generated data) to avoid having to profile CPUs at all?
@dassurma5 жыл бұрын
You could run a small benchmark on your target device, but I think that’s not worth the effort. 16 seems like a pretty safe number :)
@MacoveiVlad5 жыл бұрын
Let's hope the speed gain is not based on speculative execution. If it is it might go away with a OS patch and leave everyone wandering what happened. Or worse, someone might find a exploit. This performance gains look really close to the metal. One would expect many layers of abstraction for something that runs in a browser.
@lakandoor10075 жыл бұрын
nice video about wasm :-)
@iuliuvisovan73075 жыл бұрын
10:15 Hey Jake, it's called an index :D
@lastdecember24965 жыл бұрын
Happy to hear that even google engineer not fully understand why their code working 😅
@dmitrisheley19985 жыл бұрын
now I'm intrigued... WHAT BROWSER WAS IT??
@areebjamaliam5 жыл бұрын
Edge
@b3rakesh115 жыл бұрын
Thank you Google
@cristianberroeta98665 жыл бұрын
Great video, very interesting stuff! I would have thought that tiling improves performance because of some sort of concurrency (many tiles being processed at the same time), but that's not the case, right?
@justintaddei5 жыл бұрын
The JavaScript code in the video is single-threaded, so no.
@anonymouse70745 жыл бұрын
Podcast not linked in the description 😂
@dassurma5 жыл бұрын
I am not a smart man.
@ColinRichardson5 жыл бұрын
Best video EVER
@NanobyteOnline5 жыл бұрын
Conclusion: It works! Great.
@DenisTRUFFAUT5 жыл бұрын
Is there a general rule for tiling ? Or a JS function that gives the number of cores ?
@dassurma5 жыл бұрын
There is navigator.hardwareConcurrency
@ZacharyBerger5 жыл бұрын
@@dassurma but number of cores doesn't matter here, its the cache size
@FlowertwigDotOrg5 жыл бұрын
Actually useful info this time, thanks :)
@dassurma5 жыл бұрын
Do I sense shade?
@apo34apo345 жыл бұрын
simply woah ^^
@furetosan2 жыл бұрын
16:03 inaudible = AssemblyScript perhaps?
@b3rakesh115 жыл бұрын
Thank you
@GifCoDigital5 жыл бұрын
My head hurts now. lol :)
@Marseleon5 жыл бұрын
what was the second theory?
@b3rakesh115 жыл бұрын
Awesome
@nicolas.stucki3 жыл бұрын
Tiling improves performance because it helps keep the destination matrix in the cache.
@arkanciscan5 жыл бұрын
Does it rhyme with "kitchenette restorer"?
@recursiv5 жыл бұрын
That one doesn't run wasm so no.
@GottZ5 жыл бұрын
nice how they just described microsoft edge
@wmhilton-old5 жыл бұрын
Any sufficiently advanced technology...
@franklemanschik_de4 жыл бұрын
What happens here is simple you did the most basic optimization possible chunk your operations :)
@anonymouse70745 жыл бұрын
Name and shame!
@Benimation5 жыл бұрын
Nobody knows you're talking about IE..
@hobbyturystaSEO4 жыл бұрын
https 203 0:30
@pharmokan4 жыл бұрын
what a horrible direction web development is going in
@panstromek5 жыл бұрын
DOD comes to web ;)
@zaffarmughal54785 жыл бұрын
could be better with nice and decent. no need to be clown.