C++ cache locality and branch predictability

Рет қаралды 85,499

mCoding

Күн бұрын

Пікірлер: 150

@spaghettihair 3 жыл бұрын

Why does the ternary expression at the end not count as branching code?

@robinm7079 3 жыл бұрын

With a true branch, the statements that execute afterward the branch are different depending on the result of the condition being checked. So with the first example, the if statement, we could have added a bunch more lines of code in the first block (after p *= 2 * x), that wouldn't be executed if the condition was false. With the ternary expression, there's still a condition being checked, but it doesn't effect the statements afterward, the processor knows what's going to happen next and can start loading the next instruction, there's no uncertainty

@mCoding 3 жыл бұрын

What I was trying to say here was that the logic could be refactored where the contents of the if are the same with just a different constant. In that case, the ternary can be implemented with a value array instead of a branch. Note that the expression (x > thresh) is a comparison, not in itself a jump. So you can do arr[x > thresh] to read from index 0 to set factor to 1.5, or read from index 1 to set factor to 2.0, which is what clang does. Technically the compiler could use a branch or not use a branch for either the ternary or the if statement as the two are equivalent. I was presenting the ternary as a kind of visual stand-in to superficially show the branch being removed, which is of course not how C++ actually works. I see now that this storytelling only caused additional confusion so here is the actual assembly to keep in mind: godbolt.org/z/zoGWxcoeb. Notice the line mulsd xmm2, qword ptr [8*rdx + .LCPI0_1] that indexes into a 2-element in-binary array .LCPIO_1 instead of using a conditional jump. In the link, clang produces an identical binary whether you use the ternary or the if and it does not use a branch. On the same code gcc produces identical binary whether you use the ternary or the if, except it uses a branch for both.

@MithicSpirit 3 жыл бұрын

@@mCoding on gcc was it compiled with or without optimizations (-O2 flag)?

@mCoding 3 жыл бұрын

Both at -O3.

@sebastianmestre8971 3 жыл бұрын

The thing that branch prediction predicts is not the outcome of comparisions, but the destination of jumps. With a plain if statement, a naive compiler might emit a comparison followed by a conditional jump, whose destination the CPU will try to predict at runtime. With a simple ternary like that, a compiler might have an easier time compiling it down to a single conditional move instruction, which is not followed by any jumps. This way, there is no branch (no different destinations for jumps) in the loop, even though a conditional DOES occur. (there are other ways to compile stuff into conditional-jump-free code, like using look up tables, etc)

@Broniath 3 жыл бұрын

My first thought seeing this code: This is some weird ass python library, it almost looks like C++. Several seconds later: Oh.......

@QuantumHistorian 3 жыл бұрын

Always nice to get something a little closer to the metal than python from time to time

@ronyWeeb 3 жыл бұрын

It's a proper coding channel. Every other coding channel's I had seen are like 'pop culture' coding. But this is different. Great work man.

@mCoding 3 жыл бұрын

Thanks for your kind words!

@10e999 3 жыл бұрын

I completly agree with you. This is a real software engineering channel. Not recycled dogmatic advices.

@creator-link 2 жыл бұрын

What are some pop culture coding channels? I want to watch some

@tannerted 2 жыл бұрын

Agreed. Also look at Jonathan Blow, Molly Rocket, and Cherno. All good channels like this one.

@LennyTheSniper 3 жыл бұрын

I watched the video 10 times. Now, my code is going 1024 times faster. Thanks a lot man!

@szaszm_ 3 жыл бұрын

I love the simplicity of your sorting algorithm implementations.

@mCoding 3 жыл бұрын

Thanks, I stole all of them from various cppreference docs.

@Spiderboydk 3 жыл бұрын

Great point. When optimizing, *always* measure, people! Merely picking a particular algorithm, because it's supposed to be fast, is nothing more than superstition. You *can't* know without measuring.

@mCoding 3 жыл бұрын

Always always always. "If you aren't measuring, you don't actually care about performance."

@Yazan_Majdalawi 2 жыл бұрын

@@mCoding What are the tool/s you used in this video?

@mytech6779 3 жыл бұрын

C++20 has added the attribute keywords Likely and Unlikely so the compiler can do some of its own predictive branching optimizations. It make me wonder if CPUs will start to include an instruction to make even more use of this in a few years.

@Qbe_Root 2 жыл бұрын

PowerPC conditional branch instructions include hints for prediction, so beq+ means the branch is likely and beq- means it's unlikely. I assume other architectures have the same feature but I'm not familiar with them

@bronkolie 3 ай бұрын

@@Qbe_Root RISC-V specifies that programs should be compiled so that forward branches are most likely taken, and backward branches most likely not. So it's similar in a way

@philipp6589 3 жыл бұрын

Since you are doing Python and C++ topics, would be nice to see some video about developing C/C++ extension modules for Python. Is there any easy-to-use framework which allows just-in-time compilation of C++ code snippets inside of Python?

@mCoding 3 жыл бұрын

I don't know about JIT for C++ code, but there is Jax, which JIT compiles Python to XLA and offers a numpy-like interface. There's also Cython, which allows you to embed C and C++ into Python, but this is pre-compiled, not JIT.

@FobazF 3 жыл бұрын

Check Numba

@pranavnyavanandi9710 2 жыл бұрын

What is this just in time compilation? Do you mean an interpreter? Are there any other types? I thought compilation was just compilation.

@wChris_ 3 жыл бұрын

it would be nice to have shown the Compiler Explorer output for the branch predicting example, as a ternary is technically a branch that might get optimized into a conditional move

@mCoding 3 жыл бұрын

See the pinned comment!

@brunoais 3 жыл бұрын

9:20 In some AMD (I don't know the latest intel yet) CPU, the CPU actually computes both multiplications at the same time and then waits for the condition to finish to choose which one to use. I'm unsure if it was intoduced in Zen or only Zen2

@spicybaguette7706 2 жыл бұрын

I believe It's called speculative execution, but this leads to some very funky side-effects that can be abused by attacks such as spectre and meltdown

@Windeycastle 11 ай бұрын

I loved to see std::sort laughing at all other implementations. It's a good call to use the standard library when you can, as it's often more optimised then what you'd write yourself (with some exceptions).

@fabricio5p 3 жыл бұрын

Man, I wish you did more of these low level optimization videos. I don't work with Python but I subscribed just because of these types of videos

@mCoding 3 жыл бұрын

More on the way!

@lefttraces7207 3 жыл бұрын

I love how I only know two languages and these happen to be the only two you cover, haha. Great video as always!

@mCoding 3 жыл бұрын

Lucky both of us!

@ciCCapROSTi 2 жыл бұрын

What is missing is that YOU (the coder) yourself should prefer branchless code. Your example could have been written as an arithmetic formula. The compilers are smart, but we can help it. It's even better at optimizing algebra than optimizing branches.

@prashantrana1089 3 жыл бұрын

this guys offer so good content for free. this guy is awesome

@efenestration 11 ай бұрын

these will age like wine. thanks for the vid

@mauricio-poppe 3 жыл бұрын

Awesome, thank you for an educational video straight to the point.

@mCoding 3 жыл бұрын

Very welcome!

@dcorderoch 3 жыл бұрын

"depending on ... the time of day" at first I laughed, then I thought... depending on where the computer is, the temperature could affect the performance... nighttime might be better for performance

@robertbrummayer4908 3 жыл бұрын

Excellent video as usual, James!

@mCoding 3 жыл бұрын

Thanks again!

@johnjoyce 3 жыл бұрын

Nice soft intro to practical examples of performance tuning.

@yash1152 Жыл бұрын

1:11 > _"sorting small things is actually very common"_ i didnt know. thanks

@Swedishnbkongu 2 жыл бұрын

How much can the compiler do loop reordering, or arithmetic reordering (for more ILP), or reducing the number of branches? I've been taught to trust the optimizer more than myself. If i can confirm in compiler explorer that it is different when i micro optimize, maybe I'll do it

@aratakarkosh9588 3 жыл бұрын

Holy smokes I already knew it, first time on this channel :DD

@mCoding 3 жыл бұрын

You're becoming wiser!

@mk-mc8yx 2 жыл бұрын

In another article I read that we shouldn't be optimizing our code for branch-prediction. If that's true why should one be concerned about it, if so it seems like a good to know thing. If that's not correct, are there any good coding practices that optimize our code to take advantage of branch prediction ? Btw, I found out this channel just few days ago. Very informative and lots to learn for me from here. Thankyou very much.

@mCoding 2 жыл бұрын

Most articles that recommend this are not recommending that you avoid optimizing branches, but rather that there are most likely other more important things you should focus on first, and that you should profile your code to find out what is actually slow before blindly trying to optimize branches. Of course, in a high performance systems not otherwise limited by network or io, eventually heap allocations, branches mispredictions, and cache misses start to become the dominating factors that you need to optimize.

@kyyrabb1121 3 жыл бұрын

Please do more c++ videos!

@rotbjorn 3 жыл бұрын

Interesting video! What program did you use to view the graphs & plots and what library did you use to create them?

@mCoding 3 жыл бұрын

The plots are viewable in any modern browser. I made them in Python using plotly express. You can see the code for making the plots on github!

@harveyroper5526 3 жыл бұрын

great video, it would have been really useful if you had gone into a little more detail on why cache locality works, ie storing multiple addresses into a single cache line. other than that i really enjoyed seeing a high level application of this concept

@ProjectPhysX 3 жыл бұрын

9:12 how about the ternary operator ?: or bit masking?

@Yupppi 11 ай бұрын

And this is why for example's intel's algorithm is far better than writing loops by yourself (the I assume famous video of matrix multiplication algorithm taking it from like 2000 ms to 2 ms). I love branchless solutions too, they feel like magic.

@TDskirvin 3 жыл бұрын

very informative & succinct, thank you

@mCoding 3 жыл бұрын

You're very welcome!

@danielbelsky8836 Жыл бұрын

That was sooo clear!! Thank You!!

@kinershah464 Жыл бұрын

How does the type of cache (k-way or associative or k-eay set associative) play any role in performance? Or does it?

@Zooiest Жыл бұрын

Can you make a video about vectorization and/or multithreading?

@danielrhouck Жыл бұрын

1:00 I tried to read the example code for the length of the vector, but unless it’s 0 I don’t know the benchmarking library enough to have a guess. But I think I was on the right track and the vectors were too short.

@kivicode 3 жыл бұрын

Great video as always! Only one question: what editor theme are you using?

@mCoding 3 жыл бұрын

Monokai Pro Theme

@veritas7010 Жыл бұрын

Would be awesome if the use of compiler opt -O 0,1,2,3 was benched

@Khushpich 3 жыл бұрын

Great video as always.

@mCoding 3 жыл бұрын

I appreciate that!

@louisfghbvc 8 ай бұрын

impressive about cache locality...

@alexander3293 3 жыл бұрын

love this kind of videos!!

@arnabthakuria2243 2 ай бұрын

what font is this. Looks like Berkley Mono

@empireempire3545 2 жыл бұрын

Thats great, but how to benchmark where your code is actually having problems with cache?

@mCoding 2 жыл бұрын

Great question and unfortunately i don't have a great answer for you, at least not a definitive one. Generally you want to start with a higher level view of your application. Benchmark/profile the thing as a whole to get a feel for which parts are slow. If you have certain end to end tasks such as producing a complex value or rendering a frame, create benchmarks for those individual tasks you care about and profile them individually. This might involve engineering benchmark code into your application itself (turned on or off with a macro). The profile should help you determine at least broadly where any slowness is. Then you can start to hone in on those slow parts. Look at them and ask yourself WHY are they slow? What data is going in and out, how is it laid out in memory, are your branches predicatable, do calculations that happen near eachother live near each other in memory, and does as much data for any given calculation fit inside a cache line? You can also use tools like perf (linux only, but im sure windows has something similar). Your hardware has counters that track how many cache misses, branch mispredictipns, etc your program caused, and perd will report these numbers to you. You can then use perf and other such tools to get a feel for whether cache or branch predictions are the problem or something else, like a deeper architectural issue or inherently slow algorithm. In general there is no one answer and no one tool that works in all cases, it's just something you get better at with experience. Hope this helps.

@hievery9678 3 жыл бұрын

But is it always worth code readability? I mean, getting x2 performance is cool and all, but maintaing sometimes almost unreadable code because of such optimizations may be even worse. @mCoding May be you have some production experience with them? Have you used them there? How your coworkers reacted to such weird shananigans with code? Or you tried to use them only when readability wasn't suffering?

@mCoding 3 жыл бұрын

I would say that writing code with good cache locality and branch predictability rarely involves any degradation in readability. There are exceptions, particularly when performance is highly critical, e.g. in electronic trading, that you might design your whole application around these properties, causing readability to suffer. However, in those cases the readability of the code was already chosen to be second priority to the performance due to the nature of the code.

@timog7358 3 ай бұрын

amazing video

@BenjaminWheeler0510 3 жыл бұрын

I’m confused. What are all these curly braces??? Is this some weird Python fork?

@mCoding 3 жыл бұрын

Not sure if joking 🤔. This is C++.

@padnomnidprenon9672 3 жыл бұрын

Im not the only one who told to myself : "Wow new word auto and var declaration, what is this package". And then i saw static void

@adicide9070 2 жыл бұрын

@@mCoding yeah must have been joking. and a good one :D python fork :D:D:D

@pengchengwu447 9 ай бұрын

std::sort uses insertion sort for small sizes too I believe, why is it faster than yours?

@puddleglum5610 3 жыл бұрын

One note -- you should never optimize your code prematurely! A lot of times, it creates more bugs and development overhead than its worth. Optimization should be for when you're having performance issues. But, nonetheless, super great video! This brings me back to my old Computer Architecture classes in college.

@venkateshhariharan4341 3 жыл бұрын

great video bro, helped alot thank you so much ❤️❤️❤️

@mCoding 3 жыл бұрын

Glad to hear it!

@---hw6up 3 жыл бұрын

Why do you usr auto for type deduction in function declarations? Looks infernal

@mCoding 3 жыл бұрын

std::vector::iterator is too long to fit on screen

@---hw6up 3 жыл бұрын

@@mCoding Didn't realise the sorting was done using iterators. My bad

@venkateshhariharan4341 3 жыл бұрын

how these results will be affected when using with constexpr for ordered input?, probably get faster I don't know, I will be trying that soon. thanks for letting us know these amazing concepts with practical examples

@mCoding 3 жыл бұрын

constexpr could allow you to do some of the computation at compile-time, so that would be a big improvement over doing it at runtime, but depending on the use case this wouldn't necessarily be possible. As of C++20, vector is not constexpr. However, it could be possible with array.

@thiagosannafreiresilva4366 3 жыл бұрын

So normally how much will compilers flag and / or optimize this sort of thing? I'm just moving from interpreted to compiled languages and I don't quite have a grasp yet to the extent of optimization done by compilers.

@mCoding 3 жыл бұрын

Compilers are fairly decent at optimizing branches away when possible. A compiler will almost never do an optimization like exchanging ijk for ikj in a for loop though because without knowledge of the whole program (e.g. we were computing the sum of integers which is commutative) changing the order of things might change the behavior of the program. Basic things like removing unused variables and not recomputing things that it can prove haven't changed will usually be done. I highly recommend Prof Leiserson's course on performance engineering in C (available on youtube, it's mit ocw) if you want to learn more. Also any CppCon talk.

@thiagosannafreiresilva4366 3 жыл бұрын

@@mCoding thanks!

@szaszm_ 3 жыл бұрын

@@mCoding GCC can perform the ijk -> ikj exchange with -floop-interchange (enabled at -O3) or with -floop-nest-optimize (which includes even more nested loop optimizations) if compiled with --with-isl.

@marcusmors8485 3 жыл бұрын

Good video! Thanks

@mCoding 3 жыл бұрын

Glad you liked it!

@aadityabhetuwal5990 2 жыл бұрын

Which IDE do you use?

@lightify97 3 жыл бұрын

Please make a C++ course in the same style.

@ItzSwiftyBoy_Gaming 2 жыл бұрын

Damn, you're coding genius.

@10e999 3 жыл бұрын

10:22 How is the line 11 not a branch?

@Yazan_Majdalawi 2 жыл бұрын

See the pinned comment :)

@colin398 Жыл бұрын

Pretty sure std::sort defaults to some type of insertion sort for small sizes

@echoptic775 3 жыл бұрын

Ik curious what are ur thought about rust lang?

@mCoding 3 жыл бұрын

Haven't tried it but I've heard it has a lot of nice properties

@mytech6779 3 жыл бұрын

Only 8M of cache? My 386 had 4M of RAM. (About $1000 of RAM adjusted for inflation.) And it could run MSwindows.

@irwainnornossa4605 Жыл бұрын

{ belongs to a new line! We say „2 times“, not „2 x“!

@aadithyavarma 3 жыл бұрын

Hi, I love your video. Are these principles applicable for Python also?

@mCoding 3 жыл бұрын

Yes but to a much lesser degree. Because every object in Python is a pointer, practically every access of an object behaves like the random order access example and is a cache miss, which is one of the reasons for Python's general slowness.

@switchblade6226 Жыл бұрын

@@mCoding WARNING: Resurrecting old comment section to expand a bit on that :P Modern CPUs do preform unconditional branch target prediction via multi-level prediction tables (more info can be seen in Agner Fog's optimization papers), and those do help with dynamic languages' pointer hell, but obviously only to a degree. Also, hypothetically, if you JIT python you would be able to potentially take advantage of conditional branch prediction too. And if you have an extremely sophisticated JIT that would flatten and simplify python structures, that can improve cache locality. But that would require some genie to make a miracle JIT happen lol

@Pa_Nic 3 жыл бұрын

If you watch this video 1000 times, you can break RSA-4096

@mCoding 3 жыл бұрын

Shhhh :)

@andrewf8366 3 жыл бұрын

Why didn't you plot more points in your comparison? You mentioned that the largest "kink" was at 8MB... but you also didn't have another point until 16, so it's not as good a demonstration as could be used if you had more points to show.

@TheJackal917 3 жыл бұрын

Nice japanese explanations, man. Cool.

@netfri25 3 жыл бұрын

what ide do you use?

@mCoding 3 жыл бұрын

This is CLion.

@netfri25 3 жыл бұрын

@@mCoding thanks a lot!

@ssspud3545 3 жыл бұрын

I didn't even notice it was 7 hours long until the closing notes... I was recommended this after never being recommended anything from pyrocynical in over a year while being subscribed. I have to admit, these documentary type videos that review or cover a TV show or piece of media are extremely popular now and basically equal free watch time. I mean you have all the work cut out for you. Just watch the show and have a comprehensive knowledge about it and writing skills; which you can learn in college or even a US highschool(I meant college in Europe and in the US because community colleges are basically the same concept in both regions).

@mCoding 3 жыл бұрын

7 hours long? I think you might have commented on the wrong video.

@falxie_ 3 жыл бұрын

I really should probably go to college so I can learn techniques like this

@chainingsolid 3 жыл бұрын

Just use some of the terms from this video and google. Much can be learned online and testing stuff your self. If your curious message me and I'll dig up some info (aka more youtube videos).

@ThePC007 3 жыл бұрын

10:29 Ternary operators are just syntactic sugar for an if-else statement. They don't eliminate branching at all. That said, given that booleans are just 0s and 1s this code could probably be optimized to a branchless multiplication, which is what I believe eliminated the branching in this case.

@mCoding 3 жыл бұрын

See the pinned comment where I explain what I was trying to suggest here!

@subnumeric 2 жыл бұрын

IEEE 754 floating point numbers are NOT commutative! You've used templates in the get_sum function, so depending on the data this implementation might return very incorrect sums. This is in fact what the unsafe -Ofast optimizations do (in gcc), as in some cases it doesn't produce significant errors and the speedup is worth it. I know averaging isn't really the focus of this video, but I felt like i had to mention it.

@mCoding 2 жыл бұрын

Yes I'm aware of the that these implementations are not general purpose production ready. Indeed, even implementing the average of 2 numbers correctly is a monumental task, so I hope you unserstand why I wouldn't want to attempt this kind of thing unless it is was a key point of the video. But yes, always watch out for noncommitativity hiding in traditionally commutative operators like +!

@subnumeric 2 жыл бұрын

@@mCoding I do understand, it's just that these sort of "details" get you thrown out of university exams, and much worse, neuter rocket defense systems and cost people their lives. I simply felt obligated to mention it, which I in turn hope you understand. Alas, this is indeed a confusing university-level topic and you have to simplify it into a 10 minute video, so I suppose details being cut is to be expected. Please, do not misunderstand my capitalized "NOT" and the exclamation mark as some harsh critique, but rather as a "passionately highlighted" footnote in a hefty textbook. I appreciate the effort you've put into you video, which is clearly demonstrated by your response to a comment on a several years old video. =)

@gregorymorse8423 3 жыл бұрын

?: is branching so is && and ||. int c=x > thresh; 2&c|1.5&!c would be the branch less form. Which is how the compiler would optimize it effectively. Also small sizes being faster for less complex algorithm is a known constant threshold which is hard to precisely compute given the complexity of the state machine in the processor. You engineer micro optimizations based on data sizes. Merge sort is not as fast as an in place merge sort, and you didnt specify this detail

@mCoding 3 жыл бұрын

See the pinned comment for what I was trying to convey with the ternary. I'm also curious why you would call a 2x performance difference that I demonstrated across multiple different problems a micro optimization.

@gregorymorse8423 3 жыл бұрын

@@mCoding yes I suppose table vs. Bitwise logic is debatable depending on exactly what is being optimized. But if there are side effects on the true or false expression, control flow required!!

@axe863 Жыл бұрын

Cache me outside .... really bro

@CAMOBAP795 3 жыл бұрын

Interesting should I thing about Branch Prediction when I work with High-level languages (Python, Java)? P.S. Few years ago I saw some presentation about "cache locality aware" coding in Java