Why does the ternary expression at the end not count as branching code?
@robinm70793 жыл бұрын
With a true branch, the statements that execute afterward the branch are different depending on the result of the condition being checked. So with the first example, the if statement, we could have added a bunch more lines of code in the first block (after p *= 2 * x), that wouldn't be executed if the condition was false. With the ternary expression, there's still a condition being checked, but it doesn't effect the statements afterward, the processor knows what's going to happen next and can start loading the next instruction, there's no uncertainty
@mCoding3 жыл бұрын
What I was trying to say here was that the logic could be refactored where the contents of the if are the same with just a different constant. In that case, the ternary can be implemented with a value array instead of a branch. Note that the expression (x > thresh) is a comparison, not in itself a jump. So you can do arr[x > thresh] to read from index 0 to set factor to 1.5, or read from index 1 to set factor to 2.0, which is what clang does. Technically the compiler could use a branch or not use a branch for either the ternary or the if statement as the two are equivalent. I was presenting the ternary as a kind of visual stand-in to superficially show the branch being removed, which is of course not how C++ actually works. I see now that this storytelling only caused additional confusion so here is the actual assembly to keep in mind: godbolt.org/z/zoGWxcoeb. Notice the line mulsd xmm2, qword ptr [8*rdx + .LCPI0_1] that indexes into a 2-element in-binary array .LCPIO_1 instead of using a conditional jump. In the link, clang produces an identical binary whether you use the ternary or the if and it does not use a branch. On the same code gcc produces identical binary whether you use the ternary or the if, except it uses a branch for both.
@MithicSpirit3 жыл бұрын
@@mCoding on gcc was it compiled with or without optimizations (-O2 flag)?
@mCoding3 жыл бұрын
Both at -O3.
@sebastianmestre89713 жыл бұрын
The thing that branch prediction predicts is not the outcome of comparisions, but the destination of jumps. With a plain if statement, a naive compiler might emit a comparison followed by a conditional jump, whose destination the CPU will try to predict at runtime. With a simple ternary like that, a compiler might have an easier time compiling it down to a single conditional move instruction, which is not followed by any jumps. This way, there is no branch (no different destinations for jumps) in the loop, even though a conditional DOES occur. (there are other ways to compile stuff into conditional-jump-free code, like using look up tables, etc)
@Broniath3 жыл бұрын
My first thought seeing this code: This is some weird ass python library, it almost looks like C++. Several seconds later: Oh.......
@QuantumHistorian3 жыл бұрын
Always nice to get something a little closer to the metal than python from time to time
@ronyWeeb3 жыл бұрын
It's a proper coding channel. Every other coding channel's I had seen are like 'pop culture' coding. But this is different. Great work man.
@mCoding3 жыл бұрын
Thanks for your kind words!
@10e9993 жыл бұрын
I completly agree with you. This is a real software engineering channel. Not recycled dogmatic advices.
@creator-link2 жыл бұрын
What are some pop culture coding channels? I want to watch some
@tannerted2 жыл бұрын
Agreed. Also look at Jonathan Blow, Molly Rocket, and Cherno. All good channels like this one.
@LennyTheSniper3 жыл бұрын
I watched the video 10 times. Now, my code is going 1024 times faster. Thanks a lot man!
@szaszm_3 жыл бұрын
I love the simplicity of your sorting algorithm implementations.
@mCoding3 жыл бұрын
Thanks, I stole all of them from various cppreference docs.
@Spiderboydk3 жыл бұрын
Great point. When optimizing, *always* measure, people! Merely picking a particular algorithm, because it's supposed to be fast, is nothing more than superstition. You *can't* know without measuring.
@mCoding3 жыл бұрын
Always always always. "If you aren't measuring, you don't actually care about performance."
@Yazan_Majdalawi2 жыл бұрын
@@mCoding What are the tool/s you used in this video?
@mytech67793 жыл бұрын
C++20 has added the attribute keywords Likely and Unlikely so the compiler can do some of its own predictive branching optimizations. It make me wonder if CPUs will start to include an instruction to make even more use of this in a few years.
@Qbe_Root2 жыл бұрын
PowerPC conditional branch instructions include hints for prediction, so beq+ means the branch is likely and beq- means it's unlikely. I assume other architectures have the same feature but I'm not familiar with them
@bronkolie3 ай бұрын
@@Qbe_Root RISC-V specifies that programs should be compiled so that forward branches are most likely taken, and backward branches most likely not. So it's similar in a way
@philipp65893 жыл бұрын
Since you are doing Python and C++ topics, would be nice to see some video about developing C/C++ extension modules for Python. Is there any easy-to-use framework which allows just-in-time compilation of C++ code snippets inside of Python?
@mCoding3 жыл бұрын
I don't know about JIT for C++ code, but there is Jax, which JIT compiles Python to XLA and offers a numpy-like interface. There's also Cython, which allows you to embed C and C++ into Python, but this is pre-compiled, not JIT.
@FobazF3 жыл бұрын
Check Numba
@pranavnyavanandi97102 жыл бұрын
What is this just in time compilation? Do you mean an interpreter? Are there any other types? I thought compilation was just compilation.
@wChris_3 жыл бұрын
it would be nice to have shown the Compiler Explorer output for the branch predicting example, as a ternary is technically a branch that might get optimized into a conditional move
@mCoding3 жыл бұрын
See the pinned comment!
@brunoais3 жыл бұрын
9:20 In some AMD (I don't know the latest intel yet) CPU, the CPU actually computes both multiplications at the same time and then waits for the condition to finish to choose which one to use. I'm unsure if it was intoduced in Zen or only Zen2
@spicybaguette77062 жыл бұрын
I believe It's called speculative execution, but this leads to some very funky side-effects that can be abused by attacks such as spectre and meltdown
@Windeycastle11 ай бұрын
I loved to see std::sort laughing at all other implementations. It's a good call to use the standard library when you can, as it's often more optimised then what you'd write yourself (with some exceptions).
@fabricio5p3 жыл бұрын
Man, I wish you did more of these low level optimization videos. I don't work with Python but I subscribed just because of these types of videos
@mCoding3 жыл бұрын
More on the way!
@lefttraces72073 жыл бұрын
I love how I only know two languages and these happen to be the only two you cover, haha. Great video as always!
@mCoding3 жыл бұрын
Lucky both of us!
@ciCCapROSTi2 жыл бұрын
What is missing is that YOU (the coder) yourself should prefer branchless code. Your example could have been written as an arithmetic formula. The compilers are smart, but we can help it. It's even better at optimizing algebra than optimizing branches.
@prashantrana10893 жыл бұрын
this guys offer so good content for free. this guy is awesome
@efenestration11 ай бұрын
these will age like wine. thanks for the vid
@mauricio-poppe3 жыл бұрын
Awesome, thank you for an educational video straight to the point.
@mCoding3 жыл бұрын
Very welcome!
@dcorderoch3 жыл бұрын
"depending on ... the time of day" at first I laughed, then I thought... depending on where the computer is, the temperature could affect the performance... nighttime might be better for performance
@robertbrummayer49083 жыл бұрын
Excellent video as usual, James!
@mCoding3 жыл бұрын
Thanks again!
@johnjoyce3 жыл бұрын
Nice soft intro to practical examples of performance tuning.
@yash1152 Жыл бұрын
1:11 > _"sorting small things is actually very common"_ i didnt know. thanks
@Swedishnbkongu2 жыл бұрын
How much can the compiler do loop reordering, or arithmetic reordering (for more ILP), or reducing the number of branches? I've been taught to trust the optimizer more than myself. If i can confirm in compiler explorer that it is different when i micro optimize, maybe I'll do it
@aratakarkosh95883 жыл бұрын
Holy smokes I already knew it, first time on this channel :DD
@mCoding3 жыл бұрын
You're becoming wiser!
@mk-mc8yx2 жыл бұрын
In another article I read that we shouldn't be optimizing our code for branch-prediction. If that's true why should one be concerned about it, if so it seems like a good to know thing. If that's not correct, are there any good coding practices that optimize our code to take advantage of branch prediction ? Btw, I found out this channel just few days ago. Very informative and lots to learn for me from here. Thankyou very much.
@mCoding2 жыл бұрын
Most articles that recommend this are not recommending that you avoid optimizing branches, but rather that there are most likely other more important things you should focus on first, and that you should profile your code to find out what is actually slow before blindly trying to optimize branches. Of course, in a high performance systems not otherwise limited by network or io, eventually heap allocations, branches mispredictions, and cache misses start to become the dominating factors that you need to optimize.
@kyyrabb11213 жыл бұрын
Please do more c++ videos!
@rotbjorn3 жыл бұрын
Interesting video! What program did you use to view the graphs & plots and what library did you use to create them?
@mCoding3 жыл бұрын
The plots are viewable in any modern browser. I made them in Python using plotly express. You can see the code for making the plots on github!
@harveyroper55263 жыл бұрын
great video, it would have been really useful if you had gone into a little more detail on why cache locality works, ie storing multiple addresses into a single cache line. other than that i really enjoyed seeing a high level application of this concept
@ProjectPhysX3 жыл бұрын
9:12 how about the ternary operator ?: or bit masking?
@Yupppi11 ай бұрын
And this is why for example's intel's algorithm is far better than writing loops by yourself (the I assume famous video of matrix multiplication algorithm taking it from like 2000 ms to 2 ms). I love branchless solutions too, they feel like magic.
@TDskirvin3 жыл бұрын
very informative & succinct, thank you
@mCoding3 жыл бұрын
You're very welcome!
@danielbelsky8836 Жыл бұрын
That was sooo clear!! Thank You!!
@kinershah464 Жыл бұрын
How does the type of cache (k-way or associative or k-eay set associative) play any role in performance? Or does it?
@Zooiest Жыл бұрын
Can you make a video about vectorization and/or multithreading?
@danielrhouck Жыл бұрын
1:00 I tried to read the example code for the length of the vector, but unless it’s 0 I don’t know the benchmarking library enough to have a guess. But I think I was on the right track and the vectors were too short.
@kivicode3 жыл бұрын
Great video as always! Only one question: what editor theme are you using?
@mCoding3 жыл бұрын
Monokai Pro Theme
@veritas7010 Жыл бұрын
Would be awesome if the use of compiler opt -O 0,1,2,3 was benched
@Khushpich3 жыл бұрын
Great video as always.
@mCoding3 жыл бұрын
I appreciate that!
@louisfghbvc8 ай бұрын
impressive about cache locality...
@alexander32933 жыл бұрын
love this kind of videos!!
@arnabthakuria22432 ай бұрын
what font is this. Looks like Berkley Mono
@empireempire35452 жыл бұрын
Thats great, but how to benchmark where your code is actually having problems with cache?
@mCoding2 жыл бұрын
Great question and unfortunately i don't have a great answer for you, at least not a definitive one. Generally you want to start with a higher level view of your application. Benchmark/profile the thing as a whole to get a feel for which parts are slow. If you have certain end to end tasks such as producing a complex value or rendering a frame, create benchmarks for those individual tasks you care about and profile them individually. This might involve engineering benchmark code into your application itself (turned on or off with a macro). The profile should help you determine at least broadly where any slowness is. Then you can start to hone in on those slow parts. Look at them and ask yourself WHY are they slow? What data is going in and out, how is it laid out in memory, are your branches predicatable, do calculations that happen near eachother live near each other in memory, and does as much data for any given calculation fit inside a cache line? You can also use tools like perf (linux only, but im sure windows has something similar). Your hardware has counters that track how many cache misses, branch mispredictipns, etc your program caused, and perd will report these numbers to you. You can then use perf and other such tools to get a feel for whether cache or branch predictions are the problem or something else, like a deeper architectural issue or inherently slow algorithm. In general there is no one answer and no one tool that works in all cases, it's just something you get better at with experience. Hope this helps.
@hievery96783 жыл бұрын
But is it always worth code readability? I mean, getting x2 performance is cool and all, but maintaing sometimes almost unreadable code because of such optimizations may be even worse. @mCoding May be you have some production experience with them? Have you used them there? How your coworkers reacted to such weird shananigans with code? Or you tried to use them only when readability wasn't suffering?
@mCoding3 жыл бұрын
I would say that writing code with good cache locality and branch predictability rarely involves any degradation in readability. There are exceptions, particularly when performance is highly critical, e.g. in electronic trading, that you might design your whole application around these properties, causing readability to suffer. However, in those cases the readability of the code was already chosen to be second priority to the performance due to the nature of the code.
@timog73583 ай бұрын
amazing video
@BenjaminWheeler05103 жыл бұрын
I’m confused. What are all these curly braces??? Is this some weird Python fork?
@mCoding3 жыл бұрын
Not sure if joking 🤔. This is C++.
@padnomnidprenon96723 жыл бұрын
Im not the only one who told to myself : "Wow new word auto and var declaration, what is this package". And then i saw static void
@adicide90702 жыл бұрын
@@mCoding yeah must have been joking. and a good one :D python fork :D:D:D
@pengchengwu4479 ай бұрын
std::sort uses insertion sort for small sizes too I believe, why is it faster than yours?
@puddleglum56103 жыл бұрын
One note -- you should never optimize your code prematurely! A lot of times, it creates more bugs and development overhead than its worth. Optimization should be for when you're having performance issues. But, nonetheless, super great video! This brings me back to my old Computer Architecture classes in college.
@venkateshhariharan43413 жыл бұрын
great video bro, helped alot thank you so much ❤️❤️❤️
@mCoding3 жыл бұрын
Glad to hear it!
@---hw6up3 жыл бұрын
Why do you usr auto for type deduction in function declarations? Looks infernal
@mCoding3 жыл бұрын
std::vector::iterator is too long to fit on screen
@---hw6up3 жыл бұрын
@@mCoding Didn't realise the sorting was done using iterators. My bad
@venkateshhariharan43413 жыл бұрын
how these results will be affected when using with constexpr for ordered input?, probably get faster I don't know, I will be trying that soon. thanks for letting us know these amazing concepts with practical examples
@mCoding3 жыл бұрын
constexpr could allow you to do some of the computation at compile-time, so that would be a big improvement over doing it at runtime, but depending on the use case this wouldn't necessarily be possible. As of C++20, vector is not constexpr. However, it could be possible with array.
@thiagosannafreiresilva43663 жыл бұрын
So normally how much will compilers flag and / or optimize this sort of thing? I'm just moving from interpreted to compiled languages and I don't quite have a grasp yet to the extent of optimization done by compilers.
@mCoding3 жыл бұрын
Compilers are fairly decent at optimizing branches away when possible. A compiler will almost never do an optimization like exchanging ijk for ikj in a for loop though because without knowledge of the whole program (e.g. we were computing the sum of integers which is commutative) changing the order of things might change the behavior of the program. Basic things like removing unused variables and not recomputing things that it can prove haven't changed will usually be done. I highly recommend Prof Leiserson's course on performance engineering in C (available on youtube, it's mit ocw) if you want to learn more. Also any CppCon talk.
@thiagosannafreiresilva43663 жыл бұрын
@@mCoding thanks!
@szaszm_3 жыл бұрын
@@mCoding GCC can perform the ijk -> ikj exchange with -floop-interchange (enabled at -O3) or with -floop-nest-optimize (which includes even more nested loop optimizations) if compiled with --with-isl.
@marcusmors84853 жыл бұрын
Good video! Thanks
@mCoding3 жыл бұрын
Glad you liked it!
@aadityabhetuwal59902 жыл бұрын
Which IDE do you use?
@lightify973 жыл бұрын
Please make a C++ course in the same style.
@ItzSwiftyBoy_Gaming2 жыл бұрын
Damn, you're coding genius.
@10e9993 жыл бұрын
10:22 How is the line 11 not a branch?
@Yazan_Majdalawi2 жыл бұрын
See the pinned comment :)
@colin398 Жыл бұрын
Pretty sure std::sort defaults to some type of insertion sort for small sizes
@echoptic7753 жыл бұрын
Ik curious what are ur thought about rust lang?
@mCoding3 жыл бұрын
Haven't tried it but I've heard it has a lot of nice properties
@mytech67793 жыл бұрын
Only 8M of cache? My 386 had 4M of RAM. (About $1000 of RAM adjusted for inflation.) And it could run MSwindows.
@irwainnornossa4605 Жыл бұрын
{ belongs to a new line! We say „2 times“, not „2 x“!
@aadithyavarma3 жыл бұрын
Hi, I love your video. Are these principles applicable for Python also?
@mCoding3 жыл бұрын
Yes but to a much lesser degree. Because every object in Python is a pointer, practically every access of an object behaves like the random order access example and is a cache miss, which is one of the reasons for Python's general slowness.
@switchblade6226 Жыл бұрын
@@mCoding WARNING: Resurrecting old comment section to expand a bit on that :P Modern CPUs do preform unconditional branch target prediction via multi-level prediction tables (more info can be seen in Agner Fog's optimization papers), and those do help with dynamic languages' pointer hell, but obviously only to a degree. Also, hypothetically, if you JIT python you would be able to potentially take advantage of conditional branch prediction too. And if you have an extremely sophisticated JIT that would flatten and simplify python structures, that can improve cache locality. But that would require some genie to make a miracle JIT happen lol
@Pa_Nic3 жыл бұрын
If you watch this video 1000 times, you can break RSA-4096
@mCoding3 жыл бұрын
Shhhh :)
@andrewf83663 жыл бұрын
Why didn't you plot more points in your comparison? You mentioned that the largest "kink" was at 8MB... but you also didn't have another point until 16, so it's not as good a demonstration as could be used if you had more points to show.
@TheJackal9173 жыл бұрын
Nice japanese explanations, man. Cool.
@netfri253 жыл бұрын
what ide do you use?
@mCoding3 жыл бұрын
This is CLion.
@netfri253 жыл бұрын
@@mCoding thanks a lot!
@ssspud35453 жыл бұрын
I didn't even notice it was 7 hours long until the closing notes... I was recommended this after never being recommended anything from pyrocynical in over a year while being subscribed. I have to admit, these documentary type videos that review or cover a TV show or piece of media are extremely popular now and basically equal free watch time. I mean you have all the work cut out for you. Just watch the show and have a comprehensive knowledge about it and writing skills; which you can learn in college or even a US highschool(I meant college in Europe and in the US because community colleges are basically the same concept in both regions).
@mCoding3 жыл бұрын
7 hours long? I think you might have commented on the wrong video.
@falxie_3 жыл бұрын
I really should probably go to college so I can learn techniques like this
@chainingsolid3 жыл бұрын
Just use some of the terms from this video and google. Much can be learned online and testing stuff your self. If your curious message me and I'll dig up some info (aka more youtube videos).
@ThePC0073 жыл бұрын
10:29 Ternary operators are just syntactic sugar for an if-else statement. They don't eliminate branching at all. That said, given that booleans are just 0s and 1s this code could probably be optimized to a branchless multiplication, which is what I believe eliminated the branching in this case.
@mCoding3 жыл бұрын
See the pinned comment where I explain what I was trying to suggest here!
@subnumeric2 жыл бұрын
IEEE 754 floating point numbers are NOT commutative! You've used templates in the get_sum function, so depending on the data this implementation might return very incorrect sums. This is in fact what the unsafe -Ofast optimizations do (in gcc), as in some cases it doesn't produce significant errors and the speedup is worth it. I know averaging isn't really the focus of this video, but I felt like i had to mention it.
@mCoding2 жыл бұрын
Yes I'm aware of the that these implementations are not general purpose production ready. Indeed, even implementing the average of 2 numbers correctly is a monumental task, so I hope you unserstand why I wouldn't want to attempt this kind of thing unless it is was a key point of the video. But yes, always watch out for noncommitativity hiding in traditionally commutative operators like +!
@subnumeric2 жыл бұрын
@@mCoding I do understand, it's just that these sort of "details" get you thrown out of university exams, and much worse, neuter rocket defense systems and cost people their lives. I simply felt obligated to mention it, which I in turn hope you understand. Alas, this is indeed a confusing university-level topic and you have to simplify it into a 10 minute video, so I suppose details being cut is to be expected. Please, do not misunderstand my capitalized "NOT" and the exclamation mark as some harsh critique, but rather as a "passionately highlighted" footnote in a hefty textbook. I appreciate the effort you've put into you video, which is clearly demonstrated by your response to a comment on a several years old video. =)
@gregorymorse84233 жыл бұрын
?: is branching so is && and ||. int c=x > thresh; 2&c|1.5&!c would be the branch less form. Which is how the compiler would optimize it effectively. Also small sizes being faster for less complex algorithm is a known constant threshold which is hard to precisely compute given the complexity of the state machine in the processor. You engineer micro optimizations based on data sizes. Merge sort is not as fast as an in place merge sort, and you didnt specify this detail
@mCoding3 жыл бұрын
See the pinned comment for what I was trying to convey with the ternary. I'm also curious why you would call a 2x performance difference that I demonstrated across multiple different problems a micro optimization.
@gregorymorse84233 жыл бұрын
@@mCoding yes I suppose table vs. Bitwise logic is debatable depending on exactly what is being optimized. But if there are side effects on the true or false expression, control flow required!!
@axe863 Жыл бұрын
Cache me outside .... really bro
@CAMOBAP7953 жыл бұрын
Interesting should I thing about Branch Prediction when I work with High-level languages (Python, Java)? P.S. Few years ago I saw some presentation about "cache locality aware" coding in Java
@coder4362 жыл бұрын
7:30 Actually no
@trag1czny3 жыл бұрын
discord gang 🤙🤙
@lunarmagpie43053 жыл бұрын
Discord gang
@Miparwo21 күн бұрын
I downvoted because you didn't even bothered to draw what you are trying to explain.