Branchless Programming: Why "If" is Sloowww... and what we can do about it!

Рет қаралды 1,529,196

Күн бұрын

Пікірлер: 3 500

@jeremyseay 4 жыл бұрын

I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

@microcolonel 4 жыл бұрын

Brian misses the big-brain move here: just rewrite your code instead of debugging it.

@funkyflames7430 4 жыл бұрын

Work twice as long. Checkmate

@rickarmbruster8788 4 жыл бұрын

@this senator sometimes rewriting is a decent opportunity that you see your brain varies in qualaty of output each time :D

@mr_confuse 4 жыл бұрын

@@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....

@rickarmbruster8788 4 жыл бұрын

@@mr_confuse u will get a hang of nice design eventually ^^

@szymoniak75 4 жыл бұрын

DISCLAIMER FOR ALL BEGINNER PROGRAMMERS! Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.

@alexreeve 4 жыл бұрын

Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)

@davidm.johnston8994 4 жыл бұрын

Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".

@Merthalophor 4 жыл бұрын

Donald Knuth: Premature optimization is the root of all evil. This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.

@brianbarefootburns3521 4 жыл бұрын

This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.

@LeTonyJr1 4 жыл бұрын

Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.

@Mackinstyle 3 жыл бұрын

I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.

@edwardmacnab354 2 жыл бұрын

except for his last example which was light years faster as hand coded

@damaomiX 2 жыл бұрын

It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.

@ocoolwow Жыл бұрын

@@edwardmacnab354oh you poor fool

@go-away-5555 Жыл бұрын

⁠branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler). For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU. For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video. Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm. And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.

@asaskald Жыл бұрын

I'm sure he just LOVEs it too. We all LOVE it.

@Caesim9 4 жыл бұрын

I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming. Very interesting topic

@JayJayYUP 4 жыл бұрын

Do they use branchless for the most part? Or are they in violation of this?

@georgionic 4 жыл бұрын

They could use constant time operations

@flyingsayon 4 жыл бұрын

Why would you do that? Can't you use a timer to achieve the same result (not disclosing the computation time)?

@thenameipicked 4 жыл бұрын

@sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.

@flyingsayon 4 жыл бұрын

@@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.

@RyanTosh 4 жыл бұрын

When a normal person says something is slow: 5-10 minutes, maybe an hour or two When a programmer says something is slow: 2 nanoseconds

@AntonioNoack 4 жыл бұрын

well, normal people wouldn't think about doing it a billion times ;)

@soblakismyself 4 жыл бұрын

When those instructions run multiple times throughout a code that 2ns starts to makes a difference.

@spicybaguette7706 4 жыл бұрын

When you get a cache miss: This little manoeuvre is going to cost us 51 nanoseconds

@nyrahl593 4 жыл бұрын

@@spicybaguette7706 for an android that is an eternity...

@SamGarcia 4 жыл бұрын

those build up... and then you have Internet Explorer

@kylek.3689 3 жыл бұрын

Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.

@barmetler 4 жыл бұрын

So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.

@anand.suralkar 3 жыл бұрын

Yup

@QuantumConundrum 3 жыл бұрын

@InSomnia DrEvil Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)

@ГеоргиГеоргиев-с3г 3 жыл бұрын

all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.

@RobBCactive 3 жыл бұрын

The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib. It was well worth increasing processing complexity.

@nickscurvy8635 3 жыл бұрын

It still matters. Compilers are great but not perfect. There are still seemingly trivial and arcane situations where the compiler will be confounded and unable to apply proper optimizations on code. It won't be noticeable if ur writing a user application because those misses will be fairly infrequent. If you are writing a linear algebra library for physics calculations then you are gonna be performing millions of these instructions repeatedly in a very short period and the misses become noticeable. Advocating against premature optimization doesn't mean all optimization is premature.

@MultiMrAsd 4 жыл бұрын

I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!

@peterbonnema8913 4 жыл бұрын

Didn't know that. Thanks. Good point.

@nistarok123 4 жыл бұрын

So, vertex/fragment shaders for opengl, for example?

@Luxalpa 4 жыл бұрын

@@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.

@Kino-Imsureq 4 жыл бұрын

@@JeredDanielson me too ;)

@scottmichaud 4 жыл бұрын

Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.

@randyscorner9434 Жыл бұрын

I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions. In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required. In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.

@kingacrisius Жыл бұрын

I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.

@lobovutare Жыл бұрын

cmov is probably not branchless as claimed in the video. On the micro architectural level is still needs to branch.

@BogdanBaudis Жыл бұрын

@@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story. But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.

@kingacrisius Жыл бұрын

@@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.

@kingacrisius Жыл бұрын

@@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.

@christianbarnay2499 4 жыл бұрын

05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.

@tetramaximum 4 жыл бұрын

I think the lesson is: know your hardware.

@scottmichaud 4 жыл бұрын

@@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n

@tetramaximum 4 жыл бұрын

@@scottmichaud agree

@SleepyMongoose 4 жыл бұрын

Very true, the more generic your code, the more likely the compiler will understand what you are doing and perform optimization.

@scottmichaud 4 жыл бұрын

@@SleepyMongoose For the things that it can. It's your job when it comes to effectively using overall data structures, etc.

@rorytulip9343 4 жыл бұрын

Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.

@andrewsprojectsinnovations6352 3 жыл бұрын

"I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.

@aviralsood8141 3 жыл бұрын

@@andrewsprojectsinnovations6352 Maybe he should have taught your father not to become a chiropractic, much more helpful.

@maxineyoung3236 2 жыл бұрын

As a student & beginner, this is certainly correct. Was ready and excited to lmao.

@unformedvoid2223 2 жыл бұрын

It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!

@SeanCMonahan Жыл бұрын

Same! It was an interesting experience refactoring some branched OpenCL code to use algebraic solutions to replace the branches.

@programaths 4 жыл бұрын

Shaders. Branchless techniques are mandatory in those!

@plazmotech5969 4 жыл бұрын

When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!

@galandilvogler8577 4 жыл бұрын

It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching. For example, if you write: if(x>=0.8) {y=x} else {y=0} one might be tempted to optimize by writing: y = x * step(0.8,x) And yet, at least in HLSL, the compiler creates the same low level code for both.

@plazmotech5969 4 жыл бұрын

@@galandilvogler8577 Good to know! But still, best to make sure...

@clepirelli 4 жыл бұрын

This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations. See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626

@plazmotech5969 4 жыл бұрын

@@clepirelli Neat! Thank you

@frogge6443 3 жыл бұрын

Your mousepad is a masterpiece.

@robertoaguiar6230 3 жыл бұрын

high albedo for that max dpi specs. Don't tell gamers, or the prices will blast up

@evanmagill9114 3 жыл бұрын

@@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?

@kaanozk Жыл бұрын

Don't look at the finger pointing towards the moon or you will miss the heavenly glory

@vanlepthien6768 2 жыл бұрын

As a long time software developer, in virtually every case, code maintainability is far more important than speed. I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern. Introducing repeatedly executed multiplication generally is not good for performance. "And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer. Unless there is a genuine need, don't make your code ugly to eke out a few cycles.

@henrikskott Жыл бұрын

I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.

@gorak9000 Жыл бұрын

@@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways

@jacobzimmerman3492 Жыл бұрын

Once you understand cache hierarchy , you realize a heap access can be equivalent to MANY branches

@akenteva Жыл бұрын

Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability. Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.

@Josus Жыл бұрын

it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations

@willofirony 4 жыл бұрын

Awesome! Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag. These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.

@WhatsACreel 4 жыл бұрын

Thanks mate! I certainly wish the instruction set had a few things that are missing too! Cheers for watching :)

@vyor8837 4 жыл бұрын

@@WhatsACreel could ask intel and AMD to add that...

@halbgefressen9768 4 жыл бұрын

setz eax ror eax, 1 sar eax, 31

@CarrotCakeMake 4 жыл бұрын

Nothing in this video had anything to do with computer science. It was all computer architecture.

@RAMIdotGG 4 жыл бұрын

every community needs a Michael Keegan :) refreshing to read a wholesome honest opinion

@C5FlyingSquad 4 жыл бұрын

I think more people need to see your videos. Although they're specialised, they feel valuable for non-systems programmers like myself.

@gangstasteve5753 4 жыл бұрын

this is a very underrated youtube channel. i learned assembly from him.

@ChrisM541 3 жыл бұрын

Brilliant explanation. The following bombshell caught my eye... 12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-) Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++. Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.

@krytharn 4 жыл бұрын

Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.

@abhinoorsinghpannu2993 4 жыл бұрын

Same goes for cuda programming also

@WhatsACreel 4 жыл бұрын

So true! GPU's hate branching maybe even more than CPU's! They call the conditional execution 'predicate' in CUDA.

@dakrontu 3 жыл бұрын

Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.

@josiahmanson 3 жыл бұрын

A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.

@IceExtremeGamers 3 жыл бұрын

Doubt it will work with non ASCII characters, though.

@Henrix1998 3 жыл бұрын

@@IceExtremeGamers none of this works for non-ascii

@IceExtremeGamers 3 жыл бұрын

@@Henrix1998 sadly

@josiahmanson 3 жыл бұрын

@@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.

@igorstasenko9183 3 жыл бұрын

@@josiahmanson yeah i was waiting from author to present a final-final version of algorithm which using lookup tables.. but video ended :)

@sylviaelse5086 3 жыл бұрын

In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication. The architecture may even be incapable of calculating the true/false result other than by the use of branches. In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.

@notgabby604 Жыл бұрын

You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways. GCC found much better code than me for this in FreeBasic. sub subrandom1(result as single ptr,x as single ptr,n as ulongint) dim as ulong ptr resultptr=cast(ulong ptr,result) dim as ulong ptr xptr=cast(ulong ptr,x) dim as ulong r=&hE8FE90EB,rs,shift=18 for i as ulongint=0 to n-1 step 4 rs+=r resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000) next end sub

@khhnator 4 жыл бұрын

I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not. get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not. the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive. effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.

@PhilippeCarphin 4 жыл бұрын

@kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type). It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff. It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.

@dincerekin 4 жыл бұрын

time critical code should not be run on a micro

@iVTECInside 4 жыл бұрын

@kingofallcrypto , I would hope that it would be obvious to anyone that ended up at this video.

@wusulus 4 жыл бұрын

You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.

@webkar 4 жыл бұрын

@@dincerekin Think about what you said next time when your ESP/ABS saves you from skidding into a tree.

@psun256 4 жыл бұрын

I never really thought about CPU caching and "predictions" affecting conditional statements, this is super cool!

@konstantinkh 4 жыл бұрын

When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.

@psun256 4 жыл бұрын

@@konstantinkh So things like for loops?

@konstantinkh 4 жыл бұрын

@@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.

@psun256 4 жыл бұрын

@@konstantinkh Huh, that's pretty neat. Thanks for sharing!

@A.Martin 3 жыл бұрын

@@konstantinkh It would be great if you could actually tell your cpu which branch to load by default.

@RFC-3514 3 жыл бұрын

For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P

@Theineluctable_SOME_CANT 3 жыл бұрын

Yep!

@WorBlux 3 жыл бұрын

Multiply by 1 or 0 and you get an early exit on modern ALU's. With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved. And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.

@Theineluctable_SOME_CANT 3 жыл бұрын

@@WorBlux can I hire you on for my next big programming job?

@WorBlux 3 жыл бұрын

@@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much. Being outside and producing something I can touch is more rewarding for me.

@Theineluctable_SOME_CANT 3 жыл бұрын

@@WorBlux yes, for me also. Machine microarchitechture is quite interesting to me.

@MM-24 4 жыл бұрын

I want to buy this guy a mouse pad and maybe a second monitor...

@Bobby.Kristensen 4 жыл бұрын

He seems to be stuck in the 90s.

@stevecarter8810 4 жыл бұрын

Nobody's stopping you

@lour.222 4 жыл бұрын

Goes to show it's what you know, not what you own.

@MM-24 4 жыл бұрын

@@stevecarter8810 no they aren't, that's good to know

@MM-24 4 жыл бұрын

@@lour.222 I think its a combination, good tools in the hands of a craftsmen sort of thing

@brandonchurch6025 4 жыл бұрын

This is really good fundamental information for some of us "spoiled" high-level programmers who don't typically think beyond compilation

@ponypapa6785 4 жыл бұрын

Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing. And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D

@juangalton999 5 ай бұрын

Agreed 100%. As a Javascript programmer, I noticed many of my peers don't seem to care much about how the language works. I know there are many performance decreases with it being a dynamic/weakly typed language with a just-in-time compilation. So whatever edge I can eek out in performance I will take. Plus I eventually want to learn more than one language. But making sure I REALLY understand JS first.

@fie1329 Жыл бұрын

I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!

@StructOfArrays 4 жыл бұрын

For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2. In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler. In any case for the branched versions the awesome compiler optimized away the branches. So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.

@Guztav1337 4 жыл бұрын

“in 99% cases the compiler is smarter“

@Henrik_Holst 4 жыл бұрын

Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a

@0LoneTech 4 жыл бұрын

@@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.

@Guztav1337 4 жыл бұрын

@Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made). It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.

@Guztav1337 4 жыл бұрын

@Isaac Beizsley Yeah that's annoying. I want to say that the world would be a little bit greener if it wasn't for all bad code. lol

@TheAguydude 4 жыл бұрын

Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.

@TheGandorX 4 жыл бұрын

That's why JIT (Java) is such a good idea. The run time / JIT platform applies the optimizations best suited for the hardware.

@garretgang8349 4 жыл бұрын

@@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.

@daantimmer 3 жыл бұрын

Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict. Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.

@andywolan 3 жыл бұрын

@@daantimmer Ya, I think that behavior was the basis of the Spector and Meltdown CPU bugs discovered a few years back. (Correct me if I'm wrong.)

@daantimmer 3 жыл бұрын

@@andywolan I won't be able to correct you even if you are wrong because I never really dived in to those issues :-)

@tolkienfan1972 3 жыл бұрын

1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c

@connoisseurofcookies2047 3 жыл бұрын

I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.

@christopherjoseph651 3 жыл бұрын

The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work

@CottidaeSEA 3 жыл бұрын

@@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.

@dionyzus2909 3 жыл бұрын

@@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.

@christopherjoseph651 3 жыл бұрын

@@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?

@christopherjoseph651 3 жыл бұрын

@@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!

@bandogbone3265 3 жыл бұрын

I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!

@duanebonas8630 3 жыл бұрын

Class explanation. I Love it . U know what u going on about.

@spx0 2 жыл бұрын

Great

@treelibrarian7618 Жыл бұрын

this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.

@danielstory2761 Жыл бұрын

The iPhone 6 is capable of 20 times more FLOPS than the cray-1

@GregMoress Жыл бұрын

That's called 'Pipelining', and I'm sure you knew that. It's a form of parallel processing.

@SpiritmanProductions 3 жыл бұрын

This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++. EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.

@TurboXray 3 жыл бұрын

Someone needs to explain why that was chosen hahah.

@Vacuon 3 жыл бұрын

-1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.: not false = true (~0 = -1) true or x = true (-1 | x = -1) etc, etc, etc.

@TurboXray 3 жыл бұрын

@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it. The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).

@SeanCMonahan Жыл бұрын

Oh, god, I bet that has led to more than a few bizarre bugs.

@vladimirarnost8020 Жыл бұрын

@@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks. For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler. So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b); Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b); Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming. The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)

@giladreich810 4 жыл бұрын

KZbin should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.

@scottmichaud 4 жыл бұрын

C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision

@berndeckenfels 4 жыл бұрын

A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.

@inx1819 4 жыл бұрын

add random sleeps in the code and you're alright /s

@yvrelna 4 жыл бұрын

@@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other. Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second. Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.

@gutoguto0873 3 жыл бұрын

@@yvrelna you didn’t see his /s

@Zeuts85 3 жыл бұрын

This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.

@saudude2174 Жыл бұрын

bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.

@ncot_tech 4 жыл бұрын

I’ve been doing Z80 assembly recently, and seeing all these fancy conditional instructions and more than 8 registers is crazy 😀

@WhatsACreel 4 жыл бұрын

Long live the ZX Spectrum :)

@protonjinx 4 жыл бұрын

cheater! 6502 only had 3! spoiled brat.

@rongarza9488 3 жыл бұрын

Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.

@Vacuon 3 жыл бұрын

If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.

@musaran2 Жыл бұрын

Even that is not free either. Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere. But then cache is limited too, so maller code is often better.

@iXPilot Жыл бұрын

Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...

@volchonokilliR 4 жыл бұрын

There's [[likely]] and [[unlikely]] attributes in C++20, they can help in some cases

@vyor8837 4 жыл бұрын

And that's why you update your compiler and SDK...

@neur303 4 жыл бұрын

Also there is profile guided optimization, but sometimes you have paths that are hard to predict, because the patterns are too elaborate.

@user-tr8kr1jd2o 4 жыл бұрын

Vyor GCC has had macros for it for years

@leo69105 4 жыл бұрын

Have you tried if these help? Afaik, the branch hints are x86 only and no longer available in x86-64 mode. Not sure about arm.

@HermanWillems 4 жыл бұрын

@@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.

@dana_t0ebdot 4 жыл бұрын

Got recommended this, watched it all, now time to find even more good content you have

@jonathanross6260 2 жыл бұрын

Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.

@WhatsACreel 2 жыл бұрын

Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!! The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching. Cheers for watching :)

@thewhitefalcon8539 Жыл бұрын

could try something like d[i] ^= ((d[i] - 26) < 'a')

@bidanfullko1 2 ай бұрын

I was just about to write this, but it’s obvious someone beat me to it (years ago)! :D

@donjindra 4 жыл бұрын

In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.

@donjindra 3 жыл бұрын

@jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.

@donjindra 3 жыл бұрын

@jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.

@kevinstefanov2841 4 ай бұрын

I would love to get your email, discord or linkedin or whatever and chat with you about the various assembly optimization tricks you could teach others! I'm an operating systems developer myself with 3 years of experience and I'm just now starting to really get into assembly language (I use an internal closed-source C-like language at work which is so old, it doesn't have an easy to use print function, so often the best way for me to debug the runtime is to freeze the operating system at certified compiler-emitted assembly instructions and look at registers and memory locations hahaha) and hardware architecture and optimizations enabled by a programmer's understanding thereof. My current side project is 6000 lines of mostly algorithmic C code (custom BigInt and cryptography library to be used for a secure chat app) and I'm sure there would be a lot for us to talk about and a ton for me to learn from you!

@mohrizkyk 3 жыл бұрын

"If you want to be most people, if you want to be an ordinary dev with common skillset, this video is not for you." - Peerple -

@metamud8686 3 жыл бұрын

The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in. Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)). Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself". Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.

@Walrusbonzo 4 жыл бұрын

Loved the reference to Al Di Meola, one of my all time favourite guitarists. Great video too, right up my street.

@TheCroustade 3 жыл бұрын

Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b. branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.

@medielectro Ай бұрын

The smaller_branchless( ) does not have any branching instruction. But... its code will be larger and will take more space (memory) and more time to execute. 😂 This happens because although we do not see any "branching" in the instruction return( a*(a

@gregorym6270 2 жыл бұрын

You write return a*(a

@ChrisStavros 2 ай бұрын

You're correct. Neither the video maker nor the adoring commentators understand what a branch is, and are just jerking off trying to eliminate short jumps. Fucking brutal.

@XenXenOfficial 2 ай бұрын

People have done it before, here's an example I found online, If you wanna know if it's smaller like in the video just put a ! on the function call: int isGt(int a, int b) { int diff = a ^ b; diff |= diff >> 1; diff |= diff >> 2; diff |= diff >> 4; diff |= diff >> 8; diff |= diff >> 16; diff &= ~(diff >> 1) | 0x80000000; diff &= (a ^ 0x80000000) & (b ^ 0x7fffffff); return !!diff; }

@youuuuuuuuuuutube 3 жыл бұрын

Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).

@jcm2606 2 жыл бұрын

Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.

@Bunnokazooie 9 ай бұрын

Love when experts have a down-to-earth attitude like this. You rock man.

@originaltonywilk 4 жыл бұрын

Good to know some programmers are interested in this stuff... thought it was a dying art :) Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then). Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.

@TheEkkas 3 жыл бұрын

Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it. Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.

@valtergomes5808 2 жыл бұрын

In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..

@BlankBrain 3 жыл бұрын

Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.

@Websitedr 3 жыл бұрын

I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.

@eomoran 2 жыл бұрын

Aw jeez, if only you could write a comment annotating it /s

@TheLordoftheDarkness Жыл бұрын

Skill issues

@TheBypasser Ай бұрын

a) In fact, a "branch" is not just a conditional jump, it is every occasion your code gets "derailed", e.g. your instruction address register (usually called a Program Counter, or PC) gets modified. An interrupt, a function call (with either a static or a dynamic address), a function return, an interrupt return, just a locked-in-place goto-style jump - those all cause your program counter to take any value instead of advancing a fixed number of bytes. b) The branch-less approach is a pretty scary thing as, first, it is very hardware-bound, say your code based on the fact a C bool is either 0 or 1 may cause a real mess of register setups on many RISC cores. Skipping one single mov instruction is usually not a big deal performance-wise tho. c) The uppercase example is suboptimal. You could do "bool cnd1 = (ch >= 'a'); bool cnd2 = (ch

@frigzy3748 3 жыл бұрын

You can replace 32 in C with 'a' - 'A' just to not have magic numbers. Great vid - thanks!

@shanehebert396 3 жыл бұрын

Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.

@frigzy3748 3 жыл бұрын

@@shanehebert396 I never thought of it - thanks!

@micahwelf 2 ай бұрын

3:44 - That code is C syntax. The syntax uses a comparison operator to return a number equivalent to Boolean. However, there is no guarantee of how the syntax will translate to whatever CPU is chosen, especially when using a language whose compiler you didn't study or write. Thank you for showing the concept, but I wish you would explain up front that this is a C-syntax feature, not a reliable technique in programming. (it would set the context better and be less misleading to the naive viewer)

@Kyojin743 9 ай бұрын

I’m an undergraduate in CpE and I was working on an autonomous robot, and I had this really over engineered way to do course correction. I watched this video, and just today I was looking over my system and noticed that the techniques in your video were perfect for this application. I reduced it from a 50 line solution of ugly nested if else’s about 4 deep at max, to 2 lines. Thank you!

@Chumbawumba-l6q 2 ай бұрын

What did it bring?

@Kyojin743 2 ай бұрын

@ My robot had incredibly straight lines when driving forward, but there must’ve been something with going backwards as it kinda curved a little. Passed with an A tho!

@russellthorburn9297 4 жыл бұрын

It seems to me that it's a choice of evils: 1. Easy to read but much slower code. 2. Optimized but heavily obfuscated code. Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".

@tragedyofwind 4 жыл бұрын

this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.

@Dji-gurda_Jdi-druga 4 жыл бұрын

I would say you should determine bottlenecks and use #2 in them and #1 everywhere else.

@retrozvoc6189 4 жыл бұрын

How about a middleware that's readable to you and which gets JIT-compiled to whatever CPU you wanna run it on?

@DoomRater 4 жыл бұрын

I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler. Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP

@PhilippeCarphin 4 жыл бұрын

"Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that. And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.

@Louis_Marcotte Жыл бұрын

I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good

@LocrianDorian Жыл бұрын

It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.

@Louis_Marcotte Жыл бұрын

@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b

@LocrianDorian Жыл бұрын

@@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.

@ChrisStavros 2 ай бұрын

@@LocrianDorian He's right, you're wrong. Eliminating jumps to minimize cache miss is not what branchlessness is about. Whether you're using an if or multiplying with the result of a comparison, you are creating a branch for the CPU's branch predictor: either the comparison will be true or false, and your multiplication result depends on that. It will either predict the branch result correctly or not, resulting in either a speedup or wasted work. (a * (a > b)) creates a branch for the branch predictor no different from an if.

@bidanfullko1 2 ай бұрын

However, if the data is predictable, we can use indirect addressing, hash tables, and other branchless techniques. This is likely the limit of optimization in modern contexts.

@duncanbeauch9598 Жыл бұрын

I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison

@Bregylais 4 жыл бұрын

Somebody gift this man a mouse-pad.

@vinzer72frie 4 жыл бұрын

"The CIA wants you to believe its only if then else if then else"

@gumbo64 4 жыл бұрын

Run over all the glow-in-the-darks

@anant6778 4 жыл бұрын

The FBI wants to have a word about switch statements

@VulcanOnWheels 4 жыл бұрын

What is this quote from? If you quote something from the video you're commenting on, then please include the time where you found it.

@SiisKolkytEuroo 4 жыл бұрын

@@VulcanOnWheels kzbin.info/www/bejne/baavq5SBob-Gh7M it's in that video at around 5m 40s

@SiisKolkytEuroo 4 жыл бұрын

But it's worth it watching the entire video because Terry Davis is a legend

@markwilliamhumphries 3 жыл бұрын

You can toggle case on an ASCII alpha (from lower to upper, or vice versa) by xoring it with the space character (i.e. 0x20).

@Reginoid Жыл бұрын

yes, he did it with -32 (0x20) in that example

@adamp9553 4 жыл бұрын

For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.

@RMLLcrazy 4 жыл бұрын

This seems like a common enough thing that the compiler should be able to provide the optimisation. Isn't that the case?

@tolkienfan1972 4 жыл бұрын

I like explicit casts in such cases, with comments so people reading the code have no cognitive load

@rinrin4711 4 жыл бұрын

It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.

@JayVal90 4 жыл бұрын

I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.

@flossless200 4 жыл бұрын

Cool story bro

@arnoio8355 4 жыл бұрын

You must slay at parties

@patrickthepure 4 жыл бұрын

I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.

@fyfaenihelvete 4 жыл бұрын

That's pretty cool

@JayVal90 4 жыл бұрын

Garfieluigi Yes. Especially parties full of people with a high appreciation for the technical.

@rawbits Жыл бұрын

I love the SIMD part. I never got deep into it, sadly. I believe there is one more thing to mention in correlation with the topic: branch prediction. Often, a simple if statement is used to iterate through values and perform different actions based on them. You might find a significant improvement simply by sorting the data so branch prediction can kick in and lessen cache misses. It would be interesting to see all you mentioned to perform in this scenario.

@bobbobbity463 4 жыл бұрын

Very interesting video. Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed. Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.

@Hydrozoa 4 жыл бұрын

Interesting video! I'll trust the compiler to optimize my readable code over trying to outsmart it, though :b

@matsv201 4 жыл бұрын

Well... I think the ultimate lesson is to speed test your code

@nguyenvu8262 4 жыл бұрын

I think there are many things to do to optimize speed. This should not be high on the list but it's good to know that it's there. If a code is branch-heavy, this probably would help more. Speed up 1ms code 1000 times doesn't have that much impact on performance as a whole.

@judewestburner 3 жыл бұрын

CPU's are cheaper than developers

@jimwinchester339 11 ай бұрын

Some of the best branchless code I've seen, back before the SIMD instruction families came along, involved doing the key comparison, and then using SBB to extend the carry flag into . If was AX, CWD would extend that into DX. Then is DX was complemented, you'd have two mask registers, one guaranteed to be all 0's, the other all 1's. Then each alternative was calculated: one masked w/ AX; the other with DX. Finally, DX was OR'd into AX for the final result in AX.

@NetAndyCz 4 жыл бұрын

Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code

@johnnisshansen 3 жыл бұрын

In these 19 minuts of video my computer could raise to uppercase all books written the last 10 years.

@zwz.zdenek 3 жыл бұрын

Sounds like you have a pretty neat OCR and also pirated a lot of books. But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.

@johnnisshansen 3 жыл бұрын

@@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code. But ok reducing branches can sometimes be worthwhile.

@A.Martin 3 жыл бұрын

@@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code. For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.

@TheSkepticSkwerl 3 жыл бұрын

Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.

@Argosh 3 жыл бұрын

That kind of thinking lead us to page loading times that were worse than what we could achieve in the 90ies...

@joekelly7505 Жыл бұрын

This approach makes 2 assumptions: 1. that the code is running on a deeply pipelined architecture in which the pipeline gets squashed at every decision, and 2. that compiler optimization isn't already adjusting for 1. The first assumption isn't bad, necessarily, but I've found when using a higher level language, that making assumptions about the underlying architecture (esp. without taking 2. into account) can lead to unreadable and hard to debug code; the only exception I would make is using the `volatile` keyword to prevent key pieces of shared state from being cached. If I was writing assembly, I'd be deeply concerned about making the code as efficient as possible. But since I write in C, C++, and other C like languages, I've come to trust that the optimizer will emit code that takes the underlying architecture into account. And programming in high-level languages we have enough troubling wrestling with algorithmic / asymptotic complexity (e.g. avoiding O(n^2) loops etc.)

@GuruEvi 4 жыл бұрын

For the conversion from upper to lower, wouldn’t it be faster to just do an AND on the byte with a register with the value of the inverse of 32 in binary (6th bit set to 0). So loop over value && 11011111; depending on your architecture this conversion could be done in a single operation.

@dimdimich 2 жыл бұрын

String may contain non-alphabetic characters and they should not be 'converted', i suppose.

@Tolg Жыл бұрын

This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.

@supericeg5913 Жыл бұрын

Thanks for showing this! I now use this every day in my game in C and it's a great extra bit of fun problem solving!

@Toksyuryel 3 жыл бұрын

While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.

@vladimirarnost8020 Жыл бұрын

Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance. However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.

@peterSobieraj Жыл бұрын

I disagree a little. Thinking that most programers shouldn't care about a performance lead us to situation today. That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something. And 10 seconds of website loading is considered normal. And fast typing programmer can type faster than average IDE can display text.

@Toksyuryel Жыл бұрын

@@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.

@stuffedk 4 жыл бұрын

I would be very interested to see a video on the low level differences between x86 and ARM. Especially when it comes to SIMD and such

@WhatsACreel 4 жыл бұрын

They are very different! I'd love to make some ARM ASM videos, the Neon instructions are great :)

@SerBallister 4 жыл бұрын

ARM32 (not sure about newer ARM64) has a really interesting instruction set that encourages branchless programming. You can predicate every opcode on a flag condition (e.g. LE/GT/EQ/etc).

@totalermist 4 жыл бұрын

@@SerBallister That's still a branch, though - just merged into an instruction. That's also why ARM64 got rid of them and only supports them explicitly with branch, select, and compare instructions (for performance reasons, no less!).

@SerBallister 4 жыл бұрын

@@totalermist I thought they differed by not using branch prediction so did not come with misprediction penalties. The none executed opcode behave more like NOPs. I didn't study the internals that much. Its an interesting instruction set with some interesting opportunities for optmisation

@Gonzakoable 4 жыл бұрын

@@WhatsACreel Bonus points if you make an ASMR version of the vidz

@DeadCatX2 3 жыл бұрын

Two things I would add. 1) Conditional move et al still create dependency chains which can slow down throughput. 2) Speculative execution vulnerabilities like Spectre, as well as side-channel attacks on the execution time of a piece of code, can make branchless algorithms more useful.

@BharatAcharyaEducation 3 жыл бұрын

Dear teacher, Respect! Awesome delivery. I teach microprocessors and may want to argue on the overhead compared to the low loss in productivity around this topic. Pentium introduced branch prediction and the higher processors made it even better. Modem scenarios give 98% accuracy in branch prediction algorithms. So all we lose is basically the flushing due to 2% of the rogue code. Isn’t a whole new way of writing code? To be honest, the first example of smaller number would actually take more time to execute due to the multiplications involved. Yeah I teach ARM7 too and it incorporates conditional branching along with every instruction but just branch instructions. We have conditional mov, conditional add etc. it does make it more efficient no doubt but unless there are too many rogue singular branches that fail during branch predictions, the overhead doesn’t weigh up with the loss of efficiency. Hope I’ve made my point clear. Nonetheless, taking nothing away from your awesome work. Keep them coming. Best wishes, always.

@AndrewOxenburgh 3 жыл бұрын

I'm a long time programming practitioner, and I loved it. I'm at the other end of the chain, if you will. Python, JS, various DB's and assorted Web Technologies, but this is really interesting. Haven't touched 86 Assembler since the 90's, so that was interesting. Loved the SIMD stuff. That's all new. Liked his points about premature optimization. The keyword is *premature*. You don't ignore optimization, but be wary of it

@EliKennedy Жыл бұрын

I'm a javascript developer and I feel like a child that's accidentally gone to university instead of kindergarten

@scene2much 3 жыл бұрын

You had me at :"Assembly Code". Subscribed. This is a great topic for driver, and library, and bottle-neck algorithm implementors. idea zero: surface atomic conditionals e.g. 'mov if " from machine-code up to the high-level language (yes "Branchless C/C++", "Branchless Python", etc) If the compiler (and libraries) aren't implementing branchless methods, you may be simply optimizing one engine on a 10 engine super tanker. branchless compilers would deprecate (warn/error) branching where a branchless construct will do. If your compiler pushes you to code branchless, and the Standard Template Library, and other supporting libraries are branchless, then all the code will be pulling to the same drumbeat. With or without a "branchless compiler": I'd never code this way for the prototype. I don't like the idea of "try it and THEN see how much it helps". I prefer to respond to ALARMS, or a post prototype-review of code with a "branchless" eye. Your code profiler should be able to let you know if the cache is being blown and reloaded too often. If they don't, that's a market opportunity.

@donald-parker 4 жыл бұрын

I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.

@Antoinetheman 4 жыл бұрын

To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.

@georganatoly6646 4 жыл бұрын

I believe you may have meant to say assembly instead of "... consider using assembler" to refer to an assembly language.

@shiuido359 4 жыл бұрын

Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.

@christopherjoseph651 3 жыл бұрын

@@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.

@siriusblack9999 4 жыл бұрын

i would like to add that, from my testing with actually benchmarking branchless code, especially with "trivial" examples like the smaller than function you used as a first example, and with compiler optimizations enabled, there's actually a lot of cases where branchless is "at best equal" to branched code, and it'll even flip entirely based on the compiler used. staying with the smaller than example, without optimization the branchless version is 10% slower in clang, but 10% faster in gcc, while with optimization they're exactly equal. The second example works mainly because the branchless form can be vectorized, such that the 1024 elements can be processed in just ~300 cycles. I've tried to apply this branchless approach to my own project (doing some work with big integers. I changed only the division operator to work branchless) and it increased overall compile time of my tests 10-fold (from about 20 seconds to over 3.5 minutes) and nearly quadrupled the runtime for my long-running testcase, calculating 10, 8-byte prime numbers, from 2259 ms to over 8000 ms. Trying a branchless approach has singlehandedly undone all of the optimization i've done since first getting it working, fortunately i have backups. The reason for this is that, while it's true that mispredicted branches are slow, it only takes about 10-20 cycles to flush the buffer and switch branches, while on the other hand branchless needs to perform the work required to evaluate both branches, plus some extra overhead, so even for trivial situations it's quite difficult to actually get benefit unless your branchless code allows the compiler to vectorize the work. which is what happened for the ToUpper function. In more complex scenarios it only gets worse, since if there is more extensive work that needs to be done in each branch, the work for both branches needs to be fully executed before the return value can be determined. There are some genuine use cases for branchless programming, as noted in another comment it may be important in cryptography despite being slower, and there are situations in which branchless is actually faster, but unless you can confidently say that the branch predictor mispredicts more than normal, or the branchless form will benefit from vectorization, it will usually perform equal to branching at best. CPU's are sufficiently complex beasts that unless you actually benchmark what you're trying, it's really not possible to predict what will perform better, for instance the speed gain in the ToUpper function from earlier flips with -O1 on GCC, becoming 50% slower than the naiive branched version, whereas phrasing the condition as a binary and rather than a logical and is 20% faster.

@luis_musik 3 жыл бұрын

the places where branchless is faster is usually because it allows the compiler to see optimizations (usually SIMD) that it otherwise wouldn't have found. iirc -O1 doesn't do SIMD optimizations so that's why the results flip. personally I've seen cases where replacing branches with boolean checks allowed the compiler to do some pretty crazy things like replacing entire loops with bitwise operations resulting in over 100x speedups

@kaiduwu 3 жыл бұрын

Holy fuck I started reading and hit read more, what a fucking paragraph

@MariusSchwendtmayer Ай бұрын

In embedded systems I often inline boolean conditions as ints and use 2D matrix of function pointers so: true + true for the "x" index and true + false for the "y" translates into executing the code at index [2, 1]

@nerdydrew6818 3 жыл бұрын

this reminds me so much of how The Cherno explains c++. Good thing to know that the compiler at times knows best :D

@WhatsACreel 3 жыл бұрын

Cherno is a legend!! We certainly share the accent :)

@omri9325 4 жыл бұрын

Should've mentioned loop unrolling which is also a kind of branch removing method.

@programaths 4 жыл бұрын

He did, but it was really swift! "Well, there is always the end of the loop except if your code [move and shows code repeated]".

@nmeans73 4 жыл бұрын

Loop unrolling is very powerful, but the technique is limited to applications where the loops have a know/fixed iteration count at compile time. Loops with a dynamic iteration count or count based on the size of a dynamically sized container (like a vector) cannot be unrolled since the compiler has no way of knowing how many times to copy the code in the body of the loop. Luckily, the branch prediction hardware in modern CPUs is really good (like 90+ % hit rate) at predicting the correct branches for looping statements since loops are such a commonly encountered structure.

@BitwiseMobile 4 жыл бұрын

Crap, should have read comments *again* . I had the same comment. It's very much a branch removing method which was usually the reason to do it. You removed the branching and compare at the end of the loop which would shave several clock cycles. The draw back is more code bloat though and if you have limited memory that could become a real problem after awhile. I remember spending weeks trying to tweak a mobile port of some Tom Clancy game (this was back in '01/'02 and I can't remember the title). I kept running out of memory and I ended up having to roll back up some loops :). I had to make some other "creative" changes in the game play to compensate for some frame stuttering caused by that, but in the end it was worth it because the game just didn't want to fit on that device despite me having at least 10k overhead beyond the device limit. This was a problem with a lot of early devices. They might have say 200k on the spec sheet, but there is some process somewhere using 10k of that and so you really only have 190k. Yeah, I had to work with that much RAM at times on some form factors and the concept of phone manufacturers was it's a phone first before anything else, so calls always have priority. This could reek havoc on your carefully planned memory map when the handset gets a call during the most memory intensive point. We would have some weird bugs for sure and they almost always were RAM related. When Apple invented the iPhone I knew mobile gaming was going to change forever and in a good way.

@BitwiseMobile 4 жыл бұрын

EDIT: I want to say it was Splinter Cell maybe? That could have been a later port, but I want to think it was Splinter Cell. If you played that on Qualcom handset or more specifically a BREW handset - Kyocera, LG, and a few other manufacturers were still making BREW phones back then - I might have been the one to port it, so you can blame me for any shitty gameplay as a result :P.

@Quarky_ 4 жыл бұрын

One of the reasons people should try to use ranged for. However I find it hard to move away from indices if you are doing some mathematical operation, particularly when the index is part of the calculation. Probably choosing a better data structure (maybe a struct) instead of pod would help here, but that has its own tradeoffs.

@beautifulw999 Жыл бұрын

Absolutely fascinating. I had never, ever heard of this technique. Thanks for posting!

@teropiispala2576 3 жыл бұрын

Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests. Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do. Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.

@ImperatorZed 2 жыл бұрын

Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.

@teropiispala2576 2 жыл бұрын

@@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises. Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable. Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions. Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine. So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies. The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.

@atackhelikopter4303 Жыл бұрын

13:30 improvement idea: use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z this should make it faster because you aren't comparing anything, you just replacing information with existing one

@shinyhappyrem8728 Жыл бұрын

*256

@atackhelikopter4303 Жыл бұрын

@@shinyhappyrem8728 it stops at null so you never have to take into account the null character lol

@godnyx117 Жыл бұрын

Thank you! I'm writing a low level library to use instead of libc for my programming language and this tricks will come in handly!

@SaidMetiche-qy9hb 4 жыл бұрын

Really interesting, optimization is often neglected

@svnhddbst8968 4 жыл бұрын

it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.

@yScribblezHD 4 жыл бұрын

svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.

@xarlysx 4 жыл бұрын

When I started learning four years ago, I had this constant feeling that there was something wrong with conditionals but I didn't know enough to be able to even ask the question. And now finally I have my answer.

@thomasmaughan4798 3 жыл бұрын

There is nothing wrong with conditionals. The author has shown a way to HIDE the conditionals from the source code but the fact is a conditional is inescapable in the object code. It is also very fast, typically 4 clock cycles to execute a conditional. The problem with JUMP instructions is the difficulty of debugging, and removing obvious branches from source code simply makes it harder to debug. Optimizing your code requires some knowledge of how long each instruction requires; logical instructions (and, or, compare) typically are very fast requiring 4 clocks. Arithmetic instructions add and subtract take considerably longer, up to 100 clocks, and multiplication becomes non-deterministic but can take hundreds of clocks. So using multiplication to avoid a simple 4 clock compare instruction is incredibly inefficient. It is this inefficiency that led to the development of "Branch Prediction", suppose there's a multiplication followed by a conditional. It is going to take such a long time to calculate the item to be compared that the processor supposes that the compare could go either way and starts loading the instructions from main memory that follow both possible branches. The CPU is vastly faster than main memory. Then when it gets there, figures out which branch to take, discards the branch not taking but the discarding takes zero time as far as impact on your program. In that scenario, using the simple approach MIGHT not even be faster since is means more of your program is going to be waiting on main memory anyway. But the CPU will certainly run cooler and consume less electricity if you use simple instructions rather than these hugely complex instructions.

@oliver_twistor Жыл бұрын

I know this comment is old, and that you probably have learned a lot more since you wrote it, but be aware that this micro-optimisation is for specific usecases only. For the vast amounts of production code you'll write, if statements are much better. Your coworkers will thank you for it, because they won't have to spend their precious time trying to understand what you have written. I have worked with many "clever" coders and I usually end up deleting it all and rewrite it myself, because I can't be bothered taking precious time trying to understand what the code does.

@bluehornet6752 3 ай бұрын

I don't about going all-branchless, but this definitely shows me a couple of things: First, readability means as much (if not more) than speed, in many cases; and second, the C++ compiler can generate better assembly code than I can.

@joshuap7406 3 жыл бұрын

I never liked the if statement, but this takes it to a whole new level...

@CottidaeSEA 3 жыл бұрын

Conditionals are always necessary. The difference is that branchless programming doesn't create two different scenarios. return (a * (a > b) + b * (a

@KuraIthys 4 жыл бұрын

Ah, branch prediction. Yeah, I see. I guess I spend too much time with ancient hardware that doesn't do anything fancy like that. You do a branch on a 6502, the CPU does that pretty quickly. Then again, since there's little to no pipelining (there is some limited precursor to a pipeline which is why the cycle counts are so low; basically it does instruction decoding and execution in parallel for multiple instructions), no cache, and nothing like branch prediction... A branch is about as fast as anything else... Ah well. The 6502 is an evolutionary dead end. Not because it couldn't have been extended to higher bit depths (65816) and faster clock speeds (current record for a 65816 FPGA core is 280 mhz, roughly). No, the problem... Which can be seen indirectly with the whole Zero page/Direct page idea, is that it's a Register + Memory design. If an instruction takes two operands, in almost all cases one of them is from memory, and one is a register. It also does single cycle memory access. (compare with the slightly later and more prolific 68000 which takes 4 cycles to access memory) For a given CPU clockspeed, the 6502 requires memory that can be accessed in one cycle. The 65816 does even worse, demanding half cycle access. On the face of it there's nothing explicitly wrong with this design. The problem is that the trend in technology (even to this day) is that CPU performance has outpaced memory performance. (at least, for affordable memory) This is why the CPU cache is a thing. Not because anyone especially wants the added complexity, but because a small amount of very fast cache memory + much slower main memory is much more practical and affordable than having your entire memory system operating at speeds equal to or faster than your CPU... Branch prediction I guess is a side effect of deep CPU pipelines, which in turn are a way of reducing instruction cycle count without actually making instructions faster. Want a single cycle multiply instruction? Well, either you make some complex circuit that can actually do that in a single cycle, or you build a pipeline where 10 instructions are in different stages of execution at the same time. The net effect is that each instruction takes 1 cycle... ... If the pipeline doesn't stall... At this point I'm curious what the maximum speed of a CPU would currently be if it had no pipelines and no cache memory. I suppose you can predict that from memory speeds. If we can deal with the logic used for DDR, then I guess we get something akin to 2-4 ghz to play with, so it could be worse... But there's almost certainly some major caveats involved here...

@m.p.jallan2172 4 жыл бұрын

KuraIthys good post, i am a 6502 hobbyist and watched this video for a laugh at the semantics of "branch less programming", i get the feeling tools are so easy now programmers like to seek some form of division from the less technically inclined people who write software.

@Badspot 4 жыл бұрын

It's not just memory bandwidth, it's latency. A CPU with no cache today would be catastrophically slow because an L1 cache read takes ~1ns and a ddr read is ~100ns. I've actually wondered about doing the opposite, making a CPU with huge amounts of cache that only uses main memory to load entirely new programs. Never have to leave the cpu die.

@yScribblezHD 4 жыл бұрын

If you're talking speed of a single instruction without pipelining, of course it wouldn't change from its pipelined version if you run the instruction with the pipeline flushed. Pipelining increases throughput, but there's no reason to expect the latency to be any different.

@KuraIthys 4 жыл бұрын

@@Badspot Good point. I'm certainly no CPU designer, so I'm sure there was more to it. Though as to your idea... With a large enough cache you basically end up back where you started. If your cache is so large that you never need to access main memory, then what you've basically done is turn the cache into the system memory, and system memory into something akin to a RAMdisk... 100 ns latency... hmmh... I admit I didn't think it was still that bad... I mean, a Super Nintendo was already demanding ROM access latency of 110 nanoseconds or more... That would imply for mainstream RAM we've made no real progress with access latency in about 25+ years...

@tatianabasileus 4 жыл бұрын

@@KuraIthys Some high-speed DDR3 and DDR4 can reach 10ns latency for the first word, though delivering the whole cache line can still take 90-100ns.

@paulkossik Ай бұрын

I remember in retrospect that the main influence on my first code was whether I could read it. I was taught to design first in pseudo-code and my code looked very similar to my sketch. This video is about making your code branchless, but it displays pretty well how non-beginner code looks. The main influence on individual lines is scraping against the limits of the hardware.