Branchless Programming: Why "If" is Sloowww... and what we can do about it!

  Рет қаралды 1,423,642

Creel

Creel

Күн бұрын

Support What's a Creel? on Patreon: / whatsacreel
Office merch store: whats-a-creel-3.creator-sprin...
FaceBook: / whatsacreel
In this video we look at branchless programming. This is a technique to gain speed in our high and low level programming by avoiding branching code as much as possible.
Software used to make this vid:
Visual Studio 2019 Community: www.visualstudio.com/downloads/
OBS: obsproject.com/
Davinci Resolve 16: www.blackmagicdesign.com/prod...
OpenOffice: www.openoffice.org/
Gimp: www.gimp.org/

Пікірлер: 3 300
@jeremyseay
@jeremyseay 3 жыл бұрын
I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
@microcolonel
@microcolonel 3 жыл бұрын
Brian misses the big-brain move here: just rewrite your code instead of debugging it.
@funkyflames7430
@funkyflames7430 3 жыл бұрын
Work twice as long. Checkmate
@rickarmbruster8788
@rickarmbruster8788 3 жыл бұрын
@this senator sometimes rewriting is a decent opportunity that you see your brain varies in qualaty of output each time :D
@mr_confuse
@mr_confuse 3 жыл бұрын
@@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....
@rickarmbruster8788
@rickarmbruster8788 3 жыл бұрын
@@mr_confuse u will get a hang of nice design eventually ^^
@RyanTosh
@RyanTosh 3 жыл бұрын
When a normal person says something is slow: 5-10 minutes, maybe an hour or two When a programmer says something is slow: 2 nanoseconds
@AntonioNoack
@AntonioNoack 3 жыл бұрын
well, normal people wouldn't think about doing it a billion times ;)
@soblackismyself
@soblackismyself 3 жыл бұрын
When those instructions run multiple times throughout a code that 2ns starts to makes a difference.
@spicybaguette7706
@spicybaguette7706 3 жыл бұрын
When you get a cache miss: This little manoeuvre is going to cost us 51 nanoseconds
@nyrahl593
@nyrahl593 3 жыл бұрын
@@spicybaguette7706 for an android that is an eternity...
@SamGarcia
@SamGarcia 3 жыл бұрын
those build up... and then you have Internet Explorer
@Mackinstyle
@Mackinstyle 2 жыл бұрын
I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.
@edwardmacnab354
@edwardmacnab354 Жыл бұрын
except for his last example which was light years faster as hand coded
@damaomiX
@damaomiX Жыл бұрын
It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.
@ocoolwow
@ocoolwow 9 ай бұрын
​@@edwardmacnab354oh you poor fool
@go-away-5555
@go-away-5555 8 ай бұрын
⁠branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler). For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU. For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video. Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm. And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.
@asaskald
@asaskald 7 ай бұрын
I'm sure he just LOVEs it too. We all LOVE it.
@randyscorner9434
@randyscorner9434 9 ай бұрын
I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions. In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required. In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.
@kingacrisius
@kingacrisius 8 ай бұрын
I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.
@lobovutare
@lobovutare 8 ай бұрын
cmov is probably not branchless as claimed in the video. On the micro architectural level is still needs to branch.
@BogdanBaudis
@BogdanBaudis 8 ай бұрын
@@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story. But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.
@kingacrisius
@kingacrisius 8 ай бұрын
@@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.
@kingacrisius
@kingacrisius 8 ай бұрын
@@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.
@barmetler
@barmetler 3 жыл бұрын
So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.
@anandsuralkar2947
@anandsuralkar2947 2 жыл бұрын
Yup
@insomnia20422
@insomnia20422 2 жыл бұрын
or when in doubt just measure the performance and use the thing thats faster
@QuantumConundrum
@QuantumConundrum 2 жыл бұрын
@@insomnia20422 Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)
@user-zn4pw5nk2v
@user-zn4pw5nk2v 2 жыл бұрын
all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.
@RobBCactive
@RobBCactive 2 жыл бұрын
The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib. It was well worth increasing processing complexity.
@szymoniak75
@szymoniak75 3 жыл бұрын
DISCLAIMER FOR ALL BEGINNER PROGRAMMERS! Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.
@alexreeve
@alexreeve 3 жыл бұрын
Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)
@davidm.johnston8994
@davidm.johnston8994 3 жыл бұрын
Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".
@Merthalophor
@Merthalophor 3 жыл бұрын
Donald Knuth: Premature optimization is the root of all evil. This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.
@brianbarefootburns3521
@brianbarefootburns3521 3 жыл бұрын
This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.
@LeTonyJr1
@LeTonyJr1 3 жыл бұрын
Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.
@frogge6443
@frogge6443 3 жыл бұрын
Your mousepad is a masterpiece.
@robertoaguiar6230
@robertoaguiar6230 2 жыл бұрын
high albedo for that max dpi specs. Don't tell gamers, or the prices will blast up
@evanmagill9114
@evanmagill9114 2 жыл бұрын
@@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?
@kaanozk
@kaanozk 3 ай бұрын
Don't look at the finger pointing towards the moon or you will miss the heavenly glory
@kylek.3689
@kylek.3689 2 жыл бұрын
Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.
@Caesim9
@Caesim9 3 жыл бұрын
I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming. Very interesting topic
@Photo-Jay
@Photo-Jay 3 жыл бұрын
Do they use branchless for the most part? Or are they in violation of this?
@georgionic
@georgionic 3 жыл бұрын
They could use constant time operations
@flyingsayon
@flyingsayon 3 жыл бұрын
Why would you do that? Can't you use a timer to achieve the same result (not disclosing the computation time)?
@thenameipicked
@thenameipicked 3 жыл бұрын
@sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.
@flyingsayon
@flyingsayon 3 жыл бұрын
@@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.
@MultiMrAsd
@MultiMrAsd 3 жыл бұрын
I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!
@peterbonnema8913
@peterbonnema8913 3 жыл бұрын
Didn't know that. Thanks. Good point.
@nistarok123
@nistarok123 3 жыл бұрын
So, vertex/fragment shaders for opengl, for example?
@Luxalpa
@Luxalpa 3 жыл бұрын
@@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.
@Kino-Imsureq
@Kino-Imsureq 3 жыл бұрын
@@JeredDanielson me too ;)
@scottmichaud
@scottmichaud 3 жыл бұрын
Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.
@vanlepthien6768
@vanlepthien6768 Жыл бұрын
As a long time software developer, in virtually every case, code maintainability is far more important than speed. I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern. Introducing repeatedly executed multiplication generally is not good for performance. "And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer. Unless there is a genuine need, don't make your code ugly to eke out a few cycles.
@henrikskott
@henrikskott 9 ай бұрын
I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.
@gorak9000
@gorak9000 9 ай бұрын
@@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways
@jacobzimmerman3492
@jacobzimmerman3492 9 ай бұрын
Once you understand cache hierarchy , you realize a heap access can be equivalent to MANY branches
@akenteva
@akenteva 9 ай бұрын
Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability. Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.
@Josus
@Josus 9 ай бұрын
it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations
@connoisseurofcookies2047
@connoisseurofcookies2047 3 жыл бұрын
I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.
@christopherjoseph651
@christopherjoseph651 3 жыл бұрын
The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work
@CottidaeSEA
@CottidaeSEA 3 жыл бұрын
@@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.
@dionyzus2909
@dionyzus2909 3 жыл бұрын
@@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.
@christopherjoseph651
@christopherjoseph651 2 жыл бұрын
@@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?
@christopherjoseph651
@christopherjoseph651 2 жыл бұрын
@@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!
@christianbarnay2499
@christianbarnay2499 3 жыл бұрын
05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.
@tetramaximum
@tetramaximum 3 жыл бұрын
I think the lesson is: know your hardware.
@scottmichaud
@scottmichaud 3 жыл бұрын
@@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n
@tetramaximum
@tetramaximum 3 жыл бұрын
@@scottmichaud agree
@SleepyMongoose
@SleepyMongoose 3 жыл бұрын
Very true, the more generic your code, the more likely the compiler will understand what you are doing and perform optimization.
@scottmichaud
@scottmichaud 3 жыл бұрын
@@SleepyMongoose For the things that it can. It's your job when it comes to effectively using overall data structures, etc.
@programaths
@programaths 3 жыл бұрын
Shaders. Branchless techniques are mandatory in those!
@plazmotech5969
@plazmotech5969 3 жыл бұрын
When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!
@galandilvogler8577
@galandilvogler8577 3 жыл бұрын
It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching. For example, if you write: if(x>=0.8) {y=x} else {y=0} one might be tempted to optimize by writing: y = x * step(0.8,x) And yet, at least in HLSL, the compiler creates the same low level code for both.
@plazmotech5969
@plazmotech5969 3 жыл бұрын
@@galandilvogler8577 Good to know! But still, best to make sure...
@clepirelli
@clepirelli 3 жыл бұрын
This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations. See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626
@plazmotech5969
@plazmotech5969 3 жыл бұрын
@@clepirelli Neat! Thank you
@josiahmanson
@josiahmanson 3 жыл бұрын
A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.
@IceExtremeGamers
@IceExtremeGamers 3 жыл бұрын
Doubt it will work with non ASCII characters, though.
@Henrix1998
@Henrix1998 3 жыл бұрын
@@IceExtremeGamers none of this works for non-ascii
@IceExtremeGamers
@IceExtremeGamers 3 жыл бұрын
@@Henrix1998 sadly
@josiahmanson
@josiahmanson 3 жыл бұрын
@@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.
@igorstasenko9183
@igorstasenko9183 3 жыл бұрын
@@josiahmanson yeah i was waiting from author to present a final-final version of algorithm which using lookup tables.. but video ended :)
@RFC-3514
@RFC-3514 2 жыл бұрын
For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P
@Theineluctable_SOME_CANT
@Theineluctable_SOME_CANT 2 жыл бұрын
Yep!
@WorBlux
@WorBlux 2 жыл бұрын
Multiply by 1 or 0 and you get an early exit on modern ALU's. With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved. And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.
@Theineluctable_SOME_CANT
@Theineluctable_SOME_CANT 2 жыл бұрын
@@WorBlux can I hire you on for my next big programming job?
@WorBlux
@WorBlux 2 жыл бұрын
@@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much. Being outside and producing something I can touch is more rewarding for me.
@Theineluctable_SOME_CANT
@Theineluctable_SOME_CANT 2 жыл бұрын
@@WorBlux yes, for me also. Machine microarchitechture is quite interesting to me.
@rorytulip9343
@rorytulip9343 3 жыл бұрын
Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.
@andrewsprojectsinnovations6352
@andrewsprojectsinnovations6352 3 жыл бұрын
"I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.
@aviralsood8141
@aviralsood8141 3 жыл бұрын
@@andrewsprojectsinnovations6352 Maybe he should have taught your father not to become a chiropractic, much more helpful.
@maxineyoung3236
@maxineyoung3236 2 жыл бұрын
As a student & beginner, this is certainly correct. Was ready and excited to lmao.
@brandonchurch6025
@brandonchurch6025 3 жыл бұрын
This is really good fundamental information for some of us "spoiled" high-level programmers who don't typically think beyond compilation
@ponypapa6785
@ponypapa6785 3 жыл бұрын
Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing. And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D
@unformedvoid2223
@unformedvoid2223 2 жыл бұрын
It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!
@SeanCMonahan
@SeanCMonahan 9 ай бұрын
Same! It was an interesting experience refactoring some branched OpenCL code to use algebraic solutions to replace the branches.
@bandogbone3265
@bandogbone3265 3 жыл бұрын
I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!
@duanebonas8630
@duanebonas8630 2 жыл бұрын
Class explanation. I Love it . U know what u going on about.
@spx0
@spx0 2 жыл бұрын
Great
@treelibrarian7618
@treelibrarian7618 9 ай бұрын
this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.
@danielstory2761
@danielstory2761 9 ай бұрын
The iPhone 6 is capable of 20 times more FLOPS than the cray-1
@GregMoress
@GregMoress 9 ай бұрын
That's called 'Pipelining', and I'm sure you knew that. It's a form of parallel processing.
@MM-24
@MM-24 3 жыл бұрын
I want to buy this guy a mouse pad and maybe a second monitor...
@Bobby.Kristensen
@Bobby.Kristensen 3 жыл бұрын
He seems to be stuck in the 90s.
@stevecarter8810
@stevecarter8810 3 жыл бұрын
Nobody's stopping you
@lour.222
@lour.222 3 жыл бұрын
Goes to show it's what you know, not what you own.
@MM-24
@MM-24 3 жыл бұрын
@@stevecarter8810 no they aren't, that's good to know
@MM-24
@MM-24 3 жыл бұрын
@@lour.222 I think its a combination, good tools in the hands of a craftsmen sort of thing
@khhnator
@khhnator 3 жыл бұрын
I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not. get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not. the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive. effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.
@PhilippeCarphin
@PhilippeCarphin 3 жыл бұрын
@kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type). It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff. It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.
@dincerekin
@dincerekin 3 жыл бұрын
time critical code should not be run on a micro
@iVTECInside
@iVTECInside 3 жыл бұрын
@kingofallcrypto , I would hope that it would be obvious to anyone that ended up at this video.
@wusulus
@wusulus 3 жыл бұрын
You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.
@webkar
@webkar 3 жыл бұрын
@@dincerekin Think about what you said next time when your ESP/ABS saves you from skidding into a tree.
@Vacuon
@Vacuon 2 жыл бұрын
If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.
@musaran2
@musaran2 8 ай бұрын
Even that is not free either. Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere. But then cache is limited too, so maller code is often better.
@iXPilot
@iXPilot 8 ай бұрын
Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...
@youuuuuuuuuuutube
@youuuuuuuuuuutube 2 жыл бұрын
Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).
@jcm2606
@jcm2606 Жыл бұрын
Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.
@StructOfArrays
@StructOfArrays 3 жыл бұрын
For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2. In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler. In any case for the branched versions the awesome compiler optimized away the branches. So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.
@Guztav1337
@Guztav1337 3 жыл бұрын
“in 99% cases the compiler is smarter“
@Henrik_Holst
@Henrik_Holst 3 жыл бұрын
Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a
@0LoneTech
@0LoneTech 3 жыл бұрын
@@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.
@Guztav1337
@Guztav1337 3 жыл бұрын
@Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made). It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.
@Guztav1337
@Guztav1337 3 жыл бұрын
@Isaac Beizsley Yeah that's annoying. I want to say that the world would be a little bit greener if it wasn't for all bad code. lol
@willofirony
@willofirony 3 жыл бұрын
Awesome! Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag. These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.
@WhatsACreel
@WhatsACreel 3 жыл бұрын
Thanks mate! I certainly wish the instruction set had a few things that are missing too! Cheers for watching :)
@vyor8837
@vyor8837 3 жыл бұрын
@@WhatsACreel could ask intel and AMD to add that...
@halbgefressen9768
@halbgefressen9768 3 жыл бұрын
setz eax ror eax, 1 sar eax, 31
@CarrotCakeMake
@CarrotCakeMake 3 жыл бұрын
Nothing in this video had anything to do with computer science. It was all computer architecture.
@RAMIdotGG
@RAMIdotGG 3 жыл бұрын
every community needs a Michael Keegan :) refreshing to read a wholesome honest opinion
@Zeuts85
@Zeuts85 2 жыл бұрын
This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.
@saudude2174
@saudude2174 8 ай бұрын
bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.
@insomnia20422
@insomnia20422 2 жыл бұрын
Boss: "What did you do all day?" Me: "Well, optimizing some code, you know..."
@krytharn
@krytharn 3 жыл бұрын
Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.
@abhinoorsinghpannu2993
@abhinoorsinghpannu2993 3 жыл бұрын
Same goes for cuda programming also
@WhatsACreel
@WhatsACreel 3 жыл бұрын
So true! GPU's hate branching maybe even more than CPU's! They call the conditional execution 'predicate' in CUDA.
@dakrontu
@dakrontu 3 жыл бұрын
Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.
@psun256
@psun256 3 жыл бұрын
I never really thought about CPU caching and "predictions" affecting conditional statements, this is super cool!
@konstantinkh
@konstantinkh 3 жыл бұрын
When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.
@psun256
@psun256 3 жыл бұрын
@@konstantinkh So things like for loops?
@konstantinkh
@konstantinkh 3 жыл бұрын
@@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.
@psun256
@psun256 3 жыл бұрын
@@konstantinkh Huh, that's pretty neat. Thanks for sharing!
@A.Martin
@A.Martin 3 жыл бұрын
@@konstantinkh It would be great if you could actually tell your cpu which branch to load by default.
@fie1329
@fie1329 9 ай бұрын
I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!
@SpiritmanProductions
@SpiritmanProductions 2 жыл бұрын
This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++. EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.
@TurboXray
@TurboXray 2 жыл бұрын
Someone needs to explain why that was chosen hahah.
@Vacuon
@Vacuon 2 жыл бұрын
-1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.: not false = true (~0 = -1) true or x = true (-1 | x = -1) etc, etc, etc.
@TurboXray
@TurboXray 2 жыл бұрын
​@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it. The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).
@SeanCMonahan
@SeanCMonahan 9 ай бұрын
Oh, god, I bet that has led to more than a few bizarre bugs.
@vladimirarnost8020
@vladimirarnost8020 8 ай бұрын
@@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks. For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler. So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b); Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b); Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming. The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)
@C5FlyingSquad
@C5FlyingSquad 3 жыл бұрын
I think more people need to see your videos. Although they're specialised, they feel valuable for non-systems programmers like myself.
@gangstasteve5753
@gangstasteve5753 3 жыл бұрын
this is a very underrated youtube channel. i learned assembly from him.
@donjindra
@donjindra 3 жыл бұрын
In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.
@donjindra
@donjindra 2 жыл бұрын
@jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.
@donjindra
@donjindra 2 жыл бұрын
@jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.
@rongarza9488
@rongarza9488 2 жыл бұрын
Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.
@tolkienfan1972
@tolkienfan1972 2 жыл бұрын
1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c
@TheAguydude
@TheAguydude 3 жыл бұрын
Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.
@TheGandorX
@TheGandorX 3 жыл бұрын
That's why JIT (Java) is such a good idea. The run time / JIT platform applies the optimizations best suited for the hardware.
@garretgang8349
@garretgang8349 3 жыл бұрын
@@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.
@daantimmer
@daantimmer 2 жыл бұрын
Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict. Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.
@andywolan
@andywolan 2 жыл бұрын
@@daantimmer Ya, I think that behavior was the basis of the Spector and Meltdown CPU bugs discovered a few years back. (Correct me if I'm wrong.)
@daantimmer
@daantimmer 2 жыл бұрын
@@andywolan I won't be able to correct you even if you are wrong because I never really dived in to those issues :-)
@volchonokilliR
@volchonokilliR 3 жыл бұрын
There's [[likely]] and [[unlikely]] attributes in C++20, they can help in some cases
@vyor8837
@vyor8837 3 жыл бұрын
And that's why you update your compiler and SDK...
@neur303
@neur303 3 жыл бұрын
Also there is profile guided optimization, but sometimes you have paths that are hard to predict, because the patterns are too elaborate.
@user-tr8kr1jd2o
@user-tr8kr1jd2o 3 жыл бұрын
Vyor GCC has had macros for it for years
@leo69105
@leo69105 3 жыл бұрын
Have you tried if these help? Afaik, the branch hints are x86 only and no longer available in x86-64 mode. Not sure about arm.
@HermanWillems
@HermanWillems 3 жыл бұрын
@@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.
@sylviaelse5086
@sylviaelse5086 3 жыл бұрын
In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication. The architecture may even be incapable of calculating the true/false result other than by the use of branches. In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.
@notgabby604
@notgabby604 10 ай бұрын
You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways. GCC found much better code than me for this in FreeBasic. sub subrandom1(result as single ptr,x as single ptr,n as ulongint) dim as ulong ptr resultptr=cast(ulong ptr,result) dim as ulong ptr xptr=cast(ulong ptr,x) dim as ulong r=&hE8FE90EB,rs,shift=18 for i as ulongint=0 to n-1 step 4 rs+=r resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000) next end sub
@metamud8686
@metamud8686 2 жыл бұрын
The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in. Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)). Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself". Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.
@Bregylais
@Bregylais 3 жыл бұрын
Somebody gift this man a mouse-pad.
@danilotoebdot5616
@danilotoebdot5616 3 жыл бұрын
Got recommended this, watched it all, now time to find even more good content you have
@BlankBrain
@BlankBrain 3 жыл бұрын
Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.
@valtergomes5808
@valtergomes5808 2 жыл бұрын
In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..
@russellthorburn9297
@russellthorburn9297 3 жыл бұрын
It seems to me that it's a choice of evils: 1. Easy to read but much slower code. 2. Optimized but heavily obfuscated code. Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".
@tragedyofwind
@tragedyofwind 3 жыл бұрын
this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.
@Dji-gurda_Jdi-druga
@Dji-gurda_Jdi-druga 3 жыл бұрын
I would say you should determine bottlenecks and use #2 in them and #1 everywhere else.
@retrozvoc6189
@retrozvoc6189 3 жыл бұрын
How about a middleware that's readable to you and which gets JIT-compiled to whatever CPU you wanna run it on?
@DoomRater
@DoomRater 3 жыл бұрын
I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler. Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP
@PhilippeCarphin
@PhilippeCarphin 3 жыл бұрын
"Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that. And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.
@johnnisshansen
@johnnisshansen 3 жыл бұрын
In these 19 minuts of video my computer could raise to uppercase all books written the last 10 years.
@zwz.zdenek
@zwz.zdenek 3 жыл бұрын
Sounds like you have a pretty neat OCR and also pirated a lot of books. But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.
@johnnisshansen
@johnnisshansen 3 жыл бұрын
@@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code. But ok reducing branches can sometimes be worthwhile.
@A.Martin
@A.Martin 3 жыл бұрын
@@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code. For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.
@TheSkepticSkwerl
@TheSkepticSkwerl 3 жыл бұрын
Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.
@Argosh
@Argosh 3 жыл бұрын
That kind of thinking lead us to page loading times that were worse than what we could achieve in the 90ies...
@frigzy3748
@frigzy3748 3 жыл бұрын
You can replace 32 in C with 'a' - 'A' just to not have magic numbers. Great vid - thanks!
@shanehebert396
@shanehebert396 3 жыл бұрын
Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.
@frigzy3748
@frigzy3748 3 жыл бұрын
@@shanehebert396 I never thought of it - thanks!
@duncanbeauch9598
@duncanbeauch9598 9 ай бұрын
I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison
@Walrusbonzo
@Walrusbonzo 3 жыл бұрын
Loved the reference to Al Di Meola, one of my all time favourite guitarists. Great video too, right up my street.
@berndeckenfels
@berndeckenfels 3 жыл бұрын
A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.
@inx1819
@inx1819 3 жыл бұрын
add random sleeps in the code and you're alright /s
@yvrelna
@yvrelna 3 жыл бұрын
@@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other. Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second. Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.
@gutoguto0873
@gutoguto0873 3 жыл бұрын
@@yvrelna you didn’t see his /s
@Websitedr
@Websitedr 2 жыл бұрын
I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.
@eomoran
@eomoran 2 жыл бұрын
Aw jeez, if only you could write a comment annotating it /s
@TheLordoftheDarkness
@TheLordoftheDarkness Жыл бұрын
Skill issues
@Bunnokazooie
@Bunnokazooie 14 күн бұрын
Love when experts have a down-to-earth attitude like this. You rock man.
@vinzer72frie
@vinzer72frie 3 жыл бұрын
"The CIA wants you to believe its only if then else if then else"
@gumbo64
@gumbo64 3 жыл бұрын
Run over all the glow-in-the-darks
@anant6778
@anant6778 3 жыл бұрын
The FBI wants to have a word about switch statements
@VulcanOnWheels
@VulcanOnWheels 3 жыл бұрын
What is this quote from? If you quote something from the video you're commenting on, then please include the time where you found it.
@SiisKolkytEuroo
@SiisKolkytEuroo 3 жыл бұрын
@@VulcanOnWheels kzbin.info/www/bejne/baavq5SBob-Gh7M it's in that video at around 5m 40s
@SiisKolkytEuroo
@SiisKolkytEuroo 3 жыл бұрын
But it's worth it watching the entire video because Terry Davis is a legend
@insertoyouroemail
@insertoyouroemail 3 жыл бұрын
Very interesting! Another good example for when you want to write branchless code is when writing shaders for the GPU.
@overloader7900
@overloader7900 3 жыл бұрын
Why though?
@jarrad2000
@jarrad2000 3 жыл бұрын
PLCs are very often programmed branchless as well. Mostly because many of the PLC programming languages don't or only rudimentary support them. They're also typically programmed by people with a background in electric engineering with little to no programming experience. They like to use things like ladder logic. There are languages similar to assembly (IL) and pascal (ST) though.
@insertoyouroemail
@insertoyouroemail 3 жыл бұрын
@@overloader7900 Because shader instructions are executed inside an if-then block even if the test doesn't pass for the data that the shader is executed for.
@xinaesthetic
@xinaesthetic 3 жыл бұрын
Overloader7 you have many-many threads running parallel and if they don’t all follow the same path that can be problematic.
@jonathanross6260
@jonathanross6260 2 жыл бұрын
Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.
@WhatsACreel
@WhatsACreel 2 жыл бұрын
Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!! The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching. Cheers for watching :)
@alexc4924
@alexc4924 8 ай бұрын
could try something like d[i] ^= ((d[i] - 26) < 'a')
@jakeplonk888
@jakeplonk888 9 ай бұрын
Absolutely fascinating. I had never, ever heard of this technique. Thanks for posting!
@giladreich810
@giladreich810 3 жыл бұрын
KZbin should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.
@scottmichaud
@scottmichaud 3 жыл бұрын
C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision
@Toksyuryel
@Toksyuryel 3 жыл бұрын
While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.
@vladimirarnost8020
@vladimirarnost8020 8 ай бұрын
Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance. However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.
@peterSobieraj
@peterSobieraj 8 ай бұрын
I disagree a little. Thinking that most programers shouldn't care about a performance lead us to situation today. That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something. And 10 seconds of website loading is considered normal. And fast typing programmer can type faster than average IDE can display text.
@Toksyuryel
@Toksyuryel 8 ай бұрын
@@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.
@markwilliamhumphries
@markwilliamhumphries 2 жыл бұрын
You can toggle case on an ASCII alpha (from lower to upper, or vice versa) by xoring it with the space character (i.e. 0x20).
@Reginoid
@Reginoid 8 ай бұрын
yes, he did it with -32 (0x20) in that example
@pierrevillemaire-brooks4247
@pierrevillemaire-brooks4247 2 жыл бұрын
Happy Holidays man ! ... and thanks for the lesson ;-)
@adamp9553
@adamp9553 3 жыл бұрын
For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.
@RMLLcrazy
@RMLLcrazy 3 жыл бұрын
This seems like a common enough thing that the compiler should be able to provide the optimisation. Isn't that the case?
@tolkienfan1972
@tolkienfan1972 3 жыл бұрын
I like explicit casts in such cases, with comments so people reading the code have no cognitive load
@rinrin4711
@rinrin4711 3 жыл бұрын
It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.
@originaltonywilk
@originaltonywilk 3 жыл бұрын
Good to know some programmers are interested in this stuff... thought it was a dying art :) Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then). Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.
@Louis_Marcotte
@Louis_Marcotte 9 ай бұрын
I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good
@LocrianDorian
@LocrianDorian 9 ай бұрын
It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.
@Louis_Marcotte
@Louis_Marcotte 9 ай бұрын
​@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b
@LocrianDorian
@LocrianDorian 9 ай бұрын
@@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.
@ChrisM541
@ChrisM541 2 жыл бұрын
Brilliant explanation. The following bombshell caught my eye... 12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-) Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++. Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.
@TheCroustade
@TheCroustade 3 жыл бұрын
Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b. branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.
@NetAndyCz
@NetAndyCz 3 жыл бұрын
Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code
@supericeg5913
@supericeg5913 4 ай бұрын
Thanks for showing this! I now use this every day in my game in C and it's a great extra bit of fun problem solving!
@Rudolf215
@Rudolf215 2 жыл бұрын
I love your videos so much!! Keep up the great work!
@atackhelikopter4303
@atackhelikopter4303 9 ай бұрын
13:30 improvement idea: use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z this should make it faster because you aren't comparing anything, you just replacing information with existing one
@shinyhappyrem8728
@shinyhappyrem8728 9 ай бұрын
*256
@atackhelikopter4303
@atackhelikopter4303 9 ай бұрын
@@shinyhappyrem8728 it stops at null so you never have to take into account the null character lol
@TheEkkas
@TheEkkas 2 жыл бұрын
Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it. Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.
@JosephSaintClair
@JosephSaintClair 2 жыл бұрын
Glad I saw this. Took me back nearly 30 years. Thanks 🙏
@lucisetumbrae
@lucisetumbrae 3 жыл бұрын
Great video! Loved the random Al di Meola reference in assembly language
@Tolg
@Tolg 8 ай бұрын
This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.
@joshuap7406
@joshuap7406 3 жыл бұрын
I never liked the if statement, but this takes it to a whole new level...
@CottidaeSEA
@CottidaeSEA 3 жыл бұрын
Conditionals are always necessary. The difference is that branchless programming doesn't create two different scenarios. return (a * (a > b) + b * (a
@miricaley4988
@miricaley4988 Жыл бұрын
Very cool video, thanks a lot, been learning a lot from you lately, hope you can finish your neural network series in the future, that series is actually the one got me patreoned you and again hope you can bring up some mobile hacking material in the future as well, might be too much to ask but I am really looking forward to your new shit man, you have a good one.
@reinux
@reinux 2 жыл бұрын
Something weirdly satisfying about listening to someone talk Assembly with an Australian accent.
@donald-parker
@donald-parker 3 жыл бұрын
I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.
@Antoinetheman
@Antoinetheman 3 жыл бұрын
To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.
@georganatoly6646
@georganatoly6646 3 жыл бұрын
I believe you may have meant to say assembly instead of "... consider using assembler" to refer to an assembly language.
@shiuido359
@shiuido359 3 жыл бұрын
Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.
@christopherjoseph651
@christopherjoseph651 2 жыл бұрын
@@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.
@ncot_tech
@ncot_tech 3 жыл бұрын
I’ve been doing Z80 assembly recently, and seeing all these fancy conditional instructions and more than 8 registers is crazy 😀
@WhatsACreel
@WhatsACreel 3 жыл бұрын
Long live the ZX Spectrum :)
@protonjinx
@protonjinx 3 жыл бұрын
cheater! 6502 only had 3! spoiled brat.
@NielMalan
@NielMalan 3 жыл бұрын
So good to see that at least some coders are working on improving performance.
@Wylie288
@Wylie288 9 ай бұрын
This doesn't really improve performance in the real world lol. It only works on this very specific, bad, compiler and a very specific out dated instruction set. In the real world, you use a modern CPU where comparisons do not take anywhere near this much time. And you use compilers that do this for you IF its actually faster. Most of the time it isn't. This is ONLY useful for GLSL shaders. Thats this videos only real world application.
@NielMalan
@NielMalan 9 ай бұрын
@@Wylie288 My comment was intended to be read ironically. I don't know enough about processors and compilers to have a conversation with you about it, but I am a victim of slow software. This makes me doubt the ability and willingness of coders to study and improve their code's performance. I hate the attitude of "modern CPUs are fast so code performance doesn't matter." If a comparison only takes a microsecond, a million comparisons takes a second. If your code could have avoided those million comparisons you've wasted a second of my time, and I hate you. (Why I hate you: I have ADHD, and every time my computer hesitates it disturbs or disrupts my focus, and I pay dearly in energy and patience to regain focus. Some software (notably Evernote) I cannot use at all, because it is simply too slow and therefore takes too much of my energy.)
@godnyx117
@godnyx117 9 ай бұрын
Thank you! I'm writing a low level library to use instead of libc for my programming language and this tricks will come in handly!
@ImHencke
@ImHencke 3 жыл бұрын
Imagine if YandereDev saw this EDIT: Its a joke, chill out you angry weebs.
@vinzer72frie
@vinzer72frie 3 жыл бұрын
This video showed up on my recommended after watching yanderedev videos youtube is so smart lol
@scottmichaud
@scottmichaud 3 жыл бұрын
​@@vinzer72frie and @Henrik :: I'm a programmer that's helped with optimizing Yandere Simulator. --- I should note that the types of optimizations mentioned in this video are *hot path* optimizations. You're referring to one script that's executed ~80 times per frame, and its execution time, as measured by the profiler, is under 1ms on a typical, modern ~4 GHz CPU. If this was your bottleneck, then you would be in the ~1000 FPS range (because 1ms = 1/1000th of a second). This measurement is the sum of all 80 NPCs, and it also includes the time that the script is spent blocked on heavy calls into C++ (ex: some when some NPCs need to adjust their animation weights). --- Performance mostly depends on data... and game logic doesn't have a whole lot of data flowing through it (when compared to rendering thousands of objects, thousands of physics objects interacting, hundreds of animated objects that each have dozens of bones, etc.). People put a lot of emphasis on code, but optimizing your data (how much you load and how it is laid out) is usually where you will get the huge wins. --- Also, when you're in the realm of Object-Oriented Programming, one L2 cache miss is typically 10x to 100x slower than a branch, depending on whether the branch prediction was a success. This obviously varies from CPU-to-CPU, but it's a good order of magnitude comparison for modern x86-64ish CPUs. A correct branch is ~2 cycles. An incorrect branch is ~15-20 cycles for AMD Ryzen / Intel -Lake CPUs ( See Agner Chapter 3.8 and 3.13: www.agner.org/optimize/microarchitecture.pdf ). A cold memory fetch (reading memory that's not in cache) is ~200 cycles. That's not too big of an issue for something that happens dozens of times per frame, though, because a 4 GHz CPU does ~67 million cycles per frame at 60 FPS (per core). In other words, you would need to do ~1.5 million branches on your main thread, and have them all be incorrectly predicted by the CPU, to eat up just half of your 60FPS frame budget. --- The actual performance issues with Yandere Simulator are mostly related to animations and physics... and we've been fixing them up based on priority and availability. You can see over a 2x performance increase between January 2019 and July 2020 (assuming you have a discrete GPU and shadows are disabled -- neither of which have anything to do with game scripts). It's not uncommon to see a gaming PC with a good (ballpark ~GTX 1060-or-better) GPU and a gaming CPU get 80-110 FPS when shadows are disabled. --- Unfortunately that "if/else" meme fools a lot of people, though. It's not, and has never been the issue for that game. I mean if performance got over 2x better, and the game scripts haven't changed, then it clearly could not have been the issue. Over 50% of all execution time, good and wasteful, was removed... yet the if/else statements are still there.
@NihongoWakannai
@NihongoWakannai 3 жыл бұрын
@@scottmichaud nice comment, too bad not really anyone is going to see this. I'm a gamedev myself and seeing the way people rag on yandere dev is pretty often kinda cringe. It's very obviously a lot of script kiddies or people with no experience just jumping on a bandwagon without actually knowing what to be mad at. Did the game have optimisation problems? Sure. Is it because of some random out of context if/else chain? probably not.
@cortexauth4094
@cortexauth4094 3 жыл бұрын
@@scottmichaud I have been studying graphic libraries and trying to make an engine, meanwhile looking at code of games, It always baffled me what's wrong with those if/else, can't imagine a RPG without those. It cleared up the doubts
@scottmichaud
@scottmichaud 3 жыл бұрын
@@cortexauth4094 Yup. No problem. You want to have as much of your hardware at max capacity as possible (until there's no more work for the frame, then idle in one big chunk). When a branch is mispredicted, it needs to flush the bad work. That wastes "low tens" of cycles. ~20 cycles is nothing when you're concerned about work that happens once every 10 milliseconds (100 FPS). That's 0.0000025% (1/40,000,000th) of your budget. That could be a problem if you're in a loop of thousands or millions of objects, though, and the CPU can't figure out your pattern. Game engines will mostly have problems with L2 cache misses, though, which are 10x worse than a mispredicted branch. This means that work that needs to operate on many things (ex: thousands) should be laid out linearly in memory so your next calculations are on the same cache line, and, when you leave your cache line, you land on the start of the next cache line. The CPU watches your memory accesses and it will speculatively load cache lines based on what it thinks you will need. One of the biggest problems with memory accesses is dynamic inheritance. The way to allow multiple different child classes to fit on an array is to *not* have them contiguous in memory (because they're all different sizes based on the child class). There's a *lot* of different discussions that can go on here... but, yeah, it depends on the data you have flowing through it. Being wasteful on the order of the millionths of a 100 FPS budget might be okay if you're talking about hundreds of things... but, on the other hand, it is wasteful. IMO the biggest part is knowing what your waste is, and making a conscious judgment to waste (rather than being blindsided by the profiler months or years later).
@yuryschkatula9026
@yuryschkatula9026 3 жыл бұрын
Fashion: you have to have that glass table Optical mouse: I hate this... Not again!
@horowitzhill6480
@horowitzhill6480 3 жыл бұрын
funny thing is, i've used an optical (laser? don't really remember) mouse that had *no problem* working on a glass surface really weird
@tunahankaratay1523
@tunahankaratay1523 3 жыл бұрын
@@horowitzhill6480 That mouse probably uses infrared. Although glass is transparent to the visible spectrum, it reflects infrared. This way the heat energy at home can't escape through windows.
@horowitzhill6480
@horowitzhill6480 3 жыл бұрын
@@tunahankaratay1523 pretty sure 1) glass behaves the same with infrared and visible 2) if it did reflect, the mouse would not work, as it would be like using it on a mirror
@tunahankaratay1523
@tunahankaratay1523 3 жыл бұрын
@@horowitzhill6480 No. It's pretty common knowledge that glass reflects infrared. Otherwise it would just let the heat energy as infrared out. But most of the heat loss through windows happen at the edges, because it's hard to perfectly seal the glass. That's why window companies are focusing on the seal, not the glass itself. Also they probably engineered it further to make it work with glass, I don't think they just replaced the LED with an infrared one and called it a day.
@christopherpepin6059
@christopherpepin6059 3 жыл бұрын
@@horowitzhill6480 Not every thing that reflects light acts like a mirror. Every object that is not transparent reflects light, it's how you see them in the first place. Glass most definitely reflects IR. If you ever saw one of those IR booths at a science museum it is why people's glasses appear as dark spots. The glass prevents the IR from passing through it.
@eterpaykugml4751
@eterpaykugml4751 3 жыл бұрын
Really neat. Optimizing for speed by sacrificing some memory is an excellent trade-off in modern systems.
@DeadCatX2
@DeadCatX2 2 жыл бұрын
Two things I would add. 1) Conditional move et al still create dependency chains which can slow down throughput. 2) Speculative execution vulnerabilities like Spectre, as well as side-channel attacks on the execution time of a piece of code, can make branchless algorithms more useful.
@SaidMetiche-qy9hb
@SaidMetiche-qy9hb 3 жыл бұрын
Really interesting, optimization is often neglected
@svnhddbst8968
@svnhddbst8968 3 жыл бұрын
it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.
@yScribblezHD
@yScribblezHD 3 жыл бұрын
svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.
@bobbobbity463
@bobbobbity463 3 жыл бұрын
Very interesting video. Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed. Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.
@bluebukkitdev8069
@bluebukkitdev8069 2 жыл бұрын
Thank you for this video! KZbin recommended it and I am glad it did. You earned a like.
@Kyojin743
@Kyojin743 6 күн бұрын
I’m an undergraduate in CpE and I was working on an autonomous robot, and I had this really over engineered way to do course correction. I watched this video, and just today I was looking over my system and noticed that the techniques in your video were perfect for this application. I reduced it from a 50 line solution of ugly nested if else’s about 4 deep at max, to 2 lines. Thank you!
@GuruEvi
@GuruEvi 3 жыл бұрын
For the conversion from upper to lower, wouldn’t it be faster to just do an AND on the byte with a register with the value of the inverse of 32 in binary (6th bit set to 0). So loop over value && 11011111; depending on your architecture this conversion could be done in a single operation.
@dimdimich
@dimdimich 2 жыл бұрын
String may contain non-alphabetic characters and they should not be 'converted', i suppose.
@mohrizkyk
@mohrizkyk 2 жыл бұрын
"If you want to be most people, if you want to be an ordinary dev with common skillset, this video is not for you." - Peerple -
@Hiphopasaurus
@Hiphopasaurus 3 жыл бұрын
Great video! One caveat, the 'CMOVcc' instruction was introduced with the Intel P6 microarchitecture (Pentium Pro+), so if you're building for generic 32-bit platform, gcc will still branch: 'CMP, Jcc' ...
@teropiispala2576
@teropiispala2576 3 жыл бұрын
Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests. Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do. Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.
@ImperatorZed
@ImperatorZed Жыл бұрын
Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.
@teropiispala2576
@teropiispala2576 Жыл бұрын
@@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises. Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable. Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions. Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine. So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies. The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.
@JayVal90
@JayVal90 3 жыл бұрын
I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.
@flossless200
@flossless200 3 жыл бұрын
Cool story bro
@arnoio8355
@arnoio8355 3 жыл бұрын
You must slay at parties
@patrickthepure
@patrickthepure 3 жыл бұрын
I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.
@fyfaenihelvete
@fyfaenihelvete 3 жыл бұрын
That's pretty cool
@JayVal90
@JayVal90 3 жыл бұрын
Garfieluigi Yes. Especially parties full of people with a high appreciation for the technical.
@xarlysx
@xarlysx 3 жыл бұрын
When I started learning four years ago, I had this constant feeling that there was something wrong with conditionals but I didn't know enough to be able to even ask the question. And now finally I have my answer.
@thomasmaughan4798
@thomasmaughan4798 3 жыл бұрын
There is nothing wrong with conditionals. The author has shown a way to HIDE the conditionals from the source code but the fact is a conditional is inescapable in the object code. It is also very fast, typically 4 clock cycles to execute a conditional. The problem with JUMP instructions is the difficulty of debugging, and removing obvious branches from source code simply makes it harder to debug. Optimizing your code requires some knowledge of how long each instruction requires; logical instructions (and, or, compare) typically are very fast requiring 4 clocks. Arithmetic instructions add and subtract take considerably longer, up to 100 clocks, and multiplication becomes non-deterministic but can take hundreds of clocks. So using multiplication to avoid a simple 4 clock compare instruction is incredibly inefficient. It is this inefficiency that led to the development of "Branch Prediction", suppose there's a multiplication followed by a conditional. It is going to take such a long time to calculate the item to be compared that the processor supposes that the compare could go either way and starts loading the instructions from main memory that follow both possible branches. The CPU is vastly faster than main memory. Then when it gets there, figures out which branch to take, discards the branch not taking but the discarding takes zero time as far as impact on your program. In that scenario, using the simple approach MIGHT not even be faster since is means more of your program is going to be waiting on main memory anyway. But the CPU will certainly run cooler and consume less electricity if you use simple instructions rather than these hugely complex instructions.
@oliver_twistor
@oliver_twistor 9 ай бұрын
I know this comment is old, and that you probably have learned a lot more since you wrote it, but be aware that this micro-optimisation is for specific usecases only. For the vast amounts of production code you'll write, if statements are much better. Your coworkers will thank you for it, because they won't have to spend their precious time trying to understand what you have written. I have worked with many "clever" coders and I usually end up deleting it all and rewrite it myself, because I can't be bothered taking precious time trying to understand what the code does.
@lautaro736
@lautaro736 3 жыл бұрын
Awesome, thanx! I will be patching my assembly code woth this from now on
@juanmacias5922
@juanmacias5922 Жыл бұрын
"trying to out code the compiler" is one of the most badass things I've ever heard haha
@lake5044
@lake5044 3 жыл бұрын
Nice video. But remember, unless something is actually a bottleneck in your program's performance, you never want to waste time and effort squeezing 1ms out of it. Always profile first, optimize last. (Obviously, this doesn't apply to study/benchmark stuff.)
@krisitof3
@krisitof3 3 жыл бұрын
“99% the compiler is smarter“ Yes, the compiler tries to be but the main problem with compiler optimalizations is variance and variance depending on stuff out of your control. And there IS a lot of variance, probably more than you would think. Watch the talk “Performance Matters” by Emery Berger, it goes in-depth on this and a great talk! As a general rule of thumb, compiler optimalizations is nice to have but you should not rely on it to make your code fast. Branchless programming is really valuable if you explicitly do know what you are doing. Otherwise it probably won’t matter for you (for simpler stuff) Ps. Thanks for the more advanced topics, great video as always 😀
@oflameo8927
@oflameo8927 3 жыл бұрын
We need some compiler tutorials then so we know how to use them optimally.
@user-wi8ts6te5d
@user-wi8ts6te5d 2 жыл бұрын
"You should not rely on the compiler to make your code fast." Do you compile all your code with - O0?
@FHBStudio
@FHBStudio 8 ай бұрын
For a lot of if/else statements, I sometimes add the conditions as powers of 2 and then switch on that. So for example if I have 4 conditions that's 1*c0+2*c1+4*c2+8*c3 and then switching through case 0-15. You can make it readable using comments.
@protheu5
@protheu5 9 ай бұрын
Great video, very comprehensible.
Someone improved my code by 40,832,277,770%
28:47
Stand-up Maths
Рет қаралды 2,3 МЛН
Faster than Rust and C++: the PERFECT hash table
33:52
strager
Рет қаралды 490 М.
Айттыңба - істе ! | Synyptas 3 | 7 серия
21:55
kak budto
Рет қаралды 1,2 МЛН
skibidi toilet 73 (part 1)
04:46
DaFuq!?Boom!
Рет қаралды 32 МЛН
Mini Jelly Cake 🎂
00:50
Mr. Clabik
Рет қаралды 7 МЛН
КАКАЯ ХИТРАЯ КОШКА! #cat #funny #pets
00:50
SOFIADELMONSTRO
Рет қаралды 19 МЛН
But, what is Virtual Memory?
20:11
Tech With Nikola
Рет қаралды 74 М.
The Ultimate Tier Programming Tier List | Prime Reacts
26:57
ThePrimeTime
Рет қаралды 264 М.
why are switch statements so HECKIN fast?
11:03
Low Level Learning
Рет қаралды 350 М.
How Senior Programmers ACTUALLY Write Code
13:37
Healthy Software Developer
Рет қаралды 1,2 МЛН
Big Tech AI Is A Lie
16:56
Tina Huang
Рет қаралды 34 М.
This Algorithm is 1,606,240% FASTER
13:31
ThePrimeagen
Рет қаралды 695 М.
The Art of Linear Programming
18:56
Tom S
Рет қаралды 593 М.
Why Do We Get Bored?
12:25
Vsauce
Рет қаралды 11 МЛН
Why i think C++ is better than rust
32:48
ThePrimeTime
Рет қаралды 252 М.
Naming Things in Code
7:25
CodeAesthetic
Рет қаралды 1,9 МЛН
Айттыңба - істе ! | Synyptas 3 | 7 серия
21:55
kak budto
Рет қаралды 1,2 МЛН