DISCLAIMER FOR ALL BEGINNER PROGRAMMERS! Do not attempt it at home! Unless you really, really have to. In most cases you won't outsmart your compiler. For high level programming, you should focus on keeping your code clean and easy to read and maintain. Usually you'll notice speedup in your code by using correct algorithms and data structures, not by doing micro-optimizations. And if speed is your concern - benchmark first! Do not try optimizing code that may not need it.
@alexreeve4 жыл бұрын
Well, doesnt the first example in the video show exactly that? I mean that goes for all optimization methods, only use them when you have verified that you actually need them, (and then verify again) :)
@davidm.johnston89944 жыл бұрын
Thank you for saying that. That's what I was thinking, I was very intrigued by all of this additional complexity (from a readability standpoint)... Wondering if I could really outsmart a compiler by doing these techniques "by hand".
@Merthalophor4 жыл бұрын
Donald Knuth: Premature optimization is the root of all evil. This technique is only valuable in highly specific situations. If you're not sure if it applies to you, it doesn't.
@brianbarefootburns35214 жыл бұрын
This whole video is off limits for beginner programmers and comedy for experienced programmers. That piece of paper under his wired USB mouse is another telling sign.
@LeTonyJr14 жыл бұрын
Yes, this. How many times the code needs to run is a big consideration. If a function needs to loop through a dataset that numbers in the billions, it’s a good sign that micro-improvements like these might be of interest.
@Caesim94 жыл бұрын
I think it should be noted that branchless programming is VERY important in cryptography. It turns out that if you ise conditional statements -> sometimes your code runs faster sometimes slower (specific branches just are faster/ slower), the attacker can get info on your private key. So all cryptographic sound functions must use branchless programming. Very interesting topic
@JayJayYUP4 жыл бұрын
Do they use branchless for the most part? Or are they in violation of this?
@georgionic4 жыл бұрын
They could use constant time operations
@flyingsayon4 жыл бұрын
Why would you do that? Can't you use a timer to achieve the same result (not disclosing the computation time)?
@thenameipicked4 жыл бұрын
@sayon It is very difficult to actually accomplish this in practice. If your delay is a random value, then you can repeat your request multiple times and use the average time. Delaying so your total time is fixed is really difficult: You have to know in advance the maximum amount of time your algorithm could possibly take, and even with that knowledge, wait() is complicated and imprecise, and it is definitely possible that what you did before calling wait() will still affect the total time.
@flyingsayon4 жыл бұрын
@@thenameipicked i am talking about fixed delay, yes, not a random value. Besides how are you hoping to achieve fixed time with just balancing the computations when the process can be interrupted at any time? We usually have preemptive multitasking in our systems.
@Mackinstyle3 жыл бұрын
I LOVE how you introduced the concept by showing a way to manually optimize and how it actually fails to be better than the straightforward if-based version. That's such an important point to share up front any time we talk about clever optimization.
@edwardmacnab3542 жыл бұрын
except for his last example which was light years faster as hand coded
@damaomiX2 жыл бұрын
It just applies to c/cpp, for most other languages that don't have these kind of tricky optimizations, branchless is always faster (jit languages have jit optimization but in real scenarios the branch prediction miss a lot). Not to mention that in the video there're circumstances that branchless version did out performs branched version.
@ocoolwow Жыл бұрын
@@edwardmacnab354oh you poor fool
@go-away-5555 Жыл бұрын
branchless generally applies when you're talking about assembly. C/C++ compiles to assembly, but so do many other languages and some languages like C# can be compiled to assembly even though typically they are compiled into a virtual-machine bytecode (using il2cpp and then a c++ compiler). For a purely interpreted language like python, branch prediction and branchless code barely comes into play for the developer. When do any operation in python, the actual python runtime is almost always running dozens or hundreds of lines of C (compiled to assembly) that aren't related to your code in any way except for ensuring that you're requesting valid operations. And that itself usually requires tons of branching. If you add 2 numbers, the python runtime treats those as full objects. First it checks that it's not operating on null objects, it then ensures the wrapped value is a number or a type with the addition operator defined. If it is a number, it will call the C function it has for addition. If it's not a number type, it will evaluate the PyFunctionObject representing the python code of the function, which itself has like 100 lines of C just dedicated to passing the argument in properly as a Python variable. The most optimized way to write Python code is to minimize your time spent evaluating lines of Python code. When running a purely interpreted language like Python or Ruby, the actual operations you request to do are about 1% of the actual work going on in the CPU. For a JIT'd language like Java, Javascript, regular C#, it really depends on how good the bytecode compiler and the JIT are. There are 2 steps here that can potentially optimize your code, the first compilation step to bytecode can definitely do many of the same optimizations a C++ compiler can, and the JIT can help pick the correct cpu instruction that can handle several equivalent lines of a higher level language (like the first function example). The Java compiler and JIT are quite good, sometimes better than current C or C++ compilers. At least in synthetic benchmarks of calculating prime numbers, sorting arrays, or etc repeatedly. For example, Java is capable of generating code that uses SIMD or AVX like in the final example in the video. Basically it doesn't work in a purely interpreted language because the runtime system itself has many branches. And for other languages, it depends on the code. If you're writing a branchless algorithm, it wouldn't hurt to benchmark it against the "naive" branching version of that algorithm. And benchmarking these system specific optimizations may even produce different results on different CPU architectures. For example, some DSP devices have no branch prediction because they need to have strict guarantees on consistent realtime delivery. It's a slower architecture on average, but you can calculate a guaranteed number of clock cycles some sequence of instructions will take. A branch predicting CPU can't make those same guarantees.
@asaskald Жыл бұрын
I'm sure he just LOVEs it too. We all LOVE it.
@jeremyseay4 жыл бұрын
I love this, but I'm reminded of Kernighan's law: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”
@microcolonel4 жыл бұрын
Brian misses the big-brain move here: just rewrite your code instead of debugging it.
@funkyflames74304 жыл бұрын
Work twice as long. Checkmate
@rickarmbruster87884 жыл бұрын
@this senator sometimes rewriting is a decent opportunity that you see your brain varies in qualaty of output each time :D
@mr_confuse4 жыл бұрын
@@rickarmbruster8788 I notice that in most of my projects, I do the basic structure then add more to it come back to the basic structure see what a mess it is CTRL + A delete and rewrite it....
@rickarmbruster87884 жыл бұрын
@@mr_confuse u will get a hang of nice design eventually ^^
@barmetler4 жыл бұрын
So what we've learnt: If you write normal code and normal patterns that everybody knows, most likely it will be a pattern that the developers of gcc (or other compilers) thought of. That way you have a higher chance of the compiler doing good things.
@anandsuralkar29473 жыл бұрын
Yup
@QuantumConundrum3 жыл бұрын
@InSomnia DrEvil Real truth. I tell people on my team this all the time. If it's not hard, actually try it and extrapolate. Rarely (but it happens) you'll get different behavior past a data size threshold. And even then... usually you can infer that (or predict this would happen)
@ГеоргиГеоргиев-с3г3 жыл бұрын
all the codes shown missed the fact that you could just remove bit 5 when in range shaving 3+ additional instructions with a well placed xor eax, 20h(stored in r8d instead of -1)(times char's ID stored in the multi data case) since you don't really care for the original data or you would back it up(the more complex code still used sub 32 and some other data shuffle junk). So sometimes it's better to assemble it yourself.
@RobBCactive3 жыл бұрын
The program I optimised for gcc did a good job of making bottom level branchless if statements, but it didn't stop my carefully optimised C being 100x faster than the natural C++ implementation using stdlib. It was well worth increasing processing complexity.
@nickscurvy86353 жыл бұрын
It still matters. Compilers are great but not perfect. There are still seemingly trivial and arcane situations where the compiler will be confounded and unable to apply proper optimizations on code. It won't be noticeable if ur writing a user application because those misses will be fairly infrequent. If you are writing a linear algebra library for physics calculations then you are gonna be performing millions of these instructions repeatedly in a very short period and the misses become noticeable. Advocating against premature optimization doesn't mean all optimization is premature.
@randyscorner9434 Жыл бұрын
I've designed CPUs for over 30 years and we pay a LOT of attention breaks in code streams. Branch prediction, speculative execution, and even register reassignments are all about minimizing branch and pipeline impacts. What you may be missing in the above examples is that what seems to be avoiding branches ... don't. Every conditional test is a branch. Every conditional assignment is a branch. It is hard to beat the compiler optimization that understands the pipeline of the processor it's targeting. Clever manual decimation can be effective for long pipelines, such in GPUs, but even then compilers are pretty smart. The most efficient code is when tasks are parallelized so that multiple operations can be done based off of one or a small number of decisions. In trying to avoid branches, the cost is paid in non-maintainable code and is very high. If you really want to improve efficiency, don't rely on multiple software packages which actually may include interpreted languages somewhere underneath the shiny API. Layering of packages and the use of interpreted languages (Python, PERL, etc.) waste much of the increasing performance of processors. Yes, it means recompiling for a new processor, but one does that if efficiency is required. In one example, we recoded a numeric-heavy program that was synthesizing frequencies for an electronic piano. Recasting it to vector operations allowed the program to run on a very inexpensive Beagle-Bone Black instead of a MAC. On the MAC it consumed 35% of the processor. On the BBB it used 15% of a much less powerful processor by vectorizing the operations. These are the sorts of changes that matter.
@kingacrisius Жыл бұрын
I mean he literally looked at the decompiled assembly code to show how the branching and branchless code affected whether or not the assembly decompilation was branching. In the first case, he showed that the compiler sometimes understands the purpose of code and makes great optimizations that already remove the branches, but also shows how in other, more complex code, the branchless code was much faster than the branching code. So I really don't think he or the video missed this.
@lobovutare Жыл бұрын
cmov is probably not branchless as claimed in the video. On the micro architectural level is still needs to branch.
@BogdanBaudis Жыл бұрын
@@kingacrisius The thing is that looking at the assembly may or may not tell what actually CPU does. There are some cores (old ADI's SHARCs, Microchip's DSCs and few other) that have such predictable deterministic pipeline you can actually count the clocks and if you couple it to SRAM, then you have whole story. But modern IA64 or Cortex-A cores do so many things on so many levels, that without an intimate knowledge of what core actually does a manual optimizations is just a hopeless exercise. Silicon vendors want their wares to shine so they contribute to the compilers development heavily. Unless you are in the know in these matters, the best you can is to optimize for a particular platform, test it, and then freeze SW AND HW design. Sometimes this is what is needed and acceptable. But if you think such code will survive on other platforms or onthe future revisions of a given platform you may be for a rude surprise.
@kingacrisius Жыл бұрын
@@BogdanBaudis Well regardless he timed it too. Like it is so clear when the branchless programming is better, because he timed it and shows how the assembly is faster. I seriously don't understand where all of this is coming from.
@kingacrisius Жыл бұрын
@@BogdanBaudis And obviously some code may not be portable, but he was coding C++, which is almost always portable. And that doesn't matter anyways for this discussion.
@MultiMrAsd4 жыл бұрын
I want to add that the advantage of branchless programming is many times bigger if you are programming for the GPU and not the CPU. GPUs have multiple cores that share a instruction pipeline, where each core runs the same instruction but with other numbers. That often leads to both sides of a branch being executed and then one is discarded. I have seen performance improvements above 100x by going branchless!
@peterbonnema89134 жыл бұрын
Didn't know that. Thanks. Good point.
@nistarok1234 жыл бұрын
So, vertex/fragment shaders for opengl, for example?
@Luxalpa4 жыл бұрын
@@nistarok123 Yes, exactly. Avoid branches in GPU code as much as possible. In fact I think there are some shader languages that don't even allow the if statement.
@Kino-Imsureq4 жыл бұрын
@@JeredDanielson me too ;)
@scottmichaud4 жыл бұрын
Note: GPUs these days won't take the performance hit if all threads in a warp (NVIDIA), wavefront (AMD), etc. (others) go down the same branch... whichever branch that is. So let's say you have a complicated material depending on whether or not you're shadowed -- you will only traverse both paths if your warp/wavefront corresponds to a block of pixels that overlaps with an edge of the shadow. If the block is entirely shadowed or entirely illuminated, then it will only run the shadowed or unshadowed path, respectively. This is 32 threads in NVIDIA, 64 in AMD... although they do reserve the right to change.
@christianbarnay24994 жыл бұрын
05:00 The lesson here: never over-optimize your code beforehand. Start by writing natural and simple code. Let the compiler do its tricks. Check the result and only look for optimizations in the parts that the compiler didn't already optimize.
@tetramaximum4 жыл бұрын
I think the lesson is: know your hardware.
@scottmichaud4 жыл бұрын
@@tetramaximum Both are valuable lessons. Another one is "know your data". Imagine a super-inefficient O(n!)-scaling algorithm. If you can prove that your data must always be n
@tetramaximum4 жыл бұрын
@@scottmichaud agree
@SleepyMongoose4 жыл бұрын
Very true, the more generic your code, the more likely the compiler will understand what you are doing and perform optimization.
@scottmichaud4 жыл бұрын
@@SleepyMongoose For the things that it can. It's your job when it comes to effectively using overall data structures, etc.
@kylek.36893 жыл бұрын
Cryptographic libraries also use branchless programming a lot. This is to prevent timing side channel attacks, where the timing of an operation can provide a hint as to what was encrypted.
@RyanTosh4 жыл бұрын
When a normal person says something is slow: 5-10 minutes, maybe an hour or two When a programmer says something is slow: 2 nanoseconds
@AntonioNoack4 жыл бұрын
well, normal people wouldn't think about doing it a billion times ;)
@soblakismyself4 жыл бұрын
When those instructions run multiple times throughout a code that 2ns starts to makes a difference.
@spicybaguette77064 жыл бұрын
When you get a cache miss: This little manoeuvre is going to cost us 51 nanoseconds
@nyrahl5934 жыл бұрын
@@spicybaguette7706 for an android that is an eternity...
@SamGarcia4 жыл бұрын
those build up... and then you have Internet Explorer
@programaths4 жыл бұрын
Shaders. Branchless techniques are mandatory in those!
@plazmotech59694 жыл бұрын
When I first heard people saying "avoid branching in shaders" I had no idea what they were talking about. "How could you ever possibly avoid branching???" I thought. Now I see how. Thanks!
@galandilvogler85774 жыл бұрын
It must be said, though, that even shaders compilers are getting better and better, and you could use simple ifs which don't end up in branching. For example, if you write: if(x>=0.8) {y=x} else {y=0} one might be tempted to optimize by writing: y = x * step(0.8,x) And yet, at least in HLSL, the compiler creates the same low level code for both.
@plazmotech59694 жыл бұрын
@@galandilvogler8577 Good to know! But still, best to make sure...
@clepirelli4 жыл бұрын
This is a common misconception. IIRC, uniform branching is 100% fine, the only problem with shaders branching is that if a heavy branch is followed by one invocation, the whole warp will stall. This is problematic, but does not mean branchless techniques are better as branchless techniques would just calculate both branches, meaning you'd actually have the whole warp calculate those heavy calculations. See shadertoy (dot) com/view/wlsGDl and twitter (dot )com/bgolus/status/1235254923819802626
@plazmotech59694 жыл бұрын
@@clepirelli Neat! Thank you
@vanlepthien67682 жыл бұрын
As a long time software developer, in virtually every case, code maintainability is far more important than speed. I've written programs that processed millions of transactions - and branching was the least of the performance concerns. Paging and I/O typically are much more a problem. Except in toy problems, memory management will be a much larger concern. Introducing repeatedly executed multiplication generally is not good for performance. "And"-ing 01011111 with ASCII lower case converts to uppercase as well as subtraction, and to my mind, is much clearer. Unless there is a genuine need, don't make your code ugly to eke out a few cycles.
@henrikskott Жыл бұрын
I agree, except I'm convinced we all need to learn this the hard way. I haven't met any developer who did not try to be clever in the first 20 years of their career. Took me 30 years to start understanding this myself, but I need to really try all the fun things before I do it the hard way.
@gorak9000 Жыл бұрын
@@henrikskott I had a CS prof once tell a story of how he spent so much time trying to optimize the code as he wrote it that it took him weeks to write the code, and when he ran it for the first time, the performance was terrible. He ran it through a profiler, and the part he optimized took virtually no time to execute anyways, and it was some other part he didn't even think of that was taking all of the time. His message was "just write the code sensibly the first time, then run it, and only then if there are performance issues do you optimize anything, once you know exactly what part needs to be optimized. That's definitely a winning strategy! Don't waste time optimizing things that don't matter anyways
@jacobzimmerman3492 Жыл бұрын
Once you understand cache hierarchy , you realize a heap access can be equivalent to MANY branches
@akenteva Жыл бұрын
Good point. I found the video interesting and informative, but part of me was wondering "in what context would these techniques genuinely be useful?". While I'm sure there are things out there that require this level of optimization, typically it makes more sense to prioritize readability. Also, great idea with the AND-ing! Very elegant way to solve it, at least in the case where the string only has letters. If non-letter symbols are present though, you'd need to add more logic to handle that.
@Josus Жыл бұрын
it could be even better to implement/improve this kind of optimizations in the compiler itself, this way you can focus on the elegant code and gain a boost in maintainability while the compiler does the hard job, of this kind of very low level optimizations
@rorytulip93434 жыл бұрын
Leading with a negative example was smart - I think a lot of beginners would have tried applying this to homework assignments that really didn't need it if you hadn't.
@andrewsprojectsinnovations63523 жыл бұрын
"I can teach a monkey to do what you do. What I can't teach the monkey is when NOT to do it." - one of my father's chiropractic professors But yeah. Branchless programming is just one of many tools in our arsenal, not a silver bullet. Having a hammer doesn't make every problem a nail. Same here.
@aviralsood81413 жыл бұрын
@@andrewsprojectsinnovations6352 Maybe he should have taught your father not to become a chiropractic, much more helpful.
@maxineyoung32362 жыл бұрын
As a student & beginner, this is certainly correct. Was ready and excited to lmao.
@khhnator4 жыл бұрын
I'm surprised with the amount of negative comments, he is not telling people do always code like this... there is a difference between time critical code and not. get the damn profile out and find where your program is spending most of their time, and THEN you consider doing this or not. the only negative thing about optimization is that it costs development time. the idea that "the computers will be fast enough" is just being naive. effective optimization can be is the difference between a product that reaches some of the market or most of the market or a program that can be extended for the decades to come or something that needs to be redone in the next few years.
@PhilippeCarphin4 жыл бұрын
@kingofallcrypto Yeah, I can definitely imagine seeing stuff like this in first-year students code. In one project, they had to keep track of 16 boolean values on a microcontroller that had way more than enough memory for what we were doing. They wanted to encode them in the 16 bits of a uint16_t (16 bit integer type). It's natural for them to think that it's a good idea since we did tell them to use uint8_t over uint16_t when the number they needed to store was only going from 0 to 10. But the reason we told them that was mainly as a reason to talk about architecture and explain that since the microcontroller had an 8bit bus and discuss stuff. It did give me a chance to have a good discussion about balancing programmer effort with execution speed. In the uint8_t vs uint16_t, there are no downsides and there is a theoretical benefit in speed. In the store 16 booleans in a uint16_t, it's going to be a complete pain in the a** to store and retrieve the actual booleans (if we don't consider that the memory saving has a runtime cost), we're introducing the possibility for errors and we have more than enough memory by a wide margin for that project.
@dincerekin4 жыл бұрын
time critical code should not be run on a micro
@iVTECInside4 жыл бұрын
@kingofallcrypto , I would hope that it would be obvious to anyone that ended up at this video.
@wusulus4 жыл бұрын
You try to make it seem that it is worth it everytime, while it cerntainly isn't. This is the kind of thing you would use if you used every other trick in the book and need it for some one in a million project.
@webkar4 жыл бұрын
@@dincerekin Think about what you said next time when your ESP/ABS saves you from skidding into a tree.
@frogge64433 жыл бұрын
Your mousepad is a masterpiece.
@robertoaguiar62303 жыл бұрын
high albedo for that max dpi specs. Don't tell gamers, or the prices will blast up
@evanmagill91143 жыл бұрын
@@robertoaguiar6230 I haven't thought about the albedo consideration there... I've seen a lot of very dark color mousepads... Hmmm But I suppose it would be albedo in the specific range the mouse actually uses. Infrared?
@kaanozk9 ай бұрын
Don't look at the finger pointing towards the moon or you will miss the heavenly glory
@josiahmanson3 жыл бұрын
A fast branchless way to calculate ToUpper is to use a lookup table. The table is 256 bytes long, and easily fits in L1 cache, so character conversion takes a single instruction and will be a fast memory access. I think this is what is done in the standard library.
@IceExtremeGamers3 жыл бұрын
Doubt it will work with non ASCII characters, though.
@Henrix19983 жыл бұрын
@@IceExtremeGamers none of this works for non-ascii
@IceExtremeGamers3 жыл бұрын
@@Henrix1998 sadly
@josiahmanson3 жыл бұрын
@@IceExtremeGamers His whole discussion was assuming ASCII as a toy example. Yes, the small lookup table will not work for general multi-byte Unicode (especially variable length encodings like UTF-8), but Unicode is more complex and none of the other stuff described in the video, like SIMD, would work either. For 8-bit character encodings, the table is actually far more general than anything based on conditionals such as conditional moves. The table approach would also be usable for 16-bit character encodings, as 64 KiB is not a lot of memory nowadays, and the parts of the table that would actually be used in practice would almost certainly fit in L1D$.
@igorstasenko91833 жыл бұрын
@@josiahmanson yeah i was waiting from author to present a final-final version of algorithm which using lookup tables.. but video ended :)
@MM-244 жыл бұрын
I want to buy this guy a mouse pad and maybe a second monitor...
@Bobby.Kristensen4 жыл бұрын
He seems to be stuck in the 90s.
@stevecarter88104 жыл бұрын
Nobody's stopping you
@lour.2224 жыл бұрын
Goes to show it's what you know, not what you own.
@MM-244 жыл бұрын
@@stevecarter8810 no they aren't, that's good to know
@MM-244 жыл бұрын
@@lour.222 I think its a combination, good tools in the hands of a craftsmen sort of thing
@RFC-35143 жыл бұрын
For someone who started coding in assembly before pipelining and speculative execution were a thing, and when multiplications were super expensive, the idea of multiplying by a boolean to make a selection always feels slightly wrong. And a part of me still wants to replace every multiplication with shifts + adds, or look-up tables. :-P
@Theineluctable_SOME_CANT3 жыл бұрын
Yep!
@WorBlux3 жыл бұрын
Multiply by 1 or 0 and you get an early exit on modern ALU's. With pipe-lining and SIMD, you come out ahead if you stack a large enough vector up. 8 ops/cyle + constant 3 cycle multiply penalty. If you have 800 elements, you are basically doing that multiply in 1/8 a cycle. So even with 2 multiplies and an add, it's still over twice as fast before any branch time or penalties is involved. And then AVX512 and SVE2/Cray vectors has predicated SIMD ops where you can create a predicate mask and then conditionally perform a vector op. CompareLT V1 and V2 to create mask than copy V1 into V2 using the mask to predicate each lane. the element from v1 is only copied of the LT mask is true.
@Theineluctable_SOME_CANT3 жыл бұрын
@@WorBlux can I hire you on for my next big programming job?
@WorBlux3 жыл бұрын
@@Theineluctable_SOME_CANT Keeping myself informed about cpu microarchetecture and programming is a hobby of mine. Actual programming not so much. Being outside and producing something I can touch is more rewarding for me.
@Theineluctable_SOME_CANT3 жыл бұрын
@@WorBlux yes, for me also. Machine microarchitechture is quite interesting to me.
@unformedvoid22232 жыл бұрын
It's useful when you do shader programming where branching is very expensive. I had to «invent» some of those examples by myself when did shader programming. Great video!
@SeanCMonahan Жыл бұрын
Same! It was an interesting experience refactoring some branched OpenCL code to use algebraic solutions to replace the branches.
@krytharn4 жыл бұрын
Branching in a shader on the GPU is also a thing, but for a different reason. Here, the GPU runs the code in lockstep, which means that the ALU runs the same instruction for multiple pixels or vertices at once. To stay synchronised, it runs both branches of the conditional, then throws away the result that did not meet the conditional. So, in general, it will have to run all the code all the time and branching does not give any advantage. Note: if all pixels meet the same condition, only one of the branches is executed.
@abhinoorsinghpannu29934 жыл бұрын
Same goes for cuda programming also
@WhatsACreel4 жыл бұрын
So true! GPU's hate branching maybe even more than CPU's! They call the conditional execution 'predicate' in CUDA.
@dakrontu3 жыл бұрын
Yeah but that inefficiency of computing both branches and discarding one is the passport to being able to run the many parallel streams so you still gain.
@bandogbone32653 жыл бұрын
I was doing this back in the late 1970's, on a Cray-1 supercomputer with pipelined vector registers. Complex single-instruction operations were broken down into a sequence of simpler operations, each performed by a separate functional subunit. After one clock cycle, the first functional subunit was free to begin operating on the second element on the vector, while the second functional subunit began operating on the intermediate result in the first element of the vector that was left there by the first functional subunit. Complex operations like floating point division could then produce a result every clock cycle, even though the operation could take a total of six or more clock cycles to perform on a single operand beginning-to-end. Like stations on a car assembly line, the game was to keep as many functional subunits busy as possible at any given time. We would carefully map out each tick of each component of the CPU on a big chart before implementing in assembler. The thing was blazing fast, even by today's standards. Nice vid!
@duanebonas86302 жыл бұрын
Class explanation. I Love it . U know what u going on about.
@spx02 жыл бұрын
Great
@treelibrarian7618 Жыл бұрын
this is still the game, with 5 logic and 5 memory execution ports available for out-of-order processing (execution is determined by data dependency order), writing code to keep all of them busy with chains of non-dependent code in each loop to mitigate latency over iteration - although division is one thing that isn't pipelined. I guess division is rare enough that it's not worth the extra silicon.
@danielstory2761 Жыл бұрын
The iPhone 6 is capable of 20 times more FLOPS than the cray-1
@GregMoress Жыл бұрын
That's called 'Pipelining', and I'm sure you knew that. It's a form of parallel processing.
@StructOfArrays4 жыл бұрын
For whatever it's worth... Smaller_Branchless produces identical assembly as a branched version when using clang 10 and compiling with -O2. In GCC 8.3 Smaller_Branchless is flat out worse than the branched version - branches been optimized out by the compiler. In any case for the branched versions the awesome compiler optimized away the branches. So for the noobs out there, if you're doing something like this, always check the compiler output and measure measure measure.
@Guztav13374 жыл бұрын
“in 99% cases the compiler is smarter“
@Henrik_Holst4 жыл бұрын
Not only that but Smaller_Branchless is not actually branchless, conditionals are just hidden branches. Writing a * (a < b) still forces the compiler to add a test and jump aka branch to perform the a
@0LoneTech4 жыл бұрын
@@Henrik_Holst No, branches are distinct from conditionals. In this case, we see it transferring a condition flag into a general purpose register value using SETc instructions, because x86 has a condition flag register; not all architectures do. Conditional instructions like x86 CMOV or AVR SBRS operate by cancelling the retire step of a single instruction, which doesn't divert the instruction decoding; that's a cheaper form of branching. What you say would apply to architectures where the condition test is embedded only within branch instructions, as if you removed the CMP* instructions and had to use B* instructions in Nios II. That would be a typical case where a compiler might have a different optimization to avoid the issue, like using the sign bit from a subtraction.
@Guztav13374 жыл бұрын
@Isaac Beizsley It, indeed, won't turn your O(n^2) into O(n). But it will fix your branches better than most people in most cases. (The example OP made). It indeed won't do any simd. However, if you are prepared to go assembly just to speed up the code, it is worth assuming you know what you are doing. Like in the video, where he does indeed outsmart the compiler.
@Guztav13374 жыл бұрын
@Isaac Beizsley Yeah that's annoying. I want to say that the world would be a little bit greener if it wasn't for all bad code. lol
@tolkienfan19723 жыл бұрын
1. Make sure your branch targets are aligned to 16 byte boundaries. 2. If you're comparing to upper AND lower bounds, (whether masking or branching) you can subtract the lower bound and do a single compare instead. 3. When writing assembly pay attention to instruction dependencies, and interleave independent instructions. E.g. load 1st chat, load 2nd char, compare 1st, gen mask, compare 2nd char, and so on. 4. You can often apply these techniques directly in C++, possibly with intrinsics. ... Just my 2c
@sylviaelse50863 жыл бұрын
In some architectures, multiplication takes more cycles to execute than less demanding arithmetic. In the first example if a < b usually comes out to be the same (whether true or false), then branch prediction will avoid the delays inherent in "if", and the result could be faster than using multiplication. The architecture may even be incapable of calculating the true/false result other than by the use of branches. In any case, unless the function is called intensively, the benefits, or otherwise, of the branchless approach will be invisible.
@notgabby604 Жыл бұрын
You can get by 4 or by 8 the speed for some numeric code using assembly (eg. fast Walsh Hadamard transform, neural network code.) Sometimes the compiler can find better assembly than you can. If you need performant code try both ways. GCC found much better code than me for this in FreeBasic. sub subrandom1(result as single ptr,x as single ptr,n as ulongint) dim as ulong ptr resultptr=cast(ulong ptr,result) dim as ulong ptr xptr=cast(ulong ptr,x) dim as ulong r=&hE8FE90EB,rs,shift=18 for i as ulongint=0 to n-1 step 4 rs+=r resultptr[i]=xptr[i] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+1]=xptr[i+1] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+2]=xptr[i+2] xor ((rs xor (rs shl shift)) and &h80000000) rs+=r resultptr[i+3]=xptr[i+3] xor ((rs xor (rs shl shift)) and &h80000000) next end sub
@willofirony4 жыл бұрын
Awesome! Branchless programming would be even faster if the CPU had an instruction that filled all the bits of a register with the zero flag. These "computer science approach to C++ programming" videos are getting really good and I suspect are unique on youTube. You are improving (possibly, even saving) our mental health with these conceptual challenges and I, for one, am so grateful.
@WhatsACreel4 жыл бұрын
Thanks mate! I certainly wish the instruction set had a few things that are missing too! Cheers for watching :)
@vyor88374 жыл бұрын
@@WhatsACreel could ask intel and AMD to add that...
@halbgefressen97684 жыл бұрын
setz eax ror eax, 1 sar eax, 31
@CarrotCakeMake4 жыл бұрын
Nothing in this video had anything to do with computer science. It was all computer architecture.
@RAMIdotGG4 жыл бұрын
every community needs a Michael Keegan :) refreshing to read a wholesome honest opinion
@TheAguydude4 жыл бұрын
Note that the branch predictor is generally able to predict branches accurately for loops, since they almost always take the same path. Thus, getting rid of a loop branch is usually not of much concern, for purposes of avoiding branch prediction failures. Of course, there's still value in improving loop iteration code, since it still represents instructions that are run on ever iteration.
@TheGandorX4 жыл бұрын
That's why JIT (Java) is such a good idea. The run time / JIT platform applies the optimizations best suited for the hardware.
@garretgang83494 жыл бұрын
@@TheGandorX Unless you have a limited amount of parallelization that doesn't allow JIT to run on a separate physically parallel thread. Than it about doubles your run time.
@daantimmer3 жыл бұрын
Only if you have sorted data. Once you enter the territory of true random data the branch predictor practically doesn't exist anymore. Since it simply won't be able to predict. Now, some CPU's, (don't pin me on which ones exactly) can execute two branches simultaneously in parallel while still processing the condition. Once the condition output is known the output of the parallel processes is used and the other data path is thrown away. This has quite some prerequists though, but it works pretty well.
@andywolan2 жыл бұрын
@@daantimmer Ya, I think that behavior was the basis of the Spector and Meltdown CPU bugs discovered a few years back. (Correct me if I'm wrong.)
@daantimmer2 жыл бұрын
@@andywolan I won't be able to correct you even if you are wrong because I never really dived in to those issues :-)
@metamud86862 жыл бұрын
The primary rule of optimization is left out here, which is to FIRST MEASURE aka 'profile' to see which part of your code the CPU is spending most of its time in. Once you know where that is, you try to find out any algorithm / complexity improvements (take your algorithm from O(N2) to O(N) or even O(1)). Then finally, after all that is done, you may think about 'intervening', checking the assembly and seeing if you can "do better yourself". Introducing obfuscating code like a*(a>b)+b*(b>=a) leads to bugs and unnecessary rabbit holes for your future self when you get called in to debug the mess you made.
@brandonchurch60254 жыл бұрын
This is really good fundamental information for some of us "spoiled" high-level programmers who don't typically think beyond compilation
@ponypapa67854 жыл бұрын
Agreed, but highly dangerous and misleading information for all the others, as a disclaimer that this is basically "experts only" is missing. And seeing what inexperianced programmers try to accomplish, this can lead to a lot of technical debt, as it will be misunderstood, badly used and then mixed with branching techniques . So yeah, fun :D
@juangalton9992 ай бұрын
Agreed 100%. As a Javascript programmer, I noticed many of my peers don't seem to care much about how the language works. I know there are many performance decreases with it being a dynamic/weakly typed language with a just-in-time compilation. So whatever edge I can eek out in performance I will take. Plus I eventually want to learn more than one language. But making sure I REALLY understand JS first.
@connoisseurofcookies20473 жыл бұрын
I remember Bjarne Stoustrup saying something to the effect of 'The compiler will write better assembly than a human ever could' and in most cases, especially larger ones, that holds true.
@christopherjoseph6513 жыл бұрын
The compiler can only write assembly as good as the person who wrote the compiler can write assembly. Your statement is mostly true for people who don't understand assembly and the way CPUs actually work
@CottidaeSEA3 жыл бұрын
@@christopherjoseph651 Let any of the people who made the compiler try to write the same program in assembly as in C++ and see which one performs faster and with fewer bugs. It doesn't matter if you understand a CPU, the assembly code from the C++ compiler will be far better than any manually written assembly. Because the people who make the compiler only do translations and optimizations, they do not write entire systems with it. Because that's stupid.
@dionyzus29093 жыл бұрын
@@christopherjoseph651 Not really, in the largest scales this statement holds absolute true. There are limitations to what humans can do, a large project in pure assembly would be just plain impossible for a human to do, it's unrealistic.
@christopherjoseph6513 жыл бұрын
@@CottidaeSEA Did you not watch the entire video? He proved that a human who understands the instruction set of the CPU and who understands the purpose of the high level code can fully utilize the instruction set and hardware of the CPU. He literally beat the compiler by 45x! If the compiler is so superior why didn't it write the code using AVX? Exactly, a human wrote the rules that the translations and optimizations follow. So a complier is only going to write code as good as the person who wrote the compiler. This has nothing to do with program scale. The point is a human wrote the compiler so the complier can only write optimized code as well as the person who wrote the compiler. If that person doesn't realize how to perform an optimized task in assembly then the compiler won't be able to optimize the code either. What, do you think we're living in a world of Skynet where computers build more sophisticated computes and write their own programs?
@christopherjoseph6513 жыл бұрын
@@dionyzus2909 This is a childish statement written by someone who believes in the new education model in which older languages are out dated and inferior and therefore shouldn't be taught. We don't need new, more complex languages to solve new and more complicated problems. To this day, when a program needs to be optimized for speed, it is written by a human using inline assembly. Even the guy who made this video understands that you should look at your disassembled code and be capable of understanding what the compiler is doing. Do you need to write a large program that is not time or resource critical in assembly? No, there's no need for that, but the statement that a compiler will write better assembly than a human ever could is COMPLETELY FALSE!
@Zeuts852 жыл бұрын
This sort of optimization seems like what you'd do only for high performance real-time applications--like video games for instance--and only then as a third-pass after you'd first written the code in much more human-readable format. My coding bootcamp instructors drilled into me the mantra: First make it correct, then make it clean, then make it fast. Ideally by the time you reach 'make it clean' you've got good tests that pin down the functionality of the code before you refactor. After that you can come up with fast, clever implementations that still pass the tests. You shouldn't ever need to debug the fast clever implementations if you've restricted yourself to good, pure functions and written your tests correctly.
@saudude2174 Жыл бұрын
bootcamp instructors prepare people for no brain web dev jobs. This is something you will never use in such a work environment. It's mostly for advanced, low level programmers.
@C5FlyingSquad4 жыл бұрын
I think more people need to see your videos. Although they're specialised, they feel valuable for non-systems programmers like myself.
@gangstasteve57534 жыл бұрын
this is a very underrated youtube channel. i learned assembly from him.
@volchonokilliR4 жыл бұрын
There's [[likely]] and [[unlikely]] attributes in C++20, they can help in some cases
@vyor88374 жыл бұрын
And that's why you update your compiler and SDK...
@neur3034 жыл бұрын
Also there is profile guided optimization, but sometimes you have paths that are hard to predict, because the patterns are too elaborate.
@user-tr8kr1jd2o4 жыл бұрын
Vyor GCC has had macros for it for years
@leo691054 жыл бұрын
Have you tried if these help? Afaik, the branch hints are x86 only and no longer available in x86-64 mode. Not sure about arm.
@HermanWillems4 жыл бұрын
@@neur303 For example i have a branch that says High Alarm. (So x value is higher than y) I as a programmer know that one of the two almost never gets executed. Does this mean i can "guide" the compiler in to say hey branch predictor. Pls choose this one first for your prediction!! The other branch will almost never hit. Is this possible ? Especially if you need to check this for like hundreds of thousands of signals each cycle ? [[likely]] to be lower, [[unlikely]] to be higher than Y.
@gregorym62702 жыл бұрын
You write return a*(a
@ChrisStavros5 күн бұрын
You're correct. Neither the video maker nor the adoring commentators understand what a branch is, and are just jerking off trying to eliminate short jumps. Fucking brutal.
@psun2564 жыл бұрын
I never really thought about CPU caching and "predictions" affecting conditional statements, this is super cool!
@konstantinkh4 жыл бұрын
When you get into serious depths of optimization, you start looking at things like branch predictability. Like, there are situations when you chose to write a less optimized version of algorithm, because it has more consistent depth, and that means the loops are going to repeat a similar number of times, making branches on them more predictable, resulting in faster execution.
@psun2564 жыл бұрын
@@konstantinkh So things like for loops?
@konstantinkh4 жыл бұрын
@@psun256 Yeah. Because on assembly level, every loop is just a branch that sends the execution to beginning of the loop. So the CPU will hit the same branch over and over again as it's running the loop. And modern CPUs have very clever ways to get a heuristic for these branches. Like, if you have a very tight loop in some algorithm that always runs for, say, 10 iterations, your CPU can actually learn to predict that every 10th time it's not going to jump and start caching instructions from the correct branch. At that point, it's like the branch isn't even there! If the number of iterations varies, then the prediction is worse. Best your CPU can do in these cases is "usually, jump is taken," so it will always cache instructions for repeating the loop, and will miss and have to flush the pipeline whenever the loop finishes, which is still better than not having prediction at all, but not optimal. That's one of the reasons why an algorithm with consistent iteration counts can sometimes perform better than algorithm where iteration count varies. Of course, there are also optimizations your compiler can do if it can "guess" the number of iterations, which can make even bigger difference. We are getting into territory of optimizations where you _really_ need to have a good reason for going after, though, as you'll spend a lot of time testing, timing, and looking at disassembly, and you shouldn't be doing that until you've fixed all other timing problems. The only real world example I have where we've spent time going after this kind of stuff is character animation in games. When you have hundreds of bones in hundreds of skeletons for which you have to update the transform matrices 60+ times per second while leaving enough time for CPU to do everything else it needs to, you start looking at things like cache and branching prediction optimizations.
@psun2564 жыл бұрын
@@konstantinkh Huh, that's pretty neat. Thanks for sharing!
@A.Martin3 жыл бұрын
@@konstantinkh It would be great if you could actually tell your cpu which branch to load by default.
@youuuuuuuuuuutube3 жыл бұрын
Branchless works great for GPU programming (GLSL etc). It's actually worth complexifying the algorithm just to minimize the branching (but not always).
@jcm26062 жыл бұрын
Really depends on whether the branch is divergent or not. Modern GPUs have some level of branch prediction built in and will try their best to only take one path through the branch while leaving the other path alone, so if you know for a fact that the branch is very convergent (ie it's using a uniform/constant value, or it's based on data that's convergent in nature), then there's nothing wrong with branching (except on ancient GPUs). However, if the branch is very divergent (ie it's based on data that's divergent in nature), *then* you should try to implement the same logic in a branchless manner, as the GPU is forced to take *both* paths through the branch, with half of the warp being disabled through each branch.
@SpiritmanProductions2 жыл бұрын
This will seem trivial to some, but for anyone planning to do arithmetic with the result of a boolean expression in .NET should bear in mind that 'True' equates to -1, rather than 1, as in C++. EDIT: I'll just add that any non-zero value (positive or negative) equates to 'True' in a conditional statement, and zero equates to 'False'.
@TurboXray2 жыл бұрын
Someone needs to explain why that was chosen hahah.
@Vacuon2 жыл бұрын
-1 (int) is just max_uint, aka 0xFFFF... you can do (boolVal & 1), but it gets tedious haha And I've never used a system where true is -1 myself but if I had to guess I'd say it's because the designers wanted logical operators to work on boolean values, e.g.: not false = true (~0 = -1) true or x = true (-1 | x = -1) etc, etc, etc.
@TurboXray2 жыл бұрын
@@Vacuon I've worked with plenty of systems where -1 can be true, but it's not the ONLY truth value. As in, 0 is false and everything else is 'truthy'. ~ (BoolValAsNegativeOne) is just BoolValAsNegativeOne XOR -1, vs BoolValAsOne XOR 0x01. It doesn't appear to show negate -1 as an advantage. I mean, for either method you still have to auto-demote to a bool type to perform the operation. And it's the same amount of assembly instructions to invert; load the value, perform the operation. I'm SURE they had a reason for doing it. The thing about having a TRUE as a negative -1, means you can't index an array via an evaluation myArray[a>=b].. well I guess it works if your language supports negative indexing (like Python).
@SeanCMonahan Жыл бұрын
Oh, god, I bet that has led to more than a few bizarre bugs.
@vladimirarnost8020 Жыл бұрын
@@SeanCMonahan Not sure about C# but I wish C/C++ also used -1 (aka ~0) for 'true'. It would make very practical bit masks. For example, the video uses the slow integer multiplication (imul) instruction where a simple bitwise AND with a proper mask (0x0...000 or 0xF...FFF) would do instead. Conveniently, 0xF...FFF = -1U. Also, 0x0...000 and 0xF...FFF are just complementary values. You get the opposite one just by flipping all bits. x86 CPUs even have an instruction for that (NOT). ~0x0...000 = 0xF...FFF. All expressions consisting only of 'magic numbers' like this are pre-computed by the compiler. So if 'true' was -1, the min(a, b) function could be expressed as: return a & (a < b) | b & !(a < b); Since 'true' is 1, we have to flip the sign: return a & -(a < b) | b & -!(a < b); Or leave it to the compiler because the compiler writers know well how to compute a minimum of 2 numbers properly on your platform. Use 'std::min()/max()' for branch-less programming. The CMOVxx instructions have been with us for a very long time (since Pentium Pro in the mid-1990s) so they can be taken for granted now. :)
@berndeckenfels4 жыл бұрын
A good example for branch less is in Crypto, where data dependent code execution can lead to side channel leaks, therefore you look for brnachless and constant time algorithms. So for example to compare two byte arrays with hash results you would add the xors of each byte and not break the loop on the first difference.
@inx18194 жыл бұрын
add random sleeps in the code and you're alright /s
@yvrelna4 жыл бұрын
@@inx1819 Actually you don't want to add random sleeps. If you just sleep randomly, all operations will be extended by roughly the same amount of time, so you can still statistically figure out which operations takes longer than the other. Instead, what you want to do is to sleep longer if the operation finished faster and sleep less when the operation are slower. You can do this by deciding beforehand an operation how long an operation should take. For example, you might decide that an operation should take 1 second, so you do the operation, and if there's any remaining time, you sleep for the remainder. That means no matter which branch the code takes, they'll always take 1 second. Now, this isn't perfect, there's also power analysis because doing an actual operation likely costs you more power than sleeping, so this is enough if your attacker is remote (e.g. webserver). But if you need to consider a local attacker that has full physical access to the hardware (e.g. smart cards), you might have to busy sleep.
@gutoguto08733 жыл бұрын
@@yvrelna you didn’t see his /s
@ChrisM5412 жыл бұрын
Brilliant explanation. The following bombshell caught my eye... 12:19 "So it just goes to show that you don't automatically get speed from jumping over to assembly...you have to code it well" - very well said ;-) Wil every language above assembly, it's a shame that we have so little control over the final machine code output, even with C/C++. Lesson: 1) Never trust your compiler 2) Become an expert in assembly language and learn to e.g. inline where needed.
@Vacuon2 жыл бұрын
If is not *necessarly* slow, if your condition is going to the same place 98% of the time, your branch predictor will be very accurate, and the pipeline will rarely be broken.
@musaran2 Жыл бұрын
Even that is not free either. Branch predictors tend to have limited memory, relying on it somewhere hurts branches elsewhere. But then cache is limited too, so maller code is often better.
@iXPilot Жыл бұрын
Funny that I got recommendation of this video after watching a few presentations about the Elbrus CPU, which doesn't have the branch predictor at all. Now I understand what kind of hoops the developers of it's compiler and the porters of JVM had to jump through...
@rongarza94882 жыл бұрын
Early on in my programming career I noticed that there was usually a relationship between speed and instructions -- time and space. Plus, you can throw in the time it will take to debug clever code.
@Websitedr3 жыл бұрын
I can see these techniques being applied for things that require a lot of calculations to be done as fast as possible. The issue is when it comes to other people looking at the code without fully understanding why it's written this way.
@eomoran2 жыл бұрын
Aw jeez, if only you could write a comment annotating it /s
@TheLordoftheDarkness Жыл бұрын
Skill issues
@valtergomes58082 жыл бұрын
In my opinion, the main reason to avoid too branchy code is readability and maintaining time consuming. But for this issue, a clean code with well designed patterns will save you a bunch of 'ifs', which automatically make you code easier to understand and easier to be optimized by the compiler. Branchless programming the way it is presented here has a proper place and time.. if you need it you will know..
@Bregylais4 жыл бұрын
Somebody gift this man a mouse-pad.
@jonathanross62602 жыл бұрын
Branchless programming is often about taking a step back and finding a simpler approach. In the second example, I would just take d[i] & 0xDF; it keeps the upper case characters intact, but flips the 6th bit for lower case letters making them upper.
@WhatsACreel2 жыл бұрын
Yes, that will change some non-letter characters too. If those characters aren’t in your text, then AND away my friend! It will certainly be faster!! The example is not super important, just the technique of masking in SIMD to perform multiple branches without actually branching. Cheers for watching :)
@thewhitefalcon8539 Жыл бұрын
could try something like d[i] ^= ((d[i] - 26) < 'a')
@TheCroustade3 жыл бұрын
Some processors own a conditional selection instruction, which at source level corresponds to a>b ? a:b. branchless programming is a must in time critical applications where worst case execution time prevails. This skill is appreciated in avionics design companies.
@russellthorburn92974 жыл бұрын
It seems to me that it's a choice of evils: 1. Easy to read but much slower code. 2. Optimized but heavily obfuscated code. Option 2 is WAY above my pay grade :-) so I'd go with #1 unless I had no other choice. Still, your video was extremely well done. Even I was able to follow what you were saying and I'm merely an intermediate level programmer. That's definitely worth a "Subscribe".
@tragedyofwind4 жыл бұрын
this is why programming back in the very old days, they were really hard to read because scientists optimised the program to do things when the clock cycle is like below thousand Hz. However, the raise in computation power, mean there is no need to write optimal code for the machine, when the machine can do more work for you. so human programmer can spend less time to write less complex code, so that we could produce more programs for more purpose. This is a trade off, it really depend what type of work you were doing.
@Dji-gurda_Jdi-druga4 жыл бұрын
I would say you should determine bottlenecks and use #2 in them and #1 everywhere else.
@retrozvoc61894 жыл бұрын
How about a middleware that's readable to you and which gets JIT-compiled to whatever CPU you wanna run it on?
@DoomRater4 жыл бұрын
I dunno Option 2 is what compilers are for. Or in my case, precompilers since I work predominantly in a compiled language without certain niceties and use a precompiler that translates the extra control statements I want to use into standard controls that actually exist. Also look into the MOVuscator for more evidence that optimizations can be often offloaded onto the compiler. Edit: I said all that BEFORE WATCHING THE VIDEO WHERE HE SAYS COMPILERS HELP
@PhilippeCarphin4 жыл бұрын
"Before, we used programmers to save on computing resources, now we use computing resources to save on programmer time." Don't remember who said that. And like the others say, that doesn't mean #2 should never be used, just that you should only use it for measured bottlenecks that are worth it. Even if you identify a bottleneck but your program doesn't need to be fast, then you still don't need to optimize.
@frigzy37483 жыл бұрын
You can replace 32 in C with 'a' - 'A' just to not have magic numbers. Great vid - thanks!
@shanehebert3963 жыл бұрын
Although not the purpose of this video, the old school trick for toupper is to bitwise AND each byte with a mask (~(0x20)), presupposing the arrays are all ASCII characters in the range of A-Z/a-z. Of course, once out of ASCII, this isn't the right method.
@frigzy37483 жыл бұрын
@@shanehebert396 I never thought of it - thanks!
@BlankBrain3 жыл бұрын
Back in the late '70s, I came across a set of algorithms that computed Julian and Gregorian dates using only double precision arithmetic. Leap years and Y2K worked without modification. Operating systems and databases frequently got leap years and daylight savings time calculations wrong. It took a lot of time an effort to recover. The software that I wrote using the arithmetic functions only failed once, and that was when someone compiled them single precision. It made me a big fan of using arithmetic functions to perform logic.
@donjindra4 жыл бұрын
In time-critical code know your cpu. I once rewrote a graphics library in Z80 assembler and sped it up 10x by unrolling loops and counting cpu cycles for every bit of code. There are all sorts of tricks you can use to improve speed as any good programmer knows.
@donjindra3 жыл бұрын
@jshowa o No, tools of the trade don't necessarily make one a good programmer. But pretty much only good programmers know these things. And there is nothing unmaintainable in these designs. Again, one just has to be wise about implementation. But that's the case in all languages, including those that advertise themselves as nearly fool-proof. No language prevents or even hinders unmaintainable code.
@donjindra3 жыл бұрын
@jshowa o Yes, I watched the video. I also have 40 years of experience. You repeat what I've heard all too often. It's true compilers are very good at many things. It does not follow that they are best at everything. Experience has conclusively demonstrated that compiled code is not always the quickest code or the smallest code and there are still cases where maximum efficiency is required.
@kevinstefanov2841Ай бұрын
I would love to get your email, discord or linkedin or whatever and chat with you about the various assembly optimization tricks you could teach others! I'm an operating systems developer myself with 3 years of experience and I'm just now starting to really get into assembly language (I use an internal closed-source C-like language at work which is so old, it doesn't have an easy to use print function, so often the best way for me to debug the runtime is to freeze the operating system at certified compiler-emitted assembly instructions and look at registers and memory locations hahaha) and hardware architecture and optimizations enabled by a programmer's understanding thereof. My current side project is 6000 lines of mostly algorithmic C code (custom BigInt and cryptography library to be used for a secure chat app) and I'm sure there would be a lot for us to talk about and a ton for me to learn from you!
@adamp95534 жыл бұрын
For unsigned ranges like 'a' to 'z', a smaller range check I use is (value-lower+0u>range), where 0u forces the comparison to be unsigned and range would be the 'z'-'a', that any overflow would also result in an unsigned value out of range; instead of two compares where both values would have to accounted for, one subtract and one compare.
@RMLLcrazy4 жыл бұрын
This seems like a common enough thing that the compiler should be able to provide the optimisation. Isn't that the case?
@tolkienfan19724 жыл бұрын
I like explicit casts in such cases, with comments so people reading the code have no cognitive load
@rinrin47114 жыл бұрын
It will probably be slower since you changed one comparison and logical operator to one subtraction. Comparison is a subtraction without memory assignment, so cmp is faster.
@mohrizkyk2 жыл бұрын
"If you want to be most people, if you want to be an ordinary dev with common skillset, this video is not for you." - Peerple -
@ncot_tech4 жыл бұрын
I’ve been doing Z80 assembly recently, and seeing all these fancy conditional instructions and more than 8 registers is crazy 😀
@WhatsACreel4 жыл бұрын
Long live the ZX Spectrum :)
@protonjinx4 жыл бұрын
cheater! 6502 only had 3! spoiled brat.
@fie1329 Жыл бұрын
I did not expect a character like yours in a programming video! You do a great job on keeping things interesting and a bit funny while making good use of the 20 minutes with practicle examples. Really good video!
@Bunnokazooie7 ай бұрын
Love when experts have a down-to-earth attitude like this. You rock man.
@omri93254 жыл бұрын
Should've mentioned loop unrolling which is also a kind of branch removing method.
@programaths4 жыл бұрын
He did, but it was really swift! "Well, there is always the end of the loop except if your code [move and shows code repeated]".
@nmeans734 жыл бұрын
Loop unrolling is very powerful, but the technique is limited to applications where the loops have a know/fixed iteration count at compile time. Loops with a dynamic iteration count or count based on the size of a dynamically sized container (like a vector) cannot be unrolled since the compiler has no way of knowing how many times to copy the code in the body of the loop. Luckily, the branch prediction hardware in modern CPUs is really good (like 90+ % hit rate) at predicting the correct branches for looping statements since loops are such a commonly encountered structure.
@BitwiseMobile4 жыл бұрын
Crap, should have read comments *again* . I had the same comment. It's very much a branch removing method which was usually the reason to do it. You removed the branching and compare at the end of the loop which would shave several clock cycles. The draw back is more code bloat though and if you have limited memory that could become a real problem after awhile. I remember spending weeks trying to tweak a mobile port of some Tom Clancy game (this was back in '01/'02 and I can't remember the title). I kept running out of memory and I ended up having to roll back up some loops :). I had to make some other "creative" changes in the game play to compensate for some frame stuttering caused by that, but in the end it was worth it because the game just didn't want to fit on that device despite me having at least 10k overhead beyond the device limit. This was a problem with a lot of early devices. They might have say 200k on the spec sheet, but there is some process somewhere using 10k of that and so you really only have 190k. Yeah, I had to work with that much RAM at times on some form factors and the concept of phone manufacturers was it's a phone first before anything else, so calls always have priority. This could reek havoc on your carefully planned memory map when the handset gets a call during the most memory intensive point. We would have some weird bugs for sure and they almost always were RAM related. When Apple invented the iPhone I knew mobile gaming was going to change forever and in a good way.
@BitwiseMobile4 жыл бұрын
EDIT: I want to say it was Splinter Cell maybe? That could have been a later port, but I want to think it was Splinter Cell. If you played that on Qualcom handset or more specifically a BREW handset - Kyocera, LG, and a few other manufacturers were still making BREW phones back then - I might have been the one to port it, so you can blame me for any shitty gameplay as a result :P.
@Quarky_4 жыл бұрын
One of the reasons people should try to use ranged for. However I find it hard to move away from indices if you are doing some mathematical operation, particularly when the index is part of the calculation. Probably choosing a better data structure (maybe a struct) instead of pod would help here, but that has its own tradeoffs.
@donald-parker4 жыл бұрын
I thought it was funny when he said "The compiler couldn't figure out what we were trying to do". Same as most people trying to read that code. There are lots of compiler tricks for optimizing - removing branches is just one of them. And optimizing for speed is not the only type of optimization a compiler can do. My own intuition is to strive to make your code readable first. If you are not meeting speed requirements, reconsider your algorithm first. As a very last resort, consider using assembler. I have trouble imagining a scenario where trying to come up with some contorted approach in a high level language that is hard to understand, maintain, or predict what the performance implications would be (without looking at disassembled code, which seems kind of like "why bother") would be the preferred approach. BTW - the same high level code can be compiled into different native instructions for various reasons (e.g. different compilers, different target processors, different compiler options, ...), so it seems like you would need to buy into disassembling your code EVERY time one of these variables changes to see if it is still optimized. Of course, it is not uncommon for the writer of the code to not really know when one of these things changes in their build environment. In general, I would say if you start trying to second guess your compiler, you are at best wasting your time, and at worst, creating a maintenance nightmare.
@Antoinetheman4 жыл бұрын
To be fair, I think all of these points are taken into account by the video. He's not saying you should do this stuff everywhere, only in cases where you need extreme speed, and especially in SIMD scenarios. Nobody's doing SIMD unless they need speed.
@georganatoly66464 жыл бұрын
I believe you may have meant to say assembly instead of "... consider using assembler" to refer to an assembly language.
@shiuido3594 жыл бұрын
Remember that not everyone is writing code for webservers and PCs. If you work on high level applications, the it's hard to imagine, but there are people writing for low level applications where cycles and bytes are scarce.
@christopherjoseph6513 жыл бұрын
@@shiuido359 EXACTLY! 99.999% of people that write code are obtuse to the fact that low level programming is still relevant and necessary in certain applications. The live under the belief that assembly is old, out dated, incapable and doesn't need to be taught because they learned java or python.
@tessasmith3426 Жыл бұрын
Use early return to avoid *else* and branches: int Smaller(int a, int b) { if(a
@vinzer72frie4 жыл бұрын
"The CIA wants you to believe its only if then else if then else"
@gumbo644 жыл бұрын
Run over all the glow-in-the-darks
@anant67784 жыл бұрын
The FBI wants to have a word about switch statements
@VulcanOnWheels4 жыл бұрын
What is this quote from? If you quote something from the video you're commenting on, then please include the time where you found it.
@SiisKolkytEuroo4 жыл бұрын
@@VulcanOnWheels kzbin.info/www/bejne/baavq5SBob-Gh7M it's in that video at around 5m 40s
@SiisKolkytEuroo4 жыл бұрын
But it's worth it watching the entire video because Terry Davis is a legend
@giladreich8104 жыл бұрын
KZbin should pay you x10 times more to each view you get for the underrated quality content you publish. Love your channel! ❤ I was recently getting more into this subject myself, especially reading more about the subject of how different compilers handles N/RVO (Return Value Optimization) and Copy Elision to reduce copying when all the move semantics were introduced in C++11. Would be also a great idea for the next video to understand more about optimization and what we should not worry about when we write our code.
@scottmichaud4 жыл бұрын
C++17 mandated that RVO (not NRVO) is performed. If you're allowed to use C++17 (or later) then the compiler is not allowed to perform intermediate constructors and destructors anymore, so there shouldn't be any differences between compilers or optimization levels. You can see the edge cases here: en.cppreference.com/w/cpp/language/copy_elision
@jimwinchester3398 ай бұрын
Some of the best branchless code I've seen, back before the SIMD instruction families came along, involved doing the key comparison, and then using SBB to extend the carry flag into . If was AX, CWD would extend that into DX. Then is DX was complemented, you'd have two mask registers, one guaranteed to be all 0's, the other all 1's. Then each alternative was calculated: one masked w/ AX; the other with DX. Finally, DX was OR'd into AX for the final result in AX.
@Toksyuryel3 жыл бұрын
While these techniques aren't very useful for most programmers, they are extremely useful for compiler programmers. The more patterns they can recognize and replace, the better everything is for everyone.
@vladimirarnost8020 Жыл бұрын
Or library developers, e.g. providing the fastest primitives like memcpy, strcpy, strlen, etc. is essential for C/C++ performance. However all the ugliness and complexity should stay locked inside the library implementation and covered with extensive unit tests and benchmarks.
@peterSobieraj Жыл бұрын
I disagree a little. Thinking that most programers shouldn't care about a performance lead us to situation today. That you need 2x2GHz CPU just to start windows. Just to display desktop background image, some icons, and do nothing while waiting for user to click something. And 10 seconds of website loading is considered normal. And fast typing programmer can type faster than average IDE can display text.
@Toksyuryel Жыл бұрын
@@peterSobieraj The goal is that top level programmers shouldn't *have* to think about performance because the low level stuff takes care of it for them. I agree that we aren't there yet though.
@originaltonywilk4 жыл бұрын
Good to know some programmers are interested in this stuff... thought it was a dying art :) Did a lot of stuff back in late 90's with good old MMX instructions to process video in real time (which was tricky back then). Even now some clever optimization can make a huge difference - like being able to adjust photo editing filters in realtime with a slider (e.g. Affinity Photo) or waiting 1/2 a second or more for each tweak.
@markwilliamhumphries3 жыл бұрын
You can toggle case on an ASCII alpha (from lower to upper, or vice versa) by xoring it with the space character (i.e. 0x20).
@Reginoid Жыл бұрын
yes, he did it with -32 (0x20) in that example
@JayVal904 жыл бұрын
I used the vector technique for a mapping algorithm at my last job, using the Accelerate framework on iOS. Got a HUGE performance increase, although I never thought about the branchless part of it. I was just happy that I had an algorithm that could convert tens of thousands of lat/lon pairs to rasterized mercator tiles of 32-bit indexes into an array of values for each mercator tile pixel in the same time that it took the naive, pointer-chasing for each vector method to process 10 lat/lon pairs. I was working with 300x speedups.
@flossless2004 жыл бұрын
Cool story bro
@arnoio83554 жыл бұрын
You must slay at parties
@patrickthepure4 жыл бұрын
I observed a similar situation in BVH tree traversal. For a raytracing program, my first approach was, storing BVH data with pointers and dynamic memory, and traversing it recursively. It was a very elegant and OOP styled code. Then I had to make it run on GPU, so re-made it in a branchless way, I stored all its data in contiguous memory, and used stackless BVH traversal algorithm. Yep... 100x speedups when run on CPU, 1000x speedups when run on GPU.
@fyfaenihelvete4 жыл бұрын
That's pretty cool
@JayVal904 жыл бұрын
Garfieluigi Yes. Especially parties full of people with a high appreciation for the technical.
@Walrusbonzo4 жыл бұрын
Loved the reference to Al Di Meola, one of my all time favourite guitarists. Great video too, right up my street.
@Tolg Жыл бұрын
This sort of optimization is very hard to get right and often leads to premature optimization. For tight loops on higher level languages, you should check your byte code. And for lower level languages you’re the disassembly. However, don’t forget about the third thing…. The CPU execution pipeline. There are also many optimization each CPU can take. So while one disassembly looks slower than the other, it may not always be true as the CPU might be able to optimize one better than the other. So it’s very tricky and really ends up, requiring a lot of testing on specific CPUs. I wish there was a platform we had where are you can benchmark your code against hundreds of different CPU types (with emulation) and see the results.
@lake50444 жыл бұрын
Nice video. But remember, unless something is actually a bottleneck in your program's performance, you never want to waste time and effort squeezing 1ms out of it. Always profile first, optimize last. (Obviously, this doesn't apply to study/benchmark stuff.)
@KuraIthys4 жыл бұрын
Ah, branch prediction. Yeah, I see. I guess I spend too much time with ancient hardware that doesn't do anything fancy like that. You do a branch on a 6502, the CPU does that pretty quickly. Then again, since there's little to no pipelining (there is some limited precursor to a pipeline which is why the cycle counts are so low; basically it does instruction decoding and execution in parallel for multiple instructions), no cache, and nothing like branch prediction... A branch is about as fast as anything else... Ah well. The 6502 is an evolutionary dead end. Not because it couldn't have been extended to higher bit depths (65816) and faster clock speeds (current record for a 65816 FPGA core is 280 mhz, roughly). No, the problem... Which can be seen indirectly with the whole Zero page/Direct page idea, is that it's a Register + Memory design. If an instruction takes two operands, in almost all cases one of them is from memory, and one is a register. It also does single cycle memory access. (compare with the slightly later and more prolific 68000 which takes 4 cycles to access memory) For a given CPU clockspeed, the 6502 requires memory that can be accessed in one cycle. The 65816 does even worse, demanding half cycle access. On the face of it there's nothing explicitly wrong with this design. The problem is that the trend in technology (even to this day) is that CPU performance has outpaced memory performance. (at least, for affordable memory) This is why the CPU cache is a thing. Not because anyone especially wants the added complexity, but because a small amount of very fast cache memory + much slower main memory is much more practical and affordable than having your entire memory system operating at speeds equal to or faster than your CPU... Branch prediction I guess is a side effect of deep CPU pipelines, which in turn are a way of reducing instruction cycle count without actually making instructions faster. Want a single cycle multiply instruction? Well, either you make some complex circuit that can actually do that in a single cycle, or you build a pipeline where 10 instructions are in different stages of execution at the same time. The net effect is that each instruction takes 1 cycle... ... If the pipeline doesn't stall... At this point I'm curious what the maximum speed of a CPU would currently be if it had no pipelines and no cache memory. I suppose you can predict that from memory speeds. If we can deal with the logic used for DDR, then I guess we get something akin to 2-4 ghz to play with, so it could be worse... But there's almost certainly some major caveats involved here...
@m.p.jallan21724 жыл бұрын
KuraIthys good post, i am a 6502 hobbyist and watched this video for a laugh at the semantics of "branch less programming", i get the feeling tools are so easy now programmers like to seek some form of division from the less technically inclined people who write software.
@Badspot4 жыл бұрын
It's not just memory bandwidth, it's latency. A CPU with no cache today would be catastrophically slow because an L1 cache read takes ~1ns and a ddr read is ~100ns. I've actually wondered about doing the opposite, making a CPU with huge amounts of cache that only uses main memory to load entirely new programs. Never have to leave the cpu die.
@yScribblezHD4 жыл бұрын
If you're talking speed of a single instruction without pipelining, of course it wouldn't change from its pipelined version if you run the instruction with the pipeline flushed. Pipelining increases throughput, but there's no reason to expect the latency to be any different.
@KuraIthys4 жыл бұрын
@@Badspot Good point. I'm certainly no CPU designer, so I'm sure there was more to it. Though as to your idea... With a large enough cache you basically end up back where you started. If your cache is so large that you never need to access main memory, then what you've basically done is turn the cache into the system memory, and system memory into something akin to a RAMdisk... 100 ns latency... hmmh... I admit I didn't think it was still that bad... I mean, a Super Nintendo was already demanding ROM access latency of 110 nanoseconds or more... That would imply for mainstream RAM we've made no real progress with access latency in about 25+ years...
@tatianabasileus4 жыл бұрын
@@KuraIthys Some high-speed DDR3 and DDR4 can reach 10ns latency for the first word, though delivering the whole cache line can still take 90-100ns.
@jifi9866 Жыл бұрын
Hello everyone I am newbie in programming. Thanks KZbin for gathering all experts here.
@bobbobbity4634 жыл бұрын
Very interesting video. Another technique you didn't show (and is a lot easier for beginners) is caching. In the case of toUpperCase, you could prepopulate an array of 255 elements filled with zeroes and a couple of -32s. You sacrifice a bit of memory for great gains in speed. Also, depending on the language you use, you can store function references in an array and call funcArray[a>b](a,b) to reduce the amount of branching. Or even funcArray[a%4](args) for more options.
@Louis_Marcotte Жыл бұрын
I didn't even know branchless coding was a thing. I just assumed "if" was necessary to test a condition and didn't really question it, seemed reasonable enough. You've opened my eyes. I'm not gonna use branchless coding yet, most of it flew WAY over my head, but knowing why it's bad is still good
@LocrianDorian Жыл бұрын
It really isn't that difficult if you think about it, for comparison replacement, a>b itself is a mathematical expression which returns 1 if it's true, so if you do x=a*(a>b), that will mean x=a if a is larger than b. You can use that same principle for anything. That said, I think the main lesson learned from this video is that you shouldn't try to be too clever because most of the time the compiler will do a better job than you ever will.
@Louis_Marcotte Жыл бұрын
@@LocrianDorian ye nah I get the algebra, I just don't understand how converting "if(a>b) return a" to "return a * (a>b)" removes a comparison, you're still comparing a to b
@LocrianDorian Жыл бұрын
@@Louis_Marcotte It doesn't remove the comparison, it removes the branch. The thing that slows down the CPU is not the comparison itself, it's the "if" statement.
@ChrisStavros5 күн бұрын
@@LocrianDorian He's right, you're wrong. Eliminating jumps to minimize cache miss is not what branchlessness is about. Whether you're using an if or multiplying with the result of a comparison, you are creating a branch for the CPU's branch predictor: either the comparison will be true or false, and your multiplication result depends on that. It will either predict the branch result correctly or not, resulting in either a speedup or wasted work. (a * (a > b)) creates a branch for the branch predictor no different from an if.
@edminchau8113 жыл бұрын
If you look at the assembly code the branches are still there, just using cmovl or setle or setl to handle the condition within the instruction. What the transistors are actually doing within those instructions is just doing the branch in a very efficient way, but the condition/branch is still there.
@SaidMetiche-qy9hb4 жыл бұрын
Really interesting, optimization is often neglected
@svnhddbst89684 жыл бұрын
it's fun to look into python for optimization. sometimes the most readable code is only mediocre for timing, where the really verbose code that could be 3 times as long and look like a foreign language could be three times as fast.
@yScribblezHD4 жыл бұрын
svnhddbst I think optimizing python code is more or less a pipe dream, considering anything highly performant should just be written in some other language. Writing python code that's short and simple is what makes it fun heh.
@siriusblack99994 жыл бұрын
i would like to add that, from my testing with actually benchmarking branchless code, especially with "trivial" examples like the smaller than function you used as a first example, and with compiler optimizations enabled, there's actually a lot of cases where branchless is "at best equal" to branched code, and it'll even flip entirely based on the compiler used. staying with the smaller than example, without optimization the branchless version is 10% slower in clang, but 10% faster in gcc, while with optimization they're exactly equal. The second example works mainly because the branchless form can be vectorized, such that the 1024 elements can be processed in just ~300 cycles. I've tried to apply this branchless approach to my own project (doing some work with big integers. I changed only the division operator to work branchless) and it increased overall compile time of my tests 10-fold (from about 20 seconds to over 3.5 minutes) and nearly quadrupled the runtime for my long-running testcase, calculating 10, 8-byte prime numbers, from 2259 ms to over 8000 ms. Trying a branchless approach has singlehandedly undone all of the optimization i've done since first getting it working, fortunately i have backups. The reason for this is that, while it's true that mispredicted branches are slow, it only takes about 10-20 cycles to flush the buffer and switch branches, while on the other hand branchless needs to perform the work required to evaluate both branches, plus some extra overhead, so even for trivial situations it's quite difficult to actually get benefit unless your branchless code allows the compiler to vectorize the work. which is what happened for the ToUpper function. In more complex scenarios it only gets worse, since if there is more extensive work that needs to be done in each branch, the work for both branches needs to be fully executed before the return value can be determined. There are some genuine use cases for branchless programming, as noted in another comment it may be important in cryptography despite being slower, and there are situations in which branchless is actually faster, but unless you can confidently say that the branch predictor mispredicts more than normal, or the branchless form will benefit from vectorization, it will usually perform equal to branching at best. CPU's are sufficiently complex beasts that unless you actually benchmark what you're trying, it's really not possible to predict what will perform better, for instance the speed gain in the ToUpper function from earlier flips with -O1 on GCC, becoming 50% slower than the naiive branched version, whereas phrasing the condition as a binary and rather than a logical and is 20% faster.
@luis_musik3 жыл бұрын
the places where branchless is faster is usually because it allows the compiler to see optimizations (usually SIMD) that it otherwise wouldn't have found. iirc -O1 doesn't do SIMD optimizations so that's why the results flip. personally I've seen cases where replacing branches with boolean checks allowed the compiler to do some pretty crazy things like replacing entire loops with bitwise operations resulting in over 100x speedups
@kaiduwu2 жыл бұрын
Holy fuck I started reading and hit read more, what a fucking paragraph
@DeadCatX23 жыл бұрын
Two things I would add. 1) Conditional move et al still create dependency chains which can slow down throughput. 2) Speculative execution vulnerabilities like Spectre, as well as side-channel attacks on the execution time of a piece of code, can make branchless algorithms more useful.
@krisitof34 жыл бұрын
“99% the compiler is smarter“ Yes, the compiler tries to be but the main problem with compiler optimalizations is variance and variance depending on stuff out of your control. And there IS a lot of variance, probably more than you would think. Watch the talk “Performance Matters” by Emery Berger, it goes in-depth on this and a great talk! As a general rule of thumb, compiler optimalizations is nice to have but you should not rely on it to make your code fast. Branchless programming is really valuable if you explicitly do know what you are doing. Otherwise it probably won’t matter for you (for simpler stuff) Ps. Thanks for the more advanced topics, great video as always 😀
@oflameo89274 жыл бұрын
We need some compiler tutorials then so we know how to use them optimally.
@محمدآلعطية-ظ9ظ3 жыл бұрын
"You should not rely on the compiler to make your code fast." Do you compile all your code with - O0?
@dana_t0ebdot4 жыл бұрын
Got recommended this, watched it all, now time to find even more good content you have
@duncanbeauch9598 Жыл бұрын
I remember learning all these topics knowing it would speed up code but I didn’t know to what degree. It was very informative to walk through the same example for direct comparison
@GuruEvi4 жыл бұрын
For the conversion from upper to lower, wouldn’t it be faster to just do an AND on the byte with a register with the value of the inverse of 32 in binary (6th bit set to 0). So loop over value && 11011111; depending on your architecture this conversion could be done in a single operation.
@dimdimich2 жыл бұрын
String may contain non-alphabetic characters and they should not be 'converted', i suppose.
@TheEkkas2 жыл бұрын
Very nice. Not often we see real code optimisation. First time I’ve seen this. Love it. Smaller is faster. Not only your data, but your code blocks also need to move in/out of cpu caches so avoiding a possible branch outside of L1 size will add up and makes sense (now). This is a great technique for ultra-performance code. The algorithm at the end is a nice added bonus. Subscribed.
@atackhelikopter4303 Жыл бұрын
13:30 improvement idea: use an array that has the length of 255 and place basically the whole ASCII table there but make it so that the portion a-z has the values A-Z this should make it faster because you aren't comparing anything, you just replacing information with existing one
@shinyhappyrem8728 Жыл бұрын
*256
@atackhelikopter4303 Жыл бұрын
@@shinyhappyrem8728 it stops at null so you never have to take into account the null character lol
@teropiispala25763 жыл бұрын
Nice to find out that someone is still interested about performance. I started serious programming in 80's and did lots on embedded assembler programming in 90's. Computers are fast today but still it feels crime to write slow code. Have given up assembler for compatibility and readability reasons but with time critical code always check compiled results and make performance tests. Parallel pipelined execution is so complex nowadays that sometimes it takes long to understand why some implementation works as fast as it do. Sadly web based techniques and scripted languages spreading into areas they shouldn't. For many modern programmer, any operation is fast if it takes less than second even if being done right, time could be nanosecond level.
@ImperatorZed2 жыл бұрын
Optimizing isn't always worth increased complexity, so it's often a tradeoff choice. Benchmarking seems to be the way to go, I need to learn how to do that.
@teropiispala25762 жыл бұрын
@@ImperatorZed I’m not talking about hacker optimizations but higher level architectural choises. Web technologies I mentioned, are extremely slow compared to natively executed code which is tailored into purpose. They also create hidden complexity, which makes the code unreliable. Applications are being made using complex ready made components which functionalities don’t typically fit well on purpose. They bring huge amount of new failure paths, which are mostly unhandled, or make proper error handling extremely complex. Programming errors are typically caught as runtime exceptions. Components have complex dependency network on other components, which are constantly upgraded, partly because security vulnerabilies from functions which are not needed in most cases. Full system is never tested and applications work only in happy-day scenario principle. Often developer’s answer for a problem is, try another browser or it works in my machine. So, my point is, mostly self-written functional code is several orders of magnitude simpler and faster when written in traditional C or C++ compared to current web technologies. The evilness lies in easiness to use those complex components and difficulty to use simple approach. You use what you got works well in this.
@karoshi23 жыл бұрын
Meanwhile in reality: Geek A: Look, I sped this code up by 200 times! (Already disappointed by life) geek B: How long did it take to code? A: Only 3 days! B: What's it gonna save? A: 200 times! B: No, in computing time. How many hours of computing time? A: Over the whole lifetime of this program (maths) ... maybe 10 seconds? And yes, I've been both A and B in my career.
@micahwelf7 күн бұрын
3:44 - That code is C syntax. The syntax uses a comparison operator to return a number equivalent to Boolean. However, there is no guarantee of how the syntax will translate to whatever CPU is chosen, especially when using a language whose compiler you didn't study or write. Thank you for showing the concept, but I wish you would explain up front that this is a C-syntax feature, not a reliable technique in programming. (it would set the context better and be less misleading to the naive viewer)
@redjr2424 жыл бұрын
Since upper and lower case ASCII characters differ only in the fifth bit, can't you just AND each character with 0xef?
@WhatsACreel4 жыл бұрын
You can use AND, yes. But you have to make sure to only AND the lower case letters. I don't think it saves any time since VPAND and VPADD both run at the same speed. Excellent idea still, and I'm certain my code is not the fastest possible, so it would be worth exploring other instruction :)
@redjr2424 жыл бұрын
@@WhatsACreel Very good point. I forgot that we might be given non alphabetic characters.
@davidmartensson2734 жыл бұрын
But this correlation between upper and lower only works for ASCII, which means that for any international solution it does not work at all since the language uniqe chars are not spaced 32 numbers apart.
@user-zu1ix3yq2w4 жыл бұрын
@@davidmartensson273 You don't mean like UTF-8, do you?
@davidmartensson2734 жыл бұрын
@@user-zu1ix3yq2w Yes, for example. But when used inside a program its often expanded to 16 bit chars or 32 bit structures for simplicity, having different sizes for different chars makes optimizations very difficult ;), at least as far as I have seen. C# uses 16 bit chars, but do have surrogate pairs for some chars that do not fit into 16 bit.
@Hydrozoa4 жыл бұрын
Interesting video! I'll trust the compiler to optimize my readable code over trying to outsmart it, though :b
@matsv2014 жыл бұрын
Well... I think the ultimate lesson is to speed test your code
@nguyenvu82624 жыл бұрын
I think there are many things to do to optimize speed. This should not be high on the list but it's good to know that it's there. If a code is branch-heavy, this probably would help more. Speed up 1ms code 1000 times doesn't have that much impact on performance as a whole.
@judewestburner3 жыл бұрын
CPU's are cheaper than developers
@supericeg591311 ай бұрын
Thanks for showing this! I now use this every day in my game in C and it's a great extra bit of fun problem solving!
@johnnisshansen3 жыл бұрын
In these 19 minuts of video my computer could raise to uppercase all books written the last 10 years.
@zwz.zdenek3 жыл бұрын
Sounds like you have a pretty neat OCR and also pirated a lot of books. But really, unless your corpus is already in plaintext and loaded in RAM, other bottlenecks will make the conversion speed irrelevant.
@johnnisshansen3 жыл бұрын
@@zwz.zdenek well, my point was that computers are cheep and humans rarely have to care optimizing the code. But ok reducing branches can sometimes be worthwhile.
@A.Martin3 жыл бұрын
@@johnnisshansen its most important where massive amounts of calculations are being done, like I dunno, finding large primes to have very optimized code. For regular programs less effort is put into optimizing now than was done back in like the 80s or 90s, because computer power is so much higher that it is usually not needed. They do stuff near enough instantaneously anyway.
@TheSkepticSkwerl3 жыл бұрын
Never assume optimization is achieved. It comes down to your use case and needs. If your need to look through 50gb databases or you need to talk to 100k computers any and all tricks can help. Granted when talking to 100k computers, network is a big bottleneck. But I've been able to take 20 day scripts and convert them to a few hours. Optimization can and will matter in certain scenarios.
@Argosh3 жыл бұрын
That kind of thinking lead us to page loading times that were worse than what we could achieve in the 90ies...
@stuffedk4 жыл бұрын
I would be very interested to see a video on the low level differences between x86 and ARM. Especially when it comes to SIMD and such
@WhatsACreel4 жыл бұрын
They are very different! I'd love to make some ARM ASM videos, the Neon instructions are great :)
@SerBallister4 жыл бұрын
ARM32 (not sure about newer ARM64) has a really interesting instruction set that encourages branchless programming. You can predicate every opcode on a flag condition (e.g. LE/GT/EQ/etc).
@totalermist4 жыл бұрын
@@SerBallister That's still a branch, though - just merged into an instruction. That's also why ARM64 got rid of them and only supports them explicitly with branch, select, and compare instructions (for performance reasons, no less!).
@SerBallister4 жыл бұрын
@@totalermist I thought they differed by not using branch prediction so did not come with misprediction penalties. The none executed opcode behave more like NOPs. I didn't study the internals that much. Its an interesting instruction set with some interesting opportunities for optmisation
@Gonzakoable4 жыл бұрын
@@WhatsACreel Bonus points if you make an ASMR version of the vidz
@joekelly7505 Жыл бұрын
This approach makes 2 assumptions: 1. that the code is running on a deeply pipelined architecture in which the pipeline gets squashed at every decision, and 2. that compiler optimization isn't already adjusting for 1. The first assumption isn't bad, necessarily, but I've found when using a higher level language, that making assumptions about the underlying architecture (esp. without taking 2. into account) can lead to unreadable and hard to debug code; the only exception I would make is using the `volatile` keyword to prevent key pieces of shared state from being cached. If I was writing assembly, I'd be deeply concerned about making the code as efficient as possible. But since I write in C, C++, and other C like languages, I've come to trust that the optimizer will emit code that takes the underlying architecture into account. And programming in high-level languages we have enough troubling wrestling with algorithmic / asymptotic complexity (e.g. avoiding O(n^2) loops etc.)
@TheFoxfiend4 жыл бұрын
Huh, neat. I think unless I really need to optimize run time I'm going stick to more human readable code.
@thomasmaughan47983 жыл бұрын
Exactly. Debugging a program can be extremely difficult if the programmer thought he was being clever.
@joshuap74063 жыл бұрын
I never liked the if statement, but this takes it to a whole new level...
@CottidaeSEA3 жыл бұрын
Conditionals are always necessary. The difference is that branchless programming doesn't create two different scenarios. return (a * (a > b) + b * (a
@scene2much3 жыл бұрын
You had me at :"Assembly Code". Subscribed. This is a great topic for driver, and library, and bottle-neck algorithm implementors. idea zero: surface atomic conditionals e.g. 'mov if " from machine-code up to the high-level language (yes "Branchless C/C++", "Branchless Python", etc) If the compiler (and libraries) aren't implementing branchless methods, you may be simply optimizing one engine on a 10 engine super tanker. branchless compilers would deprecate (warn/error) branching where a branchless construct will do. If your compiler pushes you to code branchless, and the Standard Template Library, and other supporting libraries are branchless, then all the code will be pulling to the same drumbeat. With or without a "branchless compiler": I'd never code this way for the prototype. I don't like the idea of "try it and THEN see how much it helps". I prefer to respond to ALARMS, or a post prototype-review of code with a "branchless" eye. Your code profiler should be able to let you know if the cache is being blown and reloaded too often. If they don't, that's a market opportunity.
@NetAndyCz4 жыл бұрын
Interesting stuff, I never dug too deep into programming so I never even saw the instructions that go the CPU/GPU, would love to learn more about the actual assembly code
@nerdydrew68183 жыл бұрын
this reminds me so much of how The Cherno explains c++. Good thing to know that the compiler at times knows best :D
@WhatsACreel3 жыл бұрын
Cherno is a legend!! We certainly share the accent :)
@rawbits Жыл бұрын
I love the SIMD part. I never got deep into it, sadly. I believe there is one more thing to mention in correlation with the topic: branch prediction. Often, a simple if statement is used to iterate through values and perform different actions based on them. You might find a significant improvement simply by sorting the data so branch prediction can kick in and lessen cache misses. It would be interesting to see all you mentioned to perform in this scenario.
@BharatAcharyaEducation2 жыл бұрын
Dear teacher, Respect! Awesome delivery. I teach microprocessors and may want to argue on the overhead compared to the low loss in productivity around this topic. Pentium introduced branch prediction and the higher processors made it even better. Modem scenarios give 98% accuracy in branch prediction algorithms. So all we lose is basically the flushing due to 2% of the rogue code. Isn’t a whole new way of writing code? To be honest, the first example of smaller number would actually take more time to execute due to the multiplications involved. Yeah I teach ARM7 too and it incorporates conditional branching along with every instruction but just branch instructions. We have conditional mov, conditional add etc. it does make it more efficient no doubt but unless there are too many rogue singular branches that fail during branch predictions, the overhead doesn’t weigh up with the loss of efficiency. Hope I’ve made my point clear. Nonetheless, taking nothing away from your awesome work. Keep them coming. Best wishes, always.
@AndrewOxenburgh2 жыл бұрын
I'm a long time programming practitioner, and I loved it. I'm at the other end of the chain, if you will. Python, JS, various DB's and assorted Web Technologies, but this is really interesting. Haven't touched 86 Assembler since the 90's, so that was interesting. Loved the SIMD stuff. That's all new. Liked his points about premature optimization. The keyword is *premature*. You don't ignore optimization, but be wary of it
@nolanfaught69744 жыл бұрын
SIMD does give enormous speed gains, especially when doing operations on sparse arrays
@Kyojin7437 ай бұрын
I’m an undergraduate in CpE and I was working on an autonomous robot, and I had this really over engineered way to do course correction. I watched this video, and just today I was looking over my system and noticed that the techniques in your video were perfect for this application. I reduced it from a 50 line solution of ugly nested if else’s about 4 deep at max, to 2 lines. Thank you!
@Chumbawumba-l6q2 күн бұрын
What did it bring?
@Kyojin743Күн бұрын
@ My robot had incredibly straight lines when driving forward, but there must’ve been something with going backwards as it kinda curved a little. Passed with an A tho!
@ovaskerri3 жыл бұрын
Haven't read this suggestion in other comments: With the upper and lower case example there is actually a faster way in single execution: just loop through the string/chars and clear the 6th bit for each char, regardless of upper or lower case. In your 3rd example you write "if (condition && condition): d[i] -= 32", why not just write d &= ~(1
@antonf.9278 Жыл бұрын
People hate it when their numbers get mapped to controll characters. The most important part of any optimisation is that correctness is preserved.