Optimising Code - Computerphile

Рет қаралды 137,922

Күн бұрын

You can optimise for speed, power consumption or memory use & tiny changes can have a negligible or huge impact, but what should you optimise and most importantly, when? Dr Steve Bagley has an example!
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com

Пікірлер: 393

@bentpen2805 5 ай бұрын

It’s always cool to compile two versions of the same algorithm to assembly, and see they’re identical

@KilgoreTroutAsf 5 ай бұрын

Hardly the case. Unless we are talking about shuffling the code around, inlining constants, or factoring things out of a loop.

@jordixboy 5 ай бұрын

Not hardly the case because by default the compiler will apply all the optimizations by default. Unless you make significant chances, it makes sense

@christopherg2347 5 ай бұрын

@@KilgoreTroutAsf Yes, a modern compiler can and will do *ALL* of the above. That is it's job.

@TheArrowedKnee 5 ай бұрын

@@christopherg2347 Modern compilers truly are mindnumbingly impressive

@christopherg2347 5 ай бұрын

@@TheArrowedKnee They are written by people a few levels smarter then me with a more understanding of the hardware then I could ever learn. They are no different than a library as far as tools go.

@bartekkowalski 5 ай бұрын

For others not to waste their 10 minutes of their life: In 3:58 the text encoded in binary is "OK so this isn't strictly speaking the same text as the stuff on the left. Oh well :) :)" In 5:17 the text encoded in binary is "Right, are you really converting all this binary back to asci? Well done you! - Sean"

@Petertronic 5 ай бұрын

Well done you!

@jongyon7192p 5 ай бұрын

Well done you!

@mellowyellow7523 5 ай бұрын

i cant help but notice the binary in the background

@bartekkowalski 5 ай бұрын

@@mellowyellow7523 I also decoded the brighter part of it, but my computer accidentally plugged off before I could save the decoded content.

3 ай бұрын

I have not even noticed the second text, got to "RIGH" on the right and somehow thought I had made a mistake, because it seemed unpronounceable, so I typed it into an online binary translator.

@mausmalone 5 ай бұрын

One fun thing in optimization is the work that Kaze Eamanuar is doing on the N64. To make a long story short - he's optimizing for speed BUT one of the biggest bottlenecks for the N64 is RAM access. The CPU doesn't have direct access to the RAM, so when it requests a read from RAM it first goes to a local cache and if that page of memory isn't cached, it has to request the (essentially) GPU to copy the entire page of RAM into the CPU's local cache. This goes for both data and instructions which go to separate caches. What he's found is that one of the best ways to get good performance out of the N64 is to get your code to fit into (and be aligned into) a single page of memory. If you can do that, the CPU will hum away happily at 93MHz (which was a lot for the time!). But if you keep calling functions that are located in different pages, you'll frequently have to stall the CPU to wait for those pages to be moved into cache.

@joshbracken5450 5 ай бұрын

Yeah I love his videos. Discovered him last week and I binged them all. I don't know why but it's so satisfying for me to see how far you can go with efficiency and how much performance you can claq back from code work alone. Absolutely amazing.

@lucbloom 5 ай бұрын

Kaze is awesome! Love when he deep dives into optimization strategies.

@The_Pariah 5 ай бұрын

Pretty sure you and I watched the exact same KZbin video on how N64 graphics work.

@Malik_Attiq 5 ай бұрын

yeah...now the same approach is used by data oriented design.

@Zadster 5 ай бұрын

Something very similar was done in the days of the BBC Micro and the iconic game, Elite (published 1984). The 6502 has something called page zero, the first 256 bytes of RAM, which can be accessed very quickly with short instructions. There are also similar benefits to your code if it can fit inside 1 memory page itself. Elite used some incredibly cunning memory access algorithms involving these 2 optimisations which have only relatively recently been properly documented. It is, of course, far from the only 6502 software to use these speed-ups, but it is notable for getting a 3D space flight sim and universe sim into 22kB.

@Zullfix 5 ай бұрын

1:36 The quote "premature optimization is the root of all evil" is severely misued nowadays as an excuse to write very slow or just plain bad bad on the first (and often only) pass. The quote was initially said to discourage developers from inlining assembly inside their C applications before they even had an MVP. But nowadays it's an excuse for developers to write poor, unefficient code that becomes the architecture of application, making meaningful changes require a large refactoring. If you can write good, fast code from the start without going overboard into inlined assembly, what excuse do you have not to?

@nan0s500 4 ай бұрын

Premature pessimization is the root of all evil

@fabricelealch Ай бұрын

Quotes are the root of all evil.

@Rowlesisgay Ай бұрын

optimizing the proof of concept you've finished and had well documented itsnt premature in the slightest, and lacking any optimization makes people associate you with adobe. I think adobe either missed this part or wanted to be associated with itself.

@akashgarg9776 4 күн бұрын

Eh, it’s still true though. Remember being lectured by my advisor in my job when I spent hours having an O(n^2) algorithm and I wanted to make it O(n), and my advisor said that quote and was like, look, it needs to work first, then we worry about that difference

@brunoramey50 5 ай бұрын

- Optimise fo CPU speed - Optimise for memory usage - Optimise for power consumption - Optimise for maintenability - Optimise for Developer time Make your educated choice !

@christopherg2347 5 ай бұрын

Personally: Wait until you are in testing so you actually see what needs optimisation.

@ME0WMERE 5 ай бұрын

luckily these aren't all mutually exclusive

@infinitecrayons 5 ай бұрын

Glad someone high up in the comments mentioned optimising for dev time, too. Sometimes you're not just trying to make something run a bit more efficiently, you're also trying to finish the next piece of software sooner to bring someone that benefit sooner.

@andrewharrison8436 5 ай бұрын

Maintainability first, that actually cuts down on bugs so it rolls into developer time then you can take your pick (or spade or crowbar as preferred).

@edmondhung6097 4 ай бұрын

CPU speed can buy with money memory size can buy with money power consumption is not a concern on non-battery device Now make your choice on: - Optimise for maintenability - Optimise for Developer time

@Adeith 5 ай бұрын

As someone who has to teach juniors to optimize games, these are the types of optimization that are the least important and is very rare that you actually bother with. The part he scoffed at also kinda glossed over a very important part, the reason you don't just write it "right from the beginning" is because you don't only do trade offs between speed, memory and battery, you also do it against maintainability, which these kinds of optimization ruin and is the thing you should be optimizing for while writing it the first time around. Also, by far the most important optimizations are choosing the correct algoritm, data structure and architecture. Those optimizations are almost never premature and can give speed ups of many orders of magnitude compared to the micro optimizations shown that might give 2x at most.

@Rodrigo-me6nq 5 ай бұрын

Exactly, he brushes off premature optimization then goes straight to prematurely optimizing memcpy, one of the most optimized routines in the crt. Real world implementations of memcpy do everything he mentioned and much more.

@JackMott 5 ай бұрын

Yeah it makes me twitch when people do the premature optimization quote. Getting the basic memory layout of your data right from the start is important, as it will be hard to adjust that later.

@dmail00 5 ай бұрын

Was looking for a comment like this. To write fast code you need to consider your data from the start not afterwards.

@pierreollivier1 5 ай бұрын

Yeah he didn't even mentioned SIMD, or how to break dependency chain, to give the CPU more room for speculative execution, or how to optimise the cache line usage, by using a DOD design, those are where the bottleneck usually occurs, if you look at what the cpu is doing most of the time in poorly optimised software the CPU is waiting to get data from the cache.

@JackMott 5 ай бұрын

but is a beginner video. kinda need a few hours to delve into to that stuff.

@LordKibblesTheHeroGothamNeeds 5 ай бұрын

A note on the idea of not optimizing until its working. For context, I write embedded C for microcontrollers in time-critical systems that have safety requirements (aerospace and such). With that in mind, the software architecture needs to be designed with some degree of optimization in mind, but that aside, I agree, we not optimize until a functional block is complete. The biggest benefit we have in doing this is readability and maintainability; it is just as important that a piece of code can be understood as it is that it works. I have worked with a lot of legacy code that was written without oversight and focused on hand-coded optimization from the start. The result is code that is hard to follow and is prone to mistakes when it is picked up by another developer.

@AileTheAlien 2 ай бұрын

The most fun part, is when you get into arguments about what "readable" means. 😅

@Schnorzel1337 5 ай бұрын

One huge trick John Carmack taught: If a piece of code is written poorly and you know it. But at the moment the input size is so small that it doesnt matter. Give it a nice living comment with for example assert. Example: You have an array of unsorted numbers and you want to find if 3 numbers add up to zero. The 3-Sum problem. And your first approach takes O(n^3) instead of the optimal O(n^2) its fine go ahead. Then put down: assert arr.length

@Slarti 5 ай бұрын

Always optimise for debugging. Someone is going to have to fix your code at some point - always write the code so it is easier to step through and fix it.

@jakeezetci 5 ай бұрын

this eally depends on your field of work nobody is going to look through my code that computes solar magnetic energy - people just import the function and get their number if they want to know what happens inside they can check the formulae in my article

@rudiklein 5 ай бұрын

@@jakeezetciassuming you've implemented your formulas correctly in your code. I might want to inspect the code itself.

@rudiklein 5 ай бұрын

True! You want to optimize for readability and maintainability too.

@SimGunther 5 ай бұрын

If only compilers had a super-optimiser pipeline to magically take readable & debugable code and turn it into the most optimized code possible in release build mode...

@Wyvernnnn 5 ай бұрын

Code is ran more often than it is read, more often that it is written Optimize for user experience, then for readability, then for yourself

@LunarcomplexMain 5 ай бұрын

It's also helpful to just make the program first without thinking anything of optimization, because when you have it finished (or w.e part you're working on) changes you make afterwards while trying to optimize, you'll already have the ability to test that immediately, with whatever code you've already written.

@jakeezetci 5 ай бұрын

yeah that’s exactly what’s said in the video

@sttate 5 ай бұрын

You say that like he didn't say it six times in the video.

@solhsa 5 ай бұрын

Modern compilers are pretty insane. Writing benchmarks is difficult because compiler may realize what you're doing and optimize your payload away.

@atomgutan8064 5 ай бұрын

That is actually really funny lol. They just optimize too well beyond our human understanding.

@Me__Myself__and__I 5 ай бұрын

@@atomgutan8064 Its still very much understandable by humans, if anyone had that much time to spend. The problem is nearly every CPU is different these days. They have different sets of instructions or extra instructions. They have different timings for the individual instructions. Some have advanced caching and jump prediction. So the problem is it would take a human a vast amount of time to fully understand all the ins and outs of a single CPU - but your software many deploy to many different types of CPUs! The companies who write optimizing compilers hire bunches of people to specialize in such things and embed that knowledge into the compiler these days. Also remember that since CPUs are so much faster these days (GHz) and there is lots of available memory a compiler can use vastly more resources to find the optimal code than was feasible years ago.

@mytech6779 5 ай бұрын

@@atomgutan8064 They are actually very obvious and easy optimizations, which is why it can be a trick to stop them. Maybe you want to test some algorithm with a fixed input for consistant speed results, but the compiler is like "Hey these inputs are all constant values so the result is never going to change, I'll just substitute the final answer for the entire algorithm and print it to screen." The solution isn't that complex though, just fetch the inputs at runtime from a separate file or pipe them in from stdin, this way the compiler and optimizer do not know the values and must assume they are truly variable.

@lucbloom 5 ай бұрын

The sheer subtile effect of v = *ptr; vs v += *ptr; in benchmark loops is a good reminder of your need to pay attention.

@mikep3226 5 ай бұрын

I am reminded of a story I was told about a Fortran compiler being written in the 1970s for a brand new computer (the DG Eclipse). They spent lots of time working on getting all the optimizations they could in. They had run lots of incremental tests as it was developed, but the first big test was writing a large Fortran program which had all the types of coding it was supposed to optimize and then running that through the full compiler. The problem was the compiler produced a resulting binary that didn't compute anything, until one of the other engineers (the one I heard the story from, _not_ working on the compiler) remarked, "your program as written produces no output, so the optimizer noticed that _none_ of the code was relevant and optimized it all away!"

@TheIncredibleAverage 5 ай бұрын

Software can be and most often is magnitudes more complex since 1974 when Knuth wrote that. I would go so far as to say optimization in software means something different now than it did then since we can do a lot of optimization without writing assembly or messing with the hardware. You could make a wise optimization choice before you even start writing by choosing a package or library with no dependencies, for example. In other words, optimization can't be considered a single step of the process anymore. I can't fathom waiting until I'm close to finishing a project before thinking about efficiency

@christopherg2347 5 ай бұрын

Where did find "until I'm close to finish"? Why in blazes are your testing your code that late?

@johnbennett1465 5 ай бұрын

I am disappointed that he didn't even mention the memory alignment problem. At best, off alignment access is a significant performance hit. At worst, it fails. I don't know if any current computers have the limitation, but I have worked on computers that require all access to be aligned to the data size.

@amigalemming 4 ай бұрын

He'd better told the people to just call memcpy.

@rmsgrey 5 ай бұрын

Just watching the video, I can see two obvious (but mutually incompatible) optimisations (in terms of number of instructions per loop) before trying loop unrolling: - rather than counting up until R2 matches R3, copy R3 into R2 before the loop (if you want to preserve the original byte count for some reason) and use a single decrement-and-compare-to-0 instruction rather than the separate increment and comparison instructions. - rather than having a separate loop counter at all, calculate what the final value for one of the pointers should be and compare that pointer with that value.

@AnttiBrax 5 ай бұрын

Remember kids, you are only allowed to quote Knuth if you are going to profile and optimise your code later. If you don't, it's just an excuse for writing sloppy code.

@Me__Myself__and__I 5 ай бұрын

YES! Almost no one profiles and optimizes later anymore. And because of teachings like this very few write decent code that isn't horribly slow with terrible algorithms/data structures. This Knuth quote really needs to die, it was for a long past time when people actually cared about machine instructions and loop unrolling.

@clickrick 5 ай бұрын

"Make it right before you make it fast." (P.J.Plauger) I was taught this in one of the first classes in my CompSci course back in the 70s.

@Me__Myself__and__I 4 ай бұрын

@@clickrick Because it made sense i the 70s. And because even with that said there was a baseline of reasonable quality (aka "making it right") that was just expected. Newer developers don't have that same quality baseline, they write inefficient, unscalable and unmaintainable garbage and use quotes like this as justification. If professional coders back in the 80s write code that was this poor, they'd be unemployed.

@matankabalo 5 ай бұрын

Very interesting video, Can't wait for the next one!

@CarlosFernandez14 5 ай бұрын

it's funny how we go back to paper on Computerphile videos lol; always cool to learn these concepts Dr. Steve's way.

@erikhgt6020 5 ай бұрын

Thanks for the great critical questions by the cameraman

@Me__Myself__and__I 5 ай бұрын

Quite a number of really good comments here calling out how this video covers all the wrong things and is off base. Glad to see so many people understand that and are willing to speak up.

@kenchilton 5 ай бұрын

I am usually concerned with optimizing for reliability, maintainability, and testability. Performance means little when code is fragile, because running code is generally faster than crashed code. Walking that line between writing compact (performance and resource optimized) code and clear code is an art form in itself.

@lyndog 5 ай бұрын

I agree with you. Broken code has either a time complexity of infinity (it doesn't run or never finishes) or memory complexity that is effectively infinity. Not optimised code is always less in both dimensions.

@mikep3226 5 ай бұрын

I had a job once that I described as being the final optimization pass of the compiler. In the mid-70s a friend of mine (Rodger Doxsey, bio on wikipedia) was one of the PIs for an X-Ray astronomy satellite (SAS-3). Once every 90 minutes they got a large downlink of data from the observations in that orbit and had to analyze it quickly to decide if they wanted to change the orders for the next orbit. The problem was the analysis was all done by Fortran code written by astrophysicists, and the data was all packed in 4, 5 and 6 bit fields in larger data words. So, as the first real Computer person to look at the code, the first optimization was to just know how bit arithmetic worked and improve the Fortran in the inner loop a bunch (and, IIRC that halved the time it took for the program to run). But then I looked at the generated assembly code and realized that by better use of the machine code bit shift/mask features, it could be improved much more (IIRC, a factor of 5 this time). They were very grateful for the extra time that gave them to think about what the data actually meant.

@adamburry 5 ай бұрын

There was a missed opportunity here. Based on the example, I was expecting you to circle back to the point about making your code correct before optimising it. This code fails for a certain case of overlap between src and dst; it has the potential to overwrite your data before you've had a chance to move it.

@kierengracie6883 5 ай бұрын

That's what memmove is for ;)

@uuu12343 5 ай бұрын

I think everyone in the comment section has to remember: this video is meant as a general introduction and understanding of the necessary rules and purpose of code optimization, NOT code optimization for C (using GCC, gdb, and ASSEMBLY) These notes are meant to apply to code optimization for, say, Rust, golang, python etc etc, so while memcpy might make optimization via manual control easier - Python, Rust does not have memcpy This uses manual approaches to fully understand the flow, not the tools

@johncochran8497 5 ай бұрын

Nice. Although he missed a few points. Copying 4 at a time is good, but it also introduces potential alignment issues. Some processors can only access larger chunks of data at natural alignments. And for many that can handle unaligned accesses, the unaligned access is slower. Another thing missed. The basic copy loop was like while(i < n) { *p++ = *q++; i++; } That increment of i is mostly wasted effort. By that, it's just to keep track of how many bytes have been copied. Additionally, comparing the pointers directly is pretty easy. So how about. limit = p + n; while(p < limit) { *p++ = *q++ } Now we've eliminated the increment of i and instead are simply using the unavoidable increment of the pointers themselves to handle the loop. So with his ARM example, we now have 3 opcodes per byte instead of the 5. Of course, the optimizations illustrated in the video can also be used as well.

@fishsayhelo9872 5 ай бұрын

speaking on unrolling, one of my favorite C "techniques" for doing so has got so be Duff's device, which makes clever use of some C language features to implement loop unrolling at runtime, similarly to as shown in the video

@styleisaweapon 5 ай бұрын

its not loop unrolling that duffs device gives you, its a jump table that duffs device gives you

@kierengracie6883 5 ай бұрын

It's barely readable and obfuscating. Just use memcpy which will do that anyway. memcpy does not copy 1 byte at a time if it can get away with it. If memcpy does not do that (check the assembly output) you can write a much more readable version using inline assembly anyway.

@SimGunther 5 ай бұрын

@@styleisaweaponTechnically it's a simple jump table that takes care of N mod 8 iterations on data before N div 8 iterations on the rest of the data. Nowadays, it's reversed when you let the compiler optimize the loop using SIMD instructions.

@balijosu 5 ай бұрын

🤮

@styleisaweapon 5 ай бұрын

while I agree that its "barely readable" saying "Just use memcpy" tells us you only think duffs device is for memory copying which is such a gross ignorance that maybe, just maybe... hush@@kierengracie6883

@AssasinZorro 5 ай бұрын

I feel like a game Human Resource Machine" gives a great introduction to programs and optimization. Your video is complementary to the game itself.

@adam_fakes 5 ай бұрын

My first Software Engineering lecturer (30 years ago) taught me this motto "Make it work, Make it better"

@bigutubefan2738 5 ай бұрын

Steve Bagley is one of the most underrated people on the Internet.

@Me__Myself__and__I 5 ай бұрын

Not based on this video. This is completely wrong, focuses on extremely outdated things and would be useless in a real world development situation.

@MechMK1 5 ай бұрын

The best thing to optimize for is readability and maintainability. I've seen people write "optimal" code before, which was ~20% faster than baseline, but at cost of maintainability. These days, for most applications, maintainability is key. Also, regarding "the right algorithm": Caching is so often overlocked. Trading off memory usage for execution time can be extremely valuable, especially if the computation is expensive or requires IO.

@elpapichulo4046 5 ай бұрын

Hard disagree

@MechMK1 5 ай бұрын

@@elpapichulo4046 Would you mind stating why?

@Richardincancale 5 ай бұрын

Two other optimisation targets you might consider: 1. Optimise for stability - particularly in programs operating in real-time environments like transaction processing or process control… avoiding technologies that can lead to memory leaks etc. 2. Optimise for maintainability - for long lived code that may be maintained by a separate team the ability to perform thing’s simply and correctly, without any tricks or nooks and crannies will be more valuable in the long term.

@amorphant 5 ай бұрын

Those aren't optimizations.

@trapexit 5 ай бұрын

The ARM has the ability to read multiple words into and out of memory/registers which could also be used to improve cycles per byte of a copy routine.

@Yupppi 5 ай бұрын

My favourite topic. I love beautiful and aesthetic stuff, and efficient. Sometimes beauty and magic don't go well together, sometimes it does (there are people like people who work with standard libraries that say, generally the simpler the solution, the better it is). Then you see matrix multiplication optimization. Sean's question about isn't it better to design it well at first is great in my opinion to draw out the difference of good design vs optimization. Because at times optimization is definitely not "good design" in terms of let's say readability. And the implementation details shouldn't really be part of the design talk I don't think. For the general idea etc yes, but the optimizations come up from seeing what the actual code is like. Like Sean Parent's classic "that's a rotate" speech in GoingNative2013. I think the good way to explain compiler and optimizing is that the compiler can only optimize it as much as you allow it to: if you keep a layer of mystery to everything in the code, the compiler can't deduce what is behind the curtains and has to consider the worst case possible. If you write it smarter, the compiler can just look at the code and see "oh you declared that const, that constexpr, you made loops or avoided loops where I can just write them off and skip calculations, just saving the answer" etc. I.e. if you give compiler enough information about your code (don't go declaring types where you should allow the compiler to deduce them), it can do magic tricks that you couldn't even imagine trying to do clever bit manipulation tricks. The compiler WILL outsmart you if you give it an opportunity. But for example deciding which type of pointer you give, which kind of virtual functions you make, how you pass your pointers or pass by value/reference, that will improve things manually. And using vectors and STL algorithms. Remove loops and branches. Remove unnecessary copies and allocations. And don't lock the compiler forced to do something unnecessary by being very explicit. Don't make an int pointer to be smart because the pointer takes more resources. Of course these apply more to C++ than C, they're obviously different beasts. Matt Godbolt has a fantastic demonstrations with his Compiler Explorer about how you shouldn't try to outsmart the compiler, but work with it. Jason Turner's Commodore 64 game in C++17 is also an impressive demonstration of 0 overhead abstractions and compiler magic. Bjarne Stroustrup also had that article about linked lists vs vectors and how despite a lot of all kinds of testing, and despite being a very unintuitive result, vectors came on top almost always. And switch magic...

@avramcs 5 ай бұрын

I think another point at 2:05 from your question asked is that who knows the “way” to build something. It’s hard to build something the right way at the start because you are trying to write your mental abstractions of your program ideas into actual code

@cidercreekranch 5 ай бұрын

Your definition of optimal reminds me of the definition for recursion in the Devil's DP Dictionary. Recursion: noun, see Recursion.

@TheDeanosaurus 5 ай бұрын

At an enterprise level there's another optimization layer which is the human element; maintenance, reusability, and extensibility. There are times in our projects we often forego true computational optimization for readability or ease of use of a certain API. I think from a business standpoint that usually comes first (fortunately or unfortunately) because it affects the amount of resources required to solve a problem, especially given the higher order languages that already either have some optimizations built into the compiler/translation layer or, on the other end, are so abstracted away that computational optimization isn't possible. Not to say this isn't still thought about, we consider O(n) daily even if we don't directly solve for it, it's just either second nature or not worth the additional engineering time.

@mytech6779 5 ай бұрын

Optimizing those human items is often not in direct conflict with performance optimization. There is no need to make a poorly commented confused intertwined mess when making a better list of instructions; that tends to be a sign of someone that really doesn't know what is going on just throughing things at the wall to see what sticks. At the other end general overly verbose codebase bloat usually hurts all types of optimization, human and machine.

@christopherg2347 5 ай бұрын

Absolutely "When I was working on Visual Studio Tools For Office we did comparatively little making the customization framework code _run_ faster because our tests showed that it typically ran fast enough to satisfy customers. But we did an enormous amount of work making the framework code _load_ faster, because our research showed that Office power users were highly irritated by noticable-by-humans delays when loading customized documents for the first time. " "Which is faster?", Eric Lippert, "Part the fifth: What is this “faster” you speak of?"

@TheDeanosaurus 5 ай бұрын

@@christopherg2347 Which in essence IS optimization for loading. We have made similar decisions in iOS development to actively avoid dynamic linking because it bloats application launch times. There's newer functionality that allows for dynamic linking during debug builds and then merges those libraries into a single (or multiple depending on how configured) binary which makes release builds take longer but launch times much faster. But again that's a different kind of optimization, sometimes we'll write code and abstract away several layers of an operation to make it more testable. Does this make it run slower in production? Possibly? At a scale above a few nanoseconds? Probably not, so we don't even think to optimize for runtime performance in that instance, we optimize "for stability" instead by ensuring test coverage is there. That ultimately saves support time, saves our maintenance time, and would hopefully save consumers' time by working the first time. Even 30 seconds having to relaunch an app from a crash is more time (and bits) burned than a slightly less optimal but unstable piece of code.

@ericon.7015 5 ай бұрын

I've arrived at the same conclusion by experience. While coding multiple migration scripts, that must be ready to use as soon as possible. Soon enough I realized that first, I had to make the code work and do the job, and then later do the optimisation. At least you know where you could optimise. Otherwise you will not being optimal and will loose precious time.

@kierengracie6883 5 ай бұрын

memcpy will most likely beat any hand-written attempt you write since it has already been highly optimised for the platform you compile it to. You would have been better explaining other optimisation methods like caching recent results, using loookup tables, avoiding branches if possible, utilising the cache, etc. Also stuff like don't calculate the length of a vector if you can can use the length squared, etc.

@kwzu 5 ай бұрын

It's probably more as an example in the video, but there are a few times when memcpy is better handwritten, namely when it's a known-to-be-small byte count and the compiler still calls libc (since that call takes some time, especially if there's a cache miss)

@atomgutan8064 5 ай бұрын

Yeah not calling sqrt when just comparing lengths of 2 vectors probably saves a lot of time when you have to do collisions.

@gregorymorse8423 5 ай бұрын

@@kwzucompilers inline functions like memcpy with optimizations on specifically to avoid that. Small sizes or remainder sizes presumably are optimized out when inlining as well...

@gregorymorse8423 5 ай бұрын

@atomgutan8064 square root is a unary operation that takes one clock cycle. It's literally as expensive as integer addition. The only penalty is the extra clock cycle each for loading and retrieving from the x87 FPU. You should probably learn square root algorithms before talking ignorantly about how expensive they are, when they aren't.

@kierengracie6883 5 ай бұрын

@@gregorymorse8423 It's even faster when you don't use sqrt though... fastest code is no code. You can compare lengths of 2 vectors by comparing the magnitude squared. When it comes to collision you should probably partition your space so you don't compare every object to every other one as well.

@UncleKennysPlace 5 ай бұрын

When we wrote code for USAF in the mid-90s, each of us were assigned, beyond our workstations, the least powerful computer in use at the base. Everything needed to run properly on that lowly PC; "It runs on my PC" would get you chastised severely. I learned to write tiny code; now I can write verbose code, and the compiler does the hard work.

@Robstafarian 5 ай бұрын

The sight gags are always appreciated.

@GRHmedia 5 ай бұрын

Power is usually solved by also solving for performance. The less cycles something takes the less power it uses. The less time it runs the less power it draws over time. After that it is more efficient to configure the hardware to run at a lower power level than try and make the code use less power. If you need to use one core vs 64 cores during some low power situation then you could use code for that. However, if you need 1 core out of 64 all the time it is best just to adjust the hardware to that need. Performant code tends to also solve other issues. Performant code is usually smaller less lines. Less lines means less chances for bugs. Also makes it easier to maintain and use in the future because there is less to understand for someone new. Less lines also usually translated to less machine instructions in most cases. This means it tends to make fitting in memory easier. Granted your compiler flags can also make this change but if it is smaller to start with it will also end up smaller in most cases with the compiler flags set to whatever.

@Tawnos_ 5 ай бұрын

@13:52: The computer scientist and the mathematician square off. The mathematician answers "you've got a problem with odd numbers", recognizing that anything indivisible by 4 is indivisible by 2, twice. The computer scientist replies "you've got a problem with odd numbers or numbers that are not a multiple of 4", missing that their added comment was included in the original criterium. It's a great reminder to me (computer engineer) to try taking time to process what the other person said before replying.

@petermolineus3905 5 ай бұрын

"Optimization" - optimization for readability and understandability should be first goal.

@Wyld1one 5 ай бұрын

Other types : complexity, understandability, portability(cross platform) Testing. Sometimes the debug libraries _contain_ errors , thus causing problems determining optimizing and validating. Bugs errors and slow downs can also be encapsulated(built-in) in the compiled versions of libraries as well.

@Wyld1one 5 ай бұрын

Do you need to optimize? CPUs our general way to compute things. What if you don't need a computer at all what if you could just look up the results. Generate a table once look it up after that. What if there's a shortcut computation? There's a shortcut computation is probably also a shortcut lookup. If you need to have more precise results that's the time to make more precise lookup tables How important is it to be optimizing the first place. Are you doing trillions of operations and you need it in a fraction of a second well yes. If you're doing one operation every year well who cares

@johnbennett1465 5 ай бұрын

@@Wyld1onefor some problems tables are a great optimization. For others the table would require a memory device larger than the observable Universe. I have worked with both cases.

@TroZ_Games 5 ай бұрын

One other optimization that you could do before loop unrolling, is getting rid of the counter (i in the pseudocode). Have one line before the loop calculate the address of the last byte to copy (src address plus i). Then in the loop you don't have to do i++, just compare the src pointer to the calculated last byte to copy. If the source pointer address is equal to the last byte to copy, exit the loop. For the copying four bytes at a time version, that would make it one byte per instruction.

@christopherg2347 5 ай бұрын

I can highly recommend the Article "Which is faster?" by Eric Lippert. After a giant like Knuth, he is one of the biggest authorities on the matters of programming.

@xybersurfer 5 ай бұрын

i'm not that impressed by Eric Lippert honestly. it's probably his bad takes on the StackOverflow website, that gave me a bad impression

@RipVanFish09 5 ай бұрын

Seeing that old printer paper brought back some memories.

@mikoajzabinski3569 5 ай бұрын

In the second assembly program, shouldn't add instruction have #4 at the end?

@federicomoya4918 5 ай бұрын

Great content, thank you!

@ChrisWalshZX 5 ай бұрын

Begin a life long Z80 coder, these techniques are quite familiar. I don't know if using the stack pointer is feasible for fast data transfer on a modern CPU?

@sidpatel77 5 ай бұрын

bro is using a literal notepad as a texteditor, absolute chad

@andrewharrison8436 5 ай бұрын

The main optimisation is, in my opinion, the readability. Comments are free at run time when the compiler has removed them. Comments are prceless at 4am when things have gone pearshaped. Actually comments may avoid that 4am phone call altogether by forcing you to actually think what you are doing when writing the program.

@RealCadde 5 ай бұрын

About the bit of using the compiler to optimize your code. Most times it's not a matter of badly compiled code but a bad approach to the problem. Say a function needs the first million prime numbers... You wouldn't calculate the first million prime numbers on each function call. You would store a table of the first known million prime numbers somewhere so they can be looked up as they are needed. I.E, not CalculatePrime(N) but rather FetchPrime(N) No compiler i've heard of can figure this optimization out for you. It can only make the CalculatePrime() function run faster. It can't re-design your program to fetch a prime from a table. The best example of optimization i can recall is that of the game Factorio. Their code was optimized but it wasn't enough. They had to consider where the resources they accessed lived in the memory, was it in slow RAM or fast CACHE memory? And they re-designed their code such that as much as possible was living as close to the CPU core as possible as they needed it. Apparently, no amount of compiler optimization could do this for them even though modern processors are supposedly good at organizing what can live in cache vs RAM. They changed it so when something existed in cache, they did everything they needed to that memory in one batch and only then would discard parts of that close to CPU data that they knew they wouldn't be needing for a while. Especially as it came to path finding and fluid network calculations.

@christopherg2347 5 ай бұрын

97% means it does not apply in 3% of the times. Plenty of other games had no such issues.

@S_t_r_e_s_s 5 ай бұрын

Premature optimizing is definitely what I struggled with the most at first in software development.

@KilgoreTroutAsf 5 ай бұрын

You were learning. Nothing wrong with that.

@christopherg2347 5 ай бұрын

'Premature optimisation. They say it happens to 4 out of 5 programmers 😉"

@Takyodor2 5 ай бұрын

@@christopherg2347 Yeah it happens to roughly 5 out of 4 programmers. Hold on, it seems I swapped a couple of variables somewhere, dang it this code is completely unreadable!

@christopherg2347 5 ай бұрын

@@Takyodor2 "The two hardest problems in programming are: Cache invalidations, naming things and off-by-one errors."

@Takyodor2 5 ай бұрын

@@christopherg2347 🤣 I love that one!

@reecelawson2403 5 ай бұрын

Would you be able to make a video explaining what virtual cores are please?

@tpobrienjr 5 ай бұрын

The quote from Knuth also applies to databases: over-normalization.

@feandil666 5 ай бұрын

This saying, not optimising too early, was said at a time where programs were mostly small algorithms implemented in C. It doesn't apply to a bigger system, especially with asynchronous and latent operations, where you really have to think about optimisation early enough to not create an unoptimisable mess. For instance if you use OO you'll never get the performance you can get with a data driven approach.

@Me__Myself__and__I 5 ай бұрын

I love seeing people call this out. I've been saying this for a decade and for a long time no one else seemed to be saying such things. Regarding OO, everything has its place. OO is really excellent at dealing with complex inter-twined data that needs to adhere to certain rules. Trying to do such things without OO can be a nightmare. But if you're dealing with vast quantities of simple data OO can add a lot of overhead just from allocating, deallocating and fragging your memory. Right tool for the job.

@Kniffel101 5 ай бұрын

For most modern CPUs it's less likely to be needing ASM optimization, lacking cache utilization is the biggest bottleneck in most cases. Until we have like 512MB+ L3 caches on consumer CPUs, which on the other hand isn't too far into the future, at least for desktops! =P

@KilgoreTroutAsf 5 ай бұрын

When we finally get 512Mb caches, bad programmers will already have figure out a way to use an extra 2Gb of memory for a simple quicksort

@Kniffel101 5 ай бұрын

That might unfortunately be the case, yeah... @@KilgoreTroutAsf

@christopherg2347 5 ай бұрын

@@KilgoreTroutAsf Which is avoided by using a existing implementation, instead of prematurely optimizing a custom one.

@LarsHHoog 5 ай бұрын

Solving the right problem is my professional mantra because a good solution to the wrong problem is a waste of them. That said, Z80 assembly tweaking is a hobby when coding for the ZX Spectrum.

@CoolJosh3k 5 ай бұрын

I’d have thought that divide by 4, where 4 is a constant, that the compiler would optimise that to a shift operation for us?

@fghsgh 5 ай бұрын

As a programmer who usually codes in assembly... gosh compilers are really stupid sometimes. A lot of the time you have to write the code in C using weird builtins like \_\_builtin_ctz or SIMD stuff, and it'll be as good as in assembly, but those are the kinds of optimisations compilers do miss. And in the case of older or embedded architectures, you can really save a _lot_ because these architectures are hard to optimise for. But it makes sense that compilers aren't perfect btw. We're at the point where adding more optimisations to compilers will noticeably slow down compilation times. It's a tradeoff.

@realhet 5 ай бұрын

And what remains when you take out optimizing from programming? The never ending process of decoding business requirements and translating them to code. I consider myself lucky if optimizing is part of the business requirements, because I found it an enjoyable puzzle, but sadly that's a rare opportunity.

@Me__Myself__and__I 5 ай бұрын

Good. Always make it part of the business requirements, factor it into your estimates. The more you spend thinking about how to write good, optimal code the better you'll get at coding and the less time such things will take you later. Its a skill and if you exercise that skill you'll be a better coder than most others.

@FalcoGer 5 ай бұрын

@8:25 that program has a flaw when they source and destination overlap partially. It will overwrite the bytes it is going to copy later.

@SykikXO 5 ай бұрын

17:25 for any one wondering, its str r5, [r0] instead of [r1].

@mountp1391 4 ай бұрын

Thank you!

@FrancisFjordCupola 5 ай бұрын

ARM assembly is so nice.

@sabriath 5 ай бұрын

you should always check the pointer location before copying memory....i know that's not part of the video explanation, but it's necessary to know which direction to copy the data. For example, if you have an array of 1000 bytes and you are copying from position 2-1000 into 0-998, left to right will work.....but say you are expanding the array and copying the data from 2-998 to 4-1000, well then you have a problem because you from position 6-1000 will have a corrupt data of position 2-3 repeated nonstop. In this way, if the write position is higher in memory than the read position, you have to add the count to both positions and subtract with each loop, working backwards.

@jaffarbh 5 ай бұрын

One optimisation strategy I use for my backend-heavy Django website is to make it fit into a Raspberry Pi 3! This way, I am forced to optimise it for speed as well as memory (the Pi has under 1GB or usable RAM).

@xybersurfer 5 ай бұрын

that's a pretty cool strategy. i've been interested in doing the same with my applications. i think there is a lot of deep stuff to be learned that way, depending on how far you go

@user-cp1dn3ns7k 5 ай бұрын

Do not worry, the editor guy, I read the covers👏🏻

@casperghst42 5 ай бұрын

1) write readable code 2) make sure it works 3) if possible optimize it

@JGnLAU8OAWF6 5 ай бұрын

Only if needed and only after profiling it.

@christopherg2347 5 ай бұрын

I would say "3) Optimize only actual issues revealed during testing"

@wmrieker 3 ай бұрын

these are more optimizations that a compiler can do behind the scenes. most application programmers are concerned with picking the optimal algorithm. no compiler (afaik) is going to optimize which type of sort is best for the data you have.

@komolunanole8697 5 ай бұрын

Always write readable/maintainable code first. If performance is an issue, profile and optimize. Also sad there was no mention of CPU magic like branch prediction/speculative execution, instruction pipelining, cache effects, etc. Optimizing for "lower number of instructions" may sound reasonable but is hardly the right metric on modern hardware.

@Amonimus 5 ай бұрын

My best tip is to make it readable. You don't need the code to be efficient, but if you can tell where anything is you can fix it when necessary.

@milasudril 5 ай бұрын

It may not optimize this code unless you say restrict, due to potential aliasing.

@TheStevenWhiting 5 ай бұрын

2:06 Much what Cliff Harris says on Positech Games when he was optimising his game.

@amigalemming 4 ай бұрын

An experienced assembly programmer would never count upwards, but always downwards. It saves you the comparison instruction, because decrement always includes a test against a zero result. It's not even counterintuitive - just do not call the variable n, but numBytesStillToCopy, or so. However, modern optimizers can detect and eliminate superfluous upward counters.

@wholenutsanddonuts5741 5 ай бұрын

Can you do an episode on distillation as it applies to neural nets? I’d really appreciate that. TY!

@LudwigvanBeethoven2 5 ай бұрын

Profiling before optimization saves a lot of time because then you know which parts are slow and need to be faster

@deepakr8261 5 ай бұрын

And on ARM(and some other archs like RISCV as well) you can want to write your loop to count down to zero instead of counting up to the loop count. As they would have decrement and skip if zero flag instruction which would be faster than load value from reg, compare two register and make decision about branch. RISCV even has an r0 reg dedicated to storing 0 value. But again does all these fancy optimization make a real difference if you do it in places which are not hot spots in your code? Probably not.

@TheWyrdSmythe 5 ай бұрын

1. Make it work. 2. Make it work right. 3. Make it work fast.

@pierreabbat6157 5 ай бұрын

What happens if you move four bytes at a time 1000 bytes from address 59049 to 65536?

@statphantom 5 ай бұрын

it's probably also worth mentioning that bit-shifting right doesn't always act as a divide by 2. especially true with negative numbers. most current versions of C / C++ and I think java does follow a toward zero mantra but older ones may not + some other languages may not either

@mytech6779 5 ай бұрын

The behavior of a bit shift has little to do with the high level language, it is determined by hardware. Normally it is just truncation, but there may be differences in how the new most significant digit is filled for negatives.(keeping note of signed vs unsigned)

@oresteszoupanos 5 ай бұрын

Sean, of course we can read the book titles 🙂

@francis_the_cat9549 5 ай бұрын

The mindset of "Oh its just a 0.1 secs and it only gets run once" is why we have to wait dozens of seconds (yes thats a lot computers are FAST) for programs to start

@JackMott 5 ай бұрын

Usually when you optimize for speed you are also optimizing for power usage. Race to sleep.

@josefjelinek 5 ай бұрын

Nowadays, rather than getting on the CPU instruction level, much bigger speed benefits are generally by optimizing memory layout of the data structures used. And that can be done on a higher level than CPU instructions. What good is to save 4 instructions taking a few clock cycles, when you have a cache miss and wait for hundreds of cycles on a simple read from the main memory... Not optimizing for CPU cache usage / hits (several levels) and prefetch is the biggest blind spot of videos like this one IMHO.

@snoopyjc 5 ай бұрын

Does the data still have to be aligned to loads/store 4 bytes at a time? Back in the day, that was also a limitation of doing it 4 at a time

@anianii 5 ай бұрын

No, but it's faster if it is. There are operations for aligned and unaligned data. The aligned one will almost always be more efficient and take less time

@AnExPor 4 ай бұрын

You can also optimize for readability and testability.

@hurktang 5 ай бұрын

It's much more clear when you start talking also about optimisation for readability and optimisation for productivity. When you frame it like this, you never have to try to justify why sometime optimizing is not necessary. It's always optimized, it's always about deciding which optimisation you go for and how deep you want to take it.

@250bythepark 5 ай бұрын

I always thought a great example of optimisation is the fast inverse square root function that was developed for Quake 3

@Wyvernnnn 5 ай бұрын

Which turns out to be slower on almost every hardware because of the specific sqrt processor instructions :')

@vst-name 5 ай бұрын

Modern inbuilt algorithms are faster nowadays. Besides, back then, it was more of a hack due to a language limitations.

@Elesario 5 ай бұрын

Sometimes you're optimising for development time rather than code performance

@simonclark8290 5 ай бұрын

Optimising code for speed IS optimising for power. Software that executes a task faster allows the processor to sleep for longer and that reduces power.

@steelcock 5 ай бұрын

thanks alot

@Kosmokraton 5 ай бұрын

you're welcome

@1.4142 5 ай бұрын

optimising optimization

@Xilefian 5 ай бұрын

Optimising `memcpy`! Traditionally you don't want to ever do that, but I actually wrote a fast 32-bit ARM `memcpy` for the GBA that uses 12 registers for a whopping 48 bytes copied per iteration when in ideal conditions Each iteration is just 4 instructions: ```arm .Lloop_48: subs r2, r2, #48 ldmiage r1!, {r3-r14} stmiage r0!, {r3-r14} bgt .Lloop_48 ``` I had to set the CPU mode to FIQ to free up the stack and link registers for an extra 8 bytes to copy per iteration

@himselfe 2 ай бұрын

The quote "premature optimisation is the root of all evil" is the root of all evil. The quote is outdated and lacks important context from the original passage, and is used by people to excuse writing bad code. It is far better to practice doing it correctly the first time than it is to adopt the attitude of "good enough, we'll fix it in post". Looking at the software industry, there's hardly a pandemic of optimisation going on. Efficient code is an exception not the rule, and that holds the world back in many ways.

@not_a_human_being 5 ай бұрын

As a pythonista, I simply MUST stress, that "readability matters". Optimising for readability is a thing (not necessarily the same as PEP 8).

@guotesuoze 5 ай бұрын

The fact, that you write a program in C, C++ etc., is in itself an optimization. But it's not a premature optimization. So the hard part here is discerning a premature from a non-premature optimization. Maybe for this, there aren't hard rules, but you wouldn't write a sofware renderer in python, when you write a game engine, and later optimize it to OpenGL/Vulkan and C++.

@davidgillies620 5 ай бұрын

The memcpy function has been aggressively optimised for decades. These days, it won't even run in userspace, but do all the work in the kernel;. With modern architectures it can be written to use DMA, and quite possibly to not do any copying at all, via the MMU. In general, as long as you haven't made a completely boneheaded algorithm choice, the compiler's optimisations will be better than yours.

@RealCadde 5 ай бұрын

2:20 He is right, it's better to make the program modifiable. If you just make it work but the code is a bloody mess, you are going to have to re-write a lot of code when you finally get around to trying to optimize it. Meaning you not only develop the same program TWICE, but now you also have to replicate how the program you've written horribly works in the new optimized version minus the slowdowns or resources expenses. Premature optimization might the the root of all evil, but badly written code is like trying to find a way to divide by zero. EDIT: I personally tend to do "budget programming" and by that (a term i just made up myself) i mean that every feature of my program has a budget. It can take 10 ms to execute, but not more than that or i will most definitely have a frame drop on my target hardware at 60 FPS since it will coincide with everything else that eats another 6.666 ms. Or this feature must not use more than 100 mb of RAM, since the rest of the program only uses 100 mb of RAM in total. I don't want runaway memory usage this early on because re-factoring later to use less just means i've wasted my time developing the first example and have to redo everything anyways. I've never concerned myself with power usage because i am not developing for phones or laptops or server farms etc. It's all speed and memory usage and in some cases, network latency/bandwidth. I plan ahead of time what my goals for performance are and if a newly added feature misses those goals, i optimize them right away. As developing on top of those unoptimized features will just mean i have to change/refactor more code later on anyways to meet the goals. It's as if i include execution time and resource usage into my unit tests.

@amorphant 5 ай бұрын

Don't let the comments confuse you. You don't "optimize for maintainability." A lot of people are conflating big-O complexity with code maintainability. Yes, you should always write your code to be maintainable from the start, but optimization you don't need to knock yourself out on from the get-go. Optimization specifically means reducing the big-O complexity of time, memory usage, or power usage, say from exponential to logarithmic. Big-O does not apply to the concept of human readability/maintainability.