CPUs Are Out of Order - Computerphile

Рет қаралды 187,707

Күн бұрын

Пікірлер: 332

@flamencoprof 6 жыл бұрын

I'm 67 yo. I'm amazed that the training I was given in the 80's on early microprocessors, combined with the fun I had writing op-codes for my Commodore 64 enabled me to follow this. Thanks for the instruction!

@marrygrim199 6 жыл бұрын

Coool!

@kricku 6 жыл бұрын

"I'm 67 yo" is now one of my favorite phrases

@Chris47368 4 жыл бұрын

You were always ahead of your time 😀

@christopherlawley1842 4 жыл бұрын

you are not alone

@ShawSumma 6 жыл бұрын

He is a really slow c compiler. I’ll stick to my usual command GCC.

@ShawSumma 6 жыл бұрын

Very verbose also

@leberkassemmel 6 жыл бұрын

And he only supports ARM! And not Open Source!

@klaxoncow 6 жыл бұрын

I don't know. He's telling us what he's doing and explaining it all, so doesn't that technically make him an open source compiler?

@leberkassemmel 6 жыл бұрын

Or someone just set the -v flag.

@moshly64 6 жыл бұрын

JSR $DEADBEEF

@Xulfer 6 жыл бұрын

"...that we talked about in the caching video, many years ago" *cuts to video clip of the same shirt*

@Ozzymandiyas 6 жыл бұрын

Consistency is something that is sorely needed on YT.

@BeoandIsa 6 жыл бұрын

not the same shirt, look closer...

@ryke_masters 6 жыл бұрын

Not actually the same shirt, but there is more than a passing resemblence...

@felipemartins6433 4 жыл бұрын

_tom scott wants to know your location_

@chswin 3 жыл бұрын

That’s how you know he is the real deal…

@tarcal87 6 жыл бұрын

_"using the Computerphile paper in a _*_radically_*_ different orientation"_ such a rebel :D

@bananya6020 4 жыл бұрын

7:35

@NikiHerl 6 жыл бұрын

I have a request / maybe constructive feedback: I think it would be neat if you could update / create new Computerphile playlists. There are tons of videos I'd like to rewatch, but it's a bit of a pain to look for them one-by-one. Specifically I'd want to rewatch all the explainations of exploits/security breaches, for example

@dDAMKErkk 5 ай бұрын

6 jaar terug; geen reactie, geeft te kennen - “I am wrong”,,

@Thompson8200 6 жыл бұрын

I'd love to see an explanation of 'side-channels' and how you turn a timing of a memory operation into a specific value from memory.

@talleddie81 6 жыл бұрын

The timing is not turned into a value. The timing of the operation is used to determine whether the CPU read the value from the cache or main memory.

@mduckernz 6 жыл бұрын

talleddie81 And, to add further to this, if you know when sensitive (eg. kernel) operations are being executed, you can figure out where they're actually stored, bypassing ASLR. This takes some time, as it's a pretty noisy side channel, but can be pretty effective, as it may take many such probing operations to gather data but as billions of operations are executed per second, it actually doesn't take much real time to get some interesting data, and the longer you do it, the more precisely you can hone in on your target address.

@Thompson8200 6 жыл бұрын

Since a computer might have 16+ GB of RAM how do you even start to get an idea of where in the memory you need to be looking if all you know is that it did have to hit the RAM due to the timing?

@talleddie81 6 жыл бұрын

As Matthew Ducker said, it is possible to break the ASLR. What you then can figure out is where the user data and kernel data are stored in RAM. As far as figuring out what specific data is stored at each address, that is a very difficult and complicated topic. As far as your original question, the timing is only used to determine where the data came from. Knowing that the data came from the cache can be a clue to an attacker that the data was from a previous operation. In the case of an attack, this previous operation could be a memory read forced by the attacker that should not have occurred.

@Thompson8200 6 жыл бұрын

Thanks for the replies!

@erikengheim1106 4 жыл бұрын

Nice job! I had to click through a few explanations before I got to this one. Went straight to the point and kept me engaged, without getting buried in technical details.

@martinkunev9911 6 жыл бұрын

Assuming integers, some more time can be saved if the multiplication is done earlier (in can run in parallel with load instructions).

@larryg2320 6 жыл бұрын

Since Dr. B is right-handed I would like to recommend that the camera be located over his left shoulder instead of his right. Love the shows.

@edmundkorley8892 6 жыл бұрын

Thank you for mitigating the screeching sound of the markers!

@scatterlogical 6 жыл бұрын

I think the unfortunate situation (like any security) is that this is not a pure computing problem, but a human one. Imagine how much more efficient computers and networks could be without the overhead of dealing with untrustworthy influences. :/

@carlosgarza31 6 жыл бұрын

A hardware bug that allows user level computer programs access to kernel space or other user level processes memory address space defeats the purpose of having virtual memory security in the first place. We should all be outraged that speculative branch prediction doesn't block cache memory writes on instructions that failed the branch prediction. From what I can tell engineers were well aware of this problem but ignored it because they assumed it would be difficult to exploit what seemed to be the random nature of cache page reading and writing and the extra cost of blanking out a cache page or blocking the writeing of that cache page during a failed branch prediction. People wanted faster recovery during a failed branch prediction for marketing their CPUs. Now they've got more marketing by allowing them to sell spectre/meltdown proof CPUs.

@DavidHamby-ORF-48 6 жыл бұрын

Nicely presented. I thought of the CDC 7600 designed by Seymour Cray as you were using the Acorn RISC machine in your example. The 7600 was superscalar & pipelined with a multiply unit, divide unit, adder, load/store unit, all 60 bit floating point. Integer operations were 48 bits using the same units but exponent fixed at zero. The Fortran compiler did critical path scheduling of expression evaluation in code generation. An instruction word stack handled decode and issue. Tight loops fit in the IWS and executed without instruction fetch.

@magnum333 6 жыл бұрын

What a great channel, thank you for this.

@KnightRiderDDR 6 жыл бұрын

It's really funny how for the past 20 years no one mentioned this issue, but now when it is known the comment section of every video about Meltdown and Spectre is full of experts on the matter.

@gordonrichardson2972 6 жыл бұрын

When the X86 architecture started out more than 40 years ago, the design was entirely open, and exploiting flaws was trivially easy. Security features have been added in layers over the last few decades , while maintaining backward compatibility of instruction sets and memory addressing modes. At the same time numerous enhancements have been added, all adding to overall complexity. This is not how you would design a secure CPU from the ground up, and it does not surprise me when vulnerabilities proliferate. Trading-off speed and convenience, versus security and robustness, is seldom a winning strategy. On a personal note, some of us old fogeys were around 20-30 years ago, writing low-level machine code and understanding how the CPU worked, and well aware of (some of) the vulnerabilities.

@0xCAFEF00D 6 жыл бұрын

Well I have similar surprise. Not that people know about it but that I've seen multiple new popular programmer friendly sources on pipelining and how it works just this year. Before specter and meltdown. It's an odd coincidence and I wonder what the catalyst is. Maybe it's just me being human and seeing patterns where there are none. But Cppcon had a talk covering it just now in 2017 and I can't recall any other talks that have. I've watched those a lot. I was introduced to this in 2013-2014 I think.

@gordonrichardson2972 6 жыл бұрын

One popular issue that underlies this is the simple question: Why should I upgrade to an expensive new CPU, when due to heat dissipation limits, the maximum clock speed is pretty much the same as last year's model? Moore's Law has not ended, but it continues to be implemented in ways that are not obvious to the layperson. With previous generations of processors the differences were large and quantifiable. Now its all about cache size, incremental improvements, and reduced power consumption. IMO discussing these fundamental factors have forced the topic of speculative execution into the public consciousness, whereas it was previously known only to a limited number of geeks...

@KnightRiderDDR 6 жыл бұрын

It is kind of strange how this issue was revelaed when according to some we have reached the limit of traditional CPUs (silicon chips). If it is not a mere coincidence I can speculate and say that now that silicon chips can't get more powerful at the same rate that they were before CPU makers will have to find another way to pitch us their new products: "Look at our new CPU. It is not more powerful than the our previous ones but it has new architecture and is not vulnerable to Meltdown and Spectre so you better buy it!" But this is ONLY a speculation. I have my doubts that Intel would be willing to lose so much stock value over this.

@HenryLoenwind 6 жыл бұрын

This general issue has been known for a long time---cryptographic processors are hardened against it. Those things aren't used because they are faster (often they are not), but because they take extra measures against a variety of out-of-band timing attacks. This is just the first time someone looked for and found a way to exploit it on a general purpose CPU with usable results instead of just some academic "oh, interesting". (Also, add media hype.)

@sebastiankumlin9542 5 жыл бұрын

Its just amazing how much time goes into making these videos. Thank you!

@SparxableTunes 6 жыл бұрын

Dr. Bagley always delivers to the forefront of my curiosities. I hope to be an example of one of the individuals who may never see the footsteps of higher education, and however prove that we can indeed continue to prove ourselves as veritable compliments to the field of computer science.

@tsmupdatertsm7633 6 жыл бұрын

Thanks alot for your work! I really like those videos with Dr. Bagley. He explains everything very well. And the deep level of how computers work is very interesting.

@debanikdawn7009 6 жыл бұрын

"I'm out of order?! You're out of order! The CPUs are out of order!"

@bwzes03 6 жыл бұрын

Debanik Dawn If I was half the CPU I used to be, I'd take a pipeline to this place! Out of order ? Who do you think you are talking to? I've been around you know!

@SproutyPottedPlant 6 жыл бұрын

The Orona lift(elevator) is out of service! Out of order! press the alarm button

@leberkassemmel 6 жыл бұрын

Anyone noticed the CD hanging out of the right iMac?

@billparsons3341 6 жыл бұрын

Anyone notice that he was wearing the same shirt in the cache flashback video from a few years ago?

@kigtod 6 жыл бұрын

Bill Parsons yes - Looks like Sean's continuity briefing paid off.

@appychd 6 жыл бұрын

Very well explained

@bananya6020 4 жыл бұрын

tl;dr: optimization isn't all about using the fewest instructions, it's about using them in the right order and sometimes using a less "efficient" instruction to achieve parallelization so you can use as much of the CPU's power at once as possible.

@Revan12345678 6 жыл бұрын

Another thing that I noticed, that wasn't mentioned in the video, is that reordering the code also opens up registry space for reusability. For example (6:55) load r0; load r1; add r0 = r0+r1; load r2;

@gordonrichardson2972 6 жыл бұрын

Valid point, but that opens up a whole new layer of complexity...

@Revan12345678 6 жыл бұрын

Hey, if designing a CPU was easy, everyone would be doing it xD

@KohuGaly 6 жыл бұрын

yes, intel CPUs can actually do this. However, they typically do the exact opposite: Consider this code. ... add r0 = r0+r1; load r1; ... Notice that the load instruction needs to wait for the add instruction to finish, because they use the same register. Intel CPU will simply use different free register in the load instruction and adjust the rest of the code accordingly. .... add r0 = r0+r1; load r2; ....

@JoshuaHillerup 6 жыл бұрын

I'm confused why the processor would ever do the optimizing, instead of a combination of the compiler/interpreter (for the particular bit of code) and the OS (for different processes and whatnot) doing all the optimizing, since those actually have all the information about what will be run.

6 жыл бұрын

Actually they do not have all the information about what will be run (if that were the case, it could speed programs a lot). You have to take into account the dynamic factors. Values in cache, branch prediction, utilization of individual cores (hyperthreading) etc. all affect the program execution severely and they're very hard to predict during compilation (although compilers of course try to do their best and you can help them with profile guided optimization).

@gordonrichardson2972 6 жыл бұрын

The processor is the only one that knows whether an item has been fetched from memory previously, and is in the cache, which provides a huge speedup. The compiler cannot possibly know the contents of the cache, although it should do some optimisation of its own. BTW, modern software can be rather inefficient, and it if weren't for fast CPUs, things would sometimes go very slowly...

@radarspace 6 жыл бұрын

That's exactly how Intel's Itanium CPUs work.

@zeikjt 6 жыл бұрын

If the compiler or interpreter were to try and it do it would be what's known as a premature optimization because you'd be optimizing for an assumed/theoretical cpu instead of knowing what it's actually capable of. It could be that your optimizations work well for a select few or even a great number of cpus on the market today, but tomorrow will come and new cpus will be released and your modified code could very well perform worse on those. You should just let the cpu itself do what it knows it can do.

@JoshuaHillerup 6 жыл бұрын

ZeikJT if your compiler knows what CPU it will run on (and given the size of actual executable machine code versus the size of storage there's no reason not to include all existing CPUs), then it can target all of them. If a new CPU is built you can recompile your code to make it the most optimized.

@gideonmaxmerling204 4 жыл бұрын

with programs like these, many modern CPUs will send a few memory fetch requests one after the other. while the CPU is waiting for the memory it usually does other tasks. when the memory arrives, it might arrive out of order (out of order as in, you get b, then a, then d, then c) so it will compute the calculations by the order of arrival.

@ifell3 6 жыл бұрын

It's mind blowing to think how much stuff is wrote and executed just for something easy that we all take for granted!!

@DanielMarrable 6 жыл бұрын

I would like to see him explain hyper-threading

@flyball1788 Жыл бұрын

Spent my life on the H/W side of the fence as a developer, and have NEVER understood why problems like this, which could be addressed by having architecture-specific compilers written once and used once to generate optimised code, are always moved into H/W creating massive complexity (and hence bugs that turn up months later and can't be retro-fixed) and burning power on every single execution cycle on every single machine every single time it runs that bit of code. I agree that, usually, generalisation = slow and optimisation = complex, but surely it's only logical to put the complexity into that part of the system that can be easily changed when problems arise (as they always do with complexity) and which only entail effort/energy/time once at the start of the process. For H/W, the KISS mantra reigns supreme and complexity should be reserved for those things that can't be done up-front.

@47Mortuus Жыл бұрын

FYI - the way this fictional CPU executes the code also uses Instruction-Level-Parallelism. I don't think there is any useful CPU design that has either but not both, which means they go hand in hand.

@Treviath 6 жыл бұрын

Needs a follow up video on how the bugs work themselves

@nullptr. 6 жыл бұрын

Thanks for explaining how that works! great editing

@skyler114 4 жыл бұрын

Literally programming a queue problem for an assignment as I'm listening to this

@Disthron 6 жыл бұрын

Super scaler? There was a Sega arcade hardware called the Sega Super Scaler. Though I think that was referring to its ability to scale sprites though. Look at games like After Burner, Outrun and Thunder Blade. Just to name a few.

@luckyluckydog123 6 жыл бұрын

BTW I think it was the Pentium Pro from 1995 the first Intel CPU with out-of-order (as well as speculative) execution. The original Pentium (1993) didn't support those features, ASAIK.

@jasondoe2596 6 жыл бұрын

I think the Pro was indeed the first Intel with speculative execution, not sure about out-of-order. *edit:* apparently both

@snkline 6 жыл бұрын

The original Pentium was superscalar but didn't support OOE that is correct. In the P5's case it had two execution units that could execute instructions in parallel, but it didn't make any decisions more complicated than "Can I execute the next instruction in the second pipeline or not". The Pentium tried to pair off instructions. Pairs could enter both pipelines, while unpaired instructions could only enter the primary pipeline.

@WanderAway 6 жыл бұрын

While we're here, may I suggest another video on how adders/multipliers are built in the CPU itself? Maybe explain the difference between ripple carry adders and carry lookaheads and that kind of stuff :D

@KipIngram 8 ай бұрын

I think we took a misstep in processor design decades ago. Modern processors have become so complex that no one person can understand all of them (I mean really, REALLY understand - down to the gate level of what's going on in all cases). As a result, we wind up with things like Spectre/Meltdown and so on, which happen because the left hand doesn't know what the right hand is doing. What we chose to do decades ago was to add complex logic to our cores, in an effort to get them to execute code faster. We've gotten to the point where all that stuff represents more of the logic on the chips than the actual compute logic does. What we should have done instead was to embrace the multi-core idea much, MUCH sooner. We should have kept our cores dirt simple, and just piled more and more and more of them onto the chip. Use ALL of the logic for the business of computing. Of course, this would have required us to face multi-thread programming much sooner than we otherwise did, but we've wound up having to face it anyway. If we'd just swallowed that pill sooner then we would NOT have processors that no one can understand and I wager that we would have much more secure, reliable systems that didn't plague us with all of the difficulties that our current processors do. You can't really say "That wouldn't have worked as well," because we DON'T KNOW. Software would have evolved in a different way, and we don't have the software we'd have gotten from that other path, so we don't really know where we'd be on overall performance at this point. We let the tail wag the dog at every turn, though, and now we are where we are. I don't know if there will ever be a way out. Generally speaking, though, I oppose letting whatever body of legacy software we happen to have "at the moment" dictate how we design future hardware. The hardware design should lead, and the software design should follow.

@momokoko8811 6 жыл бұрын

If the assembly was originally written in the optimal order, will the CPU's useless attempt to reorder them cause an overhead?

@gordonrichardson2972 6 жыл бұрын

Not likely. During design and testing the CPU will be optimised to avoid this kind of wastage. Modern processors actually have huge amounts of overhead, but this is all geared towards the fastest outcome. Low-power alternative processors that have less overhead, continue to be available for specialised applications.

@MichaelQuantum 6 жыл бұрын

If people would compile their own software, you could do all this optimization with the compiler and CPUs could be a lot more simple with much less power draw while still being just has fast in the final execution.

@awirstam 6 жыл бұрын

Maybe a off topic question about CPU´s. The question is in the time frame of around 1998 - 2006. Was the PowerPC actually faster than the x86 as apple always stated even though the clock frequency was a lot lower.

@Brutaltronics 6 жыл бұрын

the whole freaking system is out of order!

@RoboBoddicker 6 жыл бұрын

Cause when you stick your hand into a pile of goo that was your BEST FRIEND'S FACE, you don't know what to do!!

@alancurssow9030 6 жыл бұрын

I like this guy, thank you very much for your time - very informative

@eldebo99 6 жыл бұрын

The color palette at 5:53, the left side, with example line "01 LDR R0, a", is challenging to read by my color-deficient eyes. Please reconsider that particular font / background color combo.

@powel5451 6 жыл бұрын

William Hebert no

@colt4547 6 жыл бұрын

Excellent video. Thank you!

@velvetsniper 6 жыл бұрын

you guys really should do a video together with level1techs

@linawhatevs8389 6 жыл бұрын

12:40 actually, instruction 8 (MUL) could happen earlier, during 6 and 7. It still wouldn't be faster than the reordered code, though.

@policyprogrammer 6 жыл бұрын

At the end of this video he says something that I think is correct, but the entire tech media has gotten wrong about Spectre / Meltdown, perhaps because the people who wrote Spectre and Meltdown papers got it wrong themselves. Spectre is a class of attacks that takes advantage of speculative execution. The attack concept does NOT rely on out-of-order execution. It could very well be that OOO machines make it easier, or that only the OOO processors run far enough ahead into the speculative path to pull this attack off, but conceptually, Spectre is a speculation issue, not an OOO issue.

@gordonrichardson2972 6 жыл бұрын

Probably true, but AFAIK all CPUs that run speculative execution, also run out-of-order execution. The reality is likely to be messy...

@policyprogrammer 6 жыл бұрын

Well, in PC-land, it all went OOO with the Pentium Pro, but the Pentium Classic and its variations had a branch predictor. But it also only had a 5 stage pipeline and dual issue, only one of which could handle a load. You know that to "surface" data, the meltdown code example requires the ability to get "far enough ahead" to do a speculative load followed by a second speculative load whose address depends on the value loaded in the first. I don't think that's possible in a short pipeline without many execution units, so older processors probably are not subject to this exploit. OTOH, there may be modern in-order processors that have deeper pipelines and superscalar with an LS unit and two ALUs that could be exploited. Some of the more modern ARM processors might qualify. ARM11 implementation are 8 and 9 deep. I think most (all?) of the modern ARM "A" cores are OOO, but I would not be surprised to see that some architectural licensees have built their own cores that are deep, SS, but not OOO. In MIPS-land, it may be similar.

@pontuz2 6 жыл бұрын

Is there any overhead in the CPU by re-ordering the instructions during OOE?

@FrodorMov 6 жыл бұрын

Well the CPU, or the execution of instructions is not used for reordering. Within the CPU, obviously, some component is required to analyze instructions and dependencies to re-order them. Obviously this costs some area on the chip, and energy, but in the end it should make execution faster.

@pontuz2 6 жыл бұрын

Thanks for the reply. Now that I think about it, the overhead of a potential re-order (+ new execution time) obviously has to be smaller than the original execution time in order to actually enhance the performance.

@gordonrichardson2972 6 жыл бұрын

The main benefit of out-of-order execution is not to re-order the instructions, but to ensure that the CPU doesn't sit idle while waiting for data to be fetched from memory. In almost all cases there is something else useful that can be done, rather than doing nothing!

@xponen 6 жыл бұрын

What if we re-order the instruction ourselves? would the CPU still do the re-ordering part?

@BrianCairns 6 жыл бұрын

In short, yes. Out-of-order designs typically require *much* more die area compared to an in-order design, and they also tend to use more power. In-order designs need higher clocks to have the same performance as an out-of-order design, but they still tend to be more efficient for low-medium performance levels. For the highest performance, you just can't clock an in-order design any higher (or it becomes inefficient to do so), and an out-of-order design is better. There are a number of modern, medium-performance in-order designs for exactly this reason, most notably the ARM Cortex-A53, which is the primary core used in virtually every low-end and mid-range smartphone (because of cost). The Cortex-A53 is also paired with higher-performance cores in higher-end smartphones, which allows the higher-power out-of-order cores to shut off when the phone is idle or under light loads (ARM calls this big.LITTLE; there's also a new version called DynamIQ).

@ITR 6 жыл бұрын

So you're saying they're not CPU aligned? Do we have to talk about parallel universes?

@JoQeZzZ 6 жыл бұрын

Wouldn't it be more benificial to do the multiplying first? Because surely a MULT takes more time than an ADD?

@mduckernz 6 жыл бұрын

Joris Not necessarily. It depends on the particular values. Some multiplications can be done in a single cycle. Notably, power-of-two multiplications (for integers, anyway) will just be converted to bit-shifts (a single cycle operation), but there are still others that may also take only a single cycle. Divisions are worse (again, except powers of two, which are just bit-shifts for integers), particularly modular division. These can take many cycles. The implementations of ALUs have many complex tricks to allow for very fast execution - I recommend reading more about them! :)

@nikoerforderlich7108 6 жыл бұрын

In this particular case it would! If you fetch d and e first, you can do the multiplication while a, b and c are being fetched.

@JoQeZzZ 6 жыл бұрын

Guy Maor yeah, so he showed hoe the processor would use OOE to speeds things up. If it would've been donr right it would choose to do the multiplication first in most cases (since a multiplication consists of bit shits and adding instead of just adding 2 numbers). This would mean that at the end of the line it would have to wait on an ADD instead of a MULT, which would speed the whole process up sligjtly

@thejedijohn 6 жыл бұрын

Great Video!!! I still have some questions: What part of the CPU looks at the instructions and evaluates a better order to execute them in? How does that not take more time than just executing in the order they were given? And do compilers like GCC rearrange the order first, or is it usually the cpu's job. If the C compiler does rearrange the order, can it inform the CPU that it's already been optimized, and to not waste time checking?

@PaulsPubAndBrew 6 жыл бұрын

Why wouldn't it take more cycles to analyze and determine an optional order than you'd save by using that new order? Does the compiler that originally compiled the code handle this? Or is this truly on the fly?

@gordonrichardson2972 6 жыл бұрын

Moderns CPU's have sufficiently complex hardware to analyse instructions several steps before they are actually executed. IMO the example chosen is simplistic, and not a good example of how pipelining works in practice.

@irenef22 6 жыл бұрын

Great explanation. Thanks.

@ms-ex8em 4 жыл бұрын

Did Lander have sound too?? Thanks.

@jamma246 6 жыл бұрын

My knowledge of how a physical processor actually works is low, but I am a mathematician by trade and find this optimisation procedure quite interesting. So I don't know if what I'm about to say actually makes sense. But: The two set of instructions in this video only differed in the order of operations. The only data that seems to be needed to run the code in the theoretically most efficient way possible is what dependencies there are between the instructions; whether they can be run concurrently; and the timings that the processes take. I guess the rub is that the latter isn't really deterministic (or perhaps they are up to a reasonable margin of error?). Still: is a simple on-the-fly optimisation (that is actually implemented at the moment) essentially one which chooses processes that allow other concurrent ones? If module A of the processor is awaiting a new instruction, then first it looks at those available, then prioritises those which allow, say, for a computation on a currently unused module B (which is perhaps prioritised a slower component of the processor)... and so on in a similar fashion? I guess the mathematical structure I have in my mind is a kind of dependency tree which forms part of the data of the instructions, perhaps with some other weights so as to incentivise some processes (those which take place on slower components of the processor). Lots of gaps here, but I find this optimisation problem theoretically quite interesting and would like to know the current state of the art. It reminds me a lot of FP, where because of lazy evaluation you can ensure that functions are performed in an order so as to not have superfluous operations. Sounds like similar ideas could be useful here.

@jonahansen 6 жыл бұрын

Very well explained!

@jaywye 3 жыл бұрын

How does an out-of-order CPU work? Is there a separate module that reorders instructions?

@y__h 6 жыл бұрын

On the serious note though, rather than superscalar architecture, isn't it more effective if we put two pipelines in the CPU and both of them sharing the same execution units?

@postvideo97 6 жыл бұрын

Yoppy Halilintar This is what SMT does I believe.

@galier2 6 жыл бұрын

Short answer. No.

@mduckernz 6 жыл бұрын

postvideo97 In a sense, yes, except that there is only one pipeline. While one thread has a particular execution unit tied up - say, waiting for data to arrive from main memory, which can take hundreds of operations in CPU-time... note that this would occur due to a failure in branch prediction; normally, it would have already noticed ahead of time that this data would be required and requested it in advance already, so that it would already be in cache or even a register, unless it predicted wrongly that it wouldn't be required - you can instead execute operations for a different thread that doesn't need that data, or that execution unit.

@jasondoe2596 6 жыл бұрын

Yoppy Halilintar, two pipelines sharing the same execution units is pretty much _the opposite_ of what you want, because delays during the execution and complex dependencies would "stall" _both_ of them. *edit:* Matthew is right, that's not what SMT (aka hyperthreading for Intel) does.

@jasondoe2596 6 жыл бұрын

Guy Maor, how does multicore "share the same execution units" ?!

@dharma6662013 6 жыл бұрын

Wouldn't the time taken by the CPU to re-order the instructions wipe out any time gained by being able to perform those instructions in parallel? In other words, re-ordering the instructions makes it quicker to do them, but you waste time re-ordering before you can start.

@mduckernz 6 жыл бұрын

dharma6662013 No, as this is usually performed by the decoder. The ALU and L/S units aren't yet involved. At this stage they will also perform things like checking to see whether data needed by the decoded operation requires data not in cache - if it's not, it will be prefetched, so that it is in cache when it's needed later. This is also where branch prediction comes in - if a branch hasn't been executed yet, it doesn't know whether data used by each branch will be needed, so it will gather the data for the operations involved in the branch it predicts will be taken based on previous behaviour. It may also perform speculative execution (this depends on the design of the specific CPU implementation)

@dharma6662013 6 жыл бұрын

Please forgive my ignorance, but that just seems to "kick the can down the road". Something, somewhere, has to spend time re-ordering things. The result is that the CPU can run things faster. How do we know, and how to we measure, how much the time used re-ordering compares to the time saved *by re-ordering*?

@vringar9792 6 жыл бұрын

dharma6662013 I would assume that chip designers and their respective companies have done quite some testing on this. You might want to look up which generation of chips was the first one to implement such a thing and how much faster they got.

@vringar9792 6 жыл бұрын

dharma6662013 tl;dr: thinking about how long something might take is faster than doing it.

@DFPercush 6 жыл бұрын

CPUs have an instruction prefetch where the next instructions are loaded into cache before they are executed, usually in 16-byte segments. That gets into branch prediction, and what if you jump to a different address. But the main takeaway regarding instruction reordering, and pipelining in general, is that it can be done _combinatorially_ - meaning a logic circuit that does not use clock cycles, but acts as a direct function on its own. As soon as you feed in the input, given some gate delays, the output appears on the other side. For the purposes of this discussion, just think of it as being an instant process. It's a very long and complicated "if" statement that happens all at once in hardware.

@mrblue728 6 жыл бұрын

This is such a relaxing stuff for my high-level language oriented brain.

@TheDuckofDoom. 6 жыл бұрын

And now we move to multi core cache management and prefetching?

@peterbustin2683 5 жыл бұрын

Really very interesting! Thank you..

@joshhayes3433 6 жыл бұрын

Having a link to the caching video would be pretty cool.

@gordonrichardson2972 6 жыл бұрын

Its from 2015: kzbin.info/www/bejne/bHvTfXdphbp0kM0

@Computerphile 6 жыл бұрын

+Josh Hayes kzbin.info/www/bejne/bHvTfXdphbp0kM0

@retop56 6 жыл бұрын

Great video.

@Johanniscool 6 жыл бұрын

The best part is when he uses the computerphile paper in a radically different orientation

@qwmf05gcpt42 6 жыл бұрын

How will they make future CPUs?

@ms-ex8em 4 жыл бұрын

Hello did Lander ever have sound at all?? Thanks.

@dichebach 6 жыл бұрын

Interesting stuff!

@ThorkilKowalski 6 жыл бұрын

I think the 386 was the first commercial superscalar processor.

@mikeklaene4359 6 жыл бұрын

Speculative execution is NOT the problem. The fact that another process can access the results of the execution IS the problem. The WALL between separate processes is not being enforced.

@gbhall 6 жыл бұрын

Hmm interesting, I was unaware of the actual implementation of orders.

@sicksock435446 6 жыл бұрын

This video taught me how to play the game Silicon Zeros...

@Roxor128 6 жыл бұрын

Thanks for reminding me I need to put in some more time on that. I'm still in the early piece-of-cake puzzles. Well, they certainly are compared to where I got up to in TIS-100 and Shenzhen I/O.

@richardmiklos 6 жыл бұрын

Can the clones execute Order 66, while the CPU is executing these instructions? I mean they don't depend on each other or anything.

@antoineroquentin2297 6 жыл бұрын

jokes on me, i'm still using an in-order CPU (D2700)

@rcookie5128 6 жыл бұрын

Super informative!!

@kevincozens6837 6 жыл бұрын

Nice explanation of "out of order" execution. I knew you were going to make one minor mistake, not that it matters for the point discussed in this video. You threw in multiply as the operation before that last variable. You didn't take into account the typical order of operations. The multiply would be executed first.

@simonnomis123321 6 жыл бұрын

Shouldn't you run B,C, and D at the beginning so the multiply can run at the same time as W?

@RPG_ash 6 жыл бұрын

Very interesting, thanks.

@KX36 6 жыл бұрын

You're out of order! You're out of order! The whole CPU is out of order! They're out of order!

@halistinejenkins5289 6 жыл бұрын

a man's man

@av733 6 жыл бұрын

All of that just to say the main point at the end?! I was lost for the first 10 minuets.

@HerrLavett 6 жыл бұрын

Nice! Thank you!

@VivekYadav-ds8oz 3 жыл бұрын

Does CPU decide all this in real time? How does it do all this?! Isn't it just supposed to be an electromechanical part? If no software intervention occurs here, this might as well be black magic to me.

@gogokowai 3 жыл бұрын

I have the same question. I'm having trouble imagining how it could possibly be faster to make a bunch of checks on multiple instructions and cache states than it would be to just perform the add/multiply.

@lucianodebenedictis6014 6 жыл бұрын

Take a shot every time he says "load store unit"

@wherestheshroomsyo 6 жыл бұрын

4:20 that "c" is moving! What? Did that happen in editing?

@-42-47 6 жыл бұрын

Interesting, though it sounded like CPU's were out of order rather than (still) being out of order.

@isaak.studio 6 жыл бұрын

Is it (a+b+c+d)*e or a+b+c+(d*e)?

@avrohomhousman5958 4 жыл бұрын

is this the same as pipelining? It sounds very similar.

@MatkatMusic 6 жыл бұрын

man, talk about a fantastic breakdown of the topic!

@rafaelrui7457 6 жыл бұрын

Do you have a PATREON page to collaborate?

@gordonrichardson2972 6 жыл бұрын

I found the code in this example a bit simplistic, it just covers simple pipelines of algebraic instructions The real meat of speculative executions comes with branch prediction and caching, and this is where the whole Spectre issue pops up. These complexities are only briefly hinted at in the last 30 seconds of the video.

@CGoody564 6 жыл бұрын

It still introduces the basic premise of how it functions. Complication is usually not a great way to introduce an idea to those whom are unfamiliar with it

@p_serdiuk 6 жыл бұрын

He already done the video detailing the exploits themselves.

@vladomaimun 6 жыл бұрын

They should make a part 2

@NikolaosSkordilis 6 жыл бұрын

I believe he covered these in the (previous) Spectre & Meltdown video, which was specifically about speculative execution and branch prediction. The point of this video was to explain out-of-order and compare to in-order execution.

@NSLikeableHuman 6 жыл бұрын

Gordon Richardson The title talks about superscalarity, which is the first step to branch prediction and then speculative execution. Can’t fit all three successive topics in a 15-minute video! ;)

@nO_d3N1AL 6 жыл бұрын

I thought it took less than 100 nanoseconds to get data from main memory, not 200. How can we calculate this? Basing it on 4200 MHz RAM.

@overwrite_oversweet 6 жыл бұрын

For DDR4 4200 RAM with a CAS latency of 19 cycles, the time required to fetch the first word, assuming the appropriate row is already activated, is 9.5 ns. However, each _sequential_ word after that would only need 0.24 ns to fetch, meaning 4 contiguous words would only require about 10.25 ns and 8 would require only 11.25. Of course, if the next required word is in another column, you would have to wait the 9.5 ns again, and if it's in another *row*, then you'll need to wait even longer, as your RAM will need to be issued the Precharge command, and then the Active command on the correct row before the next Read command can be issued. The ALU, OTOH, would usually only need one CPU clock cycle to complete whatever it's doing, especially for a simple operation like addition or multiplication, which is on the order of 0.24 ns. Some ALUs can even do multiple such operations in a single cycle, and if you were using floating point instead of integers, it is relatively common to do multiply and add in one operation.

@Schnack21 6 жыл бұрын

Shouldn't we perform the multiplication fist in this equation anyway?

@ulinvega 6 жыл бұрын

I didn't understand a thing but that's cool.

@ze_rubenator 6 жыл бұрын

CrashCourse has a crashcourse on computer science, they go very in depth about how CPUs work and what assembly code does, but they still keep it brief and simple enough so it's very easy to follow for the layman. Provided you have the attention span and pay attention.

@FrodorMov 6 жыл бұрын

The lines of code that a programmer writes, he expects to be 'executed' by the CPU sequentially. Turns out though, that CPU's move them around and execute them 'out of order' because thats faster to do. Yet to any outside observer (the programmer) it still looks like the code (which is translated into machine instructions) happens sequentially.

@rykehuss3435 6 жыл бұрын

Fz Does this apply to lets say, 3D applications? for example RTS video games, where some of the things need to be executed sequentally and multi-threading is of no help. Noob here.

@FrodorMov 6 жыл бұрын

OOO Happens in any program that runs on a CPU that supports it (Pretty much everything) . In the case of a video game, which consists of CPU + GPU parts, the CPU part will therefore be executed out-of-order. Fyi, this isnt something that the programmer can control. It just happens in hardware.

@ACTlVISION 6 жыл бұрын

Same and I had a final exam on this a month ago...

@RowenStipe 6 жыл бұрын

7:35 We've gone from landscape to portrait !

@terrahertz5284 6 жыл бұрын

I didn't see any initial Clear Carry.

@HappyBeezerStudios 6 жыл бұрын

For the here shown code a second load/store unit would speed uop the execution imensely.

@StanislavPozdnyakov 3 жыл бұрын

Did he said, that ARM architecture is implied?

@DanRoxtar 6 жыл бұрын

Damn that shirt is fly

@MrKinir 6 жыл бұрын

His shirts are magnificent. He's the embodiment of British style. Like, weird funky shirts. Reminds me of James May.

@SproutyPottedPlant 6 жыл бұрын

He is very fly!

@luppa79 6 жыл бұрын

If you like British guys wearing funky shirts, you should also watch Curious Droid videos.

@SproutyPottedPlant 6 жыл бұрын

Out of service out of order please press the alarm button!

@KipIngram 8 ай бұрын

This all should be done by the compiler (or the programmer) - investing logic to "correct" a less than optimal code sequence that has to be present in EVERY CHIP and operate EVERY TIME you run your program? That's just clearly not the best answer. I recognize it gave the hardware designers all kinds of opportunity to feel like they're clever, but it's a waste of resources.