How CPUs do Out Of Order Operations - Computerphile

  Рет қаралды 81,012

Computerphile

Computerphile

Күн бұрын

Пікірлер: 221
@momomi104
@momomi104 6 ай бұрын
A better example than a sausage factory, is a film production, all the scenes are filmed out of order, based on locations, crew availabilty etc.. But as long as the editing is ordering the scenes in the script's order (program) the result for the viewer will be acceptable
@MattGodbolt
@MattGodbolt 6 ай бұрын
Nice analogy!
@johngaughan9399
@johngaughan9399 6 ай бұрын
Let's throw branch prediction into the mix. You have two units, each filming different scenes that take place at different times in the movie. Director/writer/producer make a script change, now the second unit has to re-record a new scene and disregard what they already filmed.
@klaxoncow
@klaxoncow 6 ай бұрын
Ah, we'll just fix it in post, right?
@borisdorofeev5602
@borisdorofeev5602 Ай бұрын
Fantastic analogy
@MattGodbolt
@MattGodbolt 6 ай бұрын
I see folks are speculating on the topic already ;)
@MattGodbolt
@MattGodbolt 6 ай бұрын
And the people posting "first" ? " second" are displaying...out of order ;D
@giveaway4002
@giveaway4002 6 ай бұрын
hiii Mat, thanks for Compiler Explorer... i really love it. u r my inspiration
@MenaceInc
@MenaceInc 6 ай бұрын
Your puns, much like nearly all of your talks and tools, are great 👏
6 ай бұрын
The whole time you were talking about "as long as nobody notices", my brain was going "Spectre! Spectre! Spectre!" So, nice that you did indeed point it out at the end. I find it hard to explain to people who are technical enough to understand a little bit, but not deeply technical, how it is possible that the same bug existed in essentially every mainstream CPU on the planet, including both RISC and CISC ones, both big (e.g. POWER) and small (e.g. Cortex) ones, in Intel and AMD and Apple silicon, among others. And the answer is that it exploits one of the fundamental "tricks" that make *every* CPU fast, which is basically the trick you are explaining here. Of course, this is only the beginning as the new Go.Fetch vulnerability has shown which exploits data-dependent prefetching.
@VivekYadav-ds8oz
@VivekYadav-ds8oz 6 ай бұрын
WAIT WHAT THE FUDGE YOU ARE THEEE MATT GODBOLT!! WE HAVE A CELEBRITY OVER HERE OMG
@gojohnniegogo
@gojohnniegogo 6 ай бұрын
Bonus points for a Winamp reference. Topical too since the source code is being released in September!
@mgancarzjr
@mgancarzjr 6 ай бұрын
It really whips the llama's....
@MAGNETO-i1i
@MAGNETO-i1i 6 ай бұрын
What really? Why they release it?
@user-qf6yt3id3w
@user-qf6yt3id3w 6 ай бұрын
It's crazy how Tomasulo's algorithm for out of order execution was was first used in the IBM System/360 Model 91 released in 1967, predating the integrated circuit.
@logantcooper6
@logantcooper6 6 ай бұрын
We stand on the shoulders of giants.
@frankbucciantini388
@frankbucciantini388 6 ай бұрын
Years later someone finally explaining the Spectre / Meltdown bug properly. This is why software mitigations make the CPUs affected much slower: they limit if not disable entirely this kind of mechanism.
@jaffarbh
@jaffarbh 6 ай бұрын
Simply brilliant. I would highly suggest covering Intel's "Itanium" CPU, and why it went awfully wrong.
@DubioserKerl
@DubioserKerl 6 ай бұрын
Hang on, I know that logo on the shirt.... is that...? Yes, it IS Godbolt!
@linuxguy1199
@linuxguy1199 6 ай бұрын
Men of culture, we meet again.
@bloated_complacency
@bloated_complacency 3 ай бұрын
I have taken uni courses here in the statesbthat are not as clear and concise as this video has been. Thank you so much for this breakdown, it has added some much needed perspective into my computer science endeavors; keep up the great work!
@RKelleyCook
@RKelleyCook 6 ай бұрын
Love me the robots where W == waiting and W == working.
@MattGodbolt
@MattGodbolt 6 ай бұрын
Yeah... I didn't think that through properly did I. Last minute changes never a good idea. Hopefully you get the idea though?
@chitlitlah
@chitlitlah 6 ай бұрын
I was waiting for P to go from processing to paused.
@IceMetalPunk
@IceMetalPunk 6 ай бұрын
@@MattGodbolt Always allocate at least two bytes to your status strings 😉
@klaxoncow
@klaxoncow 6 ай бұрын
@@IceMetalPunk Strings? How wasteful. What you need is an enum. 0 = empty, 1 = ready, 2 = waiting, 3 = working, 4 = completed, 5 = retired. Only 3 bits needed (with a couple of statuses to spare - I already used one to represent "empty", as a null reference to say "ignore this row, as we've not filled it out with data yet". And if we need a human-readable string then we can have an array of strings: ['empty', 'ready', 'waiting', 'working', 'completed', 'retired'], using the status as the index. And you don't need the English language to start all the words you want to use with different letters at any point.
@golangismyjam
@golangismyjam 5 ай бұрын
​@@klaxoncowthis is why the high level language Devs don't get much respect. They are severely lacking in fundamentals because they generally just know how to make a website.
@giveaway4002
@giveaway4002 6 ай бұрын
please, please more cpu videos....
@godnyx117
@godnyx117 6 ай бұрын
Agree! Low level, hardware related videos are awesome and very valuable!
@unvergebeneid
@unvergebeneid 6 ай бұрын
The times where a sqrt would take 100 cycles are long gone. These days it's more on the order of 16-20 cycles. Still slow compared to a multiply or add but in the ballpark of a division.
@JonBrase
@JonBrase 6 ай бұрын
20 cycles is still enough time for ~100 instructions to pile into the ROB behind the sqrt at typical pipeline widths (plus whatever was already behind the sqrt when it started executing).
@JobvanderZwan
@JobvanderZwan 4 ай бұрын
The timing of this video is interesting: Lynn Conway, the computer scientist who invented some of the fundamental techniques for out-of-order execution in the 1960s, passed away at the beginning of June.
@marcwinner567
@marcwinner567 6 ай бұрын
Thanks so much for these videos. Matt is truly a great teacher of these concepts!
@nicksamek12
@nicksamek12 6 ай бұрын
This series has been fantastic so far!
@SimGunther
@SimGunther 6 ай бұрын
Modern x86 CPUs will have a RISC kind of pipeline in the microcode level despite the base ISA not implying such pipelining. So to answer the main question, technically the CPU is doing (N core * pipeline phases / average cycles per instruction) things all at once, but that's not a neat answer since hyperthreading is possible and not all instructions run equally as latency from prediction rollbacks, cache locality/coherency, and write interference need to be considered, among many other things.
@surters
@surters 6 ай бұрын
N cores * issue width * average micro ops per instruction (max execution units) etc. etc. hundreds of limitations.
@monad_tcp
@monad_tcp 6 ай бұрын
This concept of RISC/CISC is an outdated idea. All modern CPUs are implemented using microcode, which is technically RISC and on top of it we have a CISC ISA, they're all hybrid . Yes, even the simpler ones like those who use RISC-V end up having even smaller microcode for implementation, its just convenient.
@trevinbeattie4888
@trevinbeattie4888 6 ай бұрын
​@@monad_tcpI wouldn’t consider microcode a type of RISC since that layer typically isn’t exposed at the programmable instruction level and only microchip designers who are working with something like Verilog actually use that level of coding.
@ArneChristianRosenfeldt
@ArneChristianRosenfeldt 6 ай бұрын
MIPS exposes the pipeline as branch delay slot. RISCV kicked it out again. We are back at 8086 like ISA. MIPS already hid any data hazards of the pipeline and would rather wait for the cache or main memory. Ah, you mean that a load instruction is kinda the start of a pipeline. Next comes a compute instruction. Yeah.
@thewhitefalcon8539
@thewhitefalcon8539 6 ай бұрын
There isn't a fixed number of pipeline phases.
@IceMetalPunk
@IceMetalPunk 6 ай бұрын
As a web dev, this reminds me a ton of database transactions. You can update many columns in many rows in a database table at once, but if one fails, the transaction fails as a whole; it's "undone" because the changes never get committed to the database in the first place.
@BarafuAlbino
@BarafuAlbino 6 ай бұрын
What is above, so is below. What is below, so is above.
@gekko434
@gekko434 5 ай бұрын
Thanks Computerphile, I've been loving these series on how CPUs work. Absolutely fascinating, even for a layperson like me
@chaoslab
@chaoslab 6 ай бұрын
Coding assembly on the first few Archimedes machines was very interesting, multiple logical options with instructions (like conditional execution and barrel shifting).
@bemk
@bemk 6 ай бұрын
This something you don't generally need, but when you do need it it's very useful to have. Had a bug the other day in an embedded system where the file system chip would put data on the bus despite its driver never asking for it, because an RSA calculation somehow triggered a speculative read in the chip's DMA region. Debugging was infuriating, 'cause every time you'd change some code to do an inspection, or would even interfere with a debugger, you'd interfere with the pipeline causing the bug to disappear. Only reason we ended up finding the solution was due to some very kind and skilled people in some forums all over the internet and an erratum for the chip
@andrewharrison8436
@andrewharrison8436 6 ай бұрын
Parallel processing can be "fun" to debug. I once had an exit process that started failing. It ran perfectly in debug because waiting for me to respond to the prompt gave the parallel clean up process time to complete. My fix was a bodge: I shuffled the order of my code till it worked. My best guess as to the cause of the failures was that a release of the compiler or operating system had done something to the timings.
@bemk
@bemk 6 ай бұрын
@@andrewharrison8436 been there, done that, joined the club, got the t-shirt. At least it's a problem you can solve with some semaphores though... That said, the feeling you get when you actually solve an issue like this. Very satisfying
@zwanz0r
@zwanz0r 6 ай бұрын
Very nice episode! Great explanation with the todo board 😊. A great follow-up would be how modern CPUs prevent specter-like attacks, because i assumed the cache would also be reset after branch prediction failed.
@Evan-bjc4w
@Evan-bjc4w 6 ай бұрын
"Well I feel like this is a i don't know if you've ever watched Qi where the big bell goes-" *"BRRRRRRRR"*
@RecycleBin0
@RecycleBin0 6 ай бұрын
lego can do almost anything
@Evan-bjc4w
@Evan-bjc4w 6 ай бұрын
@@RecycleBin0 It's GR8BRIK's old logo
@goshisanniichi
@goshisanniichi 6 ай бұрын
I remember from a computer engineering course long ago, that there was a branch "prediction" scheme with superscalar processors where the prediction part was skipped altogether. Because the processor was capable of doing multiple things at once, it could just process both branches simultaneously and throw out the one that wasn't needed at the point when the branch was finally processed. I don't think it was actually ever used and if it was then not much because actual attempts at prediction are still better in most cases.
@techmage89
@techmage89 6 ай бұрын
Some GPUs actually do this! They run hundreds or thousands of threads over the same code, but execute groups of threads in lockstep, so if a group of threads encounters a branch and they don't all take the same path, they will all run both branches and then mask out the results from the wrong branch. It's basically a way of fitting branching into a SIMD pipeline.
@jeromethiel4323
@jeromethiel4323 6 ай бұрын
Winamp, it whips the llamas ass! ^-^ What a blast from the past.
@scaredyfish
@scaredyfish 6 ай бұрын
I like the desktop background - where's it from?
@morwar_
@morwar_ 4 ай бұрын
One time I asked a question on Matt's youtube channel and it was answered.
@jeromethiel4323
@jeromethiel4323 6 ай бұрын
I just learned not that long ago, that Cray computers invented a lot of what you were talking about in this video. Multiple streams, pipelining (not the idea, making it better), out of order execution, et al. It's why Cray was able to hold on to the title of supercomputer for so long. They were doing things nobody else at the time could. And now it's part of pretty much every modern processor. It was a heady time.
@JonnyPowell
@JonnyPowell 6 ай бұрын
it was a Crayzy time
@johngaughan9399
@johngaughan9399 6 ай бұрын
For much of computing history, microcomputers (i.e. Intel+AMD) followed the lead of mainframes (i.e. Cray+IBM). Modern NASes allow the user to hot-swap hard drives. Back in the 1970s, mainframes allowed the user to hot swap CPUs and RAM. I still can't do that in an AMD64 system in 2024.
@R.B.
@R.B. 6 ай бұрын
I can see how Cray multicore systems could schedule a process across multiple cores, but I don't think it is the same as multicore CPUs of today. Memory access and caching would be significant in an SMP way. At the processor level you can take advantage of shared L1 cache on a single thread. I think Cray was more about distrubuted threads, but each thread wasn't context switching. What's interesting to me about this is that it suggests that the core of a CPU could get into a deadlock lower than potential software dreadlocks between threads.
@Darkknight512
@Darkknight512 6 ай бұрын
This really kicks the llamas ass.
@loudej
@loudej 6 ай бұрын
Well and truly kicked indeed
@masterdjon
@masterdjon 4 ай бұрын
Really great video and really great series. I would like to submit a suggestion: when talking about parallelism, I think you would have been a bit easier to follow with tick/frame number. Still following you from years. Continue your great work!
@Omnifarious0
@Omnifarious0 6 ай бұрын
5:00 - I think the first to do "pseudo-assembly" was Knuth with MIX. Though, his "pseudo-assembly" was perhaps more rigorous than what you intended.
@unvergebeneid
@unvergebeneid 6 ай бұрын
14:31 "Honey, why are you shouting'Pee! Pee!' into your phone?" "No, mum! I'm shouting 'p' _at_ my phone." "Well whatever it is you do with pee and your phone, it is rather odd, isn't it, and I'd prefer for you to do it in your room."
@TymexComputing
@TymexComputing 6 ай бұрын
No Byte Bug puzzle here? Jane Street has addressed today's Numberphile BugByte puzzle 1-24 as Computerphile puzzle.
@TymexComputing
@TymexComputing 6 ай бұрын
src="/static/img/new/computerphile-puzzle-2024.png"
@stephenwhite506
@stephenwhite506 6 ай бұрын
In a single cycle, modern CPUs are performing thousands of XORs in parallel for every cache line check and can do this in parallel for each of the instruction, data or address translation caches. So, the answer should be thousands.
@adfaklsdjf
@adfaklsdjf 6 ай бұрын
thanks for mentioning spectre at the end :)
@axelBr1
@axelBr1 6 ай бұрын
It's amazing what can be implemented using logic gates etched into silicon.
@MissNorington
@MissNorington 6 ай бұрын
CPUs are actually much faster than in this example. They can see ahead of time that you are squaring the numbers, so there is no need to square root at the end. The robots inside the CPU are probably trained with captcha as well, which is still legal when this comment was written
@FutureAIDev2015
@FutureAIDev2015 6 ай бұрын
Jump instructions have entered the chat
@esra_erimez
@esra_erimez 6 ай бұрын
😂
@ScottLovenberg
@ScottLovenberg 6 ай бұрын
I know Goto when I see their alt!
@veers0r
@veers0r 6 ай бұрын
Love the compiler explorer. :)
@YaofuZhou
@YaofuZhou 6 ай бұрын
I guess if you throw security considerations into the mix, things quickly become super complicated ;)
@eliasross4576
@eliasross4576 6 ай бұрын
Interestingly for concurrent programming, out of order operations like stores can be visible to other threads since the pipeline executor won’t have visibility into what another pipeline executor is doing.
@MattGodbolt
@MattGodbolt 6 ай бұрын
It very much depends on your CPU architecture. X86 makes some pretty strong guarantees (for normal load and stores). And speculative stores shouldn't be visible to other threads under any circumstance that I'm aware of.
@The_Pariah
@The_Pariah 6 ай бұрын
To this day, I STILL use WinAmp. Best music player ever.
@user-dv5gm2gc3u
@user-dv5gm2gc3u 6 ай бұрын
once heard jim keller talk about this in an interview. Kinda insane what's happening in a cpu nowadays.
@MartinLindsay
@MartinLindsay 6 ай бұрын
One nitpick on the description of the "retirement" stage you described, and I apologize as I realize it may be a detail intentionally not visited here. While it is simple to say that it goes down in order committing things back out of the cpu, it only prevents committing if there is an unresolved conditional branch before the instruction. Things otherwise can commit back to main memory or cache out of order, and is one reason that atomics use acquire/release semantics to assert control over the ordering.
@rogerlevasseur397
@rogerlevasseur397 6 ай бұрын
Let's dive into Very Long Instruction Set computer architecture (Intel's Itanium and Multiflow's minisuper) where the compiler from it's analysis determines which instructions run in parallel together. simplifies the pipeline.
@borchen0
@borchen0 6 ай бұрын
Do modern compilers produce machinecode that help the CPU with this process? A sort of preprocess, so less stuff has to be undone?
@marsovac
@marsovac 4 ай бұрын
A compiler does not have the runtime values that trigger branching. Also there is no mechanism in the ISA to tell to the cpu: "you're most likely to go this way" even if you knew it. But the cpu does keep track of it in repeated branches by itself, it is what branch prediction does.
@henrycobb
@henrycobb 6 ай бұрын
When a machine is out of order it needs to get fixed. C.f. the Spectre and Meltdown bugs.
@Samhain__UK
@Samhain__UK Ай бұрын
I was hoping for more in this series. Anything in the (OoO) pipeline?
@unvergebeneid
@unvergebeneid 6 ай бұрын
You just know he chose a whiteboard because he was under the illusion that he'd just erase the old values in the cells instead of making a complete mess by repeatedly crossing things out 😄
@discoisdead8504
@discoisdead8504 6 ай бұрын
Nice vid about cpu internals 👍
@davidhand9721
@davidhand9721 6 ай бұрын
How is it writing values to that table, then checking those values to see if it's ready, without going through at least one clock cycle? How are they managing synchronization?
@ukyoize
@ukyoize 6 ай бұрын
I wonder why there isn't "Don't predict this branch, or predict it THIS way" for security.
@froop2393
@froop2393 6 ай бұрын
Where can I get such a cool compiler explorer hoodie? btw: great tool, really love it!
@Kalernor
@Kalernor 6 ай бұрын
Which area of computer science is this? Where can I learn more about it? Also, what is this topic called? Out of order operations? Again, where can I learn more about it?
@axelanderson2030
@axelanderson2030 6 ай бұрын
Google and literature piracy
@Modvivek
@Modvivek 6 ай бұрын
Amazing video ❤
@Razzbow
@Razzbow 4 ай бұрын
Yeah a CPU can only do one thing at a time. That's why modern CPUs are composed of many seperate processors which themselves break tasks down into primitive blocks which can be parsed and dispatched in paralel.
@kayakMike1000
@kayakMike1000 6 ай бұрын
Depends on how many hardware thread cores it has. RISC-V calls these harts.
@EdgyNumber1
@EdgyNumber1 6 ай бұрын
Remember the early days when XBox programmers slated PS3 CELL because it was 'too slow?' It took a while to shake off the old way of doing things and really jump on board the idea of parallel computing. Are modern x86 processors able to handle true parallelism these days or have they simply had to bodge something extra onto what is ancient architecture?
@Ascended_BUP
@Ascended_BUP 19 күн бұрын
How do the operation get decoded faster than the robots are carrying them out, i thought all operations in a computer happened more or less at the same speed as dictated by the clock cycle (with the exception of things like a square root for example which is actually many operations one after the other)? I havent watched the other videos in this series so apologies if my question was already answered in one of them
@RealCadde
@RealCadde 6 ай бұрын
The answer to the question changes based on definition. If you subscribe to the notion that computers can do as many things as there are cores in the system, then computers can to thousands of things every clock cycle. Assuming the computer has a graphics card with that many cores. If you also subscribe to the notion that each core does many things at the same time on every clock cycle, well then you have many thousands of things happening every clock cycle. And if you then look at a "super computer" (a rack of computers all tied together) well now they do millions to billions of things, billions of times per second. But realistically, computers (doesn't matter how many cores they have, how many memories they have, how their pipelines are set up) can only really do ONE thing per clock cycle. At the end of the day, there NORMALLY is a single INPUT and a single OUTPUT. You can do one INPUT every clock cycle and you can do one OUTPUT every clock cycle. Yes, there are many cases where inputs and outputs are completely separated from each other. They don't cross communicate in ANY way, they could just as well be separate computers doing separate things. But then i'd argue that you should have asked "how many things can N number of computers do" and the answer would be N things. So WHY can't many inputs and outputs be handled by a single computer at the same time, all the time? Because the inputs and outputs, somewhere in the process, needs to "talk" to each other or the operating system or a shared storage or the network. Once they do, they are bottlenecked by that interaction. Just because a modern computer can work on many different tasks at the same time, it still isn't truly completely parallel. The universe is truly parallel, there are no bottlenecks in that sense. So are, i would argue at least, quantum computers. Digital computers are only multitasking MID process, but at the end of the day everything comes together at a single input and output point.
@bity-bite
@bity-bite 6 ай бұрын
If a CPU had two cores, it means it has four threads, which means it can do 4 things at the same time, right? Would it mean the CPU accepts 4 inputs and with 4 outputs?
@landsgevaer
@landsgevaer 6 ай бұрын
A computer can simultaneously display a cursor, rotate its fan, make a mouse click noise, stimulate my retina, heat the room, annoy the cat, support a geranium, etc...
@RealCadde
@RealCadde 6 ай бұрын
@@landsgevaer In a time slice...
@RealCadde
@RealCadde 6 ай бұрын
@@bity-bite A CPU can have 1, 2, 3, 4... cores. Doesn't matter though because it still depends on definition. I see each core as a separate computer.
@trevinbeattie4888
@trevinbeattie4888 6 ай бұрын
If you go to the microarchitecture level (as suggested in the video), a single CPU core can do as many operations as it has processing units for. If you go down further to the transistor level, a CPU can do millions of gate operations at once. It all depends on what “things” you’re counting.
@shadamethyst1258
@shadamethyst1258 6 ай бұрын
Obviously when the branch prediction fails, a lot of these instructions need to be thrown out and redone, but is there anything stopping a CPU core from taking instructions from both branches, adding it to this out of order table, and letting both get executed until it knows which branch is taken and can evict all of the instructions from the wrong branch?
@nayjames123
@nayjames123 6 ай бұрын
Given you'd have to create a new instruction stream for every branch you see, and how frequently you come across branches. You'd end up needing a huge amount of instruction streams that are nearly all going to be discarded. The cost of adding all these streams is also gonna be very high and cause bottlenecks. To help the throughput when there are branch mispredictions CPU vendors created things like hyper threaded cores, where if one stream of instructions has a mis predict, the other stream will have less contention for execution units so likely be able to run a little quicker
@jeromethiel4323
@jeromethiel4323 6 ай бұрын
I thought this a while ago, and realized that it would fail pretty quickly, just based on how often branches happened. The real issue is that the processor is already so much faster than the memory, that the memory is the weakest link here, not the processor. If you tried to fill both branches of the pipeline, that's twice the memory bandwidth. If it's all in cache, you might be okay, but can you guarantee that? You cannot. Especially when your pipeline is as deep as it is in modern CPU's.
@Revoker1221
@Revoker1221 6 ай бұрын
What you're describing is predicated execution which tend to happen a lot in vectorised computing and pipelines, think SIMD or GPUs, where both sides of a branch are executed but only one side of the branch is kept at a time. An advantage of this approach is that you get to keep the speed boost of vectorised computing, but at the cost of reduced throughput as extra code is ran. If you're interested in learning more, you can read "Conversion of control dependence to data dependence" for one of the first places where this idea was introduced and fleshed out, or for something more modern, give "ISPC: A SPMD compiler for high performance CPU programming" a look over (or any other modern day SIMD tutorial. Most SIMD these days come with instructions purpose built for this kind of predicated filtering)
@ArneChristianRosenfeldt
@ArneChristianRosenfeldt 6 ай бұрын
@@jeromethiel4323branch prediction also fails pretty fast based on how often branches happen. Or is there some “bunching” of successful predictions?
@henryprickett5899
@henryprickett5899 6 ай бұрын
Usually successful predictions happen in a row. 99% hit rate looks more like 1 million hits followed by 10k misses than 100 hits followed by 1 miss.
@phasm42
@phasm42 6 ай бұрын
Interesting that Itanium tried to move this complexity out of the CPU and into the compiler, but ultimately keeping this stuff in the hardware (x86) won.
@ProjectPhysX
@ProjectPhysX 6 ай бұрын
My computer does 3584 instructions at once. GPU SIMT magic.
@Tahgtahv
@Tahgtahv 6 ай бұрын
So, mentioned at the end of the video were issues that sounded like they might affect timing. (eg, something unexpected was in a cache, or got bumped out of a cache.) However, how do you account for actual memory side effects? For example, there are some chips that automatically advance an address on reads/writes. How do you ensure that access to such a chip is strictly in order, with no speculative/out of order access?
@henryprickett5899
@henryprickett5899 6 ай бұрын
You only write at the retire stage (writes are in order) and you cache "writes" to another buffer so that out of order reads get the right result at the right time. Beyond that, cache coherency protocols keep it safe between processors.
@henryprickett5899
@henryprickett5899 6 ай бұрын
That is to say, you cache out of order writes before retirement locally, so that out of order reads see the correct values and don't need to wait for writes to retire.
@Lion_McLionhead
@Lion_McLionhead 6 ай бұрын
So what's the Apple M3 & raspberry pi 5's biggest weapon?
@NatePerdomo
@NatePerdomo 6 ай бұрын
I'm usually lost about 45 seconds into these videos. Still love 'em, though.
@surters
@surters 6 ай бұрын
Hey, just read the entire "Computer Architecture - a Quantitative approach" and you will understand!!!
@VandalIO
@VandalIO 6 ай бұрын
Is there a video of how alu does barrel shifting ?
@AnotherPointOfView944
@AnotherPointOfView944 6 ай бұрын
several ways, depending on the number of bits/word. Easiest way is lookup table.
@user-qf6yt3id3w
@user-qf6yt3id3w 6 ай бұрын
If you have a bunch of registers of the input shifting 0 bits, 1, bit, up to 32 or 64 bits and then just select from the one you need you've got a barrel shifter. It's simple but it obviously takes up a lot of area.
@VandalIO
@VandalIO 6 ай бұрын
@@user-qf6yt3id3w I know how that works ! But how’d you implement that in pure logic ? , just try designing an 8 bit barely shifter with logic gates and multiplexors, it gets crazy complicated
@VandalIO
@VandalIO 6 ай бұрын
@@AnotherPointOfView944 lookup table is cheating 😂
@user-qf6yt3id3w
@user-qf6yt3id3w 6 ай бұрын
@@VandalIO On an IC the wiring is not too bad because you can run the bit lines diagonally. Obviously you can't get around the fact that an n bit barrel shifter needs an n*n array of cells. Or even 2n*n if you want expanding shifts which is handy in graphics.On the other hand the cell doesn't need to be a flip flop. Actually if you read the Wiki article there's a better way to do it which is better for wider devices" "The very fastest shifters are implemented as full crossbars, in a manner similar to the 4-bit shifter depicted above, only larger. These incur the least delay, with the output always a single gate delay behind the input to be shifted (after allowing the small time needed for the shift count decoder to settle; this penalty, however, is only incurred when the shift count changes). These crossbar shifters require however n2 gates for n-bit shifts. Because of this, the barrel shifter is often implemented as a cascade of parallel 2×1 multiplexers instead, which allows a large reduction in gate count, now growing only with n x log n; the propagation delay is however larger, growing with log n (instead of being constant as with the crossbar shifter). For an 8-bit barrel shifter, two intermediate signals are used which shifts by four and two bits, or passes the same data, based on the value of S[2] and S[1]. This signal is then shifted by another multiplexer, which is controlled by S[0]"
@seanharricharan7602
@seanharricharan7602 5 ай бұрын
What does he actually mean when he says he has 4 CPUs on his laptop with 10 Units each?
@ScottLovenberg
@ScottLovenberg 6 ай бұрын
"I'll give myself infinite registers." - hold every possible outcome and just keep paying the memory wall tax. Now it's a physics problem.
@ScottLovenberg
@ScottLovenberg 6 ай бұрын
Wait.... We can't pack them close enough to use this solution. Nevermind. I just moved the memory wall to everywhere and introduced a singularity to the least efficient solution. My bad, guys.
@solhsa
@solhsa 6 ай бұрын
@@ScottLovenberg See also: SSA form, example: spir-v.
@Tony_Indiana
@Tony_Indiana 5 ай бұрын
Who/Where do we send out money to so we keep on getting more videos? According to my recent polling myself and others would be most interested in LLM. Though anything amazing is ok. We just need to know where to send the cash/xmr/"coin"
@NatePerdomo
@NatePerdomo 6 ай бұрын
What is his purple buffalo desktop wallpaper image?
@MattGodbolt
@MattGodbolt 6 ай бұрын
It's one of the default Ubuntu ones; a Minotaur I think?
@74_Green
@74_Green 6 ай бұрын
@@MattGodbolt Mantic Minotaur 23.10
@3rdalbum
@3rdalbum 6 ай бұрын
He'll want to upgrade that before too long, doesn't it go end of life in a few days?
@deltamico
@deltamico 6 ай бұрын
If this paralelism is integrated, does the bend language achieve anything?
@solhsa
@solhsa 6 ай бұрын
Compilers can do larger scale things, and discard operations altogether, which is way harder on this level.
@martinbakker7615
@martinbakker7615 6 ай бұрын
Hate these kind of questions because they are never specific enough. Eg, he says in the 80 computers did one thing at a time. Reading bits out of memory still was 8 at a time.
@werds1392
@werds1392 6 ай бұрын
He’s talking about one clock cycle
@ArneChristianRosenfeldt
@ArneChristianRosenfeldt 6 ай бұрын
Bit logic is a Vector Computer Right there. Archimedes had a 32bit ALU in 1987. And 32 bit memory bus. 68k had 16 bit in the 70s, but needs 4 cycles for everything.
@toby9999
@toby9999 6 ай бұрын
Depends on how you define one thing... 8bits OR is it one byte...
@TiT8851
@TiT8851 6 ай бұрын
Great, but what happen to the pipeline when an interrupt occur (like syscall or time slice expired from OS)?
@trevinbeattie4888
@trevinbeattie4888 6 ай бұрын
An interrupt is basically a jump with some extra steps to save the current CPU state and switch to a new context, so I imagine it has to go through the pipeline just like any other part of the program. Whatever is currently in the pipeline may either finish going through or be discarding depending on how interrupt handling is implemented.
@TiT8851
@TiT8851 6 ай бұрын
Thank you.
@dakiloth
@dakiloth 6 ай бұрын
Is it possible for a processor to calculate both branches of a branch?
@solhsa
@solhsa 6 ай бұрын
yes, and I think I've read some cpu archs doing that... might have been itanium, but I'm not sure. It's wasteful, though.
@stevefan8283
@stevefan8283 6 ай бұрын
7:12 but can't this example be implemented with a fused-multiply add and a hardware square root instead...?
@trevinbeattie4888
@trevinbeattie4888 6 ай бұрын
It’s only a model
@kevinscales
@kevinscales 6 ай бұрын
The CPU he is emulating doesn't have that, no
@solhsa
@solhsa 6 ай бұрын
you might as well have a single instruction that calculates distance. It's just an example.
@JohnSmith-op7ls
@JohnSmith-op7ls 6 ай бұрын
The correct answer is one. Just because multi-core CPUs can do more than one at a time doesn’t change that because they’re really just multiple CPUs with some shared resources, on the same wafer. It’s a more compact version of multi-processor motherboards. Each core is a CPU and doing one thing at a time. We’ve bastardized the meaning of CPU for simplicity when it comes to multi-core chips, but logically they are not a single unit, they are multiple units working in tandem.
@professorx4047
@professorx4047 6 ай бұрын
Well he said computer not cpu
@ArneChristianRosenfeldt
@ArneChristianRosenfeldt 6 ай бұрын
Isn’t it interesting how you can lock() an object in Java on one core, and the other cores respect this despite the widespread use of cache? How did that work with CPUs, like the two SH2 in SEGA 32x? Jaguar and PS3 cell only have scratchpad memory. JRISC in the Jaguar can access shared memory, but lacks instructions for locks. I think that it was not allowed for two of the 3 (4) processors to write to the same object . Only producer, consumer pipeline was allowed. So, in a queue each party only writes to their pointer: read or write. Each party reads the other pointer to check if the queue is full or empty.
@nathanbolstad9567
@nathanbolstad9567 6 ай бұрын
You can only write to a RAM cell from one source. The bus gets locked. CPUs have had a test and set instruction basically forever. The instruction locks the bus until complete. Locks in high level languages mark that team cel (logically) as volatile which forces the view on the CPU to skip the cache and go out to the memory every time to ensure the lock can be acquired. As a result, locking has greater overhead and has greater affect than code that doesn't require locking/semaphores. Locking hurts parallelism too. As a result, in recent years, there has been a fair amount of work on basic data structures and the like that can be implemented without locks to address the pain of locking and scalability.
@ArneChristianRosenfeldt
@ArneChristianRosenfeldt 6 ай бұрын
@@nathanbolstad9567 volatile in C just prevents that the value is „cached“ in a register. A binary compiled for my 386sx doesn’t know about cache. Only thing is that writes to video ram or sound queues are so large that it flushes any write back buffer. X86 has ports to signal that the value has to go out. Quake light maps were calculated on a 4 CPU Pentium board. No test and set instruction. So like on game consoles of the time lockless data structures were common. Amiga broke the 68k TAS instruction because the blitter and the CPU were never supposed to write into the same data structure.
@illford
@illford 5 ай бұрын
​weird quotation marks, what language uses them? Real question not anything malicious
@gregorymorse8423
@gregorymorse8423 6 ай бұрын
Square root is a one tick unary operation as fast as an add. Common misconception. Trig instructions are slowest on modern CPUs
@TheEulerID
@TheEulerID 5 ай бұрын
It was most certainly not true in the 1980s that computers could do only one thing at a time. There were mainstream SMP computers around from the early 1960s.
@ve3uom
@ve3uom 6 ай бұрын
OMG I suddenly miss Winamp!
@floppy8568
@floppy8568 5 ай бұрын
This guy has 4 10-core CPUs in his laptop
@humansizedaperture
@humansizedaperture 5 ай бұрын
@M0rn1n6St4r
@M0rn1n6St4r 5 ай бұрын
4 weeks since Computerphile's last video? What's going on? Are they done making videos? ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯
@joshd79
@joshd79 6 ай бұрын
Winamp!
@BurgerKingHarkinian
@BurgerKingHarkinian 6 ай бұрын
It really whips the lamas ass!
@nathanbolstad9567
@nathanbolstad9567 6 ай бұрын
And it's now going "open sauce" (sic).
@christopherlawley1842
@christopherlawley1842 6 ай бұрын
still using it
@zxuiji
@zxuiji 6 ай бұрын
7:07 For the game example you gave this would actually be way more than you need. Just multiply √0.5 against t1 + t3 and you'll get a close enough (if not exact) result. √0.5 just needs to be done once at the start of the game and stored for later use.
@drooplug
@drooplug 6 ай бұрын
RIP Winamp.
@nathanbolstad9567
@nathanbolstad9567 6 ай бұрын
Nope. It's being released as open source (recent news!) Expect it to reappear soon!
@nathanbolstad9567
@nathanbolstad9567 6 ай бұрын
24 Sep 2024 is the announce release of the source!
@VaughanMcAlley
@VaughanMcAlley 6 ай бұрын
I wonder how much higher level scripting languages benefit from this. Presumably a fair bit as the interpreters are mostly just C programs…
@PhilipMurphy8Extra
@PhilipMurphy8Extra 6 ай бұрын
Computerphile would never air on UK TV, it's way too smart for the TV broadcasters these days
@alex84632
@alex84632 6 ай бұрын
Why don't compilers do this with cpu cores to automatically make every program multi-threaded? Too much overhead?
@solhsa
@solhsa 6 ай бұрын
There is actually some research into "multi-core single threading", but as far as I know it never got anywhere.
@kazedcat
@kazedcat 6 ай бұрын
The problem is branch prediction. Unless you are running and calculating real values you cannot do branch prediction at compiler time.
@ElvenSpellmaker
@ElvenSpellmaker 6 ай бұрын
Spectre and Meltdown want to fight!
@badcrab7494
@badcrab7494 6 ай бұрын
Before watching, Speculative computation?
@MrJleonp
@MrJleonp 6 ай бұрын
That's an acceleration method, they will probably talk about how CPU aren't really multitasking, just assigning cru cycles to different tasks.
@phill6859
@phill6859 22 күн бұрын
The question is too vague. What do you mean by "computer" and what do you mean by "things"? Do you just mean the cpu and running instructions? Because that isn't what you asked.
@esra_erimez
@esra_erimez 6 ай бұрын
26th?
@snowballeffect7812
@snowballeffect7812 6 ай бұрын
love the QI reference lol
@dipi71
@dipi71 6 ай бұрын
21:34 Of course we notice - branch prediction uses more power and generates more waste heat. It can also rip nice holes into your cyber security. It angers me to state all that in the first place.
@vadrif-draco
@vadrif-draco 5 ай бұрын
Damn, winamp
@var67
@var67 6 ай бұрын
Has a whiteboard which is easy to wipe, still crosses out all the old words...........
@ehfik
@ehfik 6 ай бұрын
ahh, winamp.
@love_exegence
@love_exegence 6 ай бұрын
He’s cute
@yootoobvyooer
@yootoobvyooer 6 ай бұрын
AI will bring back VLIW.
@KingJellyfishII
@KingJellyfishII 6 ай бұрын
what makes you say that?
@yootoobvyooer
@yootoobvyooer 6 ай бұрын
@@KingJellyfishII efficiency. There is cost to figuring out operation order, and doing it every time is not as efficient as at compile time.
@syjwg
@syjwg 6 ай бұрын
Instead of two inputs, I'm sure a computer in the future will handle three inputs. Something like 1 + 2 + 3, That would be awesome.
@pudy2487
@pudy2487 6 ай бұрын
Three-operand instructions are as old as x86 is with LEA being the earliest and most important among them. Most vector instructions nowadays take 3 operands as well, so that a separate destination register may be specified.
@Disguised_Hawk
@Disguised_Hawk 6 ай бұрын
Please use the correct terms for things you are explaining. Like, don't call a processor that has more cores than one a computer with multiple CPUs. The computer still has one CPU that contains multiple cores, which individually are hyperthreading the instructions. Not naming things correctly or at all makes the thing you explain very hard to relate to other information that one might have on this topic. Therefore people will learn less because they can't connect it. Naming cores CPUs makes people who already what you are talking about either confused or suspicious of your research. PLEASE just name things the CORRECT way. Otherwise the main part of this video is better than the beginning, thankfully
@KingJellyfishII
@KingJellyfishII 6 ай бұрын
the problem is that "CPU" is quite an ill-defined term. Most people use it to refer to the chip or even whole unit that contains one or more cores, but by a computer science definition each core is in fact its own CPU. It's simply an inconsistency in naming between the "layperson" and the "academic", for lack of better words.
@Roomsaver
@Roomsaver 4 ай бұрын
@@KingJellyfishIIWhy would the whole unit not be the CPU? It’s the central processing UNIT, after all
@KingJellyfishII
@KingJellyfishII 4 ай бұрын
@@Roomsaver CPU isn't defined by being a central processing unit (what does that even mean, anyway?). that's just a quick description. the reason is, in the design of one multi core processor, each core is its own CPU. it could operate separately from the other cores as its own standalone CPU. This is done commercially to some extent, when producing a silicon die for a processor sometimes there's a defect in one or more cores, so they're simply switched off and the die can still be used albeit with a lower core count.
@savagesarethebest7251
@savagesarethebest7251 4 ай бұрын
Hyperthreads is an Intel terminology and it doesn't really corresponds to physical cores on the CPU chip.
Modern CPUs Assign Registers To Speed Up Your Code - Computerphile
22:25
AI "Stop Button" Problem - Computerphile
20:00
Computerphile
Рет қаралды 1,3 МЛН
Players vs Pitch 🤯
00:26
LE FOOT EN VIDÉO
Рет қаралды 136 МЛН
Players push long pins through a cardboard box attempting to pop the balloon!
00:31
Long Nails 💅🏻 #shorts
00:50
Mr DegrEE
Рет қаралды 15 МЛН
When Cucumbers Meet PVC Pipe The Results Are Wild! 🤭
00:44
Crafty Buddy
Рет қаралды 58 МЛН
Do we really need NPUs now?
15:30
TechAltar
Рет қаралды 762 М.
CPU Pipeline - Computerphile
21:48
Computerphile
Рет қаралды 69 М.
Creating Your Own Programming Language - Computerphile
21:15
Computerphile
Рет қаралды 113 М.
The Clever Way to Count Tanks - Numberphile
16:45
Numberphile
Рет қаралды 1,4 МЛН
How CPUs Do Math(s) - Computerphile
19:38
Computerphile
Рет қаралды 65 М.
What is a Monad? - Computerphile
21:50
Computerphile
Рет қаралды 608 М.
The Trash Computer That Became Your Phone
31:27
Popular Science
Рет қаралды 180 М.
Binary Search Algorithm - Computerphile
18:34
Computerphile
Рет қаралды 163 М.
Spectre & Meltdown - Computerphile
13:45
Computerphile
Рет қаралды 348 М.
Linux Kernel 6.12 | This is Historic
1:07:22
Maple Circuit
Рет қаралды 77 М.
Players vs Pitch 🤯
00:26
LE FOOT EN VIDÉO
Рет қаралды 136 МЛН