I like the little anecdote at the end about the ray tracer and changing a test to gain a big speed boost.
@MattGodbolt9 ай бұрын
Thanks! It was a real head scratcher! I've tried to post a link to where I discussed it more but it seems to have been filtered, but..if you look around I've got a talk where I go into more detail.
@NoNameAtAll29 ай бұрын
@@MattGodboltcan you give the name of the video to search?
@MattGodbolt9 ай бұрын
@@NoNameAtAll2I tried that with my first reply that seemed to get removed but..."path tracing three ways" should get it. The bit is near the end :)
@luketurner3149 ай бұрын
@@MattGodbolt Is it the one with v=HG6c4Kwbv4I (that's a KZbin video ID that goes at the end of the URL)?
@3rdalbum9 ай бұрын
I don't write any software that's time-critical or where there aren't a million other uncontrollable slowdowns (my stuff runs on shared cloud services) but I found this anecdote really opened up my horizons and my way of thinking about programming.
@axelBr19 ай бұрын
In awe of the people who came up with such simple ideas to do branch prediction. And the people that work out how to build that logic in silicon, so that it can run in one click tick are gods!
@jeromethiel43239 ай бұрын
A lot of this type of stuff can be laid at the feet of Cray computers. They invented a lot of this type of tech back in the 70's and 80's. The fact that this is in pretty much ALL modern CPU's is a feat of engineering. I may be wrong on details, but i am pretty sure Cray did groundbreaking work on branch prediction, deep pipe lining, out of order execution, and the like.
@olhoTron9 ай бұрын
Coming up with the idea is "easy", actually implementing it is the hard part
@monoamiga8 ай бұрын
@@jeromethiel4323 Cray was an unquestionable genius. Often underappreciated IMHO.
@kevincozens68379 ай бұрын
Kudos to the host for tending to ask very good questions about the topic being discussed.
@rudiklein9 ай бұрын
Being able to explain a complex technical subject in a way I can understand is an amazing skill.
@aaronr.96449 ай бұрын
I've been programming for a very long time but I didn't realise how sophisticated these branch predictors could get. The idea that it can compute a simple hash in a single clock cycle and use that to capture patterns is fascinating. Now that makes me want to go look into the details of some of these open CPU designs :)
@JamesHarr6 ай бұрын
This was an amazing explanation of branch prediction. I've been in tech for more of my life than not and I've known that branch prediction was a thing, but could not fathom how it worked even after some reading online and this made it approachable. Thank you :)
@eloniusz9 ай бұрын
Branch prediction is why there is a lot of algorithms that work faster on sorted data even if the order of elements theoreticaly doesn't matter for this algorithm.
@custard1319 ай бұрын
im sure it helps but its not the sole reason one of the big benefits of sorted data is it allows for binary search, the best example low tech example would be something like a phone book or dictionary, you can jump to the middle page, know if the thing you were searching for is earlier or later in the data and then discard the other 1/2, then repeat the process, if you had a list of everyone alive on earth it would only take 33 steps to look up anyone, compared to if the list wasnt sorted the worst case would take 8 billion steps. having the list sorted would make it ~200 million times faster, even without any fancy cpu tricks there are things with nested loops where there can be performance gains from having the resulting loop be aligned to how the data is stored in memory, eg in graphics programming with storing pixel data for each pixel it can make a big difference if you loop over it row by row or column by column which i guess branch prediction comes into though i thought was more down to the memory/storage controllers than the cpu pipeline
@henriquealrs8 ай бұрын
Anybody else amazed by the fact Matt wrote the Fibonacci sequence in x86 and just knew the size of instructions
@570rm-85 ай бұрын
love this explanation - plain and simple!
@baltakatei9 ай бұрын
Side note: Branch prediction is incompatible with keeping memory secret. Disable branch prediction when handling secrets.
@pierreabbat61579 ай бұрын
In goes branch prediction, out comes secret.
@nefex999 ай бұрын
Very cool - great, understandable explanation!
@rafaelrezende7953Ай бұрын
No flame but I do not know why the dust under the TV bothered me this much to comment. Great video!
@clehaxze9 ай бұрын
I realized this is Godbolt!!!!
@ArneChristianRosenfeldt9 ай бұрын
IBM managed to slow down their mainframe using branch prediction. How often do you have JMP (else branch) ? DSPs just had zero overhead loop instructions similar to the one in 80186 . So at the start of the loop you have an instruction where the immediate value says from where to jump back to here. Only needs a single register, not a table. Works on inner loops. And then there is hyper threading, where you fill the pipeline with the lower priority thread instead. No need for speculation or attack vectors. ARM TDMI in GBA had a flag to instruct it to follow up branches. But how does it know that there is a branch before decoding? So it still needs a cache: 1 bit for each memory address to remember an up branch. At least this is well documented, and the compiler can optimize for it. Even with this nice prediction: why not follow both paths with a ratio. One cycle to advance this path, 3 for the other. Stop at Load / Store to avoid hacks or inconsistencies. PS3 showed the way: more cores, no CISC like optimization per core. Similar today with GPUs and their RISCV cores.
@MateoPlavec9 ай бұрын
I'm _predicting_ that that the one character change was from a short-circuit && to a bitwise &. The former might be compiled as two branch institutions, while the latter as only one.
@MattGodbolt9 ай бұрын
Bingo. Well in this exact case a || to a |. And it wasn't 100% effective; sometimes the compiler still decided it was going to use two branches.
@custard1319 ай бұрын
i seem to have missed the original but this guy seems great at explaining CPU stuff any chance of a further video about how Spectre class of vulnerabilities fits into this? (my limited understanding is there are a few more things going on in between but that seems the extreme example of branch prediction going wrong)
@scaredyfish9 ай бұрын
What I don’t quite understand, and this is perhaps because the metaphor breaks down, is what is the decoding robot actually doing? It takes a piece of information, and ‘decodes’ it into a different piece of information? But why is this information understood by the next robot where the original information wasn’t? I presume this has something to do with determining which physical circuitry actually executes the instruction, but I can’t really visualise how that happens.
@TheUglyGnome9 ай бұрын
Decoder can for example find out which bits of the instruction are memory address, register address, ALU operation code, etc. Then it forwards these bits to the right units of the processor. In some other processor implementation the decoder could just check the operation code and make a microcode jump to the microcode address handling this instruction.
@tomrollet1549 ай бұрын
Its hard to explain as in a modern design, it work a bit differently. But to make simple, the initial piece of infirmation is just a pack of 1 and 0. The branch predictor is going to predict if it's a branch, if it needs to be taken and where. It does not even have to read the 1 and 0s to know if the instruction is a branch. Everything is done base of previously seen branches using tables to track things. The latter decode stage, is used to transform this pack of bits into a serie of action to do. This is done to setup what needs to be done to execute the instruction. Ex for an addition: where to get the 2 values to add, where to put the result...
@MattGodbolt9 ай бұрын
It's a fair question. In older chips the decoding was often straightforward; often implemented as a kind of ROM called a PLA that mapped opcodes to sets of things to do, cycle by cycle. In modern CPUs like x86s, the instructions are very complex and difficult to decode, and they actually decide to a whole other sequence of smaller operations called "micro ops". Maybe if we get time we will go over that in one of these videos! There's complexity there too!
@kazedcat9 ай бұрын
Instructions pack as much information into as few bits as possible. Decoders unpack this information. For simple cpu it does something like converting the binary coded add imstruction into an "on" signal to the execution hardware that performs the add operation. In modern CPU instructions are now very complex needing multiple steps to execute. So the decoder breaks down the complex instructions into multiple simpler instructions called microcode. It can also do the reverse fuse multiple instructions into one microcode.
@bjryan199 ай бұрын
As a software developer I'm wondering how you optimize for branch prediction when the cpu is effectively a black box. I guess you can only speculate that you are getting branches wrong or maybe there is a cpu setting to store branch prediction hits and misses?
@thewhitefalcon85399 ай бұрын
Modern Intel CPUs are chock-full of performance-related counters, actually.
@Stdvwr9 ай бұрын
vtune
@MattGodbolt9 ай бұрын
On Linux `perf stat -- program goes here` like `perf stat -- ls -l` (or whatever). I had that cued up to demo but the conversation went in a slightly different direction :)
@mcg67629 ай бұрын
As long as your branch has some kind of pattern to it, the CPU will do decent prediction. If the branch is completely random the CPU will miss half of the time and you are better off trying to rewrite the code to branchless, for example by using the conditional move instruction. You can often persuade a compiler to produce branchless code by using the C/C++ ? : operator in clever ways.
@KydEv9 ай бұрын
In C++ you can add [[likely]] and [[unlikely]] attributes in the code. The compiler is then supposed to do that optimization for you if he wants (basically)
@Illogical.2 ай бұрын
I don't understand the example at the end. What was the difference between the before and after? I understand C and assembly if that helps explain it.
@hyperion64839 ай бұрын
If we can decode that an instruction is a branch way ahead of the execution step that will decide to take it or not, isn't it possible to build a second pipeline in parallel as soon as we know that this instruction is a branch that could be taken, such that when we come to the execution step that will decide if we have to take it or not we only need to decide if we stay on the actual pipeline or switch to the second one we built in parallel ?
@ArneChristianRosenfeldt9 ай бұрын
Yeah. Only way to restore a wrong prediction. Anything below this does more harm than benefit. Still don’t want to leak speculative LOAD and STORE to the outside. Memory mapped IO?
@DemonixTB9 ай бұрын
Yes, this is called speculative execution. Instead of taking one branch, the CPU executes both and discards the one it wasn't supposed to take, CPUs today have a this only happens when the CPU has no other work to do, which can be quite often when waiting for memory operations, or even when just waiting for the comparison instruction to finish which can take a while given how deeply pipelined the CPU is.
@anata.one.19676 ай бұрын
What happens when the predictor makes the fetcher fetch both branches, if it it sees a branch in an address that is not in the table, does that speed up the processor??
@R.B.9 ай бұрын
Two thoughts, when does it make sense to just add a couple more robots to the middle of the pipeline so that you have two pipelines in effect? In this way, you aren't flushing your cache ever, you are simply deciding which pipeline assembly line to continue processing, so you are throwing away some work, but it doesn't stall the process. Second, at what point will we start to see neural networks used for branch prediction? Seems like you could start using back propagation to apply weights for recognizing patterns for branch prediction.
@arghnews7 ай бұрын
AMD's cpu have used a neural network for a long time now for branch prediction as you suggest (google it)
@vadrif-draco9 ай бұрын
Is that Ray Tracing video at the end soon to be released? Can't find it via search by name
@felixschwarz46996 ай бұрын
Does that mean, the first iteration of any loop is a bit slower than the following iterations?
@tambourine_man9 ай бұрын
I wanna know about that black screen in the background showing followers, stock, etc. That looks like a cool project
@MattGodbolt9 ай бұрын
It's a Tidbyt showing some standard things plus some website stats
@uttarandas9 ай бұрын
Thanks, I needed this.
@pmmeurcatpics9 ай бұрын
The part where the branch predictor increments/decrements the probability of each branch prediction reminded me of JITs, which too were covered recently on Computerphile. Do I understand correctly that this branch prediction adjustment too happens at runtime? Or could the program be dry-ran a couple of times during the compilation process to preconfigure the branch predictor somehow? It's a fascinating piece of technology either way:)
@MattGodbolt9 ай бұрын
The branch predictor is entirely live, based on the current run and history of the program. Some older intel chips did let compilers place some branch hints but they have been removed as ...to decode the hints you need to have already fetched and decided the instructions...by which time its probably too late:)
@MattGodbolt9 ай бұрын
But the ideas are similar, yes. Just even more micro-level than the tricks JITs pull
@pmmeurcatpics9 ай бұрын
@@MattGodboltthank you for taking the time to answer! Have been loving the series:)
@prakharrai10909 ай бұрын
Wonderful!
@musmuk53509 ай бұрын
Excellent video thank you
@SimGunther9 ай бұрын
Wouldn't it be cool to submit in-memory programs to a RAM pipeline much like shader programs can be submitted to a GPU pipeline? That might be something we have to do to prevent spectre-like bugs by design.
@scaredyfish9 ай бұрын
Programmable branch prediction? The idea makes my head spin!
@paulsaulpaul9 ай бұрын
Unroll your loops?
@moritzmayer94369 ай бұрын
Pipelining is hard stuff, but very well explained. 😊
@Zenas5219 ай бұрын
My take away: Branch Prediction: When I see this, I will give you that, noted.
@Stdvwr9 ай бұрын
How do you go from 2 ifs to 1 if in 1 byte?
@mss6649 ай бұрын
&& and || operators will short circuit, which means that in expression "foo() && bar()", bar will be called only when foo returns true. Replacing them with bitwise & or | will unconditionally evaluate both sides, removing the branch. Compilers can sometimes optimize those for you, if the operations are cheap and evaluating the right-hand side won't affect the program's behavior. For example, a branch in (x > 0 && x < 10) can be optimized out, but a branch in (p != NULL && *p == 42) can't and shouldn't be, because dereferencing a null pointer would crash the program.
@kenjinks54659 ай бұрын
I recall simple memories like this used in artificial life in the 90s to find apples around trees...Animat? MIT Artificial Life publication
@Roxor1289 ай бұрын
NOP isn't strictly doing nothing, it does something that _changes_ nothing. On x86, NOP is equivalent to "XCHG AX, AX", which is just swapping register AX with itself. No change, but still doing something. 8 opcodes are used for instructions that swap one of the general-purpose registers with AX, one of which just happens to correspond to using AX as the nominated register, and which gets the name NOP instead of what it actually does.
@OzeCovers9 ай бұрын
Couldn’t a neural network be implemented for this? Edit: Turns out it can be: neural branch prediction
@kazedcat9 ай бұрын
Yes it can AMD are using perceptron as fast predictors for their ZEN processor. But the misprediction rates are high. So they are also supplementing it with a slower but more accurate predictor.
@deepak.rocks.9 ай бұрын
Nice 👍
@prettypic444Ай бұрын
“Basement full of nonsense” would be a great band name
@authentic68259 ай бұрын
Godbolt!
@photon27249 ай бұрын
they have basically figured out how to make a machine learning Reinforcement-Learning prediction model in a SINGLE tick!
@ArneChristianRosenfeldt9 ай бұрын
But probably this thing again is split up into 3 pipeline stages for some reason. Like, look at MIPS and tell me how register based instructions need more than 3 stages! MIPS says: LOAD needs exactly two cycles and two stages more. This is obviously not correct if cache is involved.
@yuehuang34199 ай бұрын
It is just as like to be off the left, right.
@ivonakis9 ай бұрын
Thank you - Its just a little less dark magic.
@todayonthebench9 ай бұрын
Branch prediction is a thing I have started considering as a bit of an old relic of its time. I suspect it will be gone in the near future, since it actually isn't useful in practice since the introduction of out of order execution. (I also feels that this comment is exceptionally short and only people who thoroughly studied out of order execution will catch my drift Just decode both sides, execution can't keep up with the decoder, so interleaving it for a few tens of cycles is meaningless as far as the instruction queue/buffer is concerned. One won't get bubbles during this process, and if one does, then lengthen the queue/buffer to give execution more scraps to mull over as the decoder gets ahead again.) Now, if one doesn't use out of order execution and have a lengthy pipeline, then yes, prediction is very useful. (unless one also cares about constant time, then prediction and out of order execution are both one's nemesis.)
@MattGodbolt9 ай бұрын
But out of order execution pretty much relies 100% on accurate branch prediction! I hope to cover that (and indeed the reason I've done BP is to lay the groundwork for future videos that cover OoO)
@Lion_McLionhead9 ай бұрын
Figured they always simultaneously executed both branches until something wrote to memory or the branch was fully known.
@3rdalbum9 ай бұрын
Schrodinger's CPU
@MattGodbolt9 ай бұрын
Given there's usually a branch every 4 to 6 instructions and the pipeline can be tens of instructions long, it quickly gets out of hand: each branch would bifurcate again and again...it's better (currently!) to guess and commit to the guess
@zxuiji9 ай бұрын
That ray trace thing is better done with a collision map though? You're already drawing every object into 3d space, just note the id of a triangle in a collision map for it and have the ray lookup the cells directly. There's no comparing of "is it to the right or left", it's just "What do I load here?" where the default id (0) just loads a value of no effect against the light.
@aidanthompson50539 ай бұрын
2:03
@sophiamarchildon39988 ай бұрын
11:00 That's currying.
@KipIngram9 ай бұрын
I don't like your example - the predictive robot should be able to recognize an UNCONDITIONAL jump. I feel like that should be within the capabilities of a fetch unit. Unconditional calls as well. I understand calls raise some delicate issues, but after all, the fetch unit is the one that knows what the return address is going to be. The execute unit shouldn't have any awareness at all of where in memory the instructions its executing have come from. In a properly "clean" design that would mean that the fetch unit would "own" the return stack. Modern software strategies make that problematic - just one example of how we haven't followed our best possible path. We really shouldn't be mingling "fetch relevant" and "execution relevant" information in a single data structure.
@ryan-heath9 ай бұрын
I think I missed how it is known the prediction failed. Who is keeping tabs on the predictor? 😅
@thewhitefalcon85399 ай бұрын
When the execution step at the end of the line processes the branch instruction properly, it compares the proper answer to the prediction. If they don't match then it pulls the horn and dumps the conveyor belt same as before.
@ryan-heath9 ай бұрын
@@thewhitefalcon8539 yes I have seen the vid but it still does not click. The predictor can give the wrong address to go to based on previous behavior. Is code being executed/evaluated before it is really executed? If you catch my drift 😅 The first 100 times it predicted right. But the 101st its prediction wrong. Which address is being executed at that specific time?
@MNbenMN9 ай бұрын
@@ryan-heathI don;t think of it as executing an address directly ever, it's executing whatever is in the pipeline (presuming the pipeline is loaded correctly). The steps are abstracted so the CPU can proceed faster from the cached instructions in the pipeline, not pulling from an addressed memory location, which would take longer to pull than it does to execute, IIRC. It is the Jump instruction being executed that would reveal if the pipeline has loaded the correct prediction. In the infinite loop example, it can't be predicted wrong after 100 loops, so that example doesn't directly address that, but if it was a conditional branch operation, instead of an unconditional jump then it would be the execution of the conditional branch that reveals whether the prediction is correct
@ryan-heath9 ай бұрын
@@MNbenMN hmm I think I get what you are saying. So the cache contains the instruction and from which address it was load. The branch instruction can now check if the needed address is already loaded in the cache. If it is not the prediction was faulted.
@MNbenMN9 ай бұрын
@@ryan-heath That sounds about right for the extent of the explanation in this video, as far as I understand it. However, the modern implementations of branch prediction and caching are more sophisticated/complex with parallel threads, to the point of unintentionally introducing Spectre exploit vulnerabilities, and I am no expert on CPU architecture to the details on that level.
@muhammadsiddiqui22443 ай бұрын
The sound of markers are "awesome"
@tiagotiagot9 ай бұрын
I wonder how many years until this task is done by a built-in LLM-like predictor that is training in real time or one/few-shotting it....
@orlandomoreno61689 ай бұрын
LLM is overkill. You can embed a NN and do backpropagation / Hebb's rule in hardware.
@tiagotiagot9 ай бұрын
@@orlandomoreno6168 Next-Token-Prediction seems like the perfect skill for this task; at the speed things have been progressing, it should be a matter of years at most before LLMs can predict CPU operations faster than CPUs can run natively. I forgot which one it was, but recently one of the normal LLMs trained on human language was shown to be able to learn machine code from in-context demonstrations and demonstrated the ability to replicate the behavior of a Turing machine; imagine what one trained specifically on CPU operations running on specialized ASIC might achieve in a few years. edit: I found it I think, it was Claude 3 Opus
@MoonCrab009 ай бұрын
If a human could read the matrix like Neo he would be the closest.
@bobhadababyitsaboy57654 ай бұрын
here after windows update AMD branch prediction optimizations
@BooleanDisorder9 ай бұрын
I'm interested in so many odd subjects. 😢
@itzhexen09 ай бұрын
They have 2.39 Million subscribers.
@saurabhjha87334 ай бұрын
First robot has sharingan
@yagmur9859 ай бұрын
Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,
@Ashraqatdarwish9 ай бұрын
Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines
@brandonkim45549 ай бұрын
I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance
@GiseleLuz-rm6vd9 ай бұрын
You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.
@AbdRahmanAzhar9 ай бұрын
Inspiring! Do you think you can give me some advice on how to invest like you do now?
@Abu-h4n9 ай бұрын
If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you...prevent inflation
@itzhexen09 ай бұрын
Oh look it predicts just like a human does. Must be sentient.
@thewhitefalcon85399 ай бұрын
There are lots of different ideas to make predictors, from simple to complex. The most basic ones just have a table of like 16 slots and they write down which direction the branch went last time and overwrite the oldest one. Some AMD CPUs use small neural networks.
@omegahaxors9-119 ай бұрын
AI researchers be like
@IceMetalPunk9 ай бұрын
It doesn't "predict just like a human does". Neural networks *do* predict like a human does. No need to be sassy just because you don't understand the differences in their mechanisms.
@itzhexen09 ай бұрын
@@IceMetalPunk It's a joke. Anyone who predicts is basically a crazy person.
@itzhexen09 ай бұрын
@@IceMetalPunk Also I can be however I want. If you don't like it then that's your problem. You don't know me.
@gamechannelminecraft65839 ай бұрын
"Congrats to everyone Who is early and who found this comment.. 🐼😊
@The_Pariah9 ай бұрын
Not a great video... I gave on it before 1/2 way. I just don't like how this guy is trying to convey his messages.