CppCon 2017: Chandler Carruth “Going Nowhere Faster”

Рет қаралды 101,165

Күн бұрын

CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2017
-
You care about the performance of your C++ code. You have followed basic patterns to make your C++ code efficient. You profiled your application or server and used the appropriate algorithms to minimize how much work is done and the appropriate data structures to make it fast. You even have reliable benchmarks to cover the most critical and important parts of the system for performance. But you're profiling the benchmark and need to squeeze even more performance out of it... What next?
This talk dives into the performance and optimization concerns of the important, performance critical loops in your program. How do modern CPUs execute these loops, and what influences their performance? What can you do to make them faster? How can you leverage the C++ compiler to do this while keeping the code maintainable and clean? What optimization techniques do modern compilers make available to you? We'll cover all of this and more, with piles of code, examples, and even live demo.
While the talk will focus somewhat on x86 processors and the LLVM compiler, but everything will be broadly applicable and basic mappings for other processors and toolchains will be discussed throughout. However, be prepared for a lot of C++ code and assembly.
-
Chandler Carruth: Google, Software Engineer
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Пікірлер: 81

@eauxpie 6 жыл бұрын

If only he dug into that last question, we might have known about Spectre that much earlier.

@henke37 4 жыл бұрын

Maybe not spectre in general, but certainly the variant with stale cache entry loads.

@VivekYadav-ds8oz 2 жыл бұрын

What a butterfly effect would that have been lol

@mywtfmp3 6 жыл бұрын

Chandler helps me realize that I know nothing about benchmark.

@osere6432 3 жыл бұрын

In his 2018 talk he discusses the last question in great detail

@andik70 3 жыл бұрын

You have a link

@shadowtuber1589 Жыл бұрын

kzbin.info/www/bejne/lZeagGZ_m66FaM0

@kamilziemian995 Жыл бұрын

This talk from 2018 is: CppCon 2018: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security”. kzbin.info/www/bejne/lZeagGZ_m66FaM0

@jackwang3397 2 жыл бұрын

I have no idea about assembly language or the further details but still feel the passion of the lecturer and the audience. I am satisfied and cheer like everyone else did at 35:10 during my lunch break. Good job!

@hanneshauptmann 6 жыл бұрын

I am a simple man, I see a good talk, I press like.

@dipi71 6 жыл бұрын

... and write an encouraging comment to do the same. So do I. Cheers!

@kamilziemian995 Жыл бұрын

How much I want to have 1/1000 of Carruth's knowledge about compilers.

@ldmnyblzs 6 жыл бұрын

That command prompt is just pure madness!

@Zeturic 4 жыл бұрын

20:48 Going from .01 to .04 is a 300% increase in cache misses (e.g. 4x the amount), not .03%. When you look at it that way, dramatic changes in performance aren't that surprising.

@movax20h 4 жыл бұрын

The problem is he is counting L1 dcache misses relative to L1 dcache loads within one run. Because of dcache misses his programs runs slower, and does less iterations, and less dcache loads overall. The way to make it more clear is to set the number of iterations to the fixed value instead so each benchmark tries to do same amount of loads overall. Then compare between the runs. But yes, 4x observation from the relative measurements is a good observation. The 0.01% cache misses doesn't mean that 0.01% of your TIME were missed. It is just a counter. But L1 dcache misses are like 20 times more expensive than hits, so if you weight it properly, you will see a real sensibile info. You just need to know how to interpret these numbers.

@emanuellandeholm5657 Жыл бұрын

I'm not a C++ developer (I have a background in C), and I don't really know x86, but these talks by Chandler Carruth are so interesting to me. This is like crack! :D

@victornoagbodji 6 жыл бұрын

amazing as always!

@bluecircle56 5 жыл бұрын

"I don't even know. Magic and unicorns?" Great talk!

@ZeroPlayerGame 5 жыл бұрын

The reason for tight-loop speedup is because the branch is never missed after the first execution, so it is predicted super well and a memory store is just avoided at all.

@movax20h 4 жыл бұрын

The memory store is not avoided. It still does happen. The reason is super fast because it is to the cache line that is already in L1 dcache (we just loaded it), and due to something called write forwarding (not important here, but it works behind the scene to resolve load after store problems quickly, even if it misses L1 dcache). The probable reason the code is equal speed is due to something called op fusion. The CPU actually recognize the jump by small relative offset and move after it, and basically replaces it into conditional store I think.

@JackAdrianZappa 5 жыл бұрын

The most important thing you can get out of this talk is that Magic and Unicorns keeps the processor from crashing! :D

@abc3631 4 жыл бұрын

Thats what i love about chandler's talks, he goes to the nub of the topics, and hard ones at that, rather than just glossing over.

@tobiasfuchs7016 6 жыл бұрын

10:00 Ohmygosh I'm using the same nvim setup, shell and even colorscheme as Chandler Carruth, I will never change my ~/.config again!

@orbital1337 6 жыл бұрын

What's the colorscheme? Looks pretty nice.

@leonhrad 6 жыл бұрын

pretty sure it's jellybeans

@victornoagbodji 6 жыл бұрын

yes please, give away the colorscheme and nvim setup : )

@ericcurtin2812 6 жыл бұрын

Vanilla vim, dark background for life lol: .vimrc syntax off set background=dark set mouse=a set hlsearch

@christophrcr 6 жыл бұрын

In the clamp example, after the first iteration, all values will already be clamped. I guess that's why branch prediction will always be right after some point.

@AlexeySalmin 6 жыл бұрын

TSO doesn't allow reordering anything past a store. Therefore a regular load won't flush the store buffer because it's being reordered ahead of buffered writes, not past them.

@llothar68 6 жыл бұрын

Damit, he always gets the nicest hardware to play with. Hey boss can you hear me? I want a AMD EPYC workstation too.

@N00byEdge 6 жыл бұрын

So what if he used $255 instead of %ebx on the cmove too??

@CornedBee 6 жыл бұрын

You can't. cmov only works between registers.

@davidjulitz7446 4 жыл бұрын

Hmmm, cmovxx should be the right choice. Could it be that cmovxx only perform worse because it can only use registers? Looks to me that moving an immediate made the performance difference here. However, modern processors are obviously to complex to know what is the best optimization for every particular case.

@warrenhenning8064 3 жыл бұрын

6:08 did anything ever come of the Efficiency Sanitizer?

@FalcoGer 11 ай бұрын

36:30 use std::inner_product. I'm sure it is well optimized and does what you want without you re-implementing it. Running std::inner_product on vectors with 65535 elements with random values was around 3 times faster than calling the dotproduct function. (27.2ms vs 75.3ms on my machine)

@chris3kpl 6 жыл бұрын

I wonder if simple masking with 0xff or cast to char whould be faster in clamp example? Because of this special case to clamping to 255.

@MatthewChaplain 6 жыл бұрын

Yes, but you would get a modulo instead of a clamp. e.g. 257 & 0xFF is 1, not 255.

@chris3kpl 6 жыл бұрын

Yes, you're right! :)

@aqg7vy 6 жыл бұрын

What is that vim setup??

@PeterFoxtrott 4 жыл бұрын

Could someone explain me what he means with that modern processors don't have a branch forward prediction anymore? What else should a branch predictor do other than forward? And that means it does not exist anymore on modern processors? 32:06

@movax20h 4 жыл бұрын

There are two branch predictors on a global level in the CPU. There is a 'static' predictor, and dynamic predictor. The static predictor is used when the code is completely fresh and never run before (cold code). The static predictor usually only takes into account whetever a jump is forward or backward, and if it is small jump or big jump, and if it is a relative jump or absolute jump. Usually static predictor will predict branch taken for short relative jumps backwards (because they are most likely loop jump), but will predict large jumps forward as not taken (it would be to some even more cold code, possibly error handling or end of the loop handling. Usually the big relative jumps backward will also be predicted false, as they are unlikely to be loops, or they are likely to be very big loop anyway, so it is likely there is already plenty to do for the CPU to do. It might still ask the cache to prefetch the cacheline, if it is big jump one way or another, but it might be wasteful, and on multi core CPU it might add a lot to wasted memory traffic that could be used better by other cores. Sometimes these things can be tweaked in BIOS or microcode for specific CPU, but also these things change between each version of CPU, between CPUs of different manufacturers, and one way or another is hard to know which is better. Most CPU ISAs doesn't have hints in assembly to indicate which jumps are more likely or not, it is only done by code locality (i.e. closer code is more likely to be a part of loop, and should be preffered, but I don't know where is the cutoff - probably tuned on a lot of different code bases during a design). The issue in his code is actually mostly due to a random distribution of his data, and how speculative execution works. He just flushes entire pipeline on 50% of the loop iterations.

@zarakivishalkenchan 5 жыл бұрын

This talk's title should have been "cmov spoiling instruction pipelining".

@dangkhoi137 3 жыл бұрын

What are his vimrc settings?

@ginkner 6 жыл бұрын

Stupid question, but what command line is that? Ive never seen git look like that

@chris3kpl 6 жыл бұрын

Ben it's a fish shell in tmux

@chris3kpl 6 жыл бұрын

Ben, and editor is probably nvim

@MitchPleune 6 жыл бұрын

looks like it says fish and that looks like a powerline something or other.

@gehngis 6 жыл бұрын

Yes and the theme looks like bobthefish: github.com/oh-my-fish/oh-my-fish/blob/master/docs/Themes.md#bobthefish

@user-bf6in4cj3k 4 ай бұрын

Where can I find the code that was used in this presentation?

@SatyajitGhana7 4 жыл бұрын

code for the benchmark utility ?

@arnabthakuria2243 3 ай бұрын

what font is this. looks very nice

@pfeilspitze 4 жыл бұрын

"April 2019: Intel® Architecture Code Analyzer has reached its End Of Life" :(

@LeDabe Жыл бұрын

intel iacz is dead, long live llvm-mca

@user-bf6in4cj3k 4 ай бұрын

Where can I get the c++ code in his presentation?

@Cwize1 6 жыл бұрын

My theory on why using 255 is faster: 255 is the max value of an unsigned byte. So it will have a tendency to crop up a lot within programs. So internally the CPU has a register dedicated solely to storing that value. (Though I know nothing about CPU design. So I am probably completely wrong.)

@peterfireflylund 6 жыл бұрын

You are. Some CPU's hardwire a "register" to a specific value but that value is 0, not 255 (or any other "all bits set" value). The x86 CPU's don't do this but the ARM CPU's do in 64-bit mode. The x86 CPU's are in your laptop, the ARM CPU's are in your phone. Byte masks are common in performance-critical code so modern compilers recognize them.

@ChandlerCarruth 6 жыл бұрын

Sadly, the entire demo with the clamp loop was broken, no need to speculate about this one goof. =[ My apologies about this, I feel really bad. I tried to craft a tiny reproducer of an issue I see commonly in the real world and thought I had succeeded, but my data poor. I have a much better example. Suffice to say, a key ingredient is that the branch *has* to be well predicted or the branch can't possibly help (which you see in the subsequent discussion around how the processor works). A second key ingredient is that you can craft much more dramatic examples using cmov vs. a branch, which I should have to make this more clear and less confusing. Again, sorry about that.

@Peter_Cordes 5 жыл бұрын

@@ChandlerCarruth - I wrote up an analysis a while ago of a case where `gcc -O3` is slower (on sorted data) because it chooses CMOV. stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 (And some versions really shoot itself in the foot by insisting on putting CMOV into the loop-carried dep chain, even if you write it in C as conditional zeroing of an input to a loop-carried ADD). Fun fact: gcc -O3 -fprofile-use figures this out and uses a branch when you do profile-guided optimization on predictable data. I assume that's the kind of case you say would make a better example, because it would tie in with your 2nd bit about the dot product loop (which does have a loop-carried dep chain, so it would hurt the "unrolling" across iterations that out-of-order execution gives us). Using a control dependency which the CPU will speculate past can very much be a win vs. a CMOV data dependency in a loop-carried dependency chain. ---- If CMOV had been much slower, it makes me think of stackoverflow.com/questions/54190958/throughput-analysis-on-memory-copy-benchmark where a byte-at-a-time asm copy loop is only running at 1 iter per 4 clocks on IvyBridge, even though it's mostly getting cache hits. But I think that's some kind of HW prefetch failure, not totally due to stores depending on loads. Breaking the data dependency between load and store with a branch instead of CMOV was plausible in your test, but unlikely on a modern x86 with deep enough out-of-order execution to keep lots of those independent dependency chains in flight at once. Maybe on an in-order Atom. :P

@BenjaminDirgo 6 жыл бұрын

I wonder if Spectre makes this talk obsolete.

@ericcurtin2812 6 жыл бұрын

I opened a PR on his github wondering the exact same thing github.com/chandlerc/chandlerc.github.io/pull/1 If I was getting a response I probably would have got one by now.

@meneldal 5 жыл бұрын

Well he made a talk about that this year kzbin.info/www/bejne/lZeagGZ_m66FaM0

@mcneeleypeter 5 жыл бұрын

Nothing has changed from this talk due to spectre. The hardware still operates the exact same way.

@charlesreynolds8696 Жыл бұрын

What are the implications of this post-spectre?

@alpers.2123 Жыл бұрын

It turns out, it wasn't magic and unicorns

@cmilkau 5 жыл бұрын

"You can't speculate on cmov" - There seems no reason why you couldn't. Maybe today's processors cannot but I can't see any difference between branching and conditional operations that would prevent speculation. After all a branch is just a conditional write to the instruction pointer, which is kind of the hardest thing you could speculate about. What am I missing?

@Peter_Cordes 5 жыл бұрын

If you want speculation that can mispredict and might need to roll back, use a branch! If you want a data dependency, use CMOV. Speculating CMOV would defeat the entire purpose. (Unless you dynamically evaluate and predict whether it would be a good idea to do that for any given CMOV). But that would be hard to implement. An instruction that could decode to *either* an ALU operation or a branch would need another whole predictor, and a whole new mechanism for pretending we branched in the internal uops when the x86 code didn't contain branch. Plus you'd need that predictor as an input to every decoder. Or if it's only to the complex decoder, then CMOV could cause delays in decoding by having to always decode in the complex decoder, even if it chooses to only decode to a single ALU uop (with a normal data dependency) instead of a branch, in cases where it wouldn't predict well as a branch, e.g. on a condition that was only true some of the time, not almost all the time. (Possibly related: one reason that `rep movs` startup is slow is that micro-code branches on Intel CPUs are special and can't use branch prediction. See stackoverflow.com/questions/33902068/what-setup-does-rep-do for a quote from Andy Glew, former Intel architect who implemented fast strings on P6. This may mean it would be impossible for a CMOV to decode to a branch-like uop over a mov uop in a way that could actually speculate.) See also stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 for another case where CMOV is not a win, but it is a win on unsorted data. With more links and details about when CMOV is/isn't a good idea.

@bloopbleep7082 4 жыл бұрын

Conditional moves don't change the instruction stream. Conditional branches potentially do. Thus you usually don't need to flush the instruction pipeline with conditional moves.

@marcpanther7924 6 жыл бұрын

What is he using? My bash doesn't look as cool as that

@dipi71 6 жыл бұрын

It’s bobthefish; but note the numerous redrawing bugs during the talk. It doesn’t look completely trustworthy to me. Oh well, maybe it’s nvim.

@marianaldenhoevel7240 7 ай бұрын

My takeaway: If you really care about performance you have to measure it (fine) and try non-obvious things to see wether they change something (not that great). Even if Chandler disagrees we then just hope that the system you will be running them on in anger agrees with the one that you measured. The next CPU may think different. And that CPU can be on the machine you use in production or simply change under you next month. That feels like the software-equivalent of "shut up and calculate" in physics. Not a very satisfying place to be in. Luckily my code has all the time in the world and can just blunder through what my thick brain writes down.

@SanjeevkumarMSnj 3 жыл бұрын

which terminal he is using ?

@kosbarable 5 жыл бұрын

Is it was in Simpsons? Probably no. Thumb up!

@slavanap 6 жыл бұрын

35:03. Man, you just jump over a memory write operation (for the majority of your data elements). There is a huge impact that comes from that. I guess, it's more noticeable than not using `cmov`. ADDED: 49:01 YES! Put your hands away from `cmov`! That's not the issue in this command!

@SolomonUcko 4 жыл бұрын

Didn't he say that the majority of the elements are over 255, and need to be modified?

@joe-ti8rz 6 жыл бұрын

Go chandler. Survive. Be better. Buy the tibetian book of the death and meditate.

@Thecommet 5 жыл бұрын

9:08 Your indices array ranges from 0 to count, when it should be 0 to count-1. v[count] is out of bounds

@movax20h 4 жыл бұрын

The RNG will not generate count, the RNG(0, count, count), generates a 'count' elements (last parameter) by each of them is between 0 and count-1, inclusive. This is very common in RNG design, to not include the max element. And is similar to how iterators and loops work.