Performance Tuning

Рет қаралды 13,738

Matt Godbolt

Күн бұрын

Пікірлер: 36

@andreaslegomovies 6 жыл бұрын

Very nice approach and well explained! I hope to see more like this :)

@valcron-1000 5 жыл бұрын

This is some top quality content

@evgeniygorbachov3970 4 жыл бұрын

My sincere thanks for sharing this. So interesting, watched it in one breath.

@ExecutionUnit 6 жыл бұрын

Bloody dynamic_cast! Great video.

@dholmes215 6 жыл бұрын

Note that dynamic_cast performance can actually vary wildly depending on implementation and circumstance. Both libstdc++ and libc++ have a configurable optimization that replaces the strcmp() with a simple pointer comparison, which apparently is turned off by libstdc++ by default for compatibility reasons, but on by default in libc++ last time I checked. This can improve its performance by an order of magnitude. Of course, it would still be way slower than what was demonstrated later in the video.

@MattGodbolt 6 жыл бұрын

Thanks for the comment! I was unaware it could be changed: do you have a link to this?

@dholmes215 6 жыл бұрын

I wrote a thing about it here once: dholmes215.github.io/c++/2017/06/30/fun-with-typeid.html The bit under "What's going on?" mentions -D__GXX_MERGED_TYPEINFO_NAMES. Note that I didn't really dig any deeper than that, so I honestly have no idea if fiddling with that is a good idea. The main lesson I learned was to avoid dynamic_cast when it wasn't exactly what I needed anyways.

@dholmes215 6 жыл бұрын

I looked it up again and found this change: github.com/gcc-mirror/gcc/commit/62164d4c3690257a01326783bed77c15831d52b9 ... which apparently is where the libstdc++ developers decided always defaulting to off was the safest choice. That's the limit of my understanding, though.

@AndreasStenmark 6 жыл бұрын

Nice video Matt! I spend my free time tuning things in a very similar manner, only in a different domain. Out of interest, have you compared your itoa() to Alexandrescu's by any chance at all?

@pranaykothapalli3980 5 жыл бұрын

Beautiful video. I'd love to see more like this, waiting for more videos from you :)

@Xtcent 5 жыл бұрын

Love your compiler explorer, nice!

@parnmatt 5 жыл бұрын

SOH (\x01 or ^A) is Start of Heading ... frankly they should have gone with US (\x1f or ^_) the unit separator

@hl2mukkel 6 жыл бұрын

Damn I missed the videos you uploaded since spectre because youtube didn't notify me! Gonna turn on the notifications now! I'm very passionate about performance and efficiency so I love these videos! Thank you very much for them. By the way, have you tried Rust and if so, what do you think of it?

@jubbernaut 5 жыл бұрын

Very interesting stuff Matt!

@DenMarket-uc1nq 6 ай бұрын

Brilliant!

@kejith3853 5 жыл бұрын

Would it make sense to build a lookup table? So you could just pass quantity as key and you would get String back. Its a lost of wasted memory but maybe it could be a little faster?

@MattGodbolt 5 жыл бұрын

My later examples use look up tables to do two digits at a time. If you're proposing a lookup table for every number, you'll need to dedicate about 20+GB of RAM which is quite a lot :). And at some point the cache misses are far more expensive than the small amount of processing needed to use a smaller table (that fits into L1)

@itssuperninja 6 жыл бұрын

This is an amazing video. Thank you.

@afigegoznaet 6 жыл бұрын

Special thanks for the link to the cool and exciting secret things

@MattGodbolt 6 жыл бұрын

Which link? :)

@afigegoznaet 6 жыл бұрын

godbolt.org obviously, I'm playing with the code right now, without having to retype everything

@MattGodbolt 6 жыл бұрын

Oh I see! :) Hardly a secret hehe! You had me worried for a moment that I'd accidentally leaked something I shouldn't have...

@afigegoznaet 6 жыл бұрын

Nah, it's a failed joke related to your DRW job ddescription in some places. You've got a fan here ;)

@MattGodbolt 6 жыл бұрын

Haha, right, my blog description. I was just panicking, so your joke definitely worked!! Thanks!

@YouAreUnimportant 3 жыл бұрын

shift left and shift right might be simple operations but they are very slow on some intel chips.

@MattGodbolt 3 жыл бұрын

Can you say which? None of the chips I've looked at (From Conroe onward) take more than a cycle (source: uops.info/table.htmle.g. uops.info/html-instr/SHL_M16_0.html and others) for regular left and right shifts. Some involving the carry have a longer latency, but I've never seen a compiler emit those (mainly as they are so slow).

@utromvecherom 6 жыл бұрын

How were the measurements performed? And in general, how do you measure a particular piece of code?

@MattGodbolt 6 жыл бұрын

I think I described in the video: I ran the code a number of times and took the average. This is less than ideal, but gave a decent enough idea without having to get in the weeds of "how to measure performance", which is a whole other talk! I can't find the source now...I did have it around at one point and I'll post it here when I find it :)

@utromvecherom 6 жыл бұрын

Do you mean you run a piece of code in a loop? Let me elaborate a bit: I've played around with Agner Fog's (www.agner.org/optimize/#testp) testsuite to measure pieces of code and results (measured in cycles, results can be converted to ns by dividing by 3.6 for my cpu) are quite random for different runs of a test program ( each run measures a piece of code N times). And for simple code like pow(sin(x),2)+pow(cos(2),2) I have stats like (avg=236, stddev=60), (380, 120), (450, 60), and that is after I throw out 14 min and 14 max values out of 128(=N) measurement in a single run. So having this I'm very excited if there is an approach to get stable cost estimation technique for small pieces of code. Running code in a loop reveals the true cost to some degree but there you appear to have less realistic scenario if compared to a case where a given piece of code is a part of bigger path and it is surrounded with other code that touches different locations in memory and uses registers and things like that.

@perryizgr8 6 жыл бұрын

I wrote the newOrder() function and I'm calling it with random data, but I don't want to profile the calling function and the random generation etc. So how do you instruct perf to record and report stats for only one function (newOrder)? Very informative presentation btw!

@MattGodbolt 6 жыл бұрын

I didn't -- I just literally did `perf record` and `perf report`. I did edit out functions from the bottom of the list that weren't interesting (until the end slides when I show both the function in question and main())

@perryizgr8 6 жыл бұрын

Looks like my compiler is inlining newOrder() and maybe that's why it simply doesn't show up in perf's report. But I'm able to annotate main() and follow it till I reach where the call to newOrder() is supposed to be, but it is literally just a direct callq __sprintf_chk.

@MattGodbolt 6 жыл бұрын

Awesome...that's pretty much what I saw. When I got to the sprintf() version slides (~11mins) you'll see there's no mention of newOrder in the profile, just the vfprintf/xsputn etc of the implementation of sprintf. So you're seeing the same :)