Fastware - Andrei Alexandrescu

Рет қаралды 22,178

Күн бұрын

Пікірлер: 29

@dedvzer 2 жыл бұрын

I like how he expands most of these topics into complete talks for the next 10 years (this is 2008). I've seen probably most of them, but they deserve some time. And here he is, ready to do it all in one hour.

@simonfarre4907 Жыл бұрын

The talk was not from 2008. It was from 2017.

@fritzschnitzmueller3768 4 жыл бұрын

I love how he lets the audience choose the topic of the talk on the fly :D Who else does that :D

@TheOnlyAndreySotnikov 3 жыл бұрын

I tried his version of improved atoi. It indeed is faster than the standard library atoi. However, manually implemented atoi with "bad" dependency "result = result * 10 + (*b - '0')" showed the same performance as the "improved" version . IMHO it makes sense, since "result += ..." presents the same dependency on the previous result as "result = result * 10 + (*b - '0')". In clang the results are even more stunning. His "bad" version of atoi is in fact *2 times faster* than the trick with precalculated powers of 10, because when clang sees "result = result * 10 + (*b - '0')", it uses two LEA instruction instead of multiply: leaq (%rcx,%rcx,4), %rcx, leaq (%rsi,%rcx,2), %rcx. The first one performs multiplication by 5, the second combines multiply by 2 and addition.

@tedrivera8051 2 жыл бұрын

He did mention that some of the advice might "not age very well with improvements in compiler and CPU technology" (3:23). Clang has done a lot in that space, but it's good to know which advice hasn't aged well. Thanks for that!

@MrAbrazildo 11 ай бұрын

54:34, I recently discovered that unrolling a loop of 'sscanf's to (I don't remember well) 50 numbers made it more than 3x faster, for both Clang and GCC. I thought they do that by default in -O2/3.

@Voltra_ 2 жыл бұрын

I honestly didn't know how to implement strtr. So as a challenge when listening to this, I wrote it as an infinite loop. It's much easier to think about and I think I got a pretty optimized linear version

@MrAbrazildo 11 ай бұрын

50:12, I would expect isdigit to be faster, because it would only need to check some bits.

@victornoagbodji 7 жыл бұрын

amazing talk!

@ace100hyper3 2 жыл бұрын

I like Andrei, but I have to disagree that optimal way to implement StrStr is by starting with an infinite loop. I just implemented the algorithm in C# and I used the loop condition regularly. I would really like to see Andrei's implementation on this, because it may be that he is doing extra comparisons in the loop body.

@xvoidee 4 жыл бұрын

Now I have big doubts regarding my optimization skills :-)

@yuehuajia3176 6 жыл бұрын

Great Talk. I learned a lot from you.

@eduardomadrid2380 7 жыл бұрын

Another good presentation by Alexandrescu, but watch out!, "digits10" is actually a pessimization, as benchmarked in my x86-64 computer, here: github.com/thecppzoo/inprogress/blob/master/src/main.cpp The reason is that it has too many unpredictable microbranches for reasonable inputs, this is "sloware". Eventually I will publish a more detailed analysis, for the time being, you may use my implementation of digits10 which is almost twice as fast for sequential inputs and about five times faster for random inputs

@FoxDr 3 жыл бұрын

Thanks for the bench. The whole hour, I've been thinking "why not just use `ceil(log10(abs(a)))`?". Sure, float ops up the wazoo, might not be worth it for small numbers, but no looping required at all. I'll add it to your benches when I find some time :)

@chriswatch582 7 жыл бұрын

On Visual Studio 2015 using Microsoft 7 Professional with 2 Cores, I ran a test version of the non-parallel, parallel computation and unrolling, but the non-parallel routine was always faster. I even tried putting the large array and string length computation in the routine but it was still slower than the non-parallel computation. Relying on parallelism was slightly slower than unrolling in the last 2 tests. I developed 2 programs. I included the first program below and the output from the 2nd program! Output from 1st program: TEST 1: Convert string to number: 1234. Not Use Parallelism Times: min = 0 max = 5307 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Parallelism Times: min = 0 max = 6310 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Unrolling Times: min = 0 max = 6621 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 TEST 2: Convert string to number: 12345678. Not Use Parallelism Times: min = 1 max = 9573 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Parallelism Times: min = 1 max = 10227 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Unrolling Times: min = 1 max = 9799 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 TEST 3: Convert string to number: 1234567890123456. Not Use Parallelism Times: min = 1 max = 12625 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Parallelism Times: min = 2 max = 17928 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Unrolling Times: min = 2 max = 17227 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 ------------------------------------------------------------------------------------------------------------------------------ I quickly modified the first version of the below program. In this 2nd program for the unrolling test, I replaced the "for" loop with an "if" statement. The "if" statement checked if there were "4", "8", or "16" digits in the test string. Then for each active option, I unrolled 4 times, 8 times, or 16 times. The resulting output from the program was that the routine based on unparllelism was always fastest. The second fastest was always unrolling. The routine based on parallelism was always the slowest routine. ** Basically, all the interesting things Andrei said about parallelism (and unrolling) does not work with the program that I developed on Visual Studio 2015. I believe all those fine and eloquent things should work elsewhere though! ------------------------------------------------------------------------------------------------------------------------------ Output from 2nd program: TEST 1: Convert string to number: 1234. Not Use Parallelism Times: min = 0 max = 3520 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Parallelism Times: min = 0 max = 5156 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Unrolling Times: min = 0 max = 4933 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 TEST 2: Convert string to number: 12345678. Not Use Parallelism Times: min = 0 max = 5161 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Parallelism Times: min = 1 max = 8520 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Unrolling Times: min = 1 max = 8295 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 TEST 3: Convert string to number: 1234567890123456. Not Use Parallelism Times: min = 1 max = 9065 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Parallelism Times: min = 1 max = 15273 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Unrolling Times: min = 2 max = 15128 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 ------------------------------------------------------------------------------------------------------------------------------ #include "stdafx.h" //USED BY VISUAL STUDIO 2015 #include #include using namespace std; // 2 ROUTINES THAT CREATE AN INTEGER FROM A STRING REPRESENTATION OF THAT NUMBER. //THE SLOWEST IS SUPPOSE TO RELY ON NON-PARALLELISM. //ROUTINE FROM TALK: slide 59 unsigned long long atoui_notuse_parallelism(const char* b, const char* e) { unsigned long long result = 0; for (; *b != *e; ++b) { result = result * 10 + (*b - '0'); } return result; } static const long long pow10[21] = { 1UL, 10UL, 100UL, 1000UL, 10000UL, 100000UL, 1000000UL, 10000000UL, 100000000UL, 1000000000UL, 10000000000UL, 100000000000UL, 1000000000000UL, 10000000000000UL, 100000000000000UL, 1000000000000000UL, 10000000000000000UL, 100000000000000000UL, 1000000000000000000UL, 10000000000000000000UL }; //IDEA FOR ROUTINE CAME FROM :slide 63 unsigned long long atoui_use_parallelism(const char* b, const char* e, short i) { unsigned long long result = 0; for (; *b != *e; ++b) { result += pow10[i--] * (*b - '0'); } return result; } //IDEA FOR ROUTINE CAME FROM UNROLLING SLIDE: slide unsigned long long atoui_use_unrolling (const char* b, const char* e, short i) { unsigned long long result = 0; for (; *b != *e;) { result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; } return result; } int main(int argc, int** argv[]) { int elapsed_Time; int max, min; int max_num_iterations, min_num_iterations; long long result; std::clock_t start; //Holds starting time __int64 test_Number; char test[30]; //3 TEST VALUES char test2[9] = "1234."; char test3[11] = "12345678."; char test4[18] = "1234567890123456."; const char* b; const char* e = "."; int strlength; for (auto j = 1; j

@nivo6379 4 жыл бұрын

For benchmarking, you should probably use Clang or GCC.

@NoNameAtAll2 3 жыл бұрын

You use std::clock, but you need to know that it isn't representing real time, but cpu-usage time Use high_resolution_clock or steady_clock

@abc3631 4 жыл бұрын

What a deadpan audience

@leipzig3 3 жыл бұрын

Certain cicadas come out every 17 years. If you need to know if this year is a cicada year, return (year %17 + offset)

@bool2max Жыл бұрын

why does he call integers "integrals"

@GeorgeTsiros 6 жыл бұрын

37:15 wtf we have them on our WRISTS these days!

@TheDvrx 3 жыл бұрын

So pipelined cpu's are secretly vector processors.

@typon1 6 жыл бұрын

Did they find a bunch of zombies to attend this talk or are the English completely humor less?

@Vitorian 4 жыл бұрын

Aren't those the same?

@GeorgeTsiros 4 жыл бұрын

he didn't do _the hands_ on a more serious note, sometimes i watch andrei for serious... sometimes not so much (example, the allocators video, that one was a kneeslapper)