Is this New Quadratic Formula Actually Faster than the OG Quadratic Formula? (For a Computer)

Рет қаралды 94,718

Creel

Күн бұрын

Пікірлер

@gideonmaxmerling204 2 жыл бұрын

Would probably be faster to just return 1 for every entry. It would be accurate enough in astronomical scales.

@rayniac211 2 жыл бұрын

Unless your quadratics are astronomical in nature ;)

@WitherLele 2 жыл бұрын

@@rayniac211 then it would be accurate on a bigger scale

@RobertLoweIsAwesome 2 жыл бұрын

@@WitherLele kzbin.info/www/bejne/f3TZZ3qCrNlkna8

@airkami 7 ай бұрын

You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3. My Algorithm: Return false It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.

@josemonge4604 2 жыл бұрын

Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!

@HansLemurson 2 жыл бұрын

If you want Blue to be seen, add some Green!

@firstname4337 2 жыл бұрын

LOL, he said blue code and i said what code ? I had to change the brightness of my monitor to see it

@sdjhgfkshfswdfhskljh3360 2 жыл бұрын

Not just it bad by itself, but it is also more damaged by KZbin compression algorithms.

@NoNameAtAll2 Жыл бұрын

yellow on white? ha ha

@professortrog7742 Жыл бұрын

@@NoNameAtAll2 no on the dark background

@47Mortuus 2 жыл бұрын

9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results. ... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization... Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead. Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved): __m128 rcp23_ps(__m128 a) { const __m128 ONE = _mm_set1_ps(1f); __m128 r = _mm_rcp_ps(a); return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r); }

@DDlol01 2 жыл бұрын

I was looking for this one. I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik. (no indepth knowledge use at your own risk. Please corret me if im wrong) P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?

@cH3rtzb3rg 2 жыл бұрын

Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same. One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.

@JamesEdwards780 2 жыл бұрын

bit shift >> may be faster still

@vytah 2 жыл бұрын

@@JamesEdwards780 You can't just bitshift floats.

@PremierSullivan 2 жыл бұрын

@@vytah Not with that attitude!

@pyromen321 2 жыл бұрын

I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations

@-mwolf 2 жыл бұрын

whats ILP?

@leocomerford 2 жыл бұрын

@@-mwolf Instruction-level parallelism en.wikipedia.org/wiki/Instruction-level_parallelism .

@MenkoDany 2 жыл бұрын

Thank u for the wisdom, dolan

@ABaumstumpf Жыл бұрын

But it simply does not take less computational powe

@kingacrisius Жыл бұрын

@@ABaumstumpf I believe they mean literal power, like electricity. It might be less expensive in terms of electricity.

@pyromen321 2 жыл бұрын

26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction. I’m almost certain that -x/y will always become an xor and mul when y is a power of two Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!

@WhatsACreel 2 жыл бұрын

I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)

@PhilippeVerdy 2 жыл бұрын

An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not: - 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN). - 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b). - 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)). - 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac). For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case). This gives a total of 7 cases, where conditionals (if statements) are justified. But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).

@PhilippeVerdy 2 жыл бұрын

(-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y). To compute that (positiive) dividend of these two positive floatting points, - the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts; - finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1. The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)

@pyromen321 2 жыл бұрын

@@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5. Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.

@PhilippeVerdy 2 жыл бұрын

@@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !

@bronzecomeshome9517 2 жыл бұрын

My man really putting the work in on the Blender effects 😂 but damn I was so shocked to see deleting a useless branch cut the processing time in half

@protonmaster76 2 жыл бұрын

He's done an interesting video on branchless coding. In modern CPUs, a branch can be very expensive.

@bronzecomeshome9517 2 жыл бұрын

@@protonmaster76 I've seen it and enjoyed it, I just assumed that as a false branch it would have been optimized out by the compiler.

@chainingsolid 2 жыл бұрын

The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. @Creel looking at the branch predictability could be an interesting video.

@WolfgangBrehm 2 жыл бұрын

4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.

@rrshier Жыл бұрын

I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals

@larswilms8275 2 жыл бұрын

just so everybody knows: the OG quadratic formula is the closed form of the Po-Shen Lo algorithm. the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.

@alexwolski3344 2 жыл бұрын

What a great idea! Gamifying the benchmarks makes it so much fun. Love the animations and benchmarking song. Really interesting stuff.

@matthewkolar7560 2 жыл бұрын

CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.

@sabriath 2 жыл бұрын

CPU latency of modern computers are roughly: fadd 4 (as well as fsub) fmul 4 fdiv 11-25 fsqrt 35 At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements. The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code. If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster. The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.

@procedupixel213 2 жыл бұрын

If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.

@SmallSpoonBrigade 2 жыл бұрын

Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.

@attilatorok5767 2 жыл бұрын

Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.

@teambellavsteamalice 2 жыл бұрын

Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid. Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a. For the Po Shen Lo it works even better: mid = b/2a c = c/a r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 = mid + r2 the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?) This should already always beat the quadratic. Would using reciprocals be even faster?

@shanehebert396 2 жыл бұрын

For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps. Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.

@booom4849 2 жыл бұрын

Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.

@SpontaneousProcess 2 жыл бұрын

As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you! EDIT: ok so this happens a lot in this video.

@astrobullivant5908 2 жыл бұрын

When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.

@duckymomo7935 2 жыл бұрын

It’s also just quadratic formula but removing the fluff

@HDX30 2 жыл бұрын

Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5. With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts! I'd be curious to understand what you think.

@WhatsACreel 2 жыл бұрын

I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication. My operation counts were a little hit and miss! I think you’re right :)

@FrostMonolith 2 жыл бұрын

I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.

@Aras14 2 жыл бұрын

After a little bit of tinkering, I got the Po Shen method down to: 2 Float subtractions 2 Transfers (a := b) 1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2) 1 XOR (to multiply by -1) 1 Float addition 1 Square root It also only uses 3 Registers. It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64): 1. Subtract from A 2^p 2. XOR A with 2^l 3. Set C to A 4. Multiply C by A 5. Subtract from C B 6. Take the square root of C 7. Set B to C 8. Add to B A 9. Subtract from A C

@derfogel3491 2 жыл бұрын

The Po-Shen Loh method is not new. Learned that method over ten years ago in school.

@omegaroguelp 2 жыл бұрын

yea like same

@Kebabrulle4869 2 жыл бұрын

Same. It’s called the pq formula in Swedish schools. It’s basically just dividing the polynomial by a and deriving the quadratic formula again.

@marknn3 2 жыл бұрын

Instead of: b[i] /= a[i]; c[i] /= a[i]; do: float a_inverse = 1.0 / a[i]; b[i] *= a_inverse; c[i] *= a_inverse; This will remove one division and replacing it with a much faster multiply.

@radeklew1 2 жыл бұрын

That replaces a division with TWO multiplication, which... might still be faster? You'd have to benchmark, and it may depend on the CPU.

@teambellavsteamalice 2 жыл бұрын

I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5. will using r2 instead of u be faster still? so r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 += mid

@thomasbrotherton4556 2 жыл бұрын

And even faster to not be doing I/O back to the arrays.

@ABaumstumpf Жыл бұрын

And that change will give slightly different results

@ProjectPhysX 2 жыл бұрын

For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically? Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :) I personally never trust the compiler :D

@Wyoming307 2 жыл бұрын

is this actually a legal optimisation for the compiler though? if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result. I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.

@AMan-xz7tx 2 жыл бұрын

love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?

@sotrh7974 2 жыл бұрын

Yeah the blue on dark gray in particular is really hard to see

@kasuha 2 жыл бұрын

So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously? The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.

@Daniel_VolumeDown 2 жыл бұрын

@@jeremiahbullfrog9288 idk because i did not wathed but formula was proposed by someone else whose video is in the discription

@JamesJones-zt2yx 2 жыл бұрын

Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)

@corscheid 2 жыл бұрын

This video was absolutely amazing. Beautifully done. WOW.

@GiveAcademy 2 жыл бұрын

Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤

@jerryplayz101 2 жыл бұрын

15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation

@mikefochtman7164 Жыл бұрын

Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.

@DOROnoDORO 2 жыл бұрын

Love the format, very nice animation, and even nicer maths!

@FaZekiller-qe3uf 2 жыл бұрын

What a great way to pad the runtime

@Andrew90046zero 2 жыл бұрын

Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting. It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times. On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.

@mhcbon4606 2 жыл бұрын

at 4:30, friend, the code in blue is unreadable unless i use 1080p.

@Bluesourboy 2 жыл бұрын

I absolutely loved this, thanks for making such great content!

@effektor2000 2 жыл бұрын

I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions. Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.

@andersgoude 2 жыл бұрын

I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers

@josephvictory9536 2 жыл бұрын

An idea for future videos: Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches. Its only us nerds watching so there is no shame! Awesome video btw!

@kestans 2 жыл бұрын

brain neglects reading darkblue code :D

@algorithminc.8850 2 жыл бұрын

Great channel ... really enjoying it ... thank you. Cheers

@protonmaster76 2 жыл бұрын

It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.

@zactron1997 2 жыл бұрын

It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations. Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations. Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions. Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.

@Mernom 2 жыл бұрын

@@zactron1997 How fast is conversion between polar and grid forms?

@protonmaster76 2 жыл бұрын

@@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.

@protonmaster76 2 жыл бұрын

@@Mernom cartesian/polar conversation requires about the same amount of trig as you're trying to avoid in the add and sub.

@monad_tcp 2 жыл бұрын

std::complex is pretty terrible

@mikefochtman7164 Жыл бұрын

The 'blue' version @ 9:26, since 'bt' is only used for 'mid', why not eliminate it completely? Just 'float mid = - b[i] / (a[i] + a[i]). You eliminate a division operation as well as eliminate the storage 'bt' usage (and you don't need the constant '2' anymore). Just seems this code could be 'tweaked' a bit further and maybe beat the original quadratic formula. Later at 26:20, you would then beat the original by having fewer multiplications and equal number of divisions.

@nyaromeneko4455 2 жыл бұрын

For completeness, I would suggest doing the benchmark calculations in both float and double. I would have liked to see complex. 🙃

@falklumo 2 жыл бұрын

Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.

@93hothead 2 жыл бұрын

Nobody said anything about deriving

@thomasolson7447 Жыл бұрын

Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?

@christopherhume1631 2 жыл бұрын

The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization. The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach. Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.

@PhilippeVerdy 2 жыл бұрын

That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.

@PhilippeVerdy 2 жыл бұрын

As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).

@phizc Жыл бұрын

26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that. So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.

@stevegremory3849 2 жыл бұрын

bloody awesome! never knew this existed, extremely cool, hell it'll be useful for school work too ;)

@WhatsACreel 2 жыл бұрын

I reckon! Such a clever formula! Cheers for watching :)

@stevegremory3849 2 жыл бұрын

@@WhatsACreel Cheers :D

@furyzenblade3558 2 жыл бұрын

Your videos are always super interesting! Thank you

@wolfgangreichl3361 Жыл бұрын

The way I learnt it, the expression (double==constant) is a bug anyway. (9:12) Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.

@syntaxerorr 2 жыл бұрын

I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.

@thrownchance 2 жыл бұрын

Which CPU and GPU was used? A speedup of only factor 2 to go to the GPU seems to be barely worth it imho.

@user-sl6gn1ss8p 2 жыл бұрын

specially considering the time it takes to get stuff into the gpu

@Chris-op7yt Жыл бұрын

that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.

@turun_ambartanen Жыл бұрын

I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :( It's called "pq-Formel", with p=b/a and q=c/a

@Zero-4793 2 жыл бұрын

with Po=Shen Loh's, you don't need to calculate with complex numbers. you just put an i out the front if c^2 < 0: c = |c| ans = ic

@Zooiest Жыл бұрын

How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things

@CjqNslXUcM 2 жыл бұрын

awesome video, well done!

@XfStef 2 жыл бұрын

Glad round 2 took a Data Oriented approach!

@MichaelJamesActually 2 жыл бұрын

Could you use something like the inverse root hack from quake 3? I'm sure there's other approximations of square roots that could also improve that calc.

@MichaelJamesActually 2 жыл бұрын

Scratch that. Both methods have the same number of square root operations

@PhilippeVerdy 2 жыл бұрын

Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.

@mt0software 2 жыл бұрын

I wish this was the kind of game show they'd show on TV. great video

@plokki456 2 жыл бұрын

interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.

@aliengenie8896 2 жыл бұрын

When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar. The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).

@haoming3430 2 жыл бұрын

Where can I download the assembly version of the code? Many thanks!

@ExaltedDuck Жыл бұрын

fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...

@prof.salomonibarra6402 2 жыл бұрын

¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers

@mattewlefty991 2 жыл бұрын

From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂

@user-sl6gn1ss8p 2 жыл бұрын

"From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c" Really? Like, in a way that po-shen loh wouldn't?

@mattewlefty991 2 жыл бұрын

@@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?

@user-sl6gn1ss8p 2 жыл бұрын

@@mattewlefty991 now that you mention it, comparing the results might be fun as well (if fairly slow)

@monad_tcp 2 жыл бұрын

why would it do that ? the roots is basically when the curve crosses the 0. we are talking about a parabola, it must cross the origin

@mattewlefty991 2 жыл бұрын

@@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits

@selvapk5567 2 жыл бұрын

Can anyone explain these instructions which I found in a shellcode document for getting a bind shell? function hashes (executable as nop-equivalent) _emit 0x59 ; LoadLibraryA ; pop ecx _emit 0x81 ; CreateProcessA ; or ecx, 0x203062d3 _emit 0xc9 ; ExitProcess _emit 0xd3 ; WSAStartup _emit 0x62 ; WSASocketA _emit 0x30 ; bind _emit 0x20 ; listen _emit 0x41 ; accept ; inc ecx

@Mernom 2 жыл бұрын

That branch really muddied the water. I think the first test should've been redone with the optimized code.

@asailijhijr 2 жыл бұрын

Is one of those divisions by four? Can you optimise it with a bit shift? Or is that not how floats and std::complex numbers work?

@stevenlynch3456 2 жыл бұрын

The way floats are structured means that bit shifts don't work for division nor multiplication by powers of 2. Floats are structured like this: SignBit PositiveAndBiasedExponent FractionalPart SignBit is 0 for positive float, 1 for negative float PositiveAndBiasedExponent is exponentYouWantToStore+bias = exponentYouWantToStore+127 FractionalPart is like the .345 in the number 12.345 SignBit is 1 bit long. PositiveAndBiasedExponent is 8 bits long. FractionalPart is 23 bits long. Why these kinda random #s? It's standardized and together they use exactly 32 bits, which is a nice power of 2. Example float in memory: 1 10001000 11000000000000000000000 SignBit = 1, so float is negative 10001000 = 136. However, to get the exponent we intended, subtract the bias to get 136-127=9 11000... = (1/2)+(1/4) = .75 So the example float in memory is -1.75*2^(9) = -1.75*512 = -896.0 1 10001000 11000000000000000000000 = -896.0 Why 1.75 instead of .75? The leading 1 is always present in binary scientific notation, and the value 1 is the same no matter which base you're in (except Base 0 and Base 1, which are extraordinarily difficult/annoying to deal with). Now imagine shifting it to the right because you think dividing a float by 2 is the same as dividing an integer by 2. What you're doing is moving the sign bit into the exponent, dividing the EXPONENT by 2, and then (what your intention originally was) dividing the FractionalPart by 2. 1 10001000 11000000000000000000000 shifted to the right by 1 is 0 11000100 01100000000000000000000 Putting this into www.h-schmidt.net/FloatConverter/IEEE754.html shows that it's equal to 8.11656739243e+20 -896 is pretty different when compared to +8 with 20 digits after it. HOWEVER, if you could find a way to decrease the exponent by 1 (i.e. subtract place value of 1 from the exponent) WITHOUT modifying any other part of the float, you would successfully divide by 2.

@DanikaOliver 2 жыл бұрын

I can't see the code, the contrast is not good. The letters are small.

@CanKarakuzulu 2 жыл бұрын

Just trying, but using branched code in cuda code severly slows things down. Maybe you could try branchless version of po-shen lo cuda?

@MrRyanroberson1 2 жыл бұрын

4:23 already i see a problem. The +/- part should be computed once and reused, not just computed once for each root

@Misteribel 2 жыл бұрын

I might have missed it in the video, but did you consider rsqrtss or vrsqrtss? You’d need to change the algorithm, but it may improve speed of your slowest instruction: square root. Maybe you used it in the simd variant asm version.

@TerjeMathisen 2 жыл бұрын

First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order! On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value. The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)). By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap. Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).

@cH3rtzb3rg 2 жыл бұрын

I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput). Instruction latency would matter, if the output of the previous calculation influences the input of the next.

@TerjeMathisen 2 жыл бұрын

@@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)

@cH3rtzb3rg 2 жыл бұрын

@@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).

@zaydabbas1609 2 жыл бұрын

Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation

@armandosreis 2 жыл бұрын

I don't get how you can get rid of the if branch.... It says if a is not 1 then you adjust b and c. so for all values of a that are not 1 you need that adjustment. How can you just remove it? That's changing the formula. Am I missing something here?

@ABaumstumpf Жыл бұрын

The branch said to not do the division if it would divide by 1 " which is completely irrelevant cause dividing by 1 does not change the result but having a branch can seriously impact performance

@boelwerkr 2 жыл бұрын

That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code: void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){ for(int i =0; i< count; i++){ float mid = ( quads[i].b / quads[i].a ) * -0.5f float u = sqrt(mid * mid - quads[i].c / quads[i].a ); quads[i].r1 = mid + u; quads[i].r2 = mid - u; } } should be faster if i haven't made a stupid logical error. 🙂

@cH3rtzb3rg 2 жыл бұрын

Why would the original Po-Shen Loh code check for a!=1, but no code actually checked for a!=0?

@KerbalLauncher 2 жыл бұрын

What were the compiler settings?

@rogo7330 2 жыл бұрын

Is storing computation result into previously defined variable (thinking that we won't need its content anymore) faster than defining new (local to function) variable? And does this even matter?

@cH3rtzb3rg 2 жыл бұрын

For the compiler it should not matter at all, at least for simple types. Do whatever is easier to read.

@monad_tcp 2 жыл бұрын

@@cH3rtzb3rg its usually easier to store duplicated terms in a variable to make the code simpler, not for performance, the compiler can do that

@cH3rtzb3rg 2 жыл бұрын

@@monad_tcp I think Rogo was asking about code like: int x1 = SOME_EXPRESSION; // use x1 int x2 = SOME_OTHER_EXPRESSION; // use x2 If one should reuse x1 instead of introducing x2 for that.

@monad_tcp 2 жыл бұрын

@@cH3rtzb3rg oh, in that case it really doesn't matter, variables are renamed either way, the compiler either do register allocation or store in the stack. After data-flow analysis, it will reuse stack variables if possible. Unless your compiler is really, really dumb. Like Microchip ones.

@JustATempest Жыл бұрын

I'd be interested and you using the fast square root algorithm for quake as a substitute for the square root for both of these

@guylaingreer4190 2 жыл бұрын

I didn't catch if you mentioned it, but when you switched to std::complex were the coefficients complex or were they still real?

@etopowertwon 2 жыл бұрын

At differences so low you more quickly run into what compiler can optimize better at given flags (O0 vs O1 vs O3, -ffast-math/-fno-fast-math) For example: final table is unoptimized. Nobody counts 2a as (2*a). It's (a+a). (Unless compiler would decide that 0.5 * X/a is faster than X/(a+a) -- it generally shouldn't, floats have no instruction for quickly halving a value)

@ABaumstumpf Жыл бұрын

Changing a float by a factor of 2 is a single addition/subtraction

@PplsChampion 2 жыл бұрын

"benchmarking as entertainment" is next to godliness

@camofelix 2 жыл бұрын

I didn’t notice what chip you’re running this on! If it’s AVX512 with FP16 (early alder lake with ecores off) enabled there’s some SIMD fast paths for complex calculations that can do it all much faster, and because you’re using a smaller data type (fp16 instead of fp32) you get much better bang per buck on memory bandwidth

@monad_tcp 2 жыл бұрын

its an x86 probably

@michaelwilkes0 Жыл бұрын

just getting close to such a tiny we ll established formula is amazing. epic.

@decky1990 2 жыл бұрын

Could you do a bit shift right (SHR?) to help reduce the clock cycles for dividing by 2 or would it need to be an integer to preserve accuracy? Thanks for putting up these videos - desperately trying to finish a new book on C++ for the 3rd time and you’re my only real motivation at the minute.

@andersgoude 2 жыл бұрын

This only works for integers. If you really want to achieve this through bit manipulation with a float, what you could do is something like float a = 5; *reinterpret_cast(&a) -= 0x800000; and a will now have the value 2.5 instead (assuming that int is 32 bit) What this code is doing is that it decreases the exponent of the floating point number by 1, which mathematically is the same as dividing by 2. (check the IEEE 754 standard for more details how floating point numbers are stored in memory)

@decky1990 2 жыл бұрын

@@andersgoude thanks for sharing, Anders! Will check this out further, thank you

@andersgoude 2 жыл бұрын

@@decky1990 I just want to remark that my example is mainly for educational purposes, and is generally not something that is recommended. The normal solution for divisions with a constant is to replace the expression a/2 with a*(1.0/2.0) (or just write a*0.5 which will generate the same code). This works for any constant and for most modern systems, multiplications are very fast operations, compared to divisions

@disasterarea9341 2 жыл бұрын

hard to see the blue code around 4 minutes in, dark text on dark background :x

@dananskidolf 2 жыл бұрын

I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)

@barakap2372 Жыл бұрын

Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply

@devinschlegel1763 2 жыл бұрын

are there any plans to continue the neural net series?

@beaconofwierd1883 2 жыл бұрын

At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D So thank you for making me feel clever ;D

@Verrisin 2 жыл бұрын

9:20 - isn't that if wrong to begin with? It's supposed to check it's not 0, in which case maybe use an entirely different algorithm.

@felixstuber8046 Жыл бұрын

You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.

@lordadamson 2 жыл бұрын

man I love your videos haha

@dannygjk 2 жыл бұрын

Why are you using floating point?

@rafagd 2 жыл бұрын

4:10 - I was going to complain you were calculating sqrt(b - 4ac) twice for no reason on the original way, but I imagine the compiler should be smart enough to optimize that one away...

@emjizone Жыл бұрын

3:31 In your red code there is no reason I know to compute 2*a twice per loop cycle, nor to compute (b*b-4*a*c)^(0.5) twice per loop cycle ! Here you aren't simply implementing different mathematical approches. You are also comparing opposed computation implementation patterns. :/

@badhombre4942 Жыл бұрын

√(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same. If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.