Would probably be faster to just return 1 for every entry. It would be accurate enough in astronomical scales.
@rayniac2112 жыл бұрын
Unless your quadratics are astronomical in nature ;)
@WitherLele2 жыл бұрын
@@rayniac211 then it would be accurate on a bigger scale
@RobertLoweIsAwesome2 жыл бұрын
@@WitherLele kzbin.info/www/bejne/f3TZZ3qCrNlkna8
@airkami7 ай бұрын
You remind me of the algorithm i designed to check an array of positive integers to return true for each value divisible by 3. My Algorithm: Return false It is fast, it doesn't contain unpredictable failure, and it is correct most of the time given the set of all positive integers.
@josemonge46042 жыл бұрын
Dark background and blue color aren't a good combination of colors for readable text. Try white or yellow next time or something like that. Anyway great comparison and great job on the optimizations!
@HansLemurson2 жыл бұрын
If you want Blue to be seen, add some Green!
@firstname43372 жыл бұрын
LOL, he said blue code and i said what code ? I had to change the brightness of my monitor to see it
@sdjhgfkshfswdfhskljh33602 жыл бұрын
Not just it bad by itself, but it is also more damaged by KZbin compression algorithms.
@NoNameAtAll2 Жыл бұрын
yellow on white? ha ha
@professortrog7742 Жыл бұрын
@@NoNameAtAll2 no on the dark background
@47Mortuus2 жыл бұрын
9:35 don't divide by 2, multiply by 0.5. The compiler is not allowed to do so, as it will sometimes produce different bitwise results. ... Unless you have fast unsafe math enabled, which will also allow for FMA reordering, which is a pretty significant optimization... Also, much of the runtime is actually float division. It's not pipelined and you can only start a division after another has finished, which is 14 clock cycles at best, which is _extremely_ relevant when you can replace a FP division by using "RCP_PS" and do some Newton-Raphson, which has _WAY_ higher throughput than FP divison (relevant for tight loops such as this one), and can be FMA fused when multiplying by the reciprocal instead. Here's the code for an accurate reciprocal (only _possibly_ worth it if your CPU supports FMA; total latency (fastest possible): 4 (rcp) + 2 * 4 (FMA) = 12 cycles vs 11 cycles (fastest possible division), but way higher throughput and FMA "fusable", meaning another - up to - 3 clock cycles saved): __m128 rcp23_ps(__m128 a) { const __m128 ONE = _mm_set1_ps(1f); __m128 r = _mm_rcp_ps(a); return _mm_fnmadd_ps(_mm_fmsub_ps(r, a, ONE), r, r); }
@DDlol012 жыл бұрын
I was looking for this one. I also want to add: even if fast unsafe math is enabled, on some compilers (vcc: /fp:fast) you should also enable optimizations for speed to rewrite constants as needed (done when using /O2 or /Ot but NOT /Ox). /Ot is default. /O2 does some more afaik. (no indepth knowledge use at your own risk. Please corret me if im wrong) P.S.: AFAIK in gcc constant rewriting is auto enabled when using fast unsafe math (-ffast-math) and optimization levels do not affect constants, -ffasm-math takes precedence?
@cH3rtzb3rg2 жыл бұрын
Dividing by powers of two can and will actually be replaced by multiplying with the inverse by the compiler even without fast-math flags, since the result will be exactly the same. One could replace the divisions by a with a multiplications by (1/a) -- this is a thing the compiler won't do in exact-math mode. In fast-math mode, it may even use the fast rcpps (reciprocal packed single) instruction with a refinement step to do this faster.
@JamesEdwards7802 жыл бұрын
bit shift >> may be faster still
@vytah2 жыл бұрын
@@JamesEdwards780 You can't just bitshift floats.
@PremierSullivan2 жыл бұрын
@@vytah Not with that attitude!
@pyromen3212 жыл бұрын
I’m guessing the new version might actually use less power! They should be practically the same speed due to ILP, but the the new version requires less calculations
But it simply does not take less computational powe
@kingacrisius Жыл бұрын
@@ABaumstumpf I believe they mean literal power, like electricity. It might be less expensive in terms of electricity.
@pyromen3212 жыл бұрын
26:24 it’s worth noting that because you’re dividing my a constant power of two for the third div in the new version, all modern compilers will optimize that to a single mul instruction. I’m almost certain that -x/y will always become an xor and mul when y is a power of two Edit: I don’t want to litter the comment section too much, so I’ll just put this here: awesome video, man! Some of my work involves optimization, and I absolutely love when other people make this knowledge super accessible!
@WhatsACreel2 жыл бұрын
I think you're right about the division being swapped for a mul! I can't remember what I did in the ASM, but hopefully a compiler would figure that one out... Cheers for watching mate, have a good one :)
@PhilippeVerdy2 жыл бұрын
An actual implementation would distinguish 4 main cases, depending on if 'a' is zero or not, and on if 'b' is zero on not: - 1. When (a=b=0), if (c=0) there's a single root (r1=r2=0), otherwise there's no solution (r1=r2=NaN). - 2. Otherwise when only (a=0), there's no solution if (b=0), otherwise a single solution (r1=r2=-c/b). - 3. Otherwise when only (b=0), there are two solutions of opposite signs (r2=-(r1=√(c/a)). - 4. Otherwise, we have the two quadratic roots, possibly complex depending on the sign of the determinant (b²-4ac). For the last 4th case, you need **numeric stability** (due to limited precision for "float", "double" or "long double" types used by FPU) and the two quadratic solutions should be given by (r{1,2}=[√(1-4ac/b²)±1]*b/2a), where the square root may be imaginary if the "normalized" determinant (1-4ac/b²) is negative, i.e. when the usual determinant (b²-4ac) is negative (both conditions are equivalent because b≠0 in that 4th case). This gives a total of 7 cases, where conditionals (if statements) are justified. But the test on if b=1 in this video is just useless and stupid (it tries to avoid a trivial division by 1 but at the cost of a branch, which disables the optimization or parallelisation by the compiler: divisions by 1 are trivial also in FPUs, may just cost 1 cycle, LESS than the cost of a branch).
@PhilippeVerdy2 жыл бұрын
(-x/y) cannot be "optimized" with a xor and mul: we're working here with floatting point values, not integers! The negation of floatting points just flips & single bit, the sign bit, it does not perform any "xor" on the whole value (the exponent or the mantissa part are not modified), and this can be done before the division on any operand x or y, or on the result; but in fact the result sign is NOT[sign(x)AND sign(y)], and the absolute quotient is the separate division of abs(x)/abs(y). To compute that (positiive) dividend of these two positive floatting points, - the FPU separately computes the quotient of the two mantissa, which are in (0.5, 1.0], so that non-normalized quotient is in (0.5, 2], and the difference of exponent parts; - finally it renormalizes the quotient into (0.5, 1.0] and accordingly adjusts the difference of exponent parts by conditionally substracting 1. The final result may become a denormal (or zero) floatting point value, or an "infinite" floatting point value, depending if the final exponent part results in an underflow or overflow (for its allowed range), but it will be a "NaN" (or would raise a "'division-by-zero" exception) if both mantissa parts (including the implicit most significant bit if it's not stored in the floatting point format) are zero. In all cases, the result sign computed separately is blindly applied as is (possibly giving "positive or negative zero", "'positive or negative NaN", or "positive or negative infinite" values; some libraries may or may not discard the sign distinction for zeroes and NaNs, but with some extra cost using conditional branches)
@pyromen3212 жыл бұрын
@@PhilippeVerdy, you probably should have checked godbolt before that whole explanation. I just confirmed that on some compilers it becomes a mul and xor, on others it just becomes a mul by -0.5. Remember, in this case we are dividing by a positive constant power of two and negating. This means the compiler can take shortcuts.
@PhilippeVerdy2 жыл бұрын
@@pyromen321 if you are dividing à float by a negative power of two, you don't need a mul, just z decrementation of the exposant part, and flipping the sign bit, this tales two instructions, worse than a single FPU instruction performing a single multiplication by constant like -0.5 (consider data path on thd same word containing thd sign bit and the exponent, plus you need to test underflows when decrementing the exponent, or you'll wrong results !
@bronzecomeshome95172 жыл бұрын
My man really putting the work in on the Blender effects 😂 but damn I was so shocked to see deleting a useless branch cut the processing time in half
@protonmaster762 жыл бұрын
He's done an interesting video on branchless coding. In modern CPUs, a branch can be very expensive.
@bronzecomeshome95172 жыл бұрын
@@protonmaster76 I've seen it and enjoyed it, I just assumed that as a false branch it would have been optimized out by the compiler.
@chainingsolid2 жыл бұрын
The combo of a branch + floating point, may have resulted in the compiler not realizing it was redundant. And having a branch that also probably didn't predict very well (its based soley on incoming data) it may have done major performance damage. Would be cool to see the pattern of which way the branch goes, given its based solely on the input. Wonder if "sorting" the input to make it predictable would allow removal of the divide (by the CPU's branch predictor), for some of the solve attempts and thus, maybe go even faster for those cases. @Creel looking at the branch predictability could be an interesting video.
@WolfgangBrehm2 жыл бұрын
4:31 skip the if. Just always divide. If the coefficients are random, the probability of a being equal to 1 are very slim. Also it might be worthwhile to multiply by the precomputed inverse of a.
@rrshier Жыл бұрын
I was just thinking what you put down here. The if statement is actually probably his biggest adder of time here! AND usually the vectorization wants to have pure math, nothing with conditionals
@larswilms82752 жыл бұрын
just so everybody knows: the OG quadratic formula is the closed form of the Po-Shen Lo algorithm. the PSL algorithm is simpler to understand and perform by a mere human, but I would not expect optimized code for the two variants to differ al to much.
@alexwolski33442 жыл бұрын
What a great idea! Gamifying the benchmarks makes it so much fun. Love the animations and benchmarking song. Really interesting stuff.
@matthewkolar75602 жыл бұрын
CPU's & Compilers can vectorize Array of Structures as well, the difference between this and SOA is more about optimizing cache access patterns for functions that dont use every element in the structure. You should double check your compiler optimization settings if you are getting wildly differing results from a structure this small.
@sabriath2 жыл бұрын
CPU latency of modern computers are roughly: fadd 4 (as well as fsub) fmul 4 fdiv 11-25 fsqrt 35 At the very least, this puts the original at between 97 and 125 latency, while the new formula takes between 88 and 130 latency. This means on natural programming it is both slower and faster depending on division requirements. The issue I have with your assessment at the end is you say that the new formula takes 3 divisions, it actually takes 2 to normalize the equation off the first coefficient. You are dividing by 2.0f, this is wasting 11-25 cpu latency on chip where you could simply multiply by 0.5f which is 4 latency. Even worse is the fact that floats are in powers of 2, so dividing by 2 is just decrementing the exponent, which is a 1 latency to perform an integer decrement prior to loading the float into register for use (even before normalizing division in the fpu). Some compilers would have noticed the 2.0f division and possibly switched to a fmul on 0.5f for speed, but it wouldn't have figured out that you could do a dec prior to any of it, which means that you didn't actually test the right code. If done correctly, the new method should have used 1xDEC (1), 2xFDIV (22-50), 1xFMUL (4), 1xFSQRT (35), and 3xFADD/FSUB (12), which comes to 74-102, faster than the original by 21%. Also, complex numbers are automatically formatted correctly with the new method, the old method has a division after sqrt, which takes longer to simplify (squaring it, dividing the sqrt answer by it to get the complex part formatted). You can make the code a little bit faster by taking the reciprocal of the leading coefficient (1xFDIV) and then multiplying it to the other coefficients to speed up the process further, but both can make use of that trick....despite that, the new method becomes almost 23% faster. The real bottleneck comes down to the square root function, it is the most expensive part in both scenarios. If we could get that down to 15 latency, the new method would be 28% faster than the old.
@procedupixel2132 жыл бұрын
If the least significant bit of precision can be sacrificed in the new method, the two divisions by a can be replaced by one reciprocal (i.e. 1.0/a) and two multiplications with that reciprocal. Thus one division can be traded for two multiplications, which may be faster.
@SmallSpoonBrigade2 жыл бұрын
Honestly, that's the real solution here. How much accuracy are you allowed to sacrifice in order to speed things up. I'm taking accounting courses right now and there's a bunch of shortcuts that are taken for similar reasons. The result is somewhat different from what it would be with proper math, but it's usually a small amount and you're rarely dealing with more than 4 decimal places when dealing with currency anways. Plus, since people are using one of two sets of rules regarding how these things are don, there usually isn't much trouble.
@attilatorok57672 жыл бұрын
Accuracy is also why std:complex is usually slow. The C++ standard library has to work in all cases and for complex numbers not sacraficing precidion means a lot slower code. You also cant use a fast-math optimised version because the standard library is precompiled (altough you can work around this by compiling a glibc with fast-math and using that with static linking it would probably still preform worse than hand-written speed-for-accuracy optimised code). I would personally not recommend std:complex in any case unless you are sure you need super-duper accurate answers but than you would probably go with an arbitrary precidion library wich has complex numbers.
@teambellavsteamalice2 жыл бұрын
Isn't the step B = b/a and then mid = B/2 the real waste here? You only need mid. Also you can use a trick used for the quadratic, where r2 is used for the discriminant, r1 = (-b + r2)/2a and r2 = (-b - r2)/2a. For the Po Shen Lo it works even better: mid = b/2a c = c/a r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 = mid + r2 the last step I imagine can be faster as it's adding mid to itself ( r2 += mid?) This should already always beat the quadratic. Would using reciprocals be even faster?
@shanehebert3962 жыл бұрын
For the CPU/FPU code, if the bottleneck is reading from RAM, you might try putting in some prefetch loads of the next set of data (particularly for your ASM code). It may not make much difference, though, given the loads are cache line at a time. I remember way back in the day we were looking at the Fastest Fourier Transform in the West (FFTW) on the UltraSparc and telling the C compiler to heavily optimize it (including using static profile guided optimization) and the ASM it generated was pretty slick, including prefetch loading so the data for the next loop was in L1 when the next loop iteration needed it. If the compiler you used didn't use PGO to optimize but has the option, it might be interesting to see if it helps. Also, the dark blue text used for Po-Shen Loh code is pretty hard (for at least me) to read. The red was OK even if a little dark.
@booom48492 жыл бұрын
Manual prefetching will be slower because there are now additional instructions to be executed. He is just iterating over arrays, the hardware prefetcher likes that pattern.
@SpontaneousProcess2 жыл бұрын
As a tip: at 3:45 the contrast with the blue text is so low that I can’t read it. So I’m following along with your commentary by imagining the route in my head rather than reading along with you! EDIT: ok so this happens a lot in this video.
@astrobullivant59082 жыл бұрын
When I first saw Po-Shen Loh's method, I thought it looked like he was solving it by substitution, just as Tartaglia and Cardano solved cubics with at first before generalizing the formula. Po-Shen Loh's way is cool and nifty even though it's slower than the regular quadratic formula. Plus, there are situations where we'd get solutions in human-friendly form faster using Po-Shen Loh's method than we would with the regular quadratic formula.
@duckymomo79352 жыл бұрын
It’s also just quadratic formula but removing the fluff
@HDX302 жыл бұрын
Super interesting video! At the end, you do mention how multiplications versus divisions may affect running time, and list that Po-Shen Loh has 3 divisions. However, one of the divisions from Po-Shen Loh is a simple division by 2, which is effectively a multiplication by 0.5. With that said, Po-Shen Loh would have the same addition, 1 less subtraction, 3 less multiplication, the same division, and the same square root! Making it strictly better in terms of operation counts! I'd be curious to understand what you think.
@WhatsACreel2 жыл бұрын
I found it very hard to count the operations. The negatives can become subtractions, the divisions can be invterted (as you mentioned), and the FMADD instructions can perform add/sub along with a multiplication. My operation counts were a little hit and miss! I think you’re right :)
@FrostMonolith2 жыл бұрын
I felt that you haven't watch Po-Shen Loh's video at its full, because he explicitly mentions this is NOT his own equation. It's his way to interpret the logic in solving quadratic equations. He even mentions who originally had this idea.
@Aras142 жыл бұрын
After a little bit of tinkering, I got the Po Shen method down to: 2 Float subtractions 2 Transfers (a := b) 1 Integers subtraction (from a float to reduce the exponent by 1 aka dividing by 2) 1 XOR (to multiply by -1) 1 Float addition 1 Square root It also only uses 3 Registers. It goes as follows (from x² + Ax + B; A, B, C are floats; p is the number of significand bits of the type of float, l is the size of the total number for example 32, 64): 1. Subtract from A 2^p 2. XOR A with 2^l 3. Set C to A 4. Multiply C by A 5. Subtract from C B 6. Take the square root of C 7. Set B to C 8. Add to B A 9. Subtract from A C
@derfogel34912 жыл бұрын
The Po-Shen Loh method is not new. Learned that method over ten years ago in school.
@omegaroguelp2 жыл бұрын
yea like same
@Kebabrulle48692 жыл бұрын
Same. It’s called the pq formula in Swedish schools. It’s basically just dividing the polynomial by a and deriving the quadratic formula again.
@marknn32 жыл бұрын
Instead of: b[i] /= a[i]; c[i] /= a[i]; do: float a_inverse = 1.0 / a[i]; b[i] *= a_inverse; c[i] *= a_inverse; This will remove one division and replacing it with a much faster multiply.
@radeklew12 жыл бұрын
That replaces a division with TWO multiplication, which... might still be faster? You'd have to benchmark, and it may depend on the CPU.
@teambellavsteamalice2 жыл бұрын
I like the *= steps. Someone else remarked instead of mid b /= 2 or b *=.5. will using r2 instead of u be faster still? so r2 = sqrt(mid*mid - c) r1 = mid - r2 r2 += mid
@thomasbrotherton45562 жыл бұрын
And even faster to not be doing I/O back to the arrays.
@ABaumstumpf Жыл бұрын
And that change will give slightly different results
@ProjectPhysX2 жыл бұрын
For the red code at 3:30, wouldn't it be faster to compute the sqrt(...) part only once in a variable and then insert the variable twice for the +/- solutions? Or does the compiler do that automatically? Edit: Nevermind, just saw you test this later in the video and the compiler does indeed optimize it :) I personally never trust the compiler :D
@Wyoming3072 жыл бұрын
is this actually a legal optimisation for the compiler though? if r1 points to the same address as a, b or c then the first computation would change the parameters for the second. So optimising this would lead to the wrong result. I guess if the function is only called once and it knows all the arguments at compile time, then it knows this can't happen, but I'm still surprised it would do this.
@AMan-xz7tx2 жыл бұрын
love the video, but some of the text is a little hard to read, maybe add a centered shadow text effect to make it pop out a bit, and lighten the text colours just a small amount too?
@sotrh79742 жыл бұрын
Yeah the blue on dark gray in particular is really hard to see
@kasuha2 жыл бұрын
So, you take (A ± B)/C expression, separate it into A/C ± B/C and call it a new revolutionary formula? Seriously? The whole comparison is about how smart is the optimizer in your compiler and how efficiently or inefficiently can you put the two into source code. But in the end of the day, it's exactly the same formula. Programmers go to far greater lengths in deviating from formulae optimized for viewing to formulae optimized for execution when it's really important.
@Daniel_VolumeDown2 жыл бұрын
@@jeremiahbullfrog9288 idk because i did not wathed but formula was proposed by someone else whose video is in the discription
@JamesJones-zt2yx2 жыл бұрын
Would really like to see a discussion of whether Po-Shen Loh's method is preferable to the one we all learned from a numerical analysis POV. Not sure error analysis leads to quite as spiffy graphics, though. :)
@corscheid2 жыл бұрын
This video was absolutely amazing. Beautifully done. WOW.
@GiveAcademy2 жыл бұрын
Love all the content!!! Reminds me of my random coding adventures! And very inspiring!!! Very happy that you are active again! I am struggling to get myself back up and going.., would love to see a continuation of your GraceAI neural network series!! ❤
@jerryplayz1012 жыл бұрын
15:09 - my guessing that the compiler is using branch prediction, meaning that the program is already precomputing "a" and prepares for the branch as part of low level optimisation
@mikefochtman7164 Жыл бұрын
Worked on old 32-bit mainframes and we used to pour over our code for hours to hand-optimize the FORTRAN. Looking for division operations that could be replaced with multiplication by an inverse. And the bane of our work were pesky 'implicit type converions' where someone would do something like * instead of *. Yeah, those old compilers weren't very good at optimization so we had to do it ourselves. Love your channel and loved this dive into some interesting bit of math.
@DOROnoDORO2 жыл бұрын
Love the format, very nice animation, and even nicer maths!
@FaZekiller-qe3uf2 жыл бұрын
What a great way to pad the runtime
@Andrew90046zero2 жыл бұрын
Really interesting. I was really hoping Po-Shen Loh was going to be faster across the board :P But the results and the overall lesson of the video was still very interesting. It would be cool to see a video where you only talk about the Op Times table you showed at 26:36 and maybe showing other examples of algorithms and how they stack up in terms of Op Times. On top of that, I also found myself wondering a little bit about "compiler vectorization". Idk if you have any video's on that subject, but I am interested in learning more about compiler vectorization. Especially including parts where you talked about "Structure of Arrays" and how the compiler (I guess) can more easily take advantage of simd instructions.
@mhcbon46062 жыл бұрын
at 4:30, friend, the code in blue is unreadable unless i use 1080p.
@Bluesourboy2 жыл бұрын
I absolutely loved this, thanks for making such great content!
@effektor20002 жыл бұрын
I would try to calculate 1 / (2.0) * a[i]) and then do multiplications instead of the two fivisions. Also I aliasing rules prevent further compiler optimizations, making the pointers restrict should help in this case, in particular in Quad2 as it does more memory access.
@andersgoude2 жыл бұрын
I am a bit curious about the aliasing rules here. If the compiler was reusing the value for the square root in the first case (at around 7:00), wouldn't this be against the aliasing rules, as if r1 would be the same as a, b or c, it would have changed the values for the square root for the second calculation. So the compiler may already assume that there is no aliased pointers
@josephvictory95362 жыл бұрын
An idea for future videos: Can you put the standard deviation with the speed result if its less than 2? I found myself really wanting to see it for some of those closer matches. Its only us nerds watching so there is no shame! Awesome video btw!
@kestans2 жыл бұрын
brain neglects reading darkblue code :D
@algorithminc.88502 жыл бұрын
Great channel ... really enjoying it ... thank you. Cheers
@protonmaster762 жыл бұрын
It's interesting that the std::complex is so much slower than the scalar. It would have been interesting to see a complex implementation in C that didn't use the std::complex methods.
@zactron19972 жыл бұрын
It does make sense that for complex numbers you're looking at much more compute. In the complex domain you have to store 2 values per number for the same level of precision. You'd think that means you only have double the work, but you actually end up with substantially more in certain operations. Consider addition and subtraction: that maps quite nicely to ReC = ReA + ReB, and ImC = ImA + ImB. So double the operations. Now consider multiplication: ReC = ReA × ReB - ImA × ImB, and ImC = ReA × ImB + ImA × ReB. Now you have 4 multiplications instead of 1, and two additions. Division and square roots are even worse. Now granted, you could speed this up by working in polar form, but then you need trig functions to handle addition and subtraction.
@Mernom2 жыл бұрын
@@zactron1997 How fast is conversion between polar and grid forms?
@protonmaster762 жыл бұрын
@@zactron1997 it is true that when operating in the complex domain that some maths operations require a lot more operations than its real counterpart. My point is that when he used std::complex implementation it took 10,000 ms to run as opposed to a few hundred ms. But his asm real vs complex implementations did not have such a wide time difference between the two.
@protonmaster762 жыл бұрын
@@Mernom cartesian/polar conversation requires about the same amount of trig as you're trying to avoid in the add and sub.
@monad_tcp2 жыл бұрын
std::complex is pretty terrible
@mikefochtman7164 Жыл бұрын
The 'blue' version @ 9:26, since 'bt' is only used for 'mid', why not eliminate it completely? Just 'float mid = - b[i] / (a[i] + a[i]). You eliminate a division operation as well as eliminate the storage 'bt' usage (and you don't need the constant '2' anymore). Just seems this code could be 'tweaked' a bit further and maybe beat the original quadratic formula. Later at 26:20, you would then beat the original by having fewer multiplications and equal number of divisions.
@nyaromeneko44552 жыл бұрын
For completeness, I would suggest doing the benchmark calculations in both float and double. I would have liked to see complex. 🙃
@falklumo2 жыл бұрын
Po-Shen Lo did NOT derive a different quadratic formula, his’ is the same! He just derives it a little bit differently. The way he writes the formula just reshuffled common factors.
@93hothead2 жыл бұрын
Nobody said anything about deriving
@thomasolson7447 Жыл бұрын
Suppose every program is just a number. I'm wondering how that number would be calculated. Is it left to right, or right to left? Then there is the stack and heap. Is there a stage before it enters memory where it is one complete number? Suppose we added 1 to it, where would it be? Is there a halt at the end? Does the increment have to take place before the halt?
@christopherhume16312 жыл бұрын
The two algorithms are practically indistinguishable. One can regard the principle difference between them as an algebraically motivated optimization. The main value of this investigation is to underscore the fact that divisions are more expensive than multiplications. This difference would be substantially amplified by the use of double precision math. I haven't looked closely; but it may be possible to reduce the number of divisions in the Po Shen Loh approach. Regardless of approach, trading division for multiplication should improve performance - especially for double precision math. Note, however, I am not referring to constant division by powers of two. Though I am not certain, it is possible those will be converted to arithmetic shifts.
@PhilippeVerdy2 жыл бұрын
That's no longer true with modern FPUs and GPUs, where divisions are now actually a bit FASTER than multiplications! But this is actually visible only with GPUs, not with existing x64 CPU/CPU where both perform the same in one cycle... except when using "vectorized" expressions: AVX2 processor extensions (or newer), if they are used by your C/C++ compiler or its libraries, may show that divisions CAN perform faster than multiplications! You may also get different figures depending on CPU brands and models (how their internal FPU units are internally built). Do not assume things from the time where divisions were still using our "naive" algorithms (with bit-by-bit compares, shifts, and substraction) when we had no FPU to compute them in a single hardware instruction. As well, there are optimizations in modern FPUs to efficiently compute square roots. Modern FPUS can use "FFT-based" algorithms implemented by hardware (using many more silicon gates but structured in a much larger "tree" or parallel branches, but with lower depth, so with faster total propagation time): this is true for the multiplication, the division, the square root, and other trigonometric/exponential/logarithmic functions natively supported by FPU or GPU instructions.
@PhilippeVerdy2 жыл бұрын
As well there are new types of processing units, notably for IA and signal processing, using optimization of tensor flows. They start appearing on CPUs for smartphones and will be soon available on desktop PCs, and are already used in large computers performing massive computation (think about what the CERN needs for the LHC, or what Google needs for its IA with BigData, or what is needed in astronomy, fluid mechanic, or computer-aided design, notably in the aerospatial and defense sectors). These processing units have new specialized instructions for efficiently computing "convolution products" and "FFT's" by hardware. Here also, they can accelerate basic divisions faster than basic multiplications (and the difference is even more visible with massively parallel computing, with very large vectors or matrices rather than individual scalars).
@phizc Жыл бұрын
26:12 One of the divisions in Po-Shen Loh is x/2. It can be turned into a multiplication x*0.5. I'm kinda surprised if the compiler doesn't do that. So PSL vs "old" should have 1 less sub, 3 less mul and all the others the same. I.e. 7 clocks less than the "old" in the first table, and 2.5 less in the second table.
@stevegremory38492 жыл бұрын
bloody awesome! never knew this existed, extremely cool, hell it'll be useful for school work too ;)
@WhatsACreel2 жыл бұрын
I reckon! Such a clever formula! Cheers for watching :)
@stevegremory38492 жыл бұрын
@@WhatsACreel Cheers :D
@furyzenblade35582 жыл бұрын
Your videos are always super interesting! Thank you
@wolfgangreichl3361 Жыл бұрын
The way I learnt it, the expression (double==constant) is a bug anyway. (9:12) Before going to assembly I would have replaced std::complex, which seems to me to be the culprit messing up memory access.
@syntaxerorr2 жыл бұрын
I love your videos. I don't want to critique your style and this is the first I've seen with this style...but if you speed up all the animations (mainly the countdowns a lot, like 5 seconds less) I think it would flow better. I know animation is hard and a trial by error process sometimes.
@thrownchance2 жыл бұрын
Which CPU and GPU was used? A speedup of only factor 2 to go to the GPU seems to be barely worth it imho.
@user-sl6gn1ss8p2 жыл бұрын
specially considering the time it takes to get stuff into the gpu
@Chris-op7yt Жыл бұрын
that's the form for finding x axis intercepts. if just computing the y value for any particular a,b,c and x, then just plug it into y=axx+bx+c. no square roots and no division.
@turun_ambartanen Жыл бұрын
I was excited for a new formula, but it's just the regular one we are taught in school here in Germany :( It's called "pq-Formel", with p=b/a and q=c/a
@Zero-47932 жыл бұрын
with Po=Shen Loh's, you don't need to calculate with complex numbers. you just put an i out the front if c^2 < 0: c = |c| ans = ic
@Zooiest Жыл бұрын
How does the tie rule work? Is it something like abs(resultsA[i] - resultsB[i]) < stddev(resultsA)? I'm not really a statistician, but I like learning these things
@CjqNslXUcM2 жыл бұрын
awesome video, well done!
@XfStef2 жыл бұрын
Glad round 2 took a Data Oriented approach!
@MichaelJamesActually2 жыл бұрын
Could you use something like the inverse root hack from quake 3? I'm sure there's other approximations of square roots that could also improve that calc.
@MichaelJamesActually2 жыл бұрын
Scratch that. Both methods have the same number of square root operations
@PhilippeVerdy2 жыл бұрын
Is summary, we cannot decide which approach ("original" from maths on real or complex numbers with infinite precision, or "PSL"-optimized) is faster or slower as it largely depends on how you program your code, the programming language you use (notably how it rounds the results per its semantics), the compiler you use (and its compilation options notably about rounding of intermediate results, and how it may cache multiple reads from external structures/variables into local registers), the library you use, and the hardware you use: finally you can only decide after really making some local benchmark. NEVER forget the language semantics (effects of "implicit" roundings, and of "if" branches) when creating ANY implementation with that language.
@mt0software2 жыл бұрын
I wish this was the kind of game show they'd show on TV. great video
@plokki4562 жыл бұрын
interesting video! Though I find it hard to believe that the whole gpu is only twice as fast as a single cpu core for such a massively parallel task. Is there some cuda optimization flag that you did not turn on? Also the gpu may not have had the time to increase its frequency to the max if the task was too short. Did you compute the effective bandwidth vs theoretical bandwidth as you did for the cpu test? I'm curious about that as well.
@aliengenie88962 жыл бұрын
When you're doing stuff with GPGPU, if you try to simply time the time from when you "submit" the work to the GPU to when it finishes, you end up also measuring the overhead that is involved with deploying kernels and eventually having all of them hit the synchronization point at the end. There's a good chance that the amount of work that was submitted was not even to keep the GPU busy for long enough, and so the vast majority of the time is spent on that overhead. Depending on how things are done, you may end up also measuring the transfer of the inputs/outputs to/from the GPU, which would make the results look even more similar. The "dumb" way to measure how much overhead there is, would be to give it a ridiculously small amount of work, like only one calculation, and see if it ends up being ~75ms again. The better way is that Nvidia supplies tools for measuring how long each part of the process takes. (Launching the kernel vs memory transfers vs the actual work).
@haoming34302 жыл бұрын
Where can I download the assembly version of the code? Many thanks!
@ExaltedDuck Жыл бұрын
fascinating. I haven't watched the other video yet to better understand this other method but from the 1-sentence description early on in your video, it doesn't sound like they're doing different things. The description of Po Shen Lo as finding the middle and then adding and subtracting the difference to find the roots... that's exactly what the quadratic formula does. -b/2a is the middle and then the difference is sqrt(b^2-4ac)/2a. You might be able to speed up the quadratic code by a tiny amount by calculating the -b/2a as well as the discriminant. Although in the scalar version, the compiler probably already has. I'm really curious to go see the more explanatory video soon, though. Maybe there's something I'm missing...
@prof.salomonibarra64022 жыл бұрын
¡ Excelente trabajo !. You deserve the gold medal in computer olympics. For real numbers OQ performs very well.I'm going to try this on ARM V8 and compare with similar Intel. Cheers
@mattewlefty9912 жыл бұрын
From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c. So we will need to use more different methods for computing the roots. Btw the "red an blue tower thing" was really anxiety-inducing 😂
@user-sl6gn1ss8p2 жыл бұрын
"From a numerical point of view, the quadratic formula might give you roots wich are very distant from the true ones, depending on the values of a,b,c" Really? Like, in a way that po-shen loh wouldn't?
@mattewlefty9912 жыл бұрын
@@user-sl6gn1ss8p That's a good question, po's formula divides all the coefficients by a, and then does the calculations (but when abs(a) is small there might be some problems in general). @Creel would you share how you generated the test cases?
@user-sl6gn1ss8p2 жыл бұрын
@@mattewlefty991 now that you mention it, comparing the results might be fun as well (if fairly slow)
@monad_tcp2 жыл бұрын
why would it do that ? the roots is basically when the curve crosses the 0. we are talking about a parabola, it must cross the origin
@mattewlefty9912 жыл бұрын
@@monad_tcp Not all parabolas cross the x axis... But even then there are some operations (like subtraction, sum and division) that introduces errors because the computer works with a finite number of digits
@selvapk55672 жыл бұрын
Can anyone explain these instructions which I found in a shellcode document for getting a bind shell? function hashes (executable as nop-equivalent) _emit 0x59 ; LoadLibraryA ; pop ecx _emit 0x81 ; CreateProcessA ; or ecx, 0x203062d3 _emit 0xc9 ; ExitProcess _emit 0xd3 ; WSAStartup _emit 0x62 ; WSASocketA _emit 0x30 ; bind _emit 0x20 ; listen _emit 0x41 ; accept ; inc ecx
@Mernom2 жыл бұрын
That branch really muddied the water. I think the first test should've been redone with the optimized code.
@asailijhijr2 жыл бұрын
Is one of those divisions by four? Can you optimise it with a bit shift? Or is that not how floats and std::complex numbers work?
@stevenlynch34562 жыл бұрын
The way floats are structured means that bit shifts don't work for division nor multiplication by powers of 2. Floats are structured like this: SignBit PositiveAndBiasedExponent FractionalPart SignBit is 0 for positive float, 1 for negative float PositiveAndBiasedExponent is exponentYouWantToStore+bias = exponentYouWantToStore+127 FractionalPart is like the .345 in the number 12.345 SignBit is 1 bit long. PositiveAndBiasedExponent is 8 bits long. FractionalPart is 23 bits long. Why these kinda random #s? It's standardized and together they use exactly 32 bits, which is a nice power of 2. Example float in memory: 1 10001000 11000000000000000000000 SignBit = 1, so float is negative 10001000 = 136. However, to get the exponent we intended, subtract the bias to get 136-127=9 11000... = (1/2)+(1/4) = .75 So the example float in memory is -1.75*2^(9) = -1.75*512 = -896.0 1 10001000 11000000000000000000000 = -896.0 Why 1.75 instead of .75? The leading 1 is always present in binary scientific notation, and the value 1 is the same no matter which base you're in (except Base 0 and Base 1, which are extraordinarily difficult/annoying to deal with). Now imagine shifting it to the right because you think dividing a float by 2 is the same as dividing an integer by 2. What you're doing is moving the sign bit into the exponent, dividing the EXPONENT by 2, and then (what your intention originally was) dividing the FractionalPart by 2. 1 10001000 11000000000000000000000 shifted to the right by 1 is 0 11000100 01100000000000000000000 Putting this into www.h-schmidt.net/FloatConverter/IEEE754.html shows that it's equal to 8.11656739243e+20 -896 is pretty different when compared to +8 with 20 digits after it. HOWEVER, if you could find a way to decrease the exponent by 1 (i.e. subtract place value of 1 from the exponent) WITHOUT modifying any other part of the float, you would successfully divide by 2.
@DanikaOliver2 жыл бұрын
I can't see the code, the contrast is not good. The letters are small.
@CanKarakuzulu2 жыл бұрын
Just trying, but using branched code in cuda code severly slows things down. Maybe you could try branchless version of po-shen lo cuda?
@MrRyanroberson12 жыл бұрын
4:23 already i see a problem. The +/- part should be computed once and reused, not just computed once for each root
@Misteribel2 жыл бұрын
I might have missed it in the video, but did you consider rsqrtss or vrsqrtss? You’d need to change the algorithm, but it may improve speed of your slowest instruction: square root. Maybe you used it in the simd variant asm version.
@TerjeMathisen2 жыл бұрын
First, the two algorithms really are not separate: They have exactly the same numerical qualities and effectively do the exact same operations, only in a slightly modified order! On a modern CPU, sqrt() and FDIV are both far slower than fadd/fsub/fmul & fmadd, so I would start by calculating the a_reciprocal = (1/a) term: The FDIV in that operation will then be nearly or completely free because it can overlap with all the rest of the calculations. In the PSL version we immediately use the result of that division to scale b and c, which simplifies the logic but generates a latency barrier. If we instead calculate half_a_recip = (0.5/a) then we can multiply the final answers by that value. The sqrt term is sqrt(b*b-4*a*c), here the b2=b*b term and the ac=a*c term can run at the same time, then we combine them in tsqrt=sqrt(fmadd(ac,4.0,-b2)). By the time that has finished, the fdiv will be be all done, so we just have to do the final res0 = (-b + tsqrt) * half_a_recip and res1 = (-b-tqrt) * half_a_recip calculations which will also overlap. Since sqrt is slightly slower than fdiv, we can in fact delay the half_a_recip calculation a smidgen, starting it just after the b2=b*b; and ac=a*c; operations, it will still finish before the sqrt. On an x64 cpu with at least two fmadd pipelines, but in scalar mode, these calculations should take ~40 clock cycles (I'm assuming 5-cycle fmadd and 20-cycle fsqrt).
@cH3rtzb3rg2 жыл бұрын
I doubt the latency of the instructions matters at all in case you calculate millions of independent roots -- only instruction throughput and port-usage should matter (and in fact memory throughput). Instruction latency would matter, if the output of the previous calculation influences the input of the next.
@TerjeMathisen2 жыл бұрын
@@cH3rtzb3rg I mostly agree: When processing huge amounts of data, typically more than last level cache, almost any calculation will be totally dominated by memory bandwidth. OTOH, that is pretty much unimagninable for the case of solving quadratic roots. :-)
@cH3rtzb3rg2 жыл бұрын
@@TerjeMathisen I also agree. The benchmark in the video is pretty much meaningless for most practical purposes. One could compare the latency of the algorithms by calculating new coefficients based on the previous roots (that way, there would also be no memory bottleneck).
@zaydabbas16092 жыл бұрын
Just to be clear here, these are algebraically identical. There was no 'new math' discovered here as some people claim. This is merely an optimization for computation
@armandosreis2 жыл бұрын
I don't get how you can get rid of the if branch.... It says if a is not 1 then you adjust b and c. so for all values of a that are not 1 you need that adjustment. How can you just remove it? That's changing the formula. Am I missing something here?
@ABaumstumpf Жыл бұрын
The branch said to not do the division if it would divide by 1 " which is completely irrelevant cause dividing by 1 does not change the result but having a branch can seriously impact performance
@boelwerkr2 жыл бұрын
That "if" breaks a lot of processor optimizations and it's not needed i think. if "quads[i].a" is 1.0f the equation inside the if is: b/=1.0f; c/=1.0f which didn't change the values. This would speed up the code a lot. With that it should have a fighting chance. The slowest part of both codes is the "sqrt" it is a magnitude slower than any other operation in that code. So using it as little as possible is always preferable. So this code: void Po_Shen_Loh_Simple_AoS(QuadStruct* quads, int count){ for(int i =0; i< count; i++){ float mid = ( quads[i].b / quads[i].a ) * -0.5f float u = sqrt(mid * mid - quads[i].c / quads[i].a ); quads[i].r1 = mid + u; quads[i].r2 = mid - u; } } should be faster if i haven't made a stupid logical error. 🙂
@cH3rtzb3rg2 жыл бұрын
Why would the original Po-Shen Loh code check for a!=1, but no code actually checked for a!=0?
@KerbalLauncher2 жыл бұрын
What were the compiler settings?
@rogo73302 жыл бұрын
Is storing computation result into previously defined variable (thinking that we won't need its content anymore) faster than defining new (local to function) variable? And does this even matter?
@cH3rtzb3rg2 жыл бұрын
For the compiler it should not matter at all, at least for simple types. Do whatever is easier to read.
@monad_tcp2 жыл бұрын
@@cH3rtzb3rg its usually easier to store duplicated terms in a variable to make the code simpler, not for performance, the compiler can do that
@cH3rtzb3rg2 жыл бұрын
@@monad_tcp I think Rogo was asking about code like: int x1 = SOME_EXPRESSION; // use x1 int x2 = SOME_OTHER_EXPRESSION; // use x2 If one should reuse x1 instead of introducing x2 for that.
@monad_tcp2 жыл бұрын
@@cH3rtzb3rg oh, in that case it really doesn't matter, variables are renamed either way, the compiler either do register allocation or store in the stack. After data-flow analysis, it will reuse stack variables if possible. Unless your compiler is really, really dumb. Like Microchip ones.
@JustATempest Жыл бұрын
I'd be interested and you using the fast square root algorithm for quake as a substitute for the square root for both of these
@guylaingreer41902 жыл бұрын
I didn't catch if you mentioned it, but when you switched to std::complex were the coefficients complex or were they still real?
@etopowertwon2 жыл бұрын
At differences so low you more quickly run into what compiler can optimize better at given flags (O0 vs O1 vs O3, -ffast-math/-fno-fast-math) For example: final table is unoptimized. Nobody counts 2a as (2*a). It's (a+a). (Unless compiler would decide that 0.5 * X/a is faster than X/(a+a) -- it generally shouldn't, floats have no instruction for quickly halving a value)
@ABaumstumpf Жыл бұрын
Changing a float by a factor of 2 is a single addition/subtraction
@PplsChampion2 жыл бұрын
"benchmarking as entertainment" is next to godliness
@camofelix2 жыл бұрын
I didn’t notice what chip you’re running this on! If it’s AVX512 with FP16 (early alder lake with ecores off) enabled there’s some SIMD fast paths for complex calculations that can do it all much faster, and because you’re using a smaller data type (fp16 instead of fp32) you get much better bang per buck on memory bandwidth
@monad_tcp2 жыл бұрын
its an x86 probably
@michaelwilkes0 Жыл бұрын
just getting close to such a tiny we ll established formula is amazing. epic.
@decky19902 жыл бұрын
Could you do a bit shift right (SHR?) to help reduce the clock cycles for dividing by 2 or would it need to be an integer to preserve accuracy? Thanks for putting up these videos - desperately trying to finish a new book on C++ for the 3rd time and you’re my only real motivation at the minute.
@andersgoude2 жыл бұрын
This only works for integers. If you really want to achieve this through bit manipulation with a float, what you could do is something like float a = 5; *reinterpret_cast(&a) -= 0x800000; and a will now have the value 2.5 instead (assuming that int is 32 bit) What this code is doing is that it decreases the exponent of the floating point number by 1, which mathematically is the same as dividing by 2. (check the IEEE 754 standard for more details how floating point numbers are stored in memory)
@decky19902 жыл бұрын
@@andersgoude thanks for sharing, Anders! Will check this out further, thank you
@andersgoude2 жыл бұрын
@@decky1990 I just want to remark that my example is mainly for educational purposes, and is generally not something that is recommended. The normal solution for divisions with a constant is to replace the expression a/2 with a*(1.0/2.0) (or just write a*0.5 which will generate the same code). This works for any constant and for most modern systems, multiplications are very fast operations, compared to divisions
@disasterarea93412 жыл бұрын
hard to see the blue code around 4 minutes in, dark text on dark background :x
@dananskidolf2 жыл бұрын
I think processor architecture plays a big role in this. The operation times depend on FPU implementation - for example Intel improved division by a factor of two and sqrt by 4 in the Penryn architecture if I remember correctly. More cache or memory bandwidth will target the key bottleneck here - that's probably the main gain from using a GPU. Branch prediction is powerful, but difficult and I think still evolving, so this affects different CPUs differently, as well as depending on the data fed in, and the order it's in. (The branch was only there to demonstrate bad code but I thought I'd include it.)
@barakap2372 Жыл бұрын
Hey friend. Why didnt you upload for such a long time? You were a reallllllllly good introduction for me for assembly C++ and in general compsci. Please reply
@devinschlegel17632 жыл бұрын
are there any plans to continue the neural net series?
@beaconofwierd18832 жыл бұрын
At first I was angry with you for implementing it using a branch, the I was very happy that you did so because it made me feel good spotting the issue from the start :D So thank you for making me feel clever ;D
@Verrisin2 жыл бұрын
9:20 - isn't that if wrong to begin with? It's supposed to check it's not 0, in which case maybe use an entirely different algorithm.
@felixstuber8046 Жыл бұрын
You can substitute one division in the po-shen lo formula by a multiplication if you instead of dividing b by a and then by 2 simply devide it by 2*a. On the other hand, multiplication and division by 2 shouldn't be a problem at all for a binary machine.
@lordadamson2 жыл бұрын
man I love your videos haha
@dannygjk2 жыл бұрын
Why are you using floating point?
@rafagd2 жыл бұрын
4:10 - I was going to complain you were calculating sqrt(b - 4ac) twice for no reason on the original way, but I imagine the compiler should be smart enough to optimize that one away...
@emjizone Жыл бұрын
3:31 In your red code there is no reason I know to compute 2*a twice per loop cycle, nor to compute (b*b-4*a*c)^(0.5) twice per loop cycle ! Here you aren't simply implementing different mathematical approches. You are also comparing opposed computation implementation patterns. :/
@badhombre4942 Жыл бұрын
√(b²-4ac)/(2a) == √((b/2a * b/2a) - c/a) == u. Therefore both formulas are the same. If a is > 1 in the majority of your dataset, then the if block adds 2 division operations. Your red code can be optimized to make it even faster.
@ShotgunLlama2 жыл бұрын
My brother in Kerninghan, did you really just put dark blue text on a gray backgorund?
@monad_tcp2 жыл бұрын
4:42 one branch conditional and one division, I bet it loses, but its a division by 2, lets see, compiler might be smart.
@graealex2 жыл бұрын
The font used in the animation always reminds me of the animations from surreal entertainment. Just waiting for "Sheldon is kill" to appear.
@maxvilla50052 жыл бұрын
Nice vid, wish the code text in red and blue (especially blue) was easier to read though.