Great content, concise and clear. Like my grandfather used to say: If you cannot do it faster, divide the work. Perfectly sums up parallelization. God bless
@DanielBoctor Жыл бұрын
Thanks, I'm glad you liked it! Awesome quote too, thanks for sharing. 😊
@sesemuller4086 Жыл бұрын
np.prod(rets+1)-1 is twice as fast and an approximation within floating point error
@sesemuller4086 Жыл бұрын
I think the main takeaway her is that math is always important in data science. Within floating point error, your expression is equal to np.prod(rets + 1)-1, which is twice as fast as your fastest. It‘s also shape-agnostic. No SIMD necessary!
@EarthlingNews7 ай бұрын
maybe start posting some coding videos lol
@seneca9836 ай бұрын
However, log1p() and expm1() should have smaller errors for small arguments which could potentially matter. I think it's usually better to favor smaller errors over speed if you're going to make the choice without all that much thought. Also to nitpick a bit, I think the np.prod() function is still going to use SIMD, though I'm not 100% sure.
@tophat593 Жыл бұрын
Ok, I've been coding for more decades than I care to admit and this was damn helpful. Subbed and keep it coming.
@DanielBoctor Жыл бұрын
Thank you for the support! I'm glad it was helpful
@krellin10 ай бұрын
Not only you can take advantage of SIMD but you can also do ILP (instruction level parallelism) which when conditions are right will actually give you even more performance, but often at the expense of the code readability... We saw these being used in recent oneBillionRows challenge. Great video, well made.
@lapissea1190 Жыл бұрын
Every time I try to optimize python, I just end up with no python. I remember making this 3d mesh processing algorithm in python that would try to remove harsh artifacts in the geometry. It took 12 minutes to execute on 3 million points. I was so sick of debugging and developing it that I just ended up rewriting the whole thing in Java (because I still wanted kinda high level code) and it ran in 50 ish seconds single threaded and when I added some simple threading, it ran in 12 seconds. That's a speedup of 60x. Not even rewritten in an actually performant language like C.
@well.8395 Жыл бұрын
Lmao, the first line made me crack.
@charliebitmyfinger7124 Жыл бұрын
Java's actually pretty performant. There was a test by a retired MS engineer working with the OS, and Java is actually the 5th fastest language, just behind C, C++, Zig, and Rust.
@lapissea1190 Жыл бұрын
@@charliebitmyfinger7124 oh yeah absolutely! After JIT has done it's thing and you can use lots of primitives, it's basically C. Main issues with Java is the objects. You can't have a 3d vector flat in memory what can cause a lot of work for the GC and memory locality. (although there is project Valhalla that is coming soon hopefully that fixes that) Also the start time is not the best. Half a second or so to get a program to actually run is not uncommon versus a few ms in rust.
@vercolit Жыл бұрын
I know which video you're talking about and would like to add a bit more nuance. I think that this single test is in no way representative of actual real world performance. How dave plummer "measured" the performance of the language was by implementing a single, simple algorithm, and asking people to do it with other languages, in the exact same way. This is a fun video series, but is absolutely not a good way to compare languages because:1) some languages had a ton of effort put into it since they are popular online, but not in real life 2) this is just 1 program, and can't be used to extrapolate to all programs 3) the bottleneck of that specific program (prive sieve) is the ability of the cpu to do a small computation many times, being "unfairly" beneficial to JIT languages like java (they can optimise computations better than memory layout, which is usually more of a bottleneck in real applications) 4) the results are immediately highly suspicious: in practice, zig is definitely not 3 times faster than C, so why should we take the word of an irrelevant micro-benchmark? So while I do think that java is decently performant compared to other languages, it's still pretty lacking in terms of performance compared to systems programming languages like C/C++/rust/zig(I guess), and those videos are really to be taken as fun little experiments, not actual telling data that means anything in the real world
@well.8395 Жыл бұрын
@@lapissea1190 No it's not basically C.
@NoNameAtAll2 Жыл бұрын
sum of logs is log of product exp and log cancel out the function is just multiplying (val+1), and then subtracting 1
@tamimyousefi10 ай бұрын
yes, but isn't addition faster than multiplication? log is done in parallel and exp is done once.
@NoNameAtAll210 ай бұрын
@@tamimyousefi calculating logs and exp takes several multiplications other comments have timed the "prod(val+1)-1"-esk code and found it twice as fast or more, as I understand it
@tamimyousefi10 ай бұрын
@@NoNameAtAll2 I find that surprising. Thanks!
@seneca9836 ай бұрын
@@NoNameAtAll2 However, the functions log1p() and expm1() should have smaller errors if the arguments happen to be very small.
@u9vata Жыл бұрын
What counts even more is cache locality - just this was more of a computational problem, but in most real world uses cache locality of data and not using the "everything is a reference" thinking, but inlining data in memory instead of being randomly on heap make wonders. Also generally you don't even need simd for these kind of stuff - your cpu is heavily superscalar (not really pipelined as in old terms, but a frontend+backend with backend more resembling a risc kind of thing with added extra cisc for special cases and the cpu frontend managing all the re-schedule, ILP (instruction level parallelism) and so on. So given the changes even if your compiler / JIt does not do simd - still it will be at least 2x faster this way usually...
@vytah Жыл бұрын
One extra thing worth noting is that expm1 and log1p are more accurate for small arguments when used like this.
@fabiangiesen306 Жыл бұрын
For what it's worth, `log1p` and `expm1` have their own dedicated math library functions (not just in numpy) not just because they're common (although they are), but for numerical reasons. This is easiest to see for expm1 if you look at the power series expansion for exp: exp(x) = sum(k=0;inf) x^k / k! = 1 + x + x^2/2 + x^3/6 + x^4/24 + ... For |x|
@kenzostaelens1688 Жыл бұрын
interesting video but considering e^ln(x) = x (for x>=0) and log(x)=ln(x)/ln(10) you can change the formula to e^(sum(ln(x+1) for x in list)/ln(10)) -1= e^(ln(product(x+1))/ln(10)-1 = product(x+1)^(1/ln(10)) -1 with 1/ln(10) being a constant and no further expensive logarithm calculations
@LambdaTechnology10 ай бұрын
Great explanations. Just went through 5 of your vids and subscribed without hesitation. Keep it up! 👍
@DanielBoctor10 ай бұрын
haha, thanks pretty awesome. more is on the way 😊
@arenmaes4239 Жыл бұрын
You kinda forgot the most important thing of them all... Math The Product Rule for Exponents: a^m × a^n = a^(m+n) So we can rewrite the original sum to a product and then remove the exp and log operations. For my machine this gives a 5.5x improvement over your fast implementation. And a 93.5x improvement over the slow implementation. Full code: import math import numpy as np import random # Random returns in range [-0.0125, 0.0125) random_rets = [0.025 * (0.5 - random.random()) for _ in range(10_000)] random_rets_np = np.array(random_rets) def compound_returns_v1(rets): return math.exp(sum([math.log(value + 1) for value in rets])) - 1 %timeit -n 100 -r 7 compound_returns_v1(random_rets) # On my machine: 1.74 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) def compound_returns_v2(rets): return np.expm1(np.log1p(rets).sum()) %timeit -n 10000 -r 7 compound_returns_v2(random_rets) # On my machine: 574 µs ± 9.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) def compound_returns_v2b(rets): return np.expm1(np.log1p(rets).sum()) %timeit -n 10000 -r 7 compound_returns_v2b(random_rets_np) # On my machine: 103 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) def compound_returns_v3(rets): return np.prod(rets + 1) - 1 %timeit -n 10000 -r 7 compound_returns_v3(random_rets_np) # On my machine: 18.6 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # Validations to make sure the different versions give the same result res_v1 = compound_returns_v1(random_rets) res_v2 = compound_returns_v2(random_rets) res_v2b = compound_returns_v2b(random_rets_np) res_v3 = compound_returns_v3(random_rets_np) print(f"v1 == v2 -> {np.allclose(res_v1, res_v2)}") print(f"v1 == v2b -> {np.allclose(res_v1, res_v2b)}") print(f"v1 == v3 -> {np.allclose(res_v1, res_v3)}") # On my machine: # v1 == v2 -> True # v1 == v2b -> True # v1 == v3 -> True
@AlgorithmAlloy Жыл бұрын
Well done
@DanielBoctor Жыл бұрын
That's an excellent analysis, thanks a lot for sharing! That is certainly true, and that's actually how I implemented the function originally. I just wanted to pick an example with a few different operations to better illustrate the concepts of vectorization and parallelization, as the video was meant to be independent of the example. Great work!
@kakyoindonut3213 Жыл бұрын
Holy shit🤯
@uima_ Жыл бұрын
Yeah, this is the first thing I do when I see the slow implementation and want to optimize myself first, and thinking he will do it this way xd.
@carlosp.2898 Жыл бұрын
Im failling to detect any difference in v2 and v2b implementations. How did you get a different time?
@ticua07 Жыл бұрын
Only 72 subscribers for this quality? Keep it up man
@DanielBoctor Жыл бұрын
I appreciate it! I got more on the way. Thanks for watching. 😊
@seneca9836 ай бұрын
There seem to be a lot more now.
@Sweenus987 Жыл бұрын
Took me too long to realise this was a video on Python
@MaxForman-n7z Жыл бұрын
This deserves more views, good work
@DanielBoctor Жыл бұрын
Thanks! I'm glad you enjoyed it! More to come.
@blazingblast Жыл бұрын
Oksy but exp(ln(a+1)+ln(b+1)) - 1 = exp(ln((a+1) * (b+1))) - 1 = (a+1)*(b+1) - 1. So return product( [value + 1 for value in rets]) - 1 would work too and if you let numpy loose on that it'd be much faster
@dovos8572 Жыл бұрын
numpy is so much faster that it is a required "must learn" if you learn python. sure it is "harder" but most of the good help and tutorials use it and you will be way happier with this speedup for computational stuff. it is also not just math but everything array related too. numpy arrays are way better and faster than python arrays. in large scales (100k + entries) it is hundrets to thousands of times faster with the same amount of code.
@virtual5754 Жыл бұрын
So what you suggest is just using faster language to get faster execution speed. It is obvious solution, but it is not exactly making program better with more optimized code, nor it makes code author better at programming. The same way as driving faster car does not make driving skills better on its own.
@SEOTADEO Жыл бұрын
Hahaha that is exactly it. For fast python, dont use python.
@seneca9836 ай бұрын
You don't have to care that NumPy isn't written in Python. You can just call it from Python and leave the implementation details to its developers.
@arseniix Жыл бұрын
Ok, so the most important thing is switching to numpy
@art4eigen93 Жыл бұрын
\Great content. |waiting for more videos
@David_Lloyd-Jones6 ай бұрын
Sweet. Good example for a generally sound point.
@yato3335 Жыл бұрын
Also you don't need to collect math.log values to a list before summing it. You are iterating 2 times through the entire array and creating unnecessary memory. Just pass the gemerator to sum function.
@SEOTADEO Жыл бұрын
3:03 I don't think this is correct. You say "the logarithms are computed in parallel", but all that happens is that the logarithms are computed in the compiled C/C++ binary of the numpy library. To my knowledge numpy does not use any multi-threading when computing functions of arrays. So it does not happen in parallel (except for some SIMD maybe). It happens sequentially, just in C/C++ and not in the super slow language that Python is.
@dejangegic Жыл бұрын
So the main takeaway is, don't use python cus it's hella slow?
@fusionfuryx Жыл бұрын
Maybe you can consider to add numba jit compiler to compile down to machine code at runtime which can speedup the numpy calculation faster.
@turun_ambartanen Жыл бұрын
Do you have any links for that? Because my understanding is that numpy is written in C and already shipped as compiled code/machine code. But if I'm wrong I'd like to read about that more.
@eddierandom6074 Жыл бұрын
@@turun_ambartanen in this case it shouldnt help because of what you said, but lets say you have to run this function in a for loop as an example. In that case numba can compile the python for loop and make it way faster. Numba doesnt speed up numpy usually.
@yash1152 Жыл бұрын
2:07 why are voice audio levels dropping at end of each sentence?
@andrianmovtchan9745 Жыл бұрын
Very well made 👍
@mitchbet Жыл бұрын
great video
Жыл бұрын
ok... so numpy implemented in C is faster than python. Should we surprised? If you implement the same algo with direct memory access using assembly you could squeeze some more speed for sure. I was expecting something more... interesting. TLDR: use numpy for numeric computations.
@aperson4051 Жыл бұрын
SIMD doesn't change the time complexity... I don't understand the title
@DanielBoctor Жыл бұрын
That's right, it doesn't. The video was meant to showcase that although both functions have the same time complexity (O(n)), one can have a much faster runtime due to vectorization and parallelization. In other words, it's not just the number of operations that matter, it's the speed at which those operations execute.
@aperson4051 Жыл бұрын
@@DanielBoctor oh i see. Yes, now i get it... When i clicked the vid i was expecting it to be something about cache locality, and how sometimes instructing the CPU to do more work can be faster than having the CPU wait for the precomputed work to arrive. E.g. precomputing a lookup table and jumping around the LUT may actually be slower than just computing the value every time for each iteration
@aperson4051 Жыл бұрын
... however... If i precompute the LUT. I can now do n lookups and SIMD any remaining instructions :D
@Speak4Yourself2 Жыл бұрын
Thanks a lot!
@sbtopzzzlg7098 Жыл бұрын
Very nice explanation. However I would like to take a note that, yeah in such cases, optimizing the algorithm alone will not help until the moment the other factors start to kick in like the language it is coded in, the architecture of the CPU, etc... But from my point of view, I don't see Time complexity as a means of ranking algorithms based on their "speeds", but rather how the input factors influence the speed of the algorithm. Algorithms that have a O(n) time complexity will basically have linearly increasing execution times for increasing input sizes, so when we deem an algorithm to be O(n), we don't mean it's "fast", but rather the execution times linearly increase, which is relatively better than say, quadratic or exponential. So from my point of view, it's meaningless to call a O(n) algorithm "fast", and hence, I believe that the techniques that you mentioned in the video (i.e., to use low level implementations to execute operations hundreds or thousands of times faster) come before the aspect of time complexity.
@seneca9836 ай бұрын
3:40 It should be noted that the log1p() and expm1() functions also have smaller errors when the argument is very small. Actually, I would guess that this is the main reason why those functions exist rather than speed.
@batlin Жыл бұрын
The question asked in the title wasn't answered: "THIS is more important than time complexity?" The answer is: No. While there's more to efficiency than just notional time complexity, it's still the case that an O(2^n) algorithm could take billions of years to return a result. No amount of parallelisation and SIMD is going to fix that (although sometimes you can use techniques like dynamic programming to reduce from an infeasible complexity class to a more feasible one). Great video but the title is unnecessarily misleading.
Amazing!!! Keep it up Dan, finna learn the illest codes from you 🔥
@mcpecommander5327 Жыл бұрын
Alternatively you could replace that with prod(list+1)-1. Unless that’s somehow slower? I don’t see how though.
@brunoais Жыл бұрын
This is good but one thing you didn't explain. Why did you make the intermediate list in the original code? I think it wouldn't improve as much as what numpy did but I think it would make it much faster if there isn't an intermediate list. In my test, it improved 2x instead of 30x. Just not making the intermediate list. Note I didn't use jupyter and I used regular py file with cpython.
@idiota-w-swiecie Жыл бұрын
So you make a program faster by using c and not python?
@nyverinorlyth9555 Жыл бұрын
TL;DR Less Python More Fast, Better Look
@rk-zs5sy10 ай бұрын
np.exp is slower than math.exp on a single input.
@therealawesomequest Жыл бұрын
So... your advice is to write fast code? genius
@csabaczcsomps7655 Жыл бұрын
Is about cache, some routine is in cache . My noob opinion.
@SEOTADEO Жыл бұрын
What? lmao
@hudumannaia Жыл бұрын
Haha phyton, haha
@tsukinoko_kun19 күн бұрын
This kind of optimization is almost never necessary. You get much bigger speed improvements with other methods. Almost no real applications deal with numbers in a way necessary to utilize vector operations.
@nitsanbh Жыл бұрын
Bro forgot math exists. add 1 to all values, then literally just multiply them all: value[0] * value[1] * value[2] * … * value[-1]
@panjak323 Жыл бұрын
Python 'programmers' worrying about performance... How cute 👍😅😅
@sephirothbahamut245 Жыл бұрын
They key concept that cache friendliness matters more than time complexity for anything less than extremely high amounts of data still applies to every language. C++ just makes it easy to put that in practice.
@meleody Жыл бұрын
python is a great language.
@xynyde0 Жыл бұрын
this isn't a language to language comparision. Its about vectorization, something that is independent of language.
@SEOTADEO Жыл бұрын
@@xynyde0 if you write a normal for loop in Rust/C/C++ it is vectorized automatically by the compiler.