Whenever I see an upload with Prime and Casey, the next 2 hours is gone. Everything else can wait.
@frischifrisch6860Күн бұрын
jup 😂
@CobaiaHardcore16 сағат бұрын
Casie sounds better at 2x speed!
@blarghblargh10 сағат бұрын
@@CobaiaHardcore all talking sounds better at 2x speed. maybe it's the coffee
@AdityaRaj-ki3mdСағат бұрын
@@CobaiaHardcore
@EightNineOne21 сағат бұрын
I remember when the 1 billion loops one did the rounds on social media. I janked together a similar example locally between ruby and python. Both were slow, but ruby won. Then I switched from ruby 3.3.5 to truffle ruby and it went from ~30s to ~2s. What did I learn from this? Nothing. Absolutely nothing.
@mattymerr70110 сағат бұрын
I did an experiment between python and C++ where I switched from CPython to PyPy and it went faster than the C++ code. Also nothing to learn there.
@madpuppet6669 сағат бұрын
you learnt to use truffle ruby if you want performance. sometimes its not the language, its the interpreter or compiler that matters.
@uberboat45127 сағат бұрын
No. It's only the interpreter/compiler that matters. Languages themselves don't determine the speed.
@user-ni2od5lu6j5 сағат бұрын
@@uberboat4512 most limitation on compilers is set by language standard, i.e. compilers is prohibited to do some optimizations, that's why undefined behavior in c/C++ and strict aliasing appeared in standard - give compilers some room to optimize code, judging by Rust speed it seems it was wrong road, basically first example they discussed is fail for all compilers and languages, because final code should not have any loops, they can be all optimized away. just simple sum without loops: "(int)(((u-1)*(i128)u*num_u + (numpercu + 1) * numpercu)>>1) + rand()%10000"
@OnFireByte4 сағат бұрын
Language is matter quite a bit, or at least the design and spec itself. Static typed system is needed for efficiently pack the data (and vectorize it), even if the JIT is capable of monomorphism, it still has to do runtime check for that. Inlining cant be used if polymorphism is based on dynamic dispatch, you need static dispatch for that (like impl vs dyn in Rust). Undefined behavior in the C spec, even if it’s generally a bad thing, gives a room for compiler to do more aggressive optimization. There’s way more examples, but you get the idea.
@pikolopikolic556720 сағат бұрын
I cannot get over how amazing these streams with Casey are. The way in which he explains things is so clear, and the chemistry you two have on stream is amazing.
@matt_milackКүн бұрын
Can someone please explain to be why people don't count Bash and PowerShell as programming languages? We Sysadmins are humans also. Edit: I beg people who don't know what Bash, PowerShell and Sysadmin are not to reply to this comment. Thank you!
@guywithknifeКүн бұрын
Because… it doesn’t count when… reasons.
@AntranigVartanianКүн бұрын
Because people don’t know that POSIX Shell, PowerShell, Bash, AWK and the rest are proper programming languages.
@guywithknifeКүн бұрын
Seriously though, they obviously are and anyone that says otherwise is ridiculous.
@tonyhart2744Күн бұрын
"We Sysadmins are humans", because we replace sysadmin with aws years ago
@matt_milackКүн бұрын
@@guywithknife I've literally never stumbled upon any programming languages tier list mentioning Bash and PowerShel.
@etgaming606322 сағат бұрын
God I love these videos with Cacey, sitting down in the morning with coffee listening to these two for an hour just motivates me and relaxes me at the same time.
@belmintuzlic21920 сағат бұрын
First I thought this will be some simple explanation, I enjoyed, cover and absorbed first half of video, even assembly part... And then when double is introduced I started to lose myself... After that came words that I never heard before, my knees felt weak, my neck started pulsating from blood flow. I felt stupid again and I am admired by your knowledge! Great video 10/10
@VivekYadav-ds8ozКүн бұрын
28:40 KZbin won't let me post links, but it seems that in Rust this is caught, _if_ you write it a bit idiomatically. If the a[i] access happens via an .iter_mut() loop instead, ( `for arr_i in a.iter_mut() { .. }` instead of `for i in 0..100_000 {..}` ) then the compiler catches that the inner loop is entirely pure and computes it before-hand!
@hard.nurtai420923 сағат бұрын
wow. thanks for sharing that. compiler writers really are geniuses
@SimGunther23 сағат бұрын
Interesting that we'd have to have a "loop go fast" option. I would hope that the compiler just "knows" to optimize it into a classic for loop even with the iterator pattern in the source, but what would I know about "zero cost abstractions" 😂
@arseniy294322 сағат бұрын
@@SimGunther I'm not sure I understand what you're saying? Rust optimizes the iterator pattern better than it can optimize a classic for loop, so iterators are actually a net positive abstraction because they let the compiler better analyze how the pieces of code interact with each other
@remrevo394422 сағат бұрын
Clippy even warns about this: the loop variable `i` is only used to index `a` consider using an iterator: ``, `&mut a`
@SaHaRaSquad21 сағат бұрын
@@SimGunther iter_mut() is that abstraction. The reason this can't be done with the classic for loop syntax is that it makes the lack of side effects much harder (if not impossible) to detect. I'm not sure you know what zero-cost abstraction even refers to.
@Sofia-rh7ji14 сағат бұрын
I think the real question here is "why are C, Rust, and Zig different at all?" If you're compiling with clang, then every single one is just an llvm frontend, and for for loop iterations, there's more than likely going to be exactly 0 difference in the binary output except for whatever timer they're using and the calling convention of whatever functions there are.
@paua774212 сағат бұрын
Because the time measurement was also inaccurate lmao
@my_online_logs8 сағат бұрын
because they provide different information to the compiler so the optimization is also different, for example rust provides richer semantic information to llvm than c and c++, so llvm can do more optimization to rust that they can't provide to c and c++, but for now llvm can't fully utilize the rich semantic information provided (still in the development process) so the difference between rust and c and c++ is not much different, the same or slightly higher, it may be some time before llvm can take full advantage of it.
@MatthisDayer3 сағат бұрын
Zig is ever so slightly faster because by default it targets your machine, making use of whatever features your cpu has. Also functions by default don't follow a strict calling convention, this saves some moving around of registers when calling a function.
@remrevo394423 сағат бұрын
25:55 Looking at this the outer loop could definitely be removed, because every element of `a` is calculated to be the same value. And the inner loop *could* be calculated in linear time using some math. (And is even a kind of optimization that could be added to a compiler backend like llvm) Edit: Because just *saying* that the inner loop could be calculated in O(1) wasn't satisfying enough to be I just went ahead and implemented it: v1 is the test implementation from the program and v2 is the optimized version just without having to do a loop. fn v1(u: u64) -> u64 { let mut o = 0; for j in 0..100_000 { o += j % u } o } fn v2(u: u64) -> u64 { let d = (100_000 - 1) / u; let m = (100_000 - 1) % u; d * gauss(u - 1) + gauss(m) } /// Calculate sum of 1, 2, 3, .. n fn gauss(n: u64) -> u64 { n * (n + 1) / 2 }
@nbjornestol20 сағат бұрын
Yeah the whole program can actually be simplified down to `print(9999900000 % u + r)`
@remrevo394420 сағат бұрын
@@nbjornestol Where did you get that number from? Because that's not how modulo works.
@dexterman636110 сағат бұрын
Wait what am I missing here? How come a[i] = a[i] + j % u is always the same number?
@remrevo394410 сағат бұрын
@@dexterman6361 The outer loop sets every element of the array to the same number. (And then reads out *one* of them) I assume that's what you mean. The inner loop can be calculated with some *math* (which I put in my original comment).
@nbjornestol9 сағат бұрын
@@remrevo3944 Edit: Nvm you're 100% right, I was thinking too much in moduli, the sum isn't done modulo u, only each number.
@filipg4Күн бұрын
Profiling and measuring code execution in general is such a rare skill these days. You are more likely to find a sane Rust programmer, than a programmer who knows how to profile their code. Needle in a haystack.
@nicholasredmon9851Күн бұрын
It so weird too. Intel VTUNE is free and works quite well, has a nice GUI etc...
@evaldssontomКүн бұрын
@@nicholasredmon9851Never tried VTune, how is it compared to gprof cli
@Vitis-n2v23 сағат бұрын
Most people just want to write code that does what they need it to do and don't care about speed. If they do any measurement it's usually the date.now() before and after the code section to calculate approximate amount of time the code takes and that's where their debugging and profiling ends. The amount of programmers that use only print statements to debug is way more than it should be.
@evaldssontom23 сағат бұрын
@ Well, sometimes adding a few print statements to an app with HMR is a lot faster than starting a debugger. Sometimes a debugger also changes execution of the program if it is async.
@alekseyburrovets474721 сағат бұрын
Are you kidding me? Dude, there is a program called 'gperf'. Try it some time.
@eleven570722 сағат бұрын
for Julia to be that low they probably didn't even put the code inside a function, showing that there wasn't even a care to write the loops in the most performant way in each language
@eleven570722 сағат бұрын
just checked the code, they didn't use a function, right on the money lmao
@ItahangLimbu22 сағат бұрын
@@eleven5707 I was also guessing and was correct
@BIGAPEGANGLEADER21 сағат бұрын
Guarantee the R implementation is similarly fucking stupid, has unvectorised operations and ultimately executes functionality for which loops just wouldn't be used
@sjuns515920 сағат бұрын
@@BIGAPEGANGLEADER Imean I *would* argue that the fact that you have to vectorize everything in R to get anything performant is annoying, plus not everything can even be vectorized. But yeah you also shouldn't pretend vectorization doesn't exist and isn't at least okayish. In Julia loops are fast (but you do have to put em in a function, and there are definitely more non-obvious things you have to know to make code performant)
@abtesk19 сағат бұрын
Couldn't agree more.
@Kiyuja20 сағат бұрын
seeing Casey on the channel always warms my heart. I love this guys' talks
@bigmiraclewhips22 сағат бұрын
Casey is the goat. Always learn so much when listening to him.
@Ny_babs15 сағат бұрын
We need to protect Casey at all costs.
@BelacDarkstorm21 сағат бұрын
For those curious on why Go performance is so bad: The Go code uses int, which will default to 64 bit integers; basically every other language is using 32 bit integers. Weirdly it seems like on Apple Silicon, switching it to 32 bit integers reduces the performance. On x64, when you switch it to 32 bit integers then it runs about as fast as C.
@MyAmazingUsername20 сағат бұрын
This is right but wrong. The int, uint, and uintptr types are usually 32 bits wide on 32-bit systems and 64 bits wide on 64-bit systems. Go has these int types: int int8 int16 int32 int64 uint uint8 uint16 uint32 uint64 uintptr byte // alias for uint8
@MyAmazingUsername20 сағат бұрын
It is automatically based on the target architecture. It also has specific types like uint32 when you care about size.
@travistarp746618 сағат бұрын
I knew there's no way go is almost the same speed as node. Go is going to get you similar speeds to other compiled languages like C, rust if your application is not memory heavy.
@retropaganda844217 сағат бұрын
32 or 64 bits integers have the same speed until you have to transfer a huge amount of them to/from RAM
@retropaganda844217 сағат бұрын
go programs also tend to consume a second CPU to clean all the garbage the language let the programmers create.
@theferaltaint506521 сағат бұрын
A lot of people also don’t write go correctly. If you fill your program with interfaces containing numerous methods and shit when there’s no clear benefit, you will slow your program down. Not understanding how to leverage the compiler and lower GC load further slows it down. Honestly, the more simple you write the code, the better it performs. Aside from annoying things like no generics on methods without explicit type casting, and no optional function parameters without allocating slices via variadic arguments, and a couple of other things, go is by far one of the best languages out there. People hate the verbose error handling. Just build your own error system. You don’t have to return the std library error type. It’s your code. Build what you want.
@dunebuggy129217 сағат бұрын
You can literally abstract the error handling to be simple. This is such a junior level complaint.
@RomvnlyPlays15 сағат бұрын
@@dunebuggy1292just because you can get around something doesn’t mean it’s not a pot hole in the road. Nice try
@dunebuggy129215 сағат бұрын
@@RomvnlyPlays No, it's a primitive that you would otherwise abstract upon replication/redundancy, like you would any primitive. That's the whole point of programming. Go makes errors as expressions for this very reason. Anyone who is a competent programmer should be able to put one and two together and abstract for the problem set.
@mattymattffs11 сағат бұрын
@@dunebuggy1292I think the argument is more that you shouldn't have to do that
@inertia_dagger7 сағат бұрын
> no optionals could be sidestepped with pointers or builder pattern, perhaps
@PhilfreezeCH18 сағат бұрын
25:30 Not only does the sum of j%u always produce the same number, I am pretty sure modulo as used by most languages (if not all of them?) is distributive so you can effectively compute the sum of j and then just use modulo u once, completely solving the inner loop in constant time.
@vinaydeshpande86216 сағат бұрын
Yes, but you would have to use 64-bit integer for accumulation.
@simivb16 сағат бұрын
Maybe the j%u const replacement doesn't work because you add it to the array value, which is an int, for which the overflow is undefined behaviour (at least in C++ in prior versions, maybe here too), and it isn't allowed to optimize because it doesn't know a prio the result of a possibly occuring overflow.
@simivb16 сағат бұрын
Wait no, j%u can't overflow the int I think.
@mage369022 сағат бұрын
If I had to guess, the `a[i] += r;` line in the C code is there to make sure the compiler doesn't vectorize the inner loop, as is the int32_t data type. Basically, the guy was trying for readable assembly output in Godbolt, which is a weird thing to optimize for. I bet the JS guy didn't do that. Anyways, gcc has a FORTRAN compiler. IDK if that's where the FORTRAN team went or what, but if a language has it's own compiler, there's a pretty good chance it's in gcc.
@marco_foco14 сағат бұрын
I am the fastest language: 1) all the numbers are the same - just run once and print that value 2) even with only the inner loop, you can compute in closed form without a loop: if u < 100000, print (100000/u)*u*(u-1)/2 + (100000%u)*(100000%u - 1)/2 + r; // all calculated using integer arithmetics if u >= 100000, print 4999950000 + r and I got the result approaching to minute 24:00, where prime mentioned the constness of the result of the inner loop. While 1) wasn't mentioned yet.
@amafi_poe19 сағат бұрын
regarding the stuff around 1:19 or so, doing the work of decompiling and looking at the underlying instructions to investigate loops clearly takes more expertise and willingness to do work than the guy that posted the original benchmark is capable of if you read how he replies to people. I can't tell if he's actually trying to do a thing or if he's just farming engagement, but either way I wouldn't expect him to do anything useful
@cryptyyy_766714 сағат бұрын
Take a cat, a dog, an ant, an elephant and a shark and grade them all at how well they swim and declare the best swimmer as "the best animal ever". Obviously this doesn't work...
@HamishArb18 сағат бұрын
27:14 Since it only reads a[r], the compiler could also optimise out the computation for all the other values of i theoretically & make the thing O(1) when in combination with optimising the sum over j values.
@my_online_logs16 сағат бұрын
create cache friendly code so that the cpu will not recompute if the computation is the same, but instead directly fetch the result from the cache
@sereysothe.a21 сағат бұрын
the R code is atrociously written for anyone who is actually familiar with the language. two great things about R is that you can treat it like a functional language, and you can outsource expensive subroutines to C implementations
@BIGAPEGANGLEADER21 сағат бұрын
Knew this would be the case when I looked at the speed. Not to mention if we take a step even further back, I wouldn't be surprised if a loop isn't required to execute whatever task their loop implementation is doing
@SaHaRaSquad20 сағат бұрын
So the R code is bad, the PHP code is unnecessarily slow, the C code prevents compiler optimizations, the Julia code is much slower than it could be... Or in other words that benchmark is so bad the only way to make it worse is to literally just fill a table with random numbers.
@Houshalter20 сағат бұрын
You can use R for ages without ever having to write a for loop.
@sereysothe.a19 сағат бұрын
@@Houshalter I've written an entire R package without a single for-loop in it
@MrAshtordek18 сағат бұрын
Wow, that is soo bad... the most idiomatic way to write that in R would probably be by preallocating the entire "j" array, then calculating the entire "a" array using a map with the inner function doing a vectorized modulo of "j" by "u" followed by sum and adding "r"... (assuming that at this point it isn't painfully obvious that you can make the optimization they are talking about and just precompute "sum(j %% u)". 1) Theirs 2) "a
@marcocaspers3136Күн бұрын
I think where it falls down is, what do you define as the performance of a language? Can you actually measure the performance of a language? Or do you measure the performance of the runtime, or the compiler, or more precise the capability of the compiler to efficiently compile to machine code? Also, are you actually measuring the process that is executing, or are you measuring the whole operating system and everything what it is doing? That means if anti-virus is kicking in at some point, or "disk clean-up", or any other "idle" process because there isn't actually much happening on the other 23 cores of the CPU? Or the CPU is running hot at some point so it's throttling because a previous test has built up heat that has not yet dissipated. That is already providing that these tests were all performed on the exact same system with the exact same environmental conditions, because ambient temperature influences the temperature of the CPU which can influence the performance of the CPU. In fact, I would go as far as to say that there are far too many "moving" parts that you can never accurately measure this.
@harier6423 сағат бұрын
what's the difference between the compiler and the language when the end result is what the compiler produces? the language is just your steering wheel of the compiler at all times anyway
@marcocaspers313623 сағат бұрын
@@harier64 Take C, you have GCC, Clang, and another host of different compilers. The language is separate from the compiler. Which is my point. You are testing the capabilities of the compiler, not the language. There's even differences between versions of compilers, even if the language doesn't change, the output of the compiler of version 1 and the output of compiler version 2 can be different, even for the exact same code.
@JeSuisUnKikoolol22 сағат бұрын
"The performance of a language" doesn't make any sense as you point out, the only things you can measure are its implementation(s) so compilers/interpreters. When it comes to external factors then yea it can be really hard to measure and be sure that your results are not because of some factors you didn't control for. As an example the position of your code in memory can have a big effect so if you add/remove unused code it could shift things around and make your program faster/slower. I saw a paper a while ago and the authors made a framework to do benchmarks while trying to take into account a lot of these external factors. Some of the factors I remember were restarting the computer before the benchmark, having the network card disabled, disabling some daemons/services, running the program multiple times but with different stack positions/alignments, shorter/longer environment variables, different working directories. I don't think it's impossible to properly measure because of too many moving parts, it's just hard but at the end of the day the more variable you take into account the better. Most of these factors are relatively small so you don't need to control for all of them to have a realistic picture of what's actually going on. There is also the fact that some of these factors are independent from each others and will "cancel out" their effects so in average you'll get a similar performance regardless of the environment (not talking about different hardware here obviously)
@SaHaRaSquad21 сағат бұрын
Even a longer username can slow down a program because it makes the PATH environment variable longer which again can shift the address space of the program and make it less cache efficient. And some CPUs may not be (yet) supported well in the operating system or lack a microcode update and so on. There are simply too many variables for a single benchmark to ever be relevant by itself.
@my_online_logs16 сағат бұрын
the language make the compiler output different, if the language can provide rich semantic informations to the compiler, the compiler can do more optimization (not if the compiler is already able to fully utilize the rich semantic informations provided)
@owlcaps787615 сағат бұрын
Isn't it dividing the loop to four parts because it's checking the i < n and the processor can do 0.25 additions per cycle, so it can do four at a time and then check them all at once? Running the results of those four after? IDK but entertained
@JJOULK18 сағат бұрын
Very interesting discussion. Regarding "comparing languages on real workloads" by for example http servers: there is a channel (Anton Putra) that does that, for example comparing Go to Rust. However he usually uses frameworks (understandably) but does compare decent server performance metrics. All in all, it's insightful but again, should be taken with care due to the complexity of framing and measuring.
@my_online_logs16 сағат бұрын
agree. comparing language with lightweight load will not show the difference of the language, because all they do is just lightweight load. just like comparing olympic math person vs common person over lightweight question suct as 1 + 1 = ...
@manyids219 сағат бұрын
the modulus loop can be expressed in closed form [ since mod(a,c) + mod(b,c) = mod(a+b,c) ], so basically mod(100k * (100k+1)/2, u)?
@disquettepoppy12 сағат бұрын
[mod(a,c) + mod(b,c) = mod(a+b,c)] - this is untrue (you probably conflated being congruent modulo c with the operator itself or something), but some of the other comments show the correct closed form
@manyids28 сағат бұрын
@@disquettepoppy ok, of course... untrue by inspection as there exist a,b,c (a=b=c-1) such that mod(a,c) + mod(b,c) = 2c - 2; while mod(a+b,c) < c
@ferdynandkiepski502623 сағат бұрын
-march=native will generate instructions for the native cpu. If the cpu that the code gets distributed to doesn’t support the extensions that the host cpu does, the executable will crash with SIGILL. Which means illegal instruction. Also if it supports it, then it still might not be optimal. Ideally you want to compile for the target machine architecture, for example zen5. If you don't know it you either need different versions of the executable to support the major extensions that matter for perf like avx512. But then you rely on the user to choose the correct one. The only seamless solution is to use runtime detection which will then dispatch the proper version of code. This is done for high performance libraries like simdjson. Though this adds complexity.
@izd422 сағат бұрын
march=native is SUCH a funny flag. It's so architecture-specific that I can't imagine anyone but a gentoo user trying it
@romanstingler43522 сағат бұрын
@@izd4 not a gentoo user but I guess that most Arch users if they compile something from the AUR are also using march=native
@alekseyburrovets474721 сағат бұрын
Please drop a bash oneliner here in order to robustly detect march of the precompiled binary (produced by gcc) that will definitely run at the target machine. Please keep in mind that it should work with a virtual environments (with cut-off instruction set) too. Thank you so much.
@LtdJorge21 сағат бұрын
@@izd4 yes, it's mostly for testing. I'm on Gentoo and I use -march=znver2
@romanstingler43520 сағат бұрын
@@alekseyburrovets4747 objdump -d /usr/bin/fish > fish_disassembly.txt grep -i 'avx' fish_disassembly.txt add whatever instruction you are looking for
@perguto21 сағат бұрын
Prime & Casey Show let's go!!!!
@hunterap2317 сағат бұрын
If you look at all the code that the original author submitted, it is all written by someone who is very new to programming. They didn't write anything in any of the original languages' code that made sense
@bkr_41823 сағат бұрын
YES, I was hoping you’d do a video on that!!
@Windeycastle23 сағат бұрын
Guess I'll still have to know how my computer works, and how to write efficient algorithms. Which is exciting!
@madpuppet6669 сағат бұрын
I think its interesting to just know what overhead comes with a language, such as what memory tracking like ref counting or garbage collection is enforced that can't be circumvented, what kind of memory access is mediated. I wouldn't bother ever comparing interpreted languages to compiled languages, but these smaller performance tests are pretty useful for comparing interpretted languages since virtual machines are not equal, whereas languages sitting on LLVM are fairly equal.
@Kae____22 сағат бұрын
Question, would calculating an arbitrary amount of digits of PI and measure the amount of time that each program takes across multiple runs be a better micro benchmark? I still dont think it would be productive or useful for any real life application (apart from calculating pi…) but would that be better for whatever purpose the author of that benchmark wanted?
@LtdJorge21 сағат бұрын
You get into allocation territory. The bigger the number, the more allocations you need. Unless you want to limit it to u64, which will be done in seconds.
@Windeycastle16 сағат бұрын
@@Kae____ Might be a good one to try, but are you benchmarking the language or the proficiency of the programmer?
@muhammedkadirtan346919 сағат бұрын
my man beating the compilers, lets goo!
@itsnumptyКүн бұрын
It’s so weird. The speed of a loop determines a languages capabilities?
@User948Z7Z-w7nКүн бұрын
You are right. Where is if else performance metrix
@Jack-b4s3g23 сағат бұрын
If you need fast brute force loops then yes.
@ForeverZer016 сағат бұрын
Typically no, but for Python, yes.
@AlexMax274217 сағат бұрын
To my recollection, Zig doesn't actually have a decent native benchmarking framework that's cross-platform. I think the most popular one is Linux specific and leans on performance counters. I actually have been working on porting nanobench from C++ to Zig just so I could have some numbers I could trust for some code I was testing, and in my experience it's about even with or slightly slower than C. However, in some cases Zig is actually significantly slower because the language doesn't give you subclassing or interfaces, which means that you have to roll those sorts of things yourself. When Zig developers roll their own, I find they tend to gravitate towards v-table driven design instead of duck typing or interfaces, which is naturally going to be slower due to the extra indirections inherent to pointers. Plus there's also the fact that Zig is still a work in progress so naturally it will implement some things suboptimally. Don't get me wrong, Zig is still a great language, and the up-and-comer I'm most excited about by a long shot. Just take performance numbers with an enormous grain of salt.
@bzboii11 сағат бұрын
one thing that’s interesting is that they talk about the cpu like it’s some god given inherent law of nature. but remember they created the cpu and these adder units etc BECAUSE it leads to fast execution of the code they run. like 50:00 cpus having spare adders and jump capacity BECAUSE it makes for-loops fast BECAUSE cpus run for loops a lot. cpus codeveloped with code and they’re designed to make it fast and vv
@MarkAlterBridgeКүн бұрын
@24:22 Could someone explain a bit more in-depth why it would always be the same number?
@sergey151923 сағат бұрын
Actually that's not the optimisation they are talking about. They are saying you can replace the nested loop with temp = 0 for (int j = 0; j < 100000; j++) { temp = temp + j%u; } for (int i = 0; i < 10000; i++) { a[i] = temp; a[i] += r; }
@smx7523 сағат бұрын
you can rewrite it as: int tmp = 0; for (j = 0; j < 100000; ++j) { tmp = tmp + j%u } for (i = 0; i < 100000; ++i) { a[i] = a[i] + tmp; // "a[i] = tmp" is also valid }
@TheArrowedKnee23 сағат бұрын
@@sergey1519 I was a little bit confused when it was streamed, but what i think they mean is that the inner loop will always end up being the same number because u is effectively a constant, so there's no need to compute that inner loop over and over again, you can just do it factor it out, and use the result in what was previously the outer loop.
@equinox446722 сағат бұрын
After the first iteration of the inner loop, you've basically calculated the values that will be used for every iteration of the outer loop. You'd want the result of the modulo to change on every iteration of the outer loop.
@_me_steven22 сағат бұрын
They're saying that the value created by the inner loop, after summing every value, is going to be the same because they variables affecting them are never changed. That means that the value created by the inner loop always the same. So when the outer loop is asked to run the sum 10_000 times they may as well calculate the value once, before the loop is called, and then just return it 10_000 times
@nekocat3420 сағат бұрын
I was wondering what the code they did for lua looked like and it's worse than I expected. Instead of initializing each pair of the table normally by doing a[i] = 0 before the inner loop, they decide to write this: a[i] = (a[i] or 0) + j % u Writing a[i] = 0 or just using r as the value to initialize before the loop is about 10% to 15% faster on my machine Or you could just register a C function for such a task...
@newrindСағат бұрын
too much speculations. just ask the guy who did it.
@wewillrockyou1986Күн бұрын
Hell yeah another long video with Casey
@michaellatta22 сағат бұрын
We just did a rust vs elixir comparison. We ended up finding elixir slightly faster. But, it took a lot of work to optimize for the erlang VM architecture. Given the huge value of OTP we would prefer elixir even if not too much slower. The rust version took 4 monks and elixir one month.
@JariusJenkins20 сағат бұрын
My question for this 'benchmark' initially was "why force all languages to follow a model that might not fit the language in question?" Elixir's going to be recursing along, and is known for (at least the beam is) easy parallelism and concurrency. Why not try that? Couple minutes later: Name ips average deviation median 99th % HandWritten 4.64 215.44 ms ±2.40% 215.21 ms 234.31 ms FlowBased 4.56 219.53 ms ±2.48% 218.48 ms 238.11 ms Just tweaking outer loop to chunk based on # of cores, and we get the fastest time on that chart. Just shows exactly what they were saying in the video. Can compare, but need to be comparing strength vs strength, not some mediocre middle ground. The Elixir parallel code is like 6 lines longer, and doesn't add almost any complexity, so I feel a pretty fair comparison. In fact, if you use Flow, only adds an extra line of actual code, and simplifies the "outer_loop" code. I'm sure you made the right choice for your use case, especially with the wins on dev time. And, if done right, you'll really not be losing too much on performance (though you knew that already).
@my_online_logs16 сағат бұрын
without showing the code your comparison is meaningless, and without showing how much high the load used in the test so that the real difference in production work load is showed. because many mistake newbie rust did is comparing rust debug realease not the optimized release, and doing many clone because they dont know how to borrow or share ownership, rust can also be more optimized by setting in Cargo.toml and i bet you didnt do it.
@michaellatta16 сағат бұрын
@ in fact I did use a release build. The rust code is about 20k lines solving a real problem, not a benchmark. I have no need to try to convince anyone since every problem stresses languages and runtimes differently. My only point would be that for all people deciding between languages to do a real test that addresses the main risk areas of your problem. It is worth spending a month or two to get that right for a specific problem.
@my_online_logs16 сағат бұрын
@michaellatta yeah that is could be you use clone all over the place which is doing reallocation everywhere because its the easiest way to get rid of value moved error rather than using borrowing or sharing ownership, because idiomatic rust is far more performant than idiomatic elixir, because no garbage collector and rich semantic informations to the compiler to do optimizations
@michaellatta15 сағат бұрын
@ I certainly expected rust to be faster. I do some clone of small values when needed for things like dictionary keys, no clones of large structs. Given my use case is managing a large number of maps built from a much larger set of json data it is possible the elixir map implementation (presumably in c) is dominating. I have multiple smaller benchmarks built to ease into the comparison to see if we could eliminate elixir right away. So, I have a pretty good idea about what is going on. While I would not claim to be a rust expert, I have worked commercially in 17 programming languages including c and assembly, and have a good understanding of architecture impact on execution. As in all performance discussions the only way to know anything is to measure it. Intuition is too often wrong. We are satisfied (for now) that elixir will meet our needs performance wise, and is FAR better from a programmer productivity and distributed computing point of view. But, if we have issues down the road, we will revisit the decision.
@skilz809821 сағат бұрын
These two should make this a regular routine.
@SaHaRaSquad20 сағат бұрын
Benchmark Busters
@skilz809819 сағат бұрын
@@SaHaRaSquad && -> !exception
@Vinaykumar-vy2sn22 сағат бұрын
i bet casey is very protective about his computer
@MrSomethingdark21 сағат бұрын
My fav programming guys, in the same thumbnail! YO!
@lukaleko720818 сағат бұрын
can someone explain to me, how the compiler can calculate the inner loop ahead of time, at 43:06? u is unknown at compile time...
@retropaganda844217 сағат бұрын
The compiler could generate code that calculates it at runtime. But just once, before the loops.
@ForeverZer016 сағат бұрын
It isn't that it can at compile time, it is that it where it is used would only need to be calculated once, not with every iteration of the inner loop. The relationship between the numbers in the operation is constant. Here is a flattened and simplified example that conceptually illustrates the same problem: x = y = z = x - y Now, if with each iteration you add 1 to both x and y, do you really need to recalculate z with each iteration? The code in the video has such an issue, it is just far less obvious because it is using nested loops and a modulo.
@__-dy9gv20 сағат бұрын
Another optimization compilers dont do that would likely be worth doing in this code. Is computing the needed constansts for fast modulo by u at runtime. And using that instead of an idiv.
@user-pe7gf9rv4mКүн бұрын
no haskell?
@boatunsoldКүн бұрын
haskell is way too fast to be shown on the comparison
@timedebtor20 сағат бұрын
Haskell is lazily evaluated so only executes when the results are useful. Benchmarks are not useful. Haskell will never execute. 0ms runtime
@roelhemerik57157 сағат бұрын
Haskell would definitely have the fastest runtime for this benchmark, as GHC will just optimise it to the final result… That said, it will probably also have the longest compile time, because GHC must optimise it to the final result.
@benitoe.487813 сағат бұрын
That Julia should be that slow is wild. It is basically made for stuff like this. But this is what I say about every benchmark result I do not like.
@The1RandomFool6 сағат бұрын
That Python result can vary so much depending on the choice of library to run it; I think it's misleading. I used Numba, a just-in-time function compiler, and got an average of 1.747 seconds with Hyperfine on my system. Using NumPy and eliminating the loops altogether gives an average of 47.4 ms. It optimized the vast majority of the benchmark away and simply gave the answer, as designed. Python just beat everything on the chart.
@keyboard_g19 сағат бұрын
I would expect most languages compiled with LLVM for simple cases to perform nearly identical.
@madpuppet6669 сағат бұрын
if two languags hit the same ballpark then you're usually fine. as someone who works in 60fps games, Garbage collection is my main enemy. That fucker makes spikes that you either can't get around or have to spend a lot of time trying to isolate or smooth out. Any 2 llvm languages that don't use garbage collection can probably just be considered equal, because it will come down to progreammer skill but I wouldn't choose C# over C++ for performance simply because of the garbage collection.
@lupf56893 сағат бұрын
I would have assumed, the stuff they introduced in newer .NET version, like Span, Memory, Object- and Array-Pools and such, should help to avoid a lot of allocations and take a lot of pressure from the GC. Am I wrong?
@sarojregmi20023 сағат бұрын
Guys, imagine casy reviewing your pr.
@blarghblargh9 сағат бұрын
"Looks good 👍"
@markminch19069 сағат бұрын
my PR's a good so that's not a problem
@jn-iy3pz12 сағат бұрын
I think the meaning of the visualization is that the rate of the bouncing == the speed of the algo. Which would be better visualized by a bar chart (that doesn't bounce around) Or even more simple, an ordered list of numbers (without bouncing rectangles in the background)
@ShaneFagan22 сағат бұрын
My first reaction to this was just "zero chance this is a valid comparison" and the more I think of it the worse I think it is. Here is a few on my list: 1. There is a big difference between compiled and interpreted languages, C is fast because the compiler will look at it and understand "hey you are doing this thing, let me try and chop off some unneeded calls or use a particular method that is the most fast" whereas Python if you toss it something on the fly it has to work it out and then spit out the answer. So if you are saying compiled languages are by in large faster than interpreted languages then the answer is "duh" 2. Even with an interpreted language there are ways to ensure things run fast, I went in and looked at the code that was run for this comparison and got Python to run faster than most languages and with less code by using Numpy instead of a for loop which was the original approach. Numpy and standard Python got it on my machine 2ms slower than C on my machine. 3. The test itself is technically invalid too because it started off with a random number generator, I'd assume this is to avoid compiled languages cheating and just pre-computing the result before run with compiler tricks but it does cause serious issues with how valid the comparison is. Like if you generate a number that has 5 zeros in one language and 15 in another you aren't comparing apples with apples
@TurtleKwitty18 сағат бұрын
At minimum they really should have done two args from cli, and made a list of say 1000 pairs and run each language against each pair so it would be more fair and balance out and strange edge cases a language might have on specific pairs while being reproducible yeah
@tomorrow614 сағат бұрын
Ok that was so scary as it took me through a very simple performance memory sanity test loop I wrote a couple of decades ago to exercise memory in sets of 10000 item arrays in Java . This benchmark was so similar
@88Nieznany886 сағат бұрын
Amazing. I knew division is super slow and compilers despise it, but it's so eye opening that writing code that looks like it would take longer, can actually be executed 3+4x faster. Just wow!
@homomachina580816 сағат бұрын
question though.. a lot of what is mentioned here as compilation optimizations while i very much agree in theory help with performance, do we really know how much they effect real world applications? i'm certain to some likely significant degree but having a % operator in a for loop is incredibly common in many languages. if i were to be writing code for computing large arithmetic that didn't use things like % then i would likely not be using a for loop and instead just writing math operations. while i appreciate the compiler saving my rear end for sloppy for loops that i am sure i write. i'm wondering if the author of this was getting to some sort of point after all? even if the benchmarks results are crude, writing something less crude would take a lot more effort. is there any way to judge how realistic these benchmarks are?
@bibekshah37017 сағат бұрын
I think the for loop alone is also a good indication for the speed. Because looping in a huge amount of data is a fairly common task
@swordfeng7 сағат бұрын
No one noticing forall i, a[i] is just the same value? The result is just r + sum(j%u for j in range(100000)) = r + sum(range(u))*(j//u) + sum(range(j%u)) not considering negatives and overflows and you can optimize sum(range(x)) to constant time operation
@skaruts11 сағат бұрын
I don't usually compare languages, but I've done it a couple times with a game of life. I made it using CLI graphics just to make sure it's working properly, and then I turned off graphics and just compute some amount of generations. It's important that all tests start from the same board configuration, as GoL is deterministic, so it's a leveled playing field. I don't know if this is a good benchmark, but at least it's an actual program doing actual stuff, and the implementations between languages turn out to be very similar.
@MikeGaruccio16 сағат бұрын
1:17:50 I’d think the best “real world” benchmarks actually would do things like pulling in OpenSSL if that’s how someone would typically do something in a real project. The whole point for most people is figuring out how well a language may work for their project more than how fast the language internals really are.
@ForeverZer016 сағат бұрын
Exactly, "real-world" benchmarks are going to be comparing entire projects, as they are, against others that do the same thing, but written in a different language. For the sake of a benchmark, go ahead and throw out the cost and time of development, but I hate that so many people pretend that a micro-benchmark of a Fibonacci or Mandelbrot fractal is in any way informative unless that is the only thing your application does.
@VFPn96kQT5 сағат бұрын
I would be very surprised if in compiler explorer for compiled languages we'd see any differences in a generated assembly for loops.
@ShreyasGaneshs23 сағат бұрын
My goats back at it
@z-a359419 сағат бұрын
13:45 “what you said is kind of literally true”
@p_d3r421 сағат бұрын
I don't know why people are measuring time. It should be measuring also the resources being used to achieve this time and the relation of both would tell more about performance.
@mattymerr70110 сағат бұрын
The biggest problem with benchmarks I find is they write the same code for each language and not either idiomatic code or optimised code. All the code should be reasonably optimised for each language individually first.
@АлексейСтах-з3н17 сағат бұрын
Why would one ever use integer modulo then, are there cases when doing it manually is slower?
@UnidimensionalPropheticCatgirl16 сағат бұрын
To mimic the correct behavior out of 32bit integer modulo, you actually need 64bit floats in a lot of cases, due to just float representation, so in environments where 64bit floats are problematic you might opt into integer modulo. similarly you need bigger floats for 64bit ints etc.
@bhabbott6 сағат бұрын
Good benchmark, told me exactly what I wanted to know. Presentation's not so good though. What's with the bouncy bars?
@skaruts9 сағат бұрын
Casey is right about Lua: I don't know about vanilla Lua, but LuaJIT is indeed slower on the outset. I always run my Lua benchmarks in waves because of that. The first wave or two are almost always slower. It takes a bit for LuaJIT to gather information and do its magic and actually start accelerating the language.
@dipi712 сағат бұрын
The things compiler devs have to wrestle with nowadays - unreal. Zen4 and Avx512 have gotten so complex, it's gotten impossible to predict how the machine code will look like, and how modern CPUs might sequence it.
@chrisgregory344214 сағат бұрын
The title should be "Imposter Syndrome On-demand"
@halneufmille15 сағат бұрын
I didn't know how slow % was. I just managed to make a function I wrote a few days ago twice as fast by avoiding it.
@SimGuntherКүн бұрын
Why not FPS on a typical game written in these languages as a performance measurement? Too many different factors to just say the language features affect performance. Libraries and bindings will muddy the waters on how much of the performance is owed to the FFI or library linked to the program. What I'd be curious about is dev hours per FPS to see how quickly they'd get a fast program along with a count of days per phase improving performance and by how much in each language.
@IndellableHatesHandles23 сағат бұрын
That would be measuring the performance of the graphics library too.
@SimGunther23 сағат бұрын
@@IndellableHatesHandlesI just said that the libraries would muddy the waters in performance benchmarks for these languages. What I'd also be curious about is how much that those libraries use lower level OS specific APIs and how much of an impact that has on the library performance.
@rohitaug23 сағат бұрын
I think using the standard libraries of these languages to make the same todo app would be a good benchmark since it's non-trivial but wouldn't take a lot of effort to write (could include many languages in the benchmark) and would show if the language has the tools necessary to be productive.
@IndellableHatesHandles22 сағат бұрын
@@SimGunther I guess I should wake up before I read KZbin comments. Didn't even bother to read ahead for some reason. My bad.
@ZedDevStuff22 сағат бұрын
Because it's pretty useless. Games are rendered by the GPU and most games use bindings for whatever graphics API is needed. Unless you're running the game logic in the same thread as everything else
@ravenecho241011 сағат бұрын
always fun to see prime with casey
@romanstingler43523 сағат бұрын
@Prime please submit a PR to LLVM :P
@sebastianmocanu6399Күн бұрын
I watched this live, can't wait to post a link to this everytime I see a garbage comparison on linkedin
@thedude731919 сағат бұрын
As a non programmer, aren't coding languages geared towards certain specific excelling tasks ? Does a simplistic taste take this all intonaccount or is it a basic average result test without any control ?
16 сағат бұрын
We need to make the term: "the performance bro" a thing
@UnknownUser-mj8rg21 сағат бұрын
Can someone explain the j%u modulo thing? I don't get how it's a constant
@chainingsolid15 сағат бұрын
Its not that the "j%u" is a constant it's that the value generated by the inner loop (the sum of all of its iterations), is the same for every single iteration of the outer loop. So hoist it out, and run it once and cache the value in a register.
@hukasu20 сағат бұрын
Lua 5.3 or 5.4 introduced Integers, so this programs as it is probably would run with integers
@majam1n10 сағат бұрын
Nested for loops are not a proxy for how fast a language is, because computational speed is not the only metric you need take into account when selecting a tool for a job. I would love to see a comparison across multiple domains, live servers, I/O, concurrency, etc. Truly these languages have specialization in specific domains. But also, why not take into account programming ergonomics??! i.e. Rust v. JS? It matters!
@mateuszormianek605422 сағат бұрын
Hi, so kinda new here, I tried to run faster version of code on my machine (macbook m1 pro) and I got 7,5 seconds (benchmarking via time or hyperfine). What's wrong with my machine or what I could do wrong when compiling? (compiled via clang path -o program)
couldn't the program just use volatile int for the loops to prevent compiler optimization?
@perguto21 сағат бұрын
You can just write down the result of the inner loop in closed form: a[i]= (100000/u)*u*(u-1)/2+d*(d-1)/2 where d=100000%u
@onurdemirhan23 сағат бұрын
27:20 Using notepad++ ???
@dj_jiffy_pop23 сағат бұрын
I don't know how old most of you are... but back in MY day, a billion iterations of ANYTHING was UNHEARD OF...
@izd422 сағат бұрын
in 1959, a Francois Genuys calculated 16000 digits of pi over 4.3 hours on a computer that could handle 12000 additions per second. A very naïve guess could say that his program executed (4.3h)(60m/h)(60s/m)(12000c/s)=1.86 billion instructions
@minnesotasteve22 сағат бұрын
We benchmarked counting the days
@dj_jiffy_pop22 сағат бұрын
@@izd4 He was probably using Forth... all bets are off, if that's the case...
@nisonatic17 сағат бұрын
You should be old enough to have met the ancient historian Gooney the Elder who saw, in a dream, Chookus Norrus do a thousand thousand thousand iterations on an abacus to calculate his wine tab.
@Sean-fh9nj18 минут бұрын
I want to learn a new programming language, can someone please tell me which language can find prime factors the fastest?
@blarghblargh10 сағат бұрын
yes, more prime and muratori. this shit my jam
@hamidoyempemi2720 сағат бұрын
The benchmark becomes trash because you’re favorite language is at the bottom of the list 😂😂😂
@slyracoon236 сағат бұрын
Its called loop rollout. The compiler unfortunately will not rollout 100000 times but if you did roll it out then it would compile it to a constant
@platin214821 сағат бұрын
Was this odin with bounds checks or without?
@blarghblargh9 сағат бұрын
1:23:19
@shanahjrsuping734420 сағат бұрын
24:52 I don't understand why they say j modulo u is a constant. If my input numbe lr is 5 and j passes 5 doesn't it stop being a constant at that point?
@JJOULK18 сағат бұрын
For what I understood the whole inner loop would be a constant but after the first completion. As all a[i] begin from cero, the u never changes nor the loop parameters. This the inner loop always computes the same result (and the outer as well). So they are not a compile-time constant, but one can deduce that during the runtime of the program, at some point it becomes a constant. (But I would say it isn't as "trivial" to spot for a compiler)
@bearwolffish7 сағат бұрын
lol @ that twitch chat comment "ocaml won by being off the charts"
@chrisspellman595223 сағат бұрын
Just starting the video, but... real question, why is Powershell 7 never listed in any of these things? Not just this chart but so many others. You'll always see Python, sometimes Javascript shows up but never PS7. What kind of biasness is this? I mean, guess it really does add to the comparison charts are junk.
@ZedDevStuff22 сағат бұрын
No idea about the performance but isn't PowerShell just .NET? If there isn't much performance difference then C# being there counts as enough effort
@hamzakhiar3636Сағат бұрын
the man is using notepad++ What a OG
@herrquh4 сағат бұрын
for loop as a service is the final form of microservices
@VirtualShaft22 сағат бұрын
Well that's why you look at backend or fronted framework benchmarks instead of just pure language benchmarks.
@kafran21 сағат бұрын
Whoever did this is a genius. They got the entire tech community around the globe to discuss it. 😹