Language Performance Comparisons Are Junk

Рет қаралды 93,781

ThePrimeTime

Күн бұрын

Пікірлер: 486

@rohitaug Күн бұрын

Whenever I see an upload with Prime and Casey, the next 2 hours is gone. Everything else can wait.

@frischifrisch6860 Күн бұрын

jup 😂

@CobaiaHardcore 16 сағат бұрын

Casie sounds better at 2x speed!

@blarghblargh 10 сағат бұрын

@@CobaiaHardcore all talking sounds better at 2x speed. maybe it's the coffee

@AdityaRaj-ki3md Сағат бұрын

@@CobaiaHardcore

@EightNineOne 21 сағат бұрын

I remember when the 1 billion loops one did the rounds on social media. I janked together a similar example locally between ruby and python. Both were slow, but ruby won. Then I switched from ruby 3.3.5 to truffle ruby and it went from ~30s to ~2s. What did I learn from this? Nothing. Absolutely nothing.

@mattymerr701 10 сағат бұрын

I did an experiment between python and C++ where I switched from CPython to PyPy and it went faster than the C++ code. Also nothing to learn there.

@madpuppet666 9 сағат бұрын

you learnt to use truffle ruby if you want performance. sometimes its not the language, its the interpreter or compiler that matters.

@uberboat4512 7 сағат бұрын

No. It's only the interpreter/compiler that matters. Languages themselves don't determine the speed.

@user-ni2od5lu6j 5 сағат бұрын

@@uberboat4512 most limitation on compilers is set by language standard, i.e. compilers is prohibited to do some optimizations, that's why undefined behavior in c/C++ and strict aliasing appeared in standard - give compilers some room to optimize code, judging by Rust speed it seems it was wrong road, basically first example they discussed is fail for all compilers and languages, because final code should not have any loops, they can be all optimized away. just simple sum without loops: "(int)(((u-1)*(i128)u*num_u + (numpercu + 1) * numpercu)>>1) + rand()%10000"

@OnFireByte 4 сағат бұрын

Language is matter quite a bit, or at least the design and spec itself. Static typed system is needed for efficiently pack the data (and vectorize it), even if the JIT is capable of monomorphism, it still has to do runtime check for that. Inlining cant be used if polymorphism is based on dynamic dispatch, you need static dispatch for that (like impl vs dyn in Rust). Undefined behavior in the C spec, even if it’s generally a bad thing, gives a room for compiler to do more aggressive optimization. There’s way more examples, but you get the idea.

@pikolopikolic5567 20 сағат бұрын

I cannot get over how amazing these streams with Casey are. The way in which he explains things is so clear, and the chemistry you two have on stream is amazing.

@matt_milack Күн бұрын

Can someone please explain to be why people don't count Bash and PowerShell as programming languages? We Sysadmins are humans also. Edit: I beg people who don't know what Bash, PowerShell and Sysadmin are not to reply to this comment. Thank you!

@guywithknife Күн бұрын

Because… it doesn’t count when… reasons.

@AntranigVartanian Күн бұрын

Because people don’t know that POSIX Shell, PowerShell, Bash, AWK and the rest are proper programming languages.

@guywithknife Күн бұрын

Seriously though, they obviously are and anyone that says otherwise is ridiculous.

@tonyhart2744 Күн бұрын

"We Sysadmins are humans", because we replace sysadmin with aws years ago

@matt_milack Күн бұрын

@@guywithknife I've literally never stumbled upon any programming languages tier list mentioning Bash and PowerShel.

@etgaming6063 22 сағат бұрын

God I love these videos with Cacey, sitting down in the morning with coffee listening to these two for an hour just motivates me and relaxes me at the same time.

@belmintuzlic219 20 сағат бұрын

First I thought this will be some simple explanation, I enjoyed, cover and absorbed first half of video, even assembly part... And then when double is introduced I started to lose myself... After that came words that I never heard before, my knees felt weak, my neck started pulsating from blood flow. I felt stupid again and I am admired by your knowledge! Great video 10/10

@VivekYadav-ds8oz Күн бұрын

28:40 KZbin won't let me post links, but it seems that in Rust this is caught, _if_ you write it a bit idiomatically. If the a[i] access happens via an .iter_mut() loop instead, ( `for arr_i in a.iter_mut() { .. }` instead of `for i in 0..100_000 {..}` ) then the compiler catches that the inner loop is entirely pure and computes it before-hand!

@hard.nurtai4209 23 сағат бұрын

wow. thanks for sharing that. compiler writers really are geniuses

@SimGunther 23 сағат бұрын

Interesting that we'd have to have a "loop go fast" option. I would hope that the compiler just "knows" to optimize it into a classic for loop even with the iterator pattern in the source, but what would I know about "zero cost abstractions" 😂

@arseniy2943 22 сағат бұрын

@@SimGunther I'm not sure I understand what you're saying? Rust optimizes the iterator pattern better than it can optimize a classic for loop, so iterators are actually a net positive abstraction because they let the compiler better analyze how the pieces of code interact with each other

@remrevo3944 22 сағат бұрын

Clippy even warns about this: the loop variable `i` is only used to index `a` consider using an iterator: ``, `&mut a`

@SaHaRaSquad 21 сағат бұрын

@@SimGunther iter_mut() is that abstraction. The reason this can't be done with the classic for loop syntax is that it makes the lack of side effects much harder (if not impossible) to detect. I'm not sure you know what zero-cost abstraction even refers to.

@Sofia-rh7ji 14 сағат бұрын

I think the real question here is "why are C, Rust, and Zig different at all?" If you're compiling with clang, then every single one is just an llvm frontend, and for for loop iterations, there's more than likely going to be exactly 0 difference in the binary output except for whatever timer they're using and the calling convention of whatever functions there are.

@paua7742 12 сағат бұрын

Because the time measurement was also inaccurate lmao

@my_online_logs 8 сағат бұрын

because they provide different information to the compiler so the optimization is also different, for example rust provides richer semantic information to llvm than c and c++, so llvm can do more optimization to rust that they can't provide to c and c++, but for now llvm can't fully utilize the rich semantic information provided (still in the development process) so the difference between rust and c and c++ is not much different, the same or slightly higher, it may be some time before llvm can take full advantage of it.

@MatthisDayer 3 сағат бұрын

Zig is ever so slightly faster because by default it targets your machine, making use of whatever features your cpu has. Also functions by default don't follow a strict calling convention, this saves some moving around of registers when calling a function.

@remrevo3944 23 сағат бұрын

25:55 Looking at this the outer loop could definitely be removed, because every element of `a` is calculated to be the same value. And the inner loop *could* be calculated in linear time using some math. (And is even a kind of optimization that could be added to a compiler backend like llvm) Edit: Because just *saying* that the inner loop could be calculated in O(1) wasn't satisfying enough to be I just went ahead and implemented it: v1 is the test implementation from the program and v2 is the optimized version just without having to do a loop. fn v1(u: u64) -> u64 { let mut o = 0; for j in 0..100_000 { o += j % u } o } fn v2(u: u64) -> u64 { let d = (100_000 - 1) / u; let m = (100_000 - 1) % u; d * gauss(u - 1) + gauss(m) } /// Calculate sum of 1, 2, 3, .. n fn gauss(n: u64) -> u64 { n * (n + 1) / 2 }

@nbjornestol 20 сағат бұрын

Yeah the whole program can actually be simplified down to `print(9999900000 % u + r)`

@remrevo3944 20 сағат бұрын

@@nbjornestol Where did you get that number from? Because that's not how modulo works.

@dexterman6361 10 сағат бұрын

Wait what am I missing here? How come a[i] = a[i] + j % u is always the same number?

@remrevo3944 10 сағат бұрын

@@dexterman6361 The outer loop sets every element of the array to the same number. (And then reads out *one* of them) I assume that's what you mean. The inner loop can be calculated with some *math* (which I put in my original comment).

@nbjornestol 9 сағат бұрын

@@remrevo3944 Edit: Nvm you're 100% right, I was thinking too much in moduli, the sum isn't done modulo u, only each number.

@filipg4 Күн бұрын

Profiling and measuring code execution in general is such a rare skill these days. You are more likely to find a sane Rust programmer, than a programmer who knows how to profile their code. Needle in a haystack.

@nicholasredmon9851 Күн бұрын

It so weird too. Intel VTUNE is free and works quite well, has a nice GUI etc...

@evaldssontom Күн бұрын

@@nicholasredmon9851Never tried VTune, how is it compared to gprof cli

@Vitis-n2v 23 сағат бұрын

Most people just want to write code that does what they need it to do and don't care about speed. If they do any measurement it's usually the date.now() before and after the code section to calculate approximate amount of time the code takes and that's where their debugging and profiling ends. The amount of programmers that use only print statements to debug is way more than it should be.

@evaldssontom 23 сағат бұрын

@ Well, sometimes adding a few print statements to an app with HMR is a lot faster than starting a debugger. Sometimes a debugger also changes execution of the program if it is async.

@alekseyburrovets4747 21 сағат бұрын

Are you kidding me? Dude, there is a program called 'gperf'. Try it some time.

@eleven5707 22 сағат бұрын

for Julia to be that low they probably didn't even put the code inside a function, showing that there wasn't even a care to write the loops in the most performant way in each language

@eleven5707 22 сағат бұрын

just checked the code, they didn't use a function, right on the money lmao

@ItahangLimbu 22 сағат бұрын

@@eleven5707 I was also guessing and was correct

@BIGAPEGANGLEADER 21 сағат бұрын

Guarantee the R implementation is similarly fucking stupid, has unvectorised operations and ultimately executes functionality for which loops just wouldn't be used

@sjuns5159 20 сағат бұрын

@@BIGAPEGANGLEADER Imean I *would* argue that the fact that you have to vectorize everything in R to get anything performant is annoying, plus not everything can even be vectorized. But yeah you also shouldn't pretend vectorization doesn't exist and isn't at least okayish. In Julia loops are fast (but you do have to put em in a function, and there are definitely more non-obvious things you have to know to make code performant)

@abtesk 19 сағат бұрын

Couldn't agree more.

@Kiyuja 20 сағат бұрын

seeing Casey on the channel always warms my heart. I love this guys' talks

@bigmiraclewhips 22 сағат бұрын

Casey is the goat. Always learn so much when listening to him.

@Ny_babs 15 сағат бұрын

We need to protect Casey at all costs.

@BelacDarkstorm 21 сағат бұрын

For those curious on why Go performance is so bad: The Go code uses int, which will default to 64 bit integers; basically every other language is using 32 bit integers. Weirdly it seems like on Apple Silicon, switching it to 32 bit integers reduces the performance. On x64, when you switch it to 32 bit integers then it runs about as fast as C.

@MyAmazingUsername 20 сағат бұрын

This is right but wrong. The int, uint, and uintptr types are usually 32 bits wide on 32-bit systems and 64 bits wide on 64-bit systems. Go has these int types: int int8 int16 int32 int64 uint uint8 uint16 uint32 uint64 uintptr byte // alias for uint8

@MyAmazingUsername 20 сағат бұрын

It is automatically based on the target architecture. It also has specific types like uint32 when you care about size.

@travistarp7466 18 сағат бұрын

I knew there's no way go is almost the same speed as node. Go is going to get you similar speeds to other compiled languages like C, rust if your application is not memory heavy.

@retropaganda8442 17 сағат бұрын

32 or 64 bits integers have the same speed until you have to transfer a huge amount of them to/from RAM

@retropaganda8442 17 сағат бұрын

go programs also tend to consume a second CPU to clean all the garbage the language let the programmers create.

@theferaltaint5065 21 сағат бұрын

A lot of people also don’t write go correctly. If you fill your program with interfaces containing numerous methods and shit when there’s no clear benefit, you will slow your program down. Not understanding how to leverage the compiler and lower GC load further slows it down. Honestly, the more simple you write the code, the better it performs. Aside from annoying things like no generics on methods without explicit type casting, and no optional function parameters without allocating slices via variadic arguments, and a couple of other things, go is by far one of the best languages out there. People hate the verbose error handling. Just build your own error system. You don’t have to return the std library error type. It’s your code. Build what you want.

@dunebuggy1292 17 сағат бұрын

You can literally abstract the error handling to be simple. This is such a junior level complaint.

@RomvnlyPlays 15 сағат бұрын

@@dunebuggy1292just because you can get around something doesn’t mean it’s not a pot hole in the road. Nice try

@dunebuggy1292 15 сағат бұрын

@@RomvnlyPlays No, it's a primitive that you would otherwise abstract upon replication/redundancy, like you would any primitive. That's the whole point of programming. Go makes errors as expressions for this very reason. Anyone who is a competent programmer should be able to put one and two together and abstract for the problem set.

@mattymattffs 11 сағат бұрын

@@dunebuggy1292I think the argument is more that you shouldn't have to do that

@inertia_dagger 7 сағат бұрын

> no optionals could be sidestepped with pointers or builder pattern, perhaps

@PhilfreezeCH 18 сағат бұрын

25:30 Not only does the sum of j%u always produce the same number, I am pretty sure modulo as used by most languages (if not all of them?) is distributive so you can effectively compute the sum of j and then just use modulo u once, completely solving the inner loop in constant time.

@vinaydeshpande862 16 сағат бұрын

Yes, but you would have to use 64-bit integer for accumulation.

@simivb 16 сағат бұрын

Maybe the j%u const replacement doesn't work because you add it to the array value, which is an int, for which the overflow is undefined behaviour (at least in C++ in prior versions, maybe here too), and it isn't allowed to optimize because it doesn't know a prio the result of a possibly occuring overflow.

@simivb 16 сағат бұрын

Wait no, j%u can't overflow the int I think.

@mage3690 22 сағат бұрын

If I had to guess, the `a[i] += r;` line in the C code is there to make sure the compiler doesn't vectorize the inner loop, as is the int32_t data type. Basically, the guy was trying for readable assembly output in Godbolt, which is a weird thing to optimize for. I bet the JS guy didn't do that. Anyways, gcc has a FORTRAN compiler. IDK if that's where the FORTRAN team went or what, but if a language has it's own compiler, there's a pretty good chance it's in gcc.

@marco_foco 14 сағат бұрын

I am the fastest language: 1) all the numbers are the same - just run once and print that value 2) even with only the inner loop, you can compute in closed form without a loop: if u < 100000, print (100000/u)*u*(u-1)/2 + (100000%u)*(100000%u - 1)/2 + r; // all calculated using integer arithmetics if u >= 100000, print 4999950000 + r and I got the result approaching to minute 24:00, where prime mentioned the constness of the result of the inner loop. While 1) wasn't mentioned yet.

@amafi_poe 19 сағат бұрын

regarding the stuff around 1:19 or so, doing the work of decompiling and looking at the underlying instructions to investigate loops clearly takes more expertise and willingness to do work than the guy that posted the original benchmark is capable of if you read how he replies to people. I can't tell if he's actually trying to do a thing or if he's just farming engagement, but either way I wouldn't expect him to do anything useful

@cryptyyy_7667 14 сағат бұрын

Take a cat, a dog, an ant, an elephant and a shark and grade them all at how well they swim and declare the best swimmer as "the best animal ever". Obviously this doesn't work...

@HamishArb 18 сағат бұрын

27:14 Since it only reads a[r], the compiler could also optimise out the computation for all the other values of i theoretically & make the thing O(1) when in combination with optimising the sum over j values.

@my_online_logs 16 сағат бұрын

create cache friendly code so that the cpu will not recompute if the computation is the same, but instead directly fetch the result from the cache

@sereysothe.a 21 сағат бұрын

the R code is atrociously written for anyone who is actually familiar with the language. two great things about R is that you can treat it like a functional language, and you can outsource expensive subroutines to C implementations

@BIGAPEGANGLEADER 21 сағат бұрын

Knew this would be the case when I looked at the speed. Not to mention if we take a step even further back, I wouldn't be surprised if a loop isn't required to execute whatever task their loop implementation is doing

@SaHaRaSquad 20 сағат бұрын

So the R code is bad, the PHP code is unnecessarily slow, the C code prevents compiler optimizations, the Julia code is much slower than it could be... Or in other words that benchmark is so bad the only way to make it worse is to literally just fill a table with random numbers.

@Houshalter 20 сағат бұрын

You can use R for ages without ever having to write a for loop.

@sereysothe.a 19 сағат бұрын

@@Houshalter I've written an entire R package without a single for-loop in it

@MrAshtordek 18 сағат бұрын

Wow, that is soo bad... the most idiomatic way to write that in R would probably be by preallocating the entire "j" array, then calculating the entire "a" array using a map with the inner function doing a vectorized modulo of "j" by "u" followed by sum and adding "r"... (assuming that at this point it isn't painfully obvious that you can make the optimization they are talking about and just precompute "sum(j %% u)". 1) Theirs 2) "a

@marcocaspers3136 Күн бұрын

I think where it falls down is, what do you define as the performance of a language? Can you actually measure the performance of a language? Or do you measure the performance of the runtime, or the compiler, or more precise the capability of the compiler to efficiently compile to machine code? Also, are you actually measuring the process that is executing, or are you measuring the whole operating system and everything what it is doing? That means if anti-virus is kicking in at some point, or "disk clean-up", or any other "idle" process because there isn't actually much happening on the other 23 cores of the CPU? Or the CPU is running hot at some point so it's throttling because a previous test has built up heat that has not yet dissipated. That is already providing that these tests were all performed on the exact same system with the exact same environmental conditions, because ambient temperature influences the temperature of the CPU which can influence the performance of the CPU. In fact, I would go as far as to say that there are far too many "moving" parts that you can never accurately measure this.

@harier64 23 сағат бұрын

what's the difference between the compiler and the language when the end result is what the compiler produces? the language is just your steering wheel of the compiler at all times anyway

@marcocaspers3136 23 сағат бұрын

@@harier64 Take C, you have GCC, Clang, and another host of different compilers. The language is separate from the compiler. Which is my point. You are testing the capabilities of the compiler, not the language. There's even differences between versions of compilers, even if the language doesn't change, the output of the compiler of version 1 and the output of compiler version 2 can be different, even for the exact same code.

@JeSuisUnKikoolol 22 сағат бұрын

"The performance of a language" doesn't make any sense as you point out, the only things you can measure are its implementation(s) so compilers/interpreters. When it comes to external factors then yea it can be really hard to measure and be sure that your results are not because of some factors you didn't control for. As an example the position of your code in memory can have a big effect so if you add/remove unused code it could shift things around and make your program faster/slower. I saw a paper a while ago and the authors made a framework to do benchmarks while trying to take into account a lot of these external factors. Some of the factors I remember were restarting the computer before the benchmark, having the network card disabled, disabling some daemons/services, running the program multiple times but with different stack positions/alignments, shorter/longer environment variables, different working directories. I don't think it's impossible to properly measure because of too many moving parts, it's just hard but at the end of the day the more variable you take into account the better. Most of these factors are relatively small so you don't need to control for all of them to have a realistic picture of what's actually going on. There is also the fact that some of these factors are independent from each others and will "cancel out" their effects so in average you'll get a similar performance regardless of the environment (not talking about different hardware here obviously)

@SaHaRaSquad 21 сағат бұрын

Even a longer username can slow down a program because it makes the PATH environment variable longer which again can shift the address space of the program and make it less cache efficient. And some CPUs may not be (yet) supported well in the operating system or lack a microcode update and so on. There are simply too many variables for a single benchmark to ever be relevant by itself.

@my_online_logs 16 сағат бұрын

the language make the compiler output different, if the language can provide rich semantic informations to the compiler, the compiler can do more optimization (not if the compiler is already able to fully utilize the rich semantic informations provided)

@owlcaps7876 15 сағат бұрын

Isn't it dividing the loop to four parts because it's checking the i < n and the processor can do 0.25 additions per cycle, so it can do four at a time and then check them all at once? Running the results of those four after? IDK but entertained

@JJOULK 18 сағат бұрын

Very interesting discussion. Regarding "comparing languages on real workloads" by for example http servers: there is a channel (Anton Putra) that does that, for example comparing Go to Rust. However he usually uses frameworks (understandably) but does compare decent server performance metrics. All in all, it's insightful but again, should be taken with care due to the complexity of framing and measuring.

@my_online_logs 16 сағат бұрын

agree. comparing language with lightweight load will not show the difference of the language, because all they do is just lightweight load. just like comparing olympic math person vs common person over lightweight question suct as 1 + 1 = ...

@manyids2 19 сағат бұрын

the modulus loop can be expressed in closed form [ since mod(a,c) + mod(b,c) = mod(a+b,c) ], so basically mod(100k * (100k+1)/2, u)?

@disquettepoppy 12 сағат бұрын

[mod(a,c) + mod(b,c) = mod(a+b,c)] - this is untrue (you probably conflated being congruent modulo c with the operator itself or something), but some of the other comments show the correct closed form

@manyids2 8 сағат бұрын

@@disquettepoppy ok, of course... untrue by inspection as there exist a,b,c (a=b=c-1) such that mod(a,c) + mod(b,c) = 2c - 2; while mod(a+b,c) < c

@ferdynandkiepski5026 23 сағат бұрын

-march=native will generate instructions for the native cpu. If the cpu that the code gets distributed to doesn’t support the extensions that the host cpu does, the executable will crash with SIGILL. Which means illegal instruction. Also if it supports it, then it still might not be optimal. Ideally you want to compile for the target machine architecture, for example zen5. If you don't know it you either need different versions of the executable to support the major extensions that matter for perf like avx512. But then you rely on the user to choose the correct one. The only seamless solution is to use runtime detection which will then dispatch the proper version of code. This is done for high performance libraries like simdjson. Though this adds complexity.

@izd4 22 сағат бұрын

march=native is SUCH a funny flag. It's so architecture-specific that I can't imagine anyone but a gentoo user trying it

@romanstingler435 22 сағат бұрын

@@izd4 not a gentoo user but I guess that most Arch users if they compile something from the AUR are also using march=native

@alekseyburrovets4747 21 сағат бұрын

Please drop a bash oneliner here in order to robustly detect march of the precompiled binary (produced by gcc) that will definitely run at the target machine. Please keep in mind that it should work with a virtual environments (with cut-off instruction set) too. Thank you so much.

@LtdJorge 21 сағат бұрын

@@izd4 yes, it's mostly for testing. I'm on Gentoo and I use -march=znver2

@romanstingler435 20 сағат бұрын

@@alekseyburrovets4747 objdump -d /usr/bin/fish > fish_disassembly.txt grep -i 'avx' fish_disassembly.txt add whatever instruction you are looking for

@perguto 21 сағат бұрын

Prime & Casey Show let's go!!!!

@hunterap23 17 сағат бұрын

If you look at all the code that the original author submitted, it is all written by someone who is very new to programming. They didn't write anything in any of the original languages' code that made sense

@bkr_418 23 сағат бұрын

YES, I was hoping you’d do a video on that!!

@Windeycastle 23 сағат бұрын

Guess I'll still have to know how my computer works, and how to write efficient algorithms. Which is exciting!

@madpuppet666 9 сағат бұрын

I think its interesting to just know what overhead comes with a language, such as what memory tracking like ref counting or garbage collection is enforced that can't be circumvented, what kind of memory access is mediated. I wouldn't bother ever comparing interpreted languages to compiled languages, but these smaller performance tests are pretty useful for comparing interpretted languages since virtual machines are not equal, whereas languages sitting on LLVM are fairly equal.

@Kae____ 22 сағат бұрын

Question, would calculating an arbitrary amount of digits of PI and measure the amount of time that each program takes across multiple runs be a better micro benchmark? I still dont think it would be productive or useful for any real life application (apart from calculating pi…) but would that be better for whatever purpose the author of that benchmark wanted?

@LtdJorge 21 сағат бұрын

You get into allocation territory. The bigger the number, the more allocations you need. Unless you want to limit it to u64, which will be done in seconds.

@Windeycastle 16 сағат бұрын

@@Kae____ Might be a good one to try, but are you benchmarking the language or the proficiency of the programmer?

@muhammedkadirtan3469 19 сағат бұрын

my man beating the compilers, lets goo!

@itsnumpty Күн бұрын

It’s so weird. The speed of a loop determines a languages capabilities?

@User948Z7Z-w7n Күн бұрын

You are right. Where is if else performance metrix

@Jack-b4s3g 23 сағат бұрын

If you need fast brute force loops then yes.

@ForeverZer0 16 сағат бұрын

Typically no, but for Python, yes.

@AlexMax2742 17 сағат бұрын

To my recollection, Zig doesn't actually have a decent native benchmarking framework that's cross-platform. I think the most popular one is Linux specific and leans on performance counters. I actually have been working on porting nanobench from C++ to Zig just so I could have some numbers I could trust for some code I was testing, and in my experience it's about even with or slightly slower than C. However, in some cases Zig is actually significantly slower because the language doesn't give you subclassing or interfaces, which means that you have to roll those sorts of things yourself. When Zig developers roll their own, I find they tend to gravitate towards v-table driven design instead of duck typing or interfaces, which is naturally going to be slower due to the extra indirections inherent to pointers. Plus there's also the fact that Zig is still a work in progress so naturally it will implement some things suboptimally. Don't get me wrong, Zig is still a great language, and the up-and-comer I'm most excited about by a long shot. Just take performance numbers with an enormous grain of salt.

@bzboii 11 сағат бұрын

one thing that’s interesting is that they talk about the cpu like it’s some god given inherent law of nature. but remember they created the cpu and these adder units etc BECAUSE it leads to fast execution of the code they run. like 50:00 cpus having spare adders and jump capacity BECAUSE it makes for-loops fast BECAUSE cpus run for loops a lot. cpus codeveloped with code and they’re designed to make it fast and vv

@MarkAlterBridge Күн бұрын

@24:22 Could someone explain a bit more in-depth why it would always be the same number?

@sergey1519 23 сағат бұрын

Actually that's not the optimisation they are talking about. They are saying you can replace the nested loop with temp = 0 for (int j = 0; j < 100000; j++) { temp = temp + j%u; } for (int i = 0; i < 10000; i++) { a[i] = temp; a[i] += r; }

@smx75 23 сағат бұрын

you can rewrite it as: int tmp = 0; for (j = 0; j < 100000; ++j) { tmp = tmp + j%u } for (i = 0; i < 100000; ++i) { a[i] = a[i] + tmp; // "a[i] = tmp" is also valid }

@TheArrowedKnee 23 сағат бұрын

@@sergey1519 I was a little bit confused when it was streamed, but what i think they mean is that the inner loop will always end up being the same number because u is effectively a constant, so there's no need to compute that inner loop over and over again, you can just do it factor it out, and use the result in what was previously the outer loop.

@equinox4467 22 сағат бұрын

After the first iteration of the inner loop, you've basically calculated the values that will be used for every iteration of the outer loop. You'd want the result of the modulo to change on every iteration of the outer loop.

@_me_steven 22 сағат бұрын

They're saying that the value created by the inner loop, after summing every value, is going to be the same because they variables affecting them are never changed. That means that the value created by the inner loop always the same. So when the outer loop is asked to run the sum 10_000 times they may as well calculate the value once, before the loop is called, and then just return it 10_000 times

@nekocat34 20 сағат бұрын

I was wondering what the code they did for lua looked like and it's worse than I expected. Instead of initializing each pair of the table normally by doing a[i] = 0 before the inner loop, they decide to write this: a[i] = (a[i] or 0) + j % u Writing a[i] = 0 or just using r as the value to initialize before the loop is about 10% to 15% faster on my machine Or you could just register a C function for such a task...

@newrind Сағат бұрын

too much speculations. just ask the guy who did it.

@wewillrockyou1986 Күн бұрын

Hell yeah another long video with Casey

@michaellatta 22 сағат бұрын

We just did a rust vs elixir comparison. We ended up finding elixir slightly faster. But, it took a lot of work to optimize for the erlang VM architecture. Given the huge value of OTP we would prefer elixir even if not too much slower. The rust version took 4 monks and elixir one month.

@JariusJenkins 20 сағат бұрын

My question for this 'benchmark' initially was "why force all languages to follow a model that might not fit the language in question?" Elixir's going to be recursing along, and is known for (at least the beam is) easy parallelism and concurrency. Why not try that? Couple minutes later: Name ips average deviation median 99th % HandWritten 4.64 215.44 ms ±2.40% 215.21 ms 234.31 ms FlowBased 4.56 219.53 ms ±2.48% 218.48 ms 238.11 ms Just tweaking outer loop to chunk based on # of cores, and we get the fastest time on that chart. Just shows exactly what they were saying in the video. Can compare, but need to be comparing strength vs strength, not some mediocre middle ground. The Elixir parallel code is like 6 lines longer, and doesn't add almost any complexity, so I feel a pretty fair comparison. In fact, if you use Flow, only adds an extra line of actual code, and simplifies the "outer_loop" code. I'm sure you made the right choice for your use case, especially with the wins on dev time. And, if done right, you'll really not be losing too much on performance (though you knew that already).

@my_online_logs 16 сағат бұрын

without showing the code your comparison is meaningless, and without showing how much high the load used in the test so that the real difference in production work load is showed. because many mistake newbie rust did is comparing rust debug realease not the optimized release, and doing many clone because they dont know how to borrow or share ownership, rust can also be more optimized by setting in Cargo.toml and i bet you didnt do it.

@michaellatta 16 сағат бұрын

@ in fact I did use a release build. The rust code is about 20k lines solving a real problem, not a benchmark. I have no need to try to convince anyone since every problem stresses languages and runtimes differently. My only point would be that for all people deciding between languages to do a real test that addresses the main risk areas of your problem. It is worth spending a month or two to get that right for a specific problem.

@my_online_logs 16 сағат бұрын

@michaellatta yeah that is could be you use clone all over the place which is doing reallocation everywhere because its the easiest way to get rid of value moved error rather than using borrowing or sharing ownership, because idiomatic rust is far more performant than idiomatic elixir, because no garbage collector and rich semantic informations to the compiler to do optimizations

@michaellatta 15 сағат бұрын

@ I certainly expected rust to be faster. I do some clone of small values when needed for things like dictionary keys, no clones of large structs. Given my use case is managing a large number of maps built from a much larger set of json data it is possible the elixir map implementation (presumably in c) is dominating. I have multiple smaller benchmarks built to ease into the comparison to see if we could eliminate elixir right away. So, I have a pretty good idea about what is going on. While I would not claim to be a rust expert, I have worked commercially in 17 programming languages including c and assembly, and have a good understanding of architecture impact on execution. As in all performance discussions the only way to know anything is to measure it. Intuition is too often wrong. We are satisfied (for now) that elixir will meet our needs performance wise, and is FAR better from a programmer productivity and distributed computing point of view. But, if we have issues down the road, we will revisit the decision.

@skilz8098 21 сағат бұрын

These two should make this a regular routine.

@SaHaRaSquad 20 сағат бұрын

Benchmark Busters

@skilz8098 19 сағат бұрын

@@SaHaRaSquad && -> !exception

@Vinaykumar-vy2sn 22 сағат бұрын

i bet casey is very protective about his computer

@MrSomethingdark 21 сағат бұрын

My fav programming guys, in the same thumbnail! YO!

@lukaleko7208 18 сағат бұрын

can someone explain to me, how the compiler can calculate the inner loop ahead of time, at 43:06? u is unknown at compile time...

@retropaganda8442 17 сағат бұрын

The compiler could generate code that calculates it at runtime. But just once, before the loops.

@ForeverZer0 16 сағат бұрын

It isn't that it can at compile time, it is that it where it is used would only need to be calculated once, not with every iteration of the inner loop. The relationship between the numbers in the operation is constant. Here is a flattened and simplified example that conceptually illustrates the same problem: x = y = z = x - y Now, if with each iteration you add 1 to both x and y, do you really need to recalculate z with each iteration? The code in the video has such an issue, it is just far less obvious because it is using nested loops and a modulo.

@__-dy9gv 20 сағат бұрын

Another optimization compilers dont do that would likely be worth doing in this code. Is computing the needed constansts for fast modulo by u at runtime. And using that instead of an idiv.

@user-pe7gf9rv4m Күн бұрын

no haskell?

@boatunsold Күн бұрын

haskell is way too fast to be shown on the comparison

@timedebtor 20 сағат бұрын

Haskell is lazily evaluated so only executes when the results are useful. Benchmarks are not useful. Haskell will never execute. 0ms runtime

@roelhemerik5715 7 сағат бұрын

Haskell would definitely have the fastest runtime for this benchmark, as GHC will just optimise it to the final result… That said, it will probably also have the longest compile time, because GHC must optimise it to the final result.

@benitoe.4878 13 сағат бұрын

That Julia should be that slow is wild. It is basically made for stuff like this. But this is what I say about every benchmark result I do not like.

@The1RandomFool 6 сағат бұрын

That Python result can vary so much depending on the choice of library to run it; I think it's misleading. I used Numba, a just-in-time function compiler, and got an average of 1.747 seconds with Hyperfine on my system. Using NumPy and eliminating the loops altogether gives an average of 47.4 ms. It optimized the vast majority of the benchmark away and simply gave the answer, as designed. Python just beat everything on the chart.

@keyboard_g 19 сағат бұрын

I would expect most languages compiled with LLVM for simple cases to perform nearly identical.

@madpuppet666 9 сағат бұрын

if two languags hit the same ballpark then you're usually fine. as someone who works in 60fps games, Garbage collection is my main enemy. That fucker makes spikes that you either can't get around or have to spend a lot of time trying to isolate or smooth out. Any 2 llvm languages that don't use garbage collection can probably just be considered equal, because it will come down to progreammer skill but I wouldn't choose C# over C++ for performance simply because of the garbage collection.

@lupf5689 3 сағат бұрын

I would have assumed, the stuff they introduced in newer .NET version, like Span, Memory, Object- and Array-Pools and such, should help to avoid a lot of allocations and take a lot of pressure from the GC. Am I wrong?

@sarojregmi200 23 сағат бұрын

Guys, imagine casy reviewing your pr.

@blarghblargh 9 сағат бұрын

"Looks good 👍"

@markminch1906 9 сағат бұрын

my PR's a good so that's not a problem

@jn-iy3pz 12 сағат бұрын

I think the meaning of the visualization is that the rate of the bouncing == the speed of the algo. Which would be better visualized by a bar chart (that doesn't bounce around) Or even more simple, an ordered list of numbers (without bouncing rectangles in the background)

@ShaneFagan 22 сағат бұрын

My first reaction to this was just "zero chance this is a valid comparison" and the more I think of it the worse I think it is. Here is a few on my list: 1. There is a big difference between compiled and interpreted languages, C is fast because the compiler will look at it and understand "hey you are doing this thing, let me try and chop off some unneeded calls or use a particular method that is the most fast" whereas Python if you toss it something on the fly it has to work it out and then spit out the answer. So if you are saying compiled languages are by in large faster than interpreted languages then the answer is "duh" 2. Even with an interpreted language there are ways to ensure things run fast, I went in and looked at the code that was run for this comparison and got Python to run faster than most languages and with less code by using Numpy instead of a for loop which was the original approach. Numpy and standard Python got it on my machine 2ms slower than C on my machine. 3. The test itself is technically invalid too because it started off with a random number generator, I'd assume this is to avoid compiled languages cheating and just pre-computing the result before run with compiler tricks but it does cause serious issues with how valid the comparison is. Like if you generate a number that has 5 zeros in one language and 15 in another you aren't comparing apples with apples

@TurtleKwitty 18 сағат бұрын

At minimum they really should have done two args from cli, and made a list of say 1000 pairs and run each language against each pair so it would be more fair and balance out and strange edge cases a language might have on specific pairs while being reproducible yeah

@tomorrow6 14 сағат бұрын

Ok that was so scary as it took me through a very simple performance memory sanity test loop I wrote a couple of decades ago to exercise memory in sets of 10000 item arrays in Java . This benchmark was so similar

@88Nieznany88 6 сағат бұрын

Amazing. I knew division is super slow and compilers despise it, but it's so eye opening that writing code that looks like it would take longer, can actually be executed 3+4x faster. Just wow!

@homomachina5808 16 сағат бұрын

question though.. a lot of what is mentioned here as compilation optimizations while i very much agree in theory help with performance, do we really know how much they effect real world applications? i'm certain to some likely significant degree but having a % operator in a for loop is incredibly common in many languages. if i were to be writing code for computing large arithmetic that didn't use things like % then i would likely not be using a for loop and instead just writing math operations. while i appreciate the compiler saving my rear end for sloppy for loops that i am sure i write. i'm wondering if the author of this was getting to some sort of point after all? even if the benchmarks results are crude, writing something less crude would take a lot more effort. is there any way to judge how realistic these benchmarks are?

@bibekshah3701 7 сағат бұрын

I think the for loop alone is also a good indication for the speed. Because looping in a huge amount of data is a fairly common task

@swordfeng 7 сағат бұрын

No one noticing forall i, a[i] is just the same value? The result is just r + sum(j%u for j in range(100000)) = r + sum(range(u))*(j//u) + sum(range(j%u)) not considering negatives and overflows and you can optimize sum(range(x)) to constant time operation

@skaruts 11 сағат бұрын

I don't usually compare languages, but I've done it a couple times with a game of life. I made it using CLI graphics just to make sure it's working properly, and then I turned off graphics and just compute some amount of generations. It's important that all tests start from the same board configuration, as GoL is deterministic, so it's a leveled playing field. I don't know if this is a good benchmark, but at least it's an actual program doing actual stuff, and the implementations between languages turn out to be very similar.

@MikeGaruccio 16 сағат бұрын

1:17:50 I’d think the best “real world” benchmarks actually would do things like pulling in OpenSSL if that’s how someone would typically do something in a real project. The whole point for most people is figuring out how well a language may work for their project more than how fast the language internals really are.

@ForeverZer0 16 сағат бұрын

Exactly, "real-world" benchmarks are going to be comparing entire projects, as they are, against others that do the same thing, but written in a different language. For the sake of a benchmark, go ahead and throw out the cost and time of development, but I hate that so many people pretend that a micro-benchmark of a Fibonacci or Mandelbrot fractal is in any way informative unless that is the only thing your application does.

@VFPn96kQT 5 сағат бұрын

I would be very surprised if in compiler explorer for compiled languages we'd see any differences in a generated assembly for loops.

@ShreyasGaneshs 23 сағат бұрын

My goats back at it

@z-a3594 19 сағат бұрын

13:45 “what you said is kind of literally true”

@p_d3r4 21 сағат бұрын

I don't know why people are measuring time. It should be measuring also the resources being used to achieve this time and the relation of both would tell more about performance.

@mattymerr701 10 сағат бұрын

The biggest problem with benchmarks I find is they write the same code for each language and not either idiomatic code or optimised code. All the code should be reasonably optimised for each language individually first.

@АлексейСтах-з3н 17 сағат бұрын

Why would one ever use integer modulo then, are there cases when doing it manually is slower?

@UnidimensionalPropheticCatgirl 16 сағат бұрын

To mimic the correct behavior out of 32bit integer modulo, you actually need 64bit floats in a lot of cases, due to just float representation, so in environments where 64bit floats are problematic you might opt into integer modulo. similarly you need bigger floats for 64bit ints etc.

@bhabbott 6 сағат бұрын

Good benchmark, told me exactly what I wanted to know. Presentation's not so good though. What's with the bouncy bars?

@skaruts 9 сағат бұрын

Casey is right about Lua: I don't know about vanilla Lua, but LuaJIT is indeed slower on the outset. I always run my Lua benchmarks in waves because of that. The first wave or two are almost always slower. It takes a bit for LuaJIT to gather information and do its magic and actually start accelerating the language.

@dipi71 2 сағат бұрын

The things compiler devs have to wrestle with nowadays - unreal. Zen4 and Avx512 have gotten so complex, it's gotten impossible to predict how the machine code will look like, and how modern CPUs might sequence it.

@chrisgregory3442 14 сағат бұрын

The title should be "Imposter Syndrome On-demand"

@halneufmille 15 сағат бұрын

I didn't know how slow % was. I just managed to make a function I wrote a few days ago twice as fast by avoiding it.

@SimGunther Күн бұрын

Why not FPS on a typical game written in these languages as a performance measurement? Too many different factors to just say the language features affect performance. Libraries and bindings will muddy the waters on how much of the performance is owed to the FFI or library linked to the program. What I'd be curious about is dev hours per FPS to see how quickly they'd get a fast program along with a count of days per phase improving performance and by how much in each language.

@IndellableHatesHandles 23 сағат бұрын

That would be measuring the performance of the graphics library too.

@SimGunther 23 сағат бұрын

@@IndellableHatesHandlesI just said that the libraries would muddy the waters in performance benchmarks for these languages. What I'd also be curious about is how much that those libraries use lower level OS specific APIs and how much of an impact that has on the library performance.

@rohitaug 23 сағат бұрын

I think using the standard libraries of these languages to make the same todo app would be a good benchmark since it's non-trivial but wouldn't take a lot of effort to write (could include many languages in the benchmark) and would show if the language has the tools necessary to be productive.

@IndellableHatesHandles 22 сағат бұрын

@@SimGunther I guess I should wake up before I read KZbin comments. Didn't even bother to read ahead for some reason. My bad.

@ZedDevStuff 22 сағат бұрын

Because it's pretty useless. Games are rendered by the GPU and most games use bindings for whatever graphics API is needed. Unless you're running the game logic in the same thread as everything else

@ravenecho2410 11 сағат бұрын

always fun to see prime with casey

@romanstingler435 23 сағат бұрын

@Prime please submit a PR to LLVM :P

@sebastianmocanu6399 Күн бұрын

I watched this live, can't wait to post a link to this everytime I see a garbage comparison on linkedin

@thedude7319 19 сағат бұрын

As a non programmer, aren't coding languages geared towards certain specific excelling tasks ? Does a simplistic taste take this all intonaccount or is it a basic average result test without any control ?

16 сағат бұрын

We need to make the term: "the performance bro" a thing

@UnknownUser-mj8rg 21 сағат бұрын

Can someone explain the j%u modulo thing? I don't get how it's a constant

@chainingsolid 15 сағат бұрын

Its not that the "j%u" is a constant it's that the value generated by the inner loop (the sum of all of its iterations), is the same for every single iteration of the outer loop. So hoist it out, and run it once and cache the value in a register.

@hukasu 20 сағат бұрын

Lua 5.3 or 5.4 introduced Integers, so this programs as it is probably would run with integers

@majam1n 10 сағат бұрын

Nested for loops are not a proxy for how fast a language is, because computational speed is not the only metric you need take into account when selecting a tool for a job. I would love to see a comparison across multiple domains, live servers, I/O, concurrency, etc. Truly these languages have specialization in specific domains. But also, why not take into account programming ergonomics??! i.e. Rust v. JS? It matters!

@mateuszormianek6054 22 сағат бұрын

Hi, so kinda new here, I tried to run faster version of code on my machine (macbook m1 pro) and I got 7,5 seconds (benchmarking via time or hyperfine). What's wrong with my machine or what I could do wrong when compiling? (compiled via clang path -o program)

@mateuszormianek6054 22 сағат бұрын

Fixed issue, proper command: clang -O3 -march=native -flto -ffast-math -target arm64-apple-macos -pthread -o code_double code_double.c

@graydhd8688 3 сағат бұрын

couldn't the program just use volatile int for the loops to prevent compiler optimization?

@perguto 21 сағат бұрын

You can just write down the result of the inner loop in closed form: a[i]= (100000/u)*u*(u-1)/2+d*(d-1)/2 where d=100000%u

@onurdemirhan 23 сағат бұрын

27:20 Using notepad++ ???

@dj_jiffy_pop 23 сағат бұрын

I don't know how old most of you are... but back in MY day, a billion iterations of ANYTHING was UNHEARD OF...

@izd4 22 сағат бұрын

in 1959, a Francois Genuys calculated 16000 digits of pi over 4.3 hours on a computer that could handle 12000 additions per second. A very naïve guess could say that his program executed (4.3h)(60m/h)(60s/m)(12000c/s)=1.86 billion instructions

@minnesotasteve 22 сағат бұрын

We benchmarked counting the days

@dj_jiffy_pop 22 сағат бұрын

@@izd4 He was probably using Forth... all bets are off, if that's the case...

@nisonatic 17 сағат бұрын

You should be old enough to have met the ancient historian Gooney the Elder who saw, in a dream, Chookus Norrus do a thousand thousand thousand iterations on an abacus to calculate his wine tab.

@Sean-fh9nj 18 минут бұрын

I want to learn a new programming language, can someone please tell me which language can find prime factors the fastest?

@blarghblargh 10 сағат бұрын

yes, more prime and muratori. this shit my jam

@hamidoyempemi27 20 сағат бұрын

The benchmark becomes trash because you’re favorite language is at the bottom of the list 😂😂😂

@slyracoon23 6 сағат бұрын

Its called loop rollout. The compiler unfortunately will not rollout 100000 times but if you did roll it out then it would compile it to a constant

@platin2148 21 сағат бұрын

Was this odin with bounds checks or without?

@blarghblargh 9 сағат бұрын

1:23:19

@shanahjrsuping7344 20 сағат бұрын

24:52 I don't understand why they say j modulo u is a constant. If my input numbe lr is 5 and j passes 5 doesn't it stop being a constant at that point?

@JJOULK 18 сағат бұрын

For what I understood the whole inner loop would be a constant but after the first completion. As all a[i] begin from cero, the u never changes nor the loop parameters. This the inner loop always computes the same result (and the outer as well). So they are not a compile-time constant, but one can deduce that during the runtime of the program, at some point it becomes a constant. (But I would say it isn't as "trivial" to spot for a compiler)

@bearwolffish 7 сағат бұрын

lol @ that twitch chat comment "ocaml won by being off the charts"

@chrisspellman5952 23 сағат бұрын

Just starting the video, but... real question, why is Powershell 7 never listed in any of these things? Not just this chart but so many others. You'll always see Python, sometimes Javascript shows up but never PS7. What kind of biasness is this? I mean, guess it really does add to the comparison charts are junk.

@ZedDevStuff 22 сағат бұрын

No idea about the performance but isn't PowerShell just .NET? If there isn't much performance difference then C# being there counts as enough effort