To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/DougMercer . You’ll also get 20% off an annual premium subscription.
@eddie_dane7 ай бұрын
Are mustaches the new hoodies for programmers now?
@dougmercer7 ай бұрын
I grew mine at start of COVID ironically and never got rid of it ¯\_(ツ)_/¯
@raniwishahy19047 ай бұрын
Prime mentioned
@dougmercer7 ай бұрын
@@raniwishahy1904 blazingly fast!
@ApexFunplayer6 ай бұрын
Thighhighs bruh.
@knut-olaihelgesen36086 ай бұрын
Why not annotate the return type of your functions? Those are the most important once. Because it return a sequence, annotating the elements of that sequence in the return type will be golden for the rest of the program
@danieljakob13077 ай бұрын
The Summoning Salt homage at 8:26 is brilliant. Fantastic video!
@dougmercer7 ай бұрын
Thanks =] I had way too much fun with that, haha
@ric82487 ай бұрын
It would have been great to play the musical theme right there!
@dougmercer7 ай бұрын
Summoning Salt does use the track I played there sometimes ("4" by HOME). Love his music choices =]
@mortsaidmort3 ай бұрын
I absolutely loved that, too. Just 11/10 usage of the homage as well
@affrokilla5 ай бұрын
A 50x speedup between a for loop and a polars dataframe is really significant, great video!
@dougmercer5 ай бұрын
Polars/Duckdb are crazy fast. Thanks for watching!
@skanderghamgui50397 ай бұрын
I had a project last year where I had to automate a manual process using Python to extract data from an Excel file and auto-fill an XML file. After I finished the project, I reduced the process from 3 months of human work to a 20-minute code run, which made me and my boss very happy. I wish I had seen this video last year; we could have been even happier. Nevertheless, it's great to know that I can achieve such high levels of Python performance. I will ensure better time management for my future projects. Thanks.
@dougmercer7 ай бұрын
3 months to 20 minutes is a great speed up! How frequently do you need to extract the data? One of my favorite XKCD comics is "Is it worth the time?" xkcd.com/1205/ Odds are, 20 minutes is good enough =]
@skanderghamgui50397 ай бұрын
@@dougmercer the company I worked for needed that data often almost on every project they accept so yeah I saved them a ton of time! That was my end of study project during a 6 month internship which I used to succeed with high honors from the university.
@joker3451727 ай бұрын
8:24 Amazing trick! It reminds me of computer graphics class where we had to find a way to improve the DDA Line algorithm... No one could do it. Then, the professor showed us the Bresenham algorithm. It's such a simple concept - instead of working with floats, work with integers! - but it saves soooo much time. It goes to show that sometimes the data type you're working with can have a huge effect on how fast your code is. Drawing a parallel to Machine Learning, this is also why new GPUs have FP8 and FP16 as big selling points. Training with FP32, which is still the standard for a lot of applications, is just dog slow compared to using FP16 or even FP8.
@dougmercer7 ай бұрын
Very true! (Also, super cool algorithm -- I never worked with computer graphics so I just read up on bresenham's algorithm)
@Deltax646 ай бұрын
Half true - the main benefit of FP8/FP16 is reduced memory footprint, not so much the fact that individual operations are faster.
@mnxs6 ай бұрын
@@Deltax64should individual instructions not be slightly faster on smaller data? I don't actually know how floating point ops are implemented in hardware, I've only learnt a bit about integer arithmetic hardware, and in those cases I'd think bigger data sizes would mean slightly slower performance since certain ops needs to have partial results cascade or certain ops indeed needing multiple micro-ops. However, mostly it depends on the complexity of the arithmetic circuitry. But yes, the smaller data size is likely the biggest winner. So much of GPU processing is memory bound, plus you can fit larger data sets into memory with better cache performance when you have smaller data units. (I'm wondering if the practical implementation of ops with these smaller data types really just do a conversion to f32, compute, then covert back down. Would simplify things, if nothing else.)
@Deltax646 ай бұрын
@@mnxs > @Deltax64 should individual instructions not be slightly faster on smaller data? I don't actually know how floating point ops are implemented in hardware, Yes, they *can* be faster, just depends on the chip. But like you say, sometimes they're implemented internally by widening and converting back -- so may be the exact same!
@AbeDillon5 ай бұрын
@@Deltax64 OP never claimed that individual operations were faster (though that's almost generally true. It's true for every piece of silicon I know of that can handle smaller FPs). They just claimed that using different types can make a big difference and gave the use of small FP formats in ML as an example. That's not "half true" that's fully true. No correction needed. It's also not generally true that memory footprint and/or bandwidth is the bottleneck. Cerebras builds systems where the compute resources and memory bandwidth are almost perfectly balanced for sparse matrix multiplication in FP16. That means if they add support for Block FP16, they'll be compute-bound. Eventually, though, I don't know how they're going to maintain that balance, since logic density typically scales way better than memory bandwidth with each new process node.
@guinea_horn7 ай бұрын
C *can't* be slower than Java, can it? The slowest C implementation would be to implement the entire JVM and then write bad Java code
@dougmercer7 ай бұрын
From some comments on Reddit, they speculate the Java implementation performed better C because Java has a JIT. www.reddit.com/r/Python/comments/1c4ln3x/comment/kzshq27/ Alternatively, since the challenge started in Java community, more people worked together to find more optimizations.
@alfiegordon90137 ай бұрын
You would be SHOCKED how much slow linked libraries make a lot of code It's why LuaJIT FFI C is as much as 25% faster than native C, because it doesn't have to do linking
@stefanalecu95327 ай бұрын
@@dougmercerto my knowledge, Java hasn't broken the 1s barrier, while the fastest C solution is 0.5s, so C isn't losing its job any time soon
@dougmercer7 ай бұрын
Who got down to 0.5 seconds in C?
@mayur98767 ай бұрын
Jit has a runtime cost. No way java beats C in terms of code execution. To me this sounds a C skill issue😅
@BosonCollider7 ай бұрын
The actual lessons from this is: 1: use duckdb 2: otherwise, use polars 3: use pypy more, and push back against libraries that are incompatible with it
@dougmercer7 ай бұрын
Yup, absolutely
@seantparsons7 ай бұрын
The lesson I took from this is that you should probably just write it in Java in the first place.
@MorkusSalasevicius5 ай бұрын
Try using polars with parquet format instead… or even then, use it with a memory mapped arrow file
@otty40007 ай бұрын
wow this was a really great video. Its impressive to explain code/libraries differences that quickly and clearly.
@dougmercer7 ай бұрын
Thanks =]
@artlenski81157 ай бұрын
Highly optimised C with proper compiler specifiers taking almost double the time of Java implementation, even if GC is turned off.. hard to believe.
@garette86727 ай бұрын
how could it possibly be hard to believe that more people happened to try to optimize the java implementation.. not a crazy concept and surely plausible.
@JavedAlam244 ай бұрын
There's a good article on substack that got it down to 0.77 seconds in C++. Actually, it shows the baseline speed (without any optimisations) as 2 and a half minutes, which disproves lots of what people are saying here about compiled Java being as fast as C. Clearly the value referred to in the video was the unoptimised code.
@konstantin22963 ай бұрын
Because it is not true. The fadtest C approach is somewhere around 0.45s (the last time I checked)
@mathmaniac436 ай бұрын
What did you not like about the index variables in booty's orginal code? I find named variable indexes more readable than "magic numbers". I would have probably used an enum with incrementing values instead.
@dougmercer6 ай бұрын
You're right. I've since changed my mind. When refactoring, I got a bit fast and loose with timing and making multiple changes at once. I thought that removing them helped performance, but I was mistaken. They definitely help maintainability and should have been kept
@FirroLP7 ай бұрын
Dude, your production quality is so good it's criminal. Had to tell you
@dougmercer7 ай бұрын
Thanks man, that's such a nice compliment. I really appreciate it =]
@nullzeon7 ай бұрын
how am I just finding out about this channel, editing, knowledge, this video was fantastic!
@dougmercer7 ай бұрын
Thanks! Glad you enjoyed it =]
@50shmekles6 ай бұрын
This is one of the most well-done, detailed and thorough yet clear, concise and to the point videos ever. Thank you for introducing me to new concepts and libraries!
@dougmercer6 ай бұрын
Thanks! Glad it was helpful!
@south8285 ай бұрын
14:09 "I'm perfectly happy to run six times slower code if it means I never have to read or write Java." My thoughts on Java summed up in one sentence📍
@slr1503 ай бұрын
Your cloud service provider will also be happy to charge you 10x extra for the compute
@IntEagleOvefovex3 ай бұрын
Idk i find java syntax sexy, so specific and organized. Also makes python way easier to learn.
@robergroso3 ай бұрын
why? Java is a lot better and easier than python, you don't have to guess nothing and is 8x faster to write and to run
@xiaolin8673 ай бұрын
@@IntEagleOvefovex >i find java syntax sexy ☹️
@cinderwolf323 ай бұрын
I will always prefer to work in a shared codebase with a strongly typed language than a shared codebase with a loosely typed dynamic language
@tzacks_7 ай бұрын
in other words, getting performance out of python means rewriting the code in C or using a library written in C :)
@dougmercer7 ай бұрын
PyPy is written in RPython, which targets C. A lot of compilers target C ¯\_(ツ)_/¯.
@gawwad40735 ай бұрын
Did you watch the video?
@tzacks_5 ай бұрын
@@gawwad4073 i did, did you not?
@xanderplayz34464 ай бұрын
@@gawwad4073Well, mmap is written in C.
@aquacruisedb7 ай бұрын
I have no idea what any of this means, and I thought a python was a snake and rust a problem.. BUT, strangely it was entertaining to watch, and very satisfying to see the run times come down!
@dougmercer7 ай бұрын
Hah! Great comment. Thanks for watching =]
@aquacruisedb7 ай бұрын
@@dougmercer It's a testament to your presentation skills that a non-programmer made it to the end tbh. I'm just scratching my head as to why youtube put it in my feed, but I'm not complaining!
@dougmercer7 ай бұрын
@@aquacruisedb the universe is sending you signs to learn to program! Or buy a snake... Or check your car's undercarriage for rust...
@anon_y_mousse7 ай бұрын
Personally, I consider writing fast code to be a matter of experience. If you know the correct methodologies for doing things, then writing a fast solution should be second nature. Take for instance Danny's naive implementation in C, which in the linked article, he states that it took 8 minutes. His justification for writing it that way is that C doesn't have a native hash table implementation, but if you use C and aren't implementing it yourself or have previously implemented it yourself, then you should at the very least know where to get an adequate third-party library. This is also why anyone who's newly getting into programming should only use C if they want to be a good programmer because you'll have to learn how to do so much on your own until you learn what libraries you should use or have your own. Since my computer has lower specs than Danny's, I'm going to test my own library and see how it compares.
@gharren7 ай бұрын
As we all know, Python is the fastest programming language there is. By the time your program has done it's job, the C++ developer is still busy fixing segfaults.
@Danielm1037 ай бұрын
Yeah, no not really. I write with both languages, “How fast is python at..”, it not really a question, because I drop down into C/C++ and write an optimized module.
@michal45617 ай бұрын
@@Danielm103 so you can write the things needed in c++ while keeping the develop time for the general case fastly implemented?
@Danielm1037 ай бұрын
@@michal4561 This is what makes Python amazing. If you follow the paradigm “premature optimization is the root of all evil.” You can happily code along in Python until something becomes a problem performance wise, then look for an optimized module, I.e. similar to how numpy does all the heavy number crunching in C. I do a lot of heavy computations in my work, so I write the stuff that needs to be fast in C++ and call it from python
@phitc42427 ай бұрын
@michal4561 he's built different. a gigachad so to say. he does the shit you guys using regular python don't want to do - writing real optimized code. check what language your favorite python modules are written in - most of the time in C/C++. and python is just a wrapper for those two. Without those things written in C/C++ (or even assembly) python would never in a billion lifespans of the universe be as fast as it is today. we have to be honest here and accept that fact. and be thankful for a moment. Also, I have nothing against a fast python, I just want to make sure we all have a reality check here. And I love C.
@hevad6 ай бұрын
Yes, but once you put it in production, your Python implementation continues to drag on every job it runs. Also, writing the optimized Python implementation seems to take just as long than a reasonable C++ implementation; if not longer.
@rgndn_bhat6 ай бұрын
Nice one, Doug. My Cpython implementation finished in 64 seconds on M2 MacBook air, almost the same approach - memory mapped, multi processing and chunks
@dougmercer6 ай бұрын
That's pretty good! So close to sub 1 minute mark Is it possible to release the GIL and do multithreading? That would probably save time.
@masterhacker70655 ай бұрын
@@dougmercerCould u just brute force with say 32 cores from a threadripper as it seems to benefit massively from more cores?
@dougmercer5 ай бұрын
@@masterhacker7065 for the purpose of the original challenge, the evaluation platform was 8 cores and 64 GB RAM of a particularly sized Hetzner system. For my video, I allowed using all cores on my laptop (didn't feel like paying for cloud compute for a silly KZbin video ¯\_(ツ)_/¯)
@jamesborden71056 ай бұрын
Practically speaking, I prefer the polars implementation over the duckdb because I'd rather chain function calls instead of manipulating text when doing data analysis in Python. But maybe a library like pypika would solve this?
@shadamethyst12587 ай бұрын
I'm impressed you did not do any profiling, nor any statistical test to rule out measurement fluctuations
@dougmercer7 ай бұрын
There definitely are a good deal of fluctuations --- it's largely why I used language like "10ish seconds" and waited to see reasonably large deltas in performance before declaring an improvement. Things definitely get tricky to measure at this speed!
@f1shyv1shy357 ай бұрын
Why not use something like hyperfine?
@micmaxian6 ай бұрын
I think this would have blown up the scope of the video and also made it harder for non stats people to understand. I liked the fun ish measurements! Really good video, definitely subscribing and looking forward to more fun and informative content in the future Doug!
@Those_Weirdos3 ай бұрын
I’m also curious about the impacts of caches - Were caches expired/invalidated, or pre warmed to make sure the runs were consistent and not bound by disk IO?
@MakeDataUseful7 ай бұрын
Great video, thanks for taking the time to create 🤙
@dougmercer7 ай бұрын
Thanks! =]
@TiagoBonetti7 ай бұрын
The SummoningSalt reference was fire!
@dougmercer7 ай бұрын
Thanks =]
@claaaaams6 ай бұрын
lol I thought I was gonna be the only one to spot that.
@east4ming2 ай бұрын
I really like this place. Everyone in the comment area is talented and speaks nicely.
@DarkZeros7 ай бұрын
Nobody has actually tried on the C side. Because I am sure it can beat java or at least, get same results
@dougmercer7 ай бұрын
Give it a shot! Info for the C implementation is here www.dannyvankooten.com/blog/2024/1brc/
@DarkZeros7 ай бұрын
@@dougmercer Thanks! I might do! (Looks like a small enough problem to give a try)
@tsgraphics17 ай бұрын
Can you make a video comparing the performance of Mojo?
@dougmercer7 ай бұрын
I plan to some day, but am waiting on a v1 release.
@Wurstfinger-rl1zi6 ай бұрын
this shit is actual python wizardry
@randypittman2797 ай бұрын
Does file I/O chunking not really matter for the pure python implementations? That is, is there no gain in reading large chunks of the file into RAM rather than reading line-by-line? Rightly or wrongly (premature optimization) I always have a voice at the back of my head telling me to minimize I/O operations. Especially if the data is cold and on spinning platters! Super cool video. Switching to bytes and doing your own int parsing were new ideas to me!
@dougmercer7 ай бұрын
It might be possible to speed it up more with chunking! I didn't try because I couldn't really wrap my head around a good way of doing it. If you want to give it a shot, try forking this repo! github.com/dougmercer-yt/1brc (if you don't feel like generating 13GB of data, you're welcome to send me a gist or link to a fork and I'll try running it).
@olfredos63 ай бұрын
I enjoyed every single bit of this video and seeing how it uses techniques used by others in different languages. Although Java is still way ahead, this makes me super happy. Thanks Doug! Subscribe!!!
@dougmercer3 ай бұрын
Thanks =]
@seboll13Ай бұрын
Despite the use of Polars which of course I think is a good idea, I must say that the factor 10 tweak of the temperature leading to the use of integers in the for loop is super smart.
@thahrimdon6 ай бұрын
This is amazing! I was in it with you for the long haul. Had me smiling and frowning the whole way! Great video!
@dougmercer6 ай бұрын
Hahaha awesome =] thanks!
@andersondantas20107 ай бұрын
[14.5s using rust] Hi , I did the challenge myself and that was my best time on a M1 with 8GB of RAM. To be honest I used some external dependencies but still enjoyed the challenge haha (first time coding rust). If you don't mind I'd like to discuss some items from your solutions: 1. Have you tested parsing the numbers byte-per-byte? 2. How can your code account for number under the 10 degree mark as they have less than the original digits you parser expects? 3. Have you tried tweaking the chunk size to closer to the cache size? I had my best results reading chunks of 188kb As I have less memory than the whole file size, mmap didn't gave me the great performance other people had so I stayed with the manual file handling
@dougmercer7 ай бұрын
That seems like a pretty great time! Both my laptop and the official challenge workstation had 64GB of RAM, so I expect that your approach would be even faster on those systems. 1. I did not try parsing byte by byte . Do you have a gist that I could look at to get a sense of how you did it in rust? 2. Numbers in the file can either be -##.#, -#.#, #.#, or ##.#. Even if the temperature is ~0 degrees, it'll be 0.2 instead of just, say, .2, so these four cases are exhaustive. we first check if there is a minus sign. If there is, we effectively shift forward one character. Then, we check where the period is. If the period is the character after the current character, then we know that the number after the potential minus sign is of the form #.#. otherwise, we know it is of the form ##.#. 3. I did not try to mess with chunk size. Another community member submitted a solution to the GitHub that was interesting . Its almost as fast as the doug_booty4 approach and does not use mmap. It had a chunk size parameter and that did affect performance. (Whereas doug_booty4 gets down to like 9.7s on my system, his got to about 10.1). I'm not sure if using a different chunk size for the doug_booty approach would help. It may!
@andersondantas20107 ай бұрын
@@dougmercer although the chained ifs/elsifs might look like unoptmized, the compiler ends up converting those to jump tables so the processing time is constant
@dougmercer7 ай бұрын
@@andersondantas2010 ah, did you reply with a second comment containing a link? KZbin might have caught it in a filter, but I don't see anything in my "held for review" comments. if so, maybe just comment back your GitHub username and I'll try to find the gist/GitHub on there ¯\_(ツ)_/¯
@dougmercer7 ай бұрын
@@andersondantas2010 I did try this approach. It was almost as fast, but the approach I listed in the video tends to be slightly faster. github.com/dougmercer-yt/1brc/blob/main/src%2Fdoug_booty4_alternate.py#L8-L18
@bryanbischof43516 ай бұрын
The cultural impact of summoningsalt on nerds is unmatched
@gawwad40735 ай бұрын
Nice video. VERY good writing and editing. Smooth as hell, keep it up!
@dougmercer5 ай бұрын
Thanks =]
@vncstudio2 ай бұрын
Some great techniques demonstrated for Python.
@kinuthiamatata60403 ай бұрын
Love the production... johnny Harris typa themes
@dougmercer3 ай бұрын
Thanks so much! ❤️ Johnny Harris, so that compliment means a lot =]
@Sugar3Glider6 ай бұрын
Convert to Lat/Long, z becomes temperature, translate locations into chosen format and youre gooden. Just need to set the display parameters.
@dearheart27 ай бұрын
Was interesting. It reminds me of back at the university. I was engineering all kind of algorithms. At that time there was no python.
@richardrubin21927 ай бұрын
This is great - thanks, Doug!
@dougmercer7 ай бұрын
Thanks for watching! =]
@ai_outline6 ай бұрын
Damn, this was an instant follow! Hope to see more computer science content 🙏🏼 Great video :)
@dougmercer6 ай бұрын
Thanks so much =]
@V1ewSh0t6 ай бұрын
@dougmercer I have an idea, what if you use the GPU instead of just the CPU? the GPU is historically faster when running repeating computations (As far as I know) I could be completely wrong about this and if I am, please tell me. But I feel as this could be worth a try! (Great video btw!)
@dougmercer6 ай бұрын
It's a good idea! I saw this submission that uses cuDF + Dask to get 4.5 seconds on their machine github.com/gunnarmorling/1brc/discussions/487
@TheFwip6 ай бұрын
Pendantic nit: at 8:00, you say "casting it as an integer instead of a float." This should be "parsing," as casting is (usually) used to refer to things that have no runtime cost - e.g telling the compiler "now pretend these four bytes are an int32." Otherwise, very good video. Curious also which Java runtime you used.
@dougmercer6 ай бұрын
I used openjdk 21.0.2 because I wanted to brew install it, but the actual challenge winner used 21.0.2 graal
@TheFwip6 ай бұрын
@@dougmercer thanks!
@weaselontheclock66957 ай бұрын
Nice video, keep it up. Would love to have seen more language comparisons
@dougmercer7 ай бұрын
Good point. A few people have asked about Rust and Go... Will try to do next time!
@weaselontheclock66957 ай бұрын
Looking forward to it! Was my first time watching I'm already subscribed :), fantastic quality man
@ericcartmansh7 ай бұрын
Shocked to see the final java result
@dougmercer7 ай бұрын
Me too! Apparently someone's Golang solution got down to 1.1 seconds github.com/dhartunian/1brcgo
@darrenzou22257 ай бұрын
The fastest is of course muti-universe read, which can read all 1 billion rows simultaneously and do it in constant time
@dougmercer7 ай бұрын
At least until causality is deprecated. Then we can get the answer before running the code!
@АртемКулик-я3д6 ай бұрын
i would never use python, but i like watching how people optimize the hell out of something.
@dougmercer6 ай бұрын
There's something Zen about it 🧘
@gurupartapkhalsa65653 ай бұрын
A lot of python libraries, especially GPU libraries, are actually executing linked C/C++ code. A good example most people should know is "torch," which you access via Python but which is actually calling C++/CUDA code. Obviously, a SIMD operation is going to beat an interpreter performing a bunch of string conversions, even if it's Java with JIT. The benchmark has to be clear about language features and the constraints of which operations or features are being tested, as well as the testing methodology and how to achieve the same (stated) performance of the bytecode assembler (JVM) using Assembler. Rather than making it a pissing contest, it would be more laudable to demonstrate circumstances where normal people can unleash experienced performance.
@dougmercer3 ай бұрын
Yup, understood. I've done another video on Cython, Numba, mypyc, and Taichi. Feel free to try implementing this in Torch... would love to see it. Also, this is just for fun... not a "pissing contest"
@6IGNITION97 ай бұрын
Great video. How do you animate the code?
@dougmercer7 ай бұрын
My current code animation process is a bit of a pain. I made a custom Pygments formatter to create a file that I can copy/paste into my video editor (Davinci Resolve) that makes all the text+ objects be colored appropriately, and then I manually move things around or fade in/fade out. In the past I've used manim. That also was kind of a pain. I just started working on a new approach, but it's gonna be awhile before I even know if it's a good idea or not
@wanfuse7 ай бұрын
try Cython and serializing the code perhaps? seen this sort if things make a big difference , also profiling the code, also 13GB, if you don't want to bother with chunking then read into memory ahead of time. If nothing else it tells you whether your I/O bound or not
@dougmercer7 ай бұрын
I would def be interested in seeing a Cython version! I do think it's possible to beat this implementation if you can do multithreading instead of multiprocessing... I don't have time to implement it but you're welcome to try!
@kierangainer6 ай бұрын
Hey man, this is great content and I’m surprised it hasn’t been pushed to my feed earlier. Keep it up Also 8k subs and a Brilliant sponsorship? Cool shit lolol
@dougmercer6 ай бұрын
Thanks! =] And yeah, I was thankful -- I got two different sponsors around 4k subscribers and turned down a few others. I'll take it as a sign that I'm doing something right ¯\_(ツ)_/¯
@Bernarditete7 ай бұрын
I would love to see a Mojo implementation
@dougmercer7 ай бұрын
I do plan to try Mojo in some future videos. I have two requirements before covering them: language is open sourced (recently done) and they have a stable v1 release (hopefully sometime soon)
@GLOCKSURU7 ай бұрын
What is the font/theme you use in the images of code? It is so nice.
@dougmercer7 ай бұрын
I use Anonymous Pro font (fonts.google.com/specimen/Anonymous+Pro) and nord-base16 colors when syntax highlighting with pygments (github.com/idleberg/base16-pygments). Nord style is pretty close to nord-base16 though and is more common. (One minor caveat about the colors: the mapping between tokens and colors is out of date for that repo, so I fixed the colors for nord-base16 on a personal fork).
@mutatedllama7 ай бұрын
Amazing video, thanks for posting. Learning about polars and duckdb gave me a real-world takeaway that I could bring to my job. Liked, subscribed and saved!
@dougmercer7 ай бұрын
Awesome! Glad to hear =]
@sharjeel_mazhar7 ай бұрын
Great video sir! 🔥 I've a video request for you. Can you please make a video about coding time critical parts in let's say c++ and then call it from python to save time. There could be many use cases, where we want to do something and python takes forever and the same task can fly through using c++. I hope you understand what I'm tryna say? Putting simply: Extending Python with C++ or any other language for that matter let's say Java
@dougmercer7 ай бұрын
I don't have a video entirely dedicated to that, but I do have one titled "Compiled Python is FAST" which includes discussions of Cython, which can let you include plain C or C++ very easily. There other options for making c extension libraries tho Hope that helps!
@ali.moumneh5 ай бұрын
your video quality is top notch, im sure it will soon equate to video views if you keep this up, good luck.
@dougmercer5 ай бұрын
Thanks! I hope so too 🤞
@fatcats77276 ай бұрын
Just wanted to say, all of your videos are incredibly clean and well edited, and althought the algorithm isn’t picking it up rn, your efforts will not go unnoticed!
@dougmercer6 ай бұрын
Thanks so much =]
@alexnolfi37306 ай бұрын
did you test out pandas to see how much slower it was than polars?
@dougmercer6 ай бұрын
It's way slower Using the pandas implementation in here github.com/Butch78/1BillionRowChallenge/blob/main/python_1brc%2Fmain.py takes about 150s, whereas the polars implementation takes 11-12s
@oliviarojas70237 ай бұрын
Yo Doug... the repo is only showing in your recent commits... not sure if that was intentional, but it took me an extra click to get there haha... about .05 extra seconds, and I think we can do better.
@dougmercer7 ай бұрын
Oh hmm, I think I put it under my dougmercer-yt organization instead of my dougmercer user. Sorry for the confusion, but glad you found it =] Oh, and good luck! I'd love for someone to get this down to like 5 seconds.
@oliviarojas70237 ай бұрын
Oh I can't beat that haha.... I was being stupid about the extra time it took to get to your repo.... I was just goofin though ;].... love your channel btw..... just found you and you are my new go to... low level is a great name for what I was looking for! Cheers
@oliviarojas70237 ай бұрын
Oh shit I was thinking of another channel I recently ran into @lowlevellearning ... yall both got the chops though.... Doug mercer is a good name too hahaha sorry
@dougmercer7 ай бұрын
LLL is great too =]
@anneallison64026 ай бұрын
How do you do the code animations?
@dougmercer6 ай бұрын
I've used two different approaches for animating code. 1. In my early videos I used the `manim` library. The community edition has a Code object. 2. In recent videos, I created a custom Pygments formatter that outputs the syntax highlighted code as a Davinci Resolve Fusion composition. Both approaches have a lot of problems. I'm currently writing my own animation library. I may make a video about it soon (but I would probably not be open sourcing the code) Another option you may find useful is reveal.js . That let's you write code animations in JavaScript, and even has an 'autoanimate" feature that works OK. However, since that's more for live presentations, you would need to screen record if you wanted to make a video
@testkj5 ай бұрын
What If you just read every x+n(core) line according to available cores and calc min/max and avg? On a 8 Core CPU Core 1 calcs line 1, 9, 17... Core 2 calcs line 2, 10, 18... And so on. Skipping lines won't cost that much Processing Power. No need to chunk it and calculate where the chunk ends and stuff. Also: is it allowed to create a DB out of the Data?
@dougmercer5 ай бұрын
Give it a shot! Repo is in the description. As for a DB, I'd say you need to include creating the database and ingesting the data in your timing to be fair
@alexanderpotts84255 ай бұрын
I'm actually surprised the simple cpython script you started with was under 10 minutes
@dougmercer5 ай бұрын
Honestly me too
@mightygreen33644 ай бұрын
You know when PyPy is slower? When you are in a coding competition and a bug in PyPy causes you to get a Runtime error that you can't possibly find, because it doesn't exist on your machine... Definetly no personal expiriences here and Pypy is still great.
@dougmercer4 ай бұрын
Hahahaha oh no 💀
@frd857 ай бұрын
informative video with nice summoning salt vibes. good job.
@dougmercer7 ай бұрын
Thanks =] (and sorry if summoning salt music is stuck in your head now)
@irfanfauzi87047 ай бұрын
This is great. But did I miss numpy in your vids ?
@dougmercer7 ай бұрын
It wouldn't help with this problem, because so much of the work is IO + dealing with scalars
@irfanfauzi87047 ай бұрын
Interesting. I should learn more. Thanks for replying
@Finnnicus7 ай бұрын
great production value doug! you'll get many more views if you keep it up
@dougmercer7 ай бұрын
Thanks! I hope so 🤞
@mrjson30397 ай бұрын
First time channel watcher here. Amazing video, thanks for this superb piece of content Mister *checks notes* "Python Jack Black"
@dougmercer7 ай бұрын
HAHAHAHA oh man. I guess I'll take it
@wlockuz44677 ай бұрын
When I see videos like this, I feel like I know nothing about programming. I have been a software engineer for over 3 years now.
@dougmercer7 ай бұрын
It's never to late to learn new stuff! Play with a new library or start a project that's way different than your usual work I used to only know Excel, visual basic, and Matlab. Over time, I found excuses to experiment with Python, Linux, git, and docker and I became a much better developer because of them. Three years is still super early in your career. Continuous learning and intellectual curiosity is the most important skill a dev can have.
@MartialBoniou6 ай бұрын
I filter a 7GB of amazon books TSV data in 5 or 6 seconds in AWK (mawk or GNU awk; on an outdated macbook air M1). Otherwise, +1 for DuckDB (not sponsored)
@dougmercer6 ай бұрын
I do think filtering is an easier task than aggregating. These folks seemed to have a hard time getting a particularly fast awk implementation github.com/gunnarmorling/1brc/discussions/171 . I am not an awk wizard though so I can't really assess how good their code is
@danklynn6 ай бұрын
Excellent editing and presentation. Thanks!
@dougmercer6 ай бұрын
Thanks =]
@harzer992 ай бұрын
The whole trick about performant python code is calling as little native python code as possible.
@aze43084 ай бұрын
1 trillion row challenge when
@dougmercer4 ай бұрын
I'm cheap and hate paying for AWS 😬
@cinderwolf323 ай бұрын
Wondering about the efficiency of using the sim(), min(), and max() functions over chunks of the array/file rather than with only two operands.
@dougmercer3 ай бұрын
Give it a shot! You can clone the repo in the description
@Macatho5 ай бұрын
I'm curious how fast this would run with a GPU implementation. I loved this video, hope you'll extend it with a GPU implementation :)
@dougmercer5 ай бұрын
Someone did a Dask + cuDF implementation. Seems super fast github.com/gunnarmorling/1brc/discussions/487
@marlan__6 ай бұрын
Is polars multi processed? Is that something it does automatically or could we see the same improvements by running that multiprocessed too?
@dougmercer6 ай бұрын
I believe it is multithreaded in rust, which saturates all the cores. So, I wouldn't expect multiprocessing it in Python would help
@cottawalla6 ай бұрын
I couldn't get your opening "performance critical python" out of my head and so missed the entire rest of the video.
@dougmercer6 ай бұрын
¯\_(ツ)_/¯
@Almondz_6 ай бұрын
What application do you use for the code block display?
@dougmercer6 ай бұрын
Hah, so... It's a bit complicated. My current approach for animating code is to use a custom Pygments formatter to create a Davinci Resolve Fusion setting file that I can copy/paste into my video editor, then edit it in Davinci Resolve. This approach has a lot of flaws. (Very hard to find which text+ node has the token I want, very slow to render. In my old videos , I animated code using the python library `manim`. This also had a lot of flaws (inconsistent behavior, difficult to preview what I'm doing, difficult to deal with things at token level). I'm currently working on making my own text animation library similar to manim, but more tailored to what I need for my videos. I've made good progress, but it's still a WIP. There are other off the shelf options that might work for you depending on what you're trying to accomplish (e.g., reveal.js)
@Almondz_6 ай бұрын
@@dougmercer Oh, that's really cool! Do you have a way I can contact you?
@dougmercer6 ай бұрын
@Almondz_ sure, check my channel's "about" section for my email
@SimpaTheImba5 ай бұрын
14:10 my man, can't agree more
@dougmercer5 ай бұрын
Hahaha absolutely =]
@vitorsilva-or1dj7 ай бұрын
thanks for the video!
@dougmercer7 ай бұрын
Thanks for watching and commenting =]
@leosh90263 ай бұрын
Thank you Doug, very cool :)
@dougmercer3 ай бұрын
No prob! Thanks for watching =]
@joseduarte98236 ай бұрын
Depending on how large the total sum actually is, using an incremental mean may yield better performance since python won’t need to upgrade the number to a big int
@dougmercer6 ай бұрын
Neat idea... It's worth a shot! Feel free to fork the repo and give it a try
@Dan_Diaconescu6 ай бұрын
Amazing video my dude, keep it up!
@dougmercer6 ай бұрын
Thanks! Will do =]
@ardenthebibliophile7 ай бұрын
Wouldve loved to see a pandas attempt just as a benchmark
@dougmercer7 ай бұрын
It's bad... haha. You can try running the pandas version here, github.com/Butch78/1BillionRowChallenge/blob/main/python_1brc%2Fmain.py
@ardenthebibliophile7 ай бұрын
@@dougmercer not particularly interested in downloading and running it myself. Is the result posted somewhere? Hard to find from the git repo alone
@dougmercer7 ай бұрын
@@ardenthebibliophile I will run it later after I settle in
@ardenthebibliophile7 ай бұрын
@@dougmercer I really appreciate it. Also, I am a recent viewer of the channel, saw you discuss on Reddit. I very much appreciate your editing style. Well done
@dougmercer7 ай бұрын
@@ardenthebibliophile Thanks so much! I appreciate it =] On the topic of pandas-- I just ran three trials that took around 2:30s flat. Few caveats being: * I have a bunch of chrome windows open and am doing some other tasks (whereas with the full video I did 5 trials, take average of middle three, with no other user stuff running besides the terminal and background processes). * I didn't bother to format the output in the correct format (but that doesn't take more than a fraction of a second anyways) So, quite a big jump between 150s for pandas to ~11-12s for polars. Hope that helps! (and thanks again for the nice comments!)
@AntonioZL7 ай бұрын
My main takeaway from this video is that Python is much faster than I thought, and I say this as a Python back-end developer. 9 minutes with the most trivial implementation against 3-ish from Java? I'll take that. I definitely expected 20+ minutes lol
@dougmercer7 ай бұрын
I was shocked when the PyPy + pure python approach broke the 10 second mark...
@ABaumstumpf5 ай бұрын
"9 minutes with the most trivial implementation against 3-ish from Java?" But the java-code also was intentionally slow.
@ze_chrome4 ай бұрын
I'd love to see that in Bend , it should be quick af
@dougmercer4 ай бұрын
I'm looking forward to playing around with Bend
@incremental_failure7 ай бұрын
Numba comparison would've been interesting, probably combined with numpy in the compiled function.
@dougmercer7 ай бұрын
Hmmm, I'm not sure of the top of my head how I'd do it. I worry that file I/O would make it hard to only use valid Numba. That said, I am a big fan of Numba! I did another video (Compiled Python is FAST) and it showed how awesome Numba can be
@wussboi5 ай бұрын
numba only works withs with numerical compute whereas this task is primarily parsing text.
@incremental_failure5 ай бұрын
@@wussboi Numba very much can work with strings, maybe you're confusing it with numpy.
@wussboi5 ай бұрын
@@incremental_failure thank you. i looked it up and you are indeed correct! 👍 The numba docs does come with a caveat tho: The performance of some operations is known to be slower than the CPython implementation. These include substring search (in, .contains() and find()) and string creation (like .split()). Improving the string performance is an ongoing task, but the speed of CPython is unlikely to be surpassed for basic string operation in isolation. Numba is most successfully used for larger algorithms that happen to involve strings, where basic string operations are not the bottleneck.
@willymcnamara14296 ай бұрын
interesting video! thank you Doug 🤝 🐍
@dougmercer6 ай бұрын
Thanks!
@DareDevilPhil6 ай бұрын
I feel like this should be a single core challenge for purity. I'm still watching though, see if I change my mind by the end.
@dougmercer6 ай бұрын
So, did you change your mind by the end?
@DareDevilPhil6 ай бұрын
@@dougmercer can't say I did :)
@dougmercer6 ай бұрын
@@DareDevilPhil hahaha fair enough =]
@elver50415 ай бұрын
The song at 9:16 goes hard, anyone knows how it's called?
@dougmercer5 ай бұрын
Ooyy - Top Funnel
@vlc-cosplayer6 ай бұрын
Me, approaching this as an engineer: - Read a random subset of the data - Do the computation on that - Yeah that's close enough lmao, interpolation will take care of missing values
@dougmercer6 ай бұрын
Hah! Working smarter not harder 🚀
@vlc-cosplayer6 ай бұрын
@@dougmercer I remembered a video I watched about HyperLogLog. When working with extremely large datasets, a fast approximation may be more desirable than getting the correct answer, but only after a long time. 👀 It'd be interesting to measure how good an approximation you can get using only a fraction of the data. E.g., would using 10% of the data get 90% of the way to the correct answer? You probably don't need 100% accuracy all the time. In fact, your data may not even be 100% accurate to begin with! To put the cost of precision in perspective, getting 99% uptime is relatively easy (that's 80 hours of downtime/year), but every additional 9 after that becomes exponentially more expensive. 99.9% is 8 hours. 99.99% is only 1 hour, 99.999% is only 5 minutes, go to the bathroom and you'll miss that. 💀
@themanwhobeateinstein7 ай бұрын
Which Java exactly was it, I need to know so I can use it
@dougmercer7 ай бұрын
github.com/gunnarmorling/1brc?tab=readme-ov-file#results check out the top result. JDK 21.0.2-graal
@themanwhobeateinstein7 ай бұрын
@@dougmercer Thanks 👍
@__python__7 ай бұрын
Thanks @dougmercer for this video, but in the polars variation, the speed cannot be solely ascribed to the Python language, as you are likely aware of the underlying programming language employed by polars.
@dougmercer7 ай бұрын
I do say that Polars is implemented in rust, and put it in the "Python-ish" section for that reason
@KaranBulani7 ай бұрын
multi threading and multiprocessing is not supported in python correct? due to global interpreter lock. how did he do at 3:56
@dougmercer7 ай бұрын
Python fully supports multiprocessing. You just basically have to pay the overhead of serializing/deserializing data between the parent and child processes. Multi*threading* does not work well because of GIL
@this-one6 ай бұрын
Would it count as Python if we write it as a module in C?
@dougmercer6 ай бұрын
I'm no philosopher, but this gives me ship of theseus vibes. so... maybe technically but I don't feel good about it
@seansingh44217 ай бұрын
ChemE here not programmer, so would an llm inference server be faster and use comparatively lower resources if it was implemented in C++ than Python ?
@dougmercer7 ай бұрын
Hmm. There's a lot of moving parts to the question. Generally a server side ML workflow would be accelerated by GPUs (Nvidia graphics cards) or some other purpose built chips (e.g. tensor processing units, TPU). Code is structured so that they can do as much processing on these purpose built chips as possible, as they are faster or more energy efficient. In the case of Nvidia GPUs, machine learning languages like pytorch effectively marshall the data to the GPU and then execute CUDA code, Nvidias framework for doing computation of the GPU. Once there, python or C is somewhat out of the loop, or at the very least not a significant bottle neck.
@BrianStDenis-pj1tq6 ай бұрын
Great video. I'd like to know how Java is faster than C.
@dougmercer6 ай бұрын
Thanks! I'm not 100% sure (I'm admittedly not qualified to speculate on Java/C performance optimization). The thoughts I've saw are 1. Many people focused on Java (since original challenge language), fewer focused on C. So the Java implementation is super well optimized and the C could have left some potential improvements missing 2. The JVM JIT helped out ¯\_(ツ)_/¯
@emilsteen91826 ай бұрын
"I don't like these global variables." [Replaces them with magic numbers.]
@dougmercer6 ай бұрын
Yeah, I regret that part. (It didn't help performance anyway)
@kiffeeify7 ай бұрын
The root cause is the CSV file. try doing this without parsing strings to floats e.g. with parquet or even uncompressed arrow arrays:D
@dougmercer7 ай бұрын
I did at the end with duckdb and got about 5ish seconds. Definitely helped compared to 9, but still some work to do to achieve Java speeds
@robosergTV7 ай бұрын
why polars and not pandas? Pandas is a standard in Data Science, not polars
@dougmercer7 ай бұрын
Pandas would be terribly slow, because it cannot easily stream or parallelize the data ingest process. You are welcome to give it a go using this pandas implementation to see for yourself github.com/Butch78/1BillionRowChallenge/blob/main/python_1brc%2Fmain.py
@dougmercer7 ай бұрын
Because another person asked for it, I just ran the pandas implementation linked below and it took about 2:30 (150s) compared to polars 11-12s.
@shahaed2 ай бұрын
Bro said “importing a lib is better than writing java code” then proceeded to write most of his logic in a SQL string 🤦♂️
@dougmercer2 ай бұрын
... did you look at the Java solution? I would absolutely prefer to write the couple line SQL statement than the 400 line Java solution, even if it's 6x slower