Java, How Fast Can You Parse 1 Billion Rows of Weather Data? • Roy van Rijn • GOTO 2024

Рет қаралды 89,698

Күн бұрын

Пікірлер: 65

@actorenEU 3 ай бұрын

In 1975, at the University of Delft, my professor and I collaboratively developed an assembler and interpreter for the computer practicum. We had to run this on an IBM mainframe, using a higher-level language. To make it functional, we had to employ extensive masking and shifting operations. I vividly remember the complex logical intricacies we had to navigate to get it all working correctly. From this hands-on experience, I truly admire your work and the effort it takes to even get it working.

@TimBradleyFromOz 3 ай бұрын

From 4 minutes 49 seconds 679 milliseconds >>> to >>> 1 second 535 miliseconds? Wow! Great talk, thanks!

@afterthesmash Ай бұрын

You forgot to mention "without using C++". That adds immensely to the impressive factor.

@serrrsch 3 ай бұрын

I've followed the competition on Twitter and GitHub but this talk is just a Gem in how it is being told. Big up for the presentation/slides skills!

@LtdJorge 3 ай бұрын

If you optimize for specific arches, you can do the SIMD lookup with less instructions and much wider. For example, I’m using the memchr Rust crate by the genius BurntSushi, and specifically the AVX2 implementation. It does loops of 4 sequential comparisons with 256bit registers. The SIMD part is just 2 instructions.

@TechTalksWeekly 3 ай бұрын

This is a brilliant talk and it's been featured in the last issue of Tech Talks Weekly newsletter 🎉 Congrats Roy!

@Syntax753 3 ай бұрын

Kudos for mentioning Advent of Code! And yeah, most people can parse 1 Billion rows of weather data between every blink (and more if using strong Java)

@juveraey 3 ай бұрын

even my low brain can understand what you said. great talk thank you Roy van Rijn

@GeorgeTsiros 2 ай бұрын

29:16 when you've spent a lot of time coding on a device with no pipelining, no SIMD, no cache, no OOE, no hardware multiplication, no hardware float operations, not even indirect memory access, 4 MHz and 10 clock cycles to read _half_ a byte, you are basically forced to learn a lot of bit-twiddling

@afterthesmash Ай бұрын

I go all the way back to the SC/MP in the 1970s (the most basic controller I've ever used) but even that had a 8-bit memory bus and wasn't grinding away on half bytes. What pinprick of a silicon state machine were you using with those parameters?

@GeorgeTsiros Ай бұрын

@@afterthesmash Hewlett Packard 28/48 :D The whole deal started in 1987, if you believe it or not, and thank you for responding :D :D

@djchrisi 3 ай бұрын

The view count gives testamony what a fun challenge that was.

@aronhegedus 2 ай бұрын

The --worker trick is interesting! Learned some cool things in this one

@olivergooth3134 2 күн бұрын

Can anybody tell me, where I can get the 1 Billion Rows of Weather Data file? I know the challage is over but I wanna try it anyway.

@jpphoton 2 ай бұрын

very well done. one of the best imho. that's how we do it.

@robchr 3 ай бұрын

Wizard level optimizations

@RumberoEuropeo 3 ай бұрын

These tricks are great for graph analytics too!

@krystianlaskowski 2 ай бұрын

Challenge and Setup: The speaker participated in the "1 billion row challenge" using vanilla Java with no libraries. The task was to parse a file with 1 billion rows, each containing a city name and temperature, and compute the minimum, maximum, and average temperature for each city. Initial Optimization: Initial baseline implementation took about 5 minutes to parse the file. By optimizing to use multiple threads and a concurrent map, the runtime was reduced to 2 minutes. Advanced Techniques: Further improvements included using memory-mapped files and unsafe operations to bypass boundary checks for faster access. Leveraged Single Instruction, Multiple Data (SIMD) techniques to process multiple bytes simultaneously, significantly speeding up parsing. Branchless Programming: Implemented branchless programming to avoid CPU pipeline stalls, using bitwise operations for conditions and calculations. Avoided string allocations by working directly with raw memory and integer values. Final Optimizations: Used forward probing for efficient hash map operations, eliminating collisions. Adopted a worker thread setup to handle file unmapping separately, further reducing processing time. Achieved a final runtime of 23 seconds using these combined optimizations and Java's advanced features.

@palharez 3 ай бұрын

Really Great talk, thanks for sharing this knowledge

@thesilentnerd4618 2 ай бұрын

Loved every second. ❤

@Nashadelicable 3 ай бұрын

This was so much fun watching

@aronhegedus 2 ай бұрын

I'm still quote confused as to how you can read in 15Gb of text in

@afterthesmash Ай бұрын

From Gemini Advanced: The maximum theoretical bandwidth for NVMe depends on the PCIe generation and the number of lanes used. PCIe 5.0 x4: This is currently the fastest configuration, offering a maximum theoretical bandwidth of up to 16 GB/s. You can have more than one NVMe bus, as well. On a single NVMe drive, bandwidth scales linearly with the number of chips used. Gemini again: For consumer-grade SSDs, the Samsung 990 PRO with PCIe Gen4 interface offers up to 7.45 GB/s read speeds, making it one of the fastest consumer SSDs currently available. You probably issue 128 concurrent reads 128 MB each to achieve max performance. The SSD will strive to keep each individual chip as busy as possible. Gemini: If you utilize PCIe bifurcation and add-in cards (like the ASUS Hyper M.2 x16 card), you can potentially fit up to 4 NVMe x4 devices on a single Threadripper mainboard. That's in addition to any M.2 slots, and there might be several of these, too. You are clearly behind the times. So was I, but I've discovered chatbots.

@aronhegedus Ай бұрын

@@afterthesmash haha thanks!

@ericm97 3 ай бұрын

What a journey. Lovely talk 🙂

@saschazapf5232 2 ай бұрын

Hi Roy, remember good old Redcode Days,? Hope u doing well. Nice topic btw

@royvanrijn 2 ай бұрын

@@saschazapf5232 of course; we should revisit that 🙌🏼

@saschazapf5232 2 ай бұрын

@@royvanrijn Revisit would not the right word for this. I really think quite often about cw. Two years ago i done some research for a better amount of rounds for the hill because of the score deviation on differents seed at 200 Rounds. After that i had some idea about another approach for optimzing paper. But i had not enough time , that time. Now i'm tring to polish up my Java Skills and consumed some video's. Imagine my face when i saw ur face and name on my yt landing page. I saw a picture of u once, literally 15 years ago but i instantly know this was u.

@geoffxander7970 3 ай бұрын

When a software engineer stumbles upon the dark arts of real computer science... The only thing that comes to mind not talked about was AVX2 or SSE4.x (I don't know if they're supported natively in Java).

@NostraDavid2 2 ай бұрын

I didn't see CompSci though? No math theorems, etc. Just pure engineering.

@wsollers1 3 ай бұрын

This was an amazing talk

@chauchau0825 3 ай бұрын

This is gold

@phpngpl 2 ай бұрын

💪💪💪GraalVM

@-_James_- 2 ай бұрын

I'm glad that constant panning left to right of the camera isn't distracting. Otherwise I might have got irritated and not bothered watching the video. Oh wait...

@Abhigyan103 3 ай бұрын

How can I learn Java, which is this advanced, every course just teaches object oriented programming

@khuntasaurus88 2 ай бұрын

@@Abhigyan103 by writing code. Just. Write. Code. Pick an idea and implement it. Then benchmark it and try to optimize it

@GeorgeTsiros 2 ай бұрын

from what I suspect, it is not _Java_ that you want to learn, but low level programming. Or, just, you know, _programming_

@acommoncommenter9364 Ай бұрын

This is not java this is C written in java. Dont try to learn this(at least dont try to learn this in the name of learning java). If you want to be able to this kind of stuff with bytes and bits learn C

@AlLiberali 2 ай бұрын

The baseline impl takes just 5 minutes to process _1 billion_ lines. I'd say the hardware is optimised enough; No need for such hacks in production code

@androth1502 3 ай бұрын

native compilation... is it really java anymore?

@sjzara 3 ай бұрын

@@androth1502 of course.

@zombi1034 2 ай бұрын

Java does native compilation anyway using it's JIT (just in time compiler). The difference with GraalVM is that you do the compilation ahead of time. That's why for long running Java programs there is not really a big performance difference between native image and your regular Java program. Because all hot code (code that gets executed regularly) will get compiled to native code anyway. The Java VM just takes a bit of time to warmup.

@sjzara 2 ай бұрын

@@zombi1034 That's not really how it works. Runtime compilation can produce far better performance for long-running programs than ahead of time. Hot code is not just compiled to native code - the compilation is speculative and can be re-done if circumstances chance. There can be things like inlining of method calls that are possible speculatively but not ahead of time.

@androth1502 2 ай бұрын

@@zombi1034 ah. ok. i thought one did native compilation and the other to JVM. if they are both JVM then my point was irrelevant.

@Diego-Garcia Ай бұрын

If someone created an AOT compiler for Java that outputs native machine code instead of JVM's intermediary binary code, why wouldn't it still be Java? .NET 8 (released in 2023) literally made this with C#.

@akshaysom 2 ай бұрын

For my little brain this is sorcery 😂

@jamesgoodrich6927 Ай бұрын

The 'mechanical sympathy' quote... has 2 'be's in it.

@royvanrijn Ай бұрын

Yeah, Jackie Stewart was a notorious stutterer (j/k I'll update!)

@meryplays8952 3 ай бұрын

ok, how about Golang?

@HalfMonty11 25 күн бұрын

Lol what I'm picking up is to write really fast Java you have to use every trick in the book to basically write something that barely resembles java

@1Eagler 2 ай бұрын

If y need Java to go really fast, change it to C

@InXLsisDeo Ай бұрын

You won't get any below 60 seconds in C if all you do is write C and you don't apply the exact same optimizations as those described in the talk. There is a blog post of a guy doing it in Rust and his first naive attempt took 105 s. With reasonable optimizations and paralleling, he went down to about 20s (on 6 cores). Anything below that involved crazy optimizations for parsing, SIMD, specific hashmap implementation, multiple analyzes of the assembly code, etc.

@nightking4615 3 ай бұрын

Using memory maps is cheating!

@LtdJorge 3 ай бұрын

The contest results were taken by putting the file in a tmpfs, so mmapping is basically free.

@TehGettinq 3 ай бұрын

@@LtdJorge how is it free? you still need to pay for a cpy and for page faults, no?

@Knirin 2 ай бұрын

@@TehGettinqHeavy internal optimizations in the Linux tmpfs kernel driver. Simplifying things the tmpfs driver stores files directly in the read cache memory area and uses the virtual memory subsystem for access protection. Yes page faults happen but they aren’t much slower than jumping to another function outside of the currently loaded page compared to a full seek to disk.

@TehGettinq 2 ай бұрын

@@Knirin would you have a source? Want to read about it

@Knirin 2 ай бұрын

@@TehGettinq A single easy to read one, no. I suggest searching for the differences between “initrd” and “initramfs”. An LWN article about those is where I became aware of the original differences. Tmpfs has had more optimizations since then.

@Tony-dp1rl 3 ай бұрын

Any language that cannot do this entirely I/O limited by the reading of the billion rows from disk in parallel, should be shamed. This would run at I/O speed in JavaScript, Java, C#, Python, Turbo Pascal,, even LUA could do this. :)

@LtdJorge 3 ай бұрын

The results are gathered with the file on a tmpfs, so no, most languages wouldn’t run this at I/O (memory) speed.