Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale

  Рет қаралды 7,823

Coiled

Coiled

Күн бұрын

Пікірлер
@randywilliams7696
@randywilliams7696 11 ай бұрын
Great video! Recently switched from Dask to Duckdb on my ~1TB workloads, interesting to see some of the same issues I found brought up here. One gotcha I've found is that it is REALLY easy to blunder your way into making non-performant queries in dask (things that end up shuffling, partitioning, etc. a lot behind the scenes). It was more straightforward for my use case to write performant SQL queries for duckdb since that is much more of a common, solved problem. The scale-out feature of Dask and Spark is interesting too, as we are considering the merits of a natively clustered solution vs just breaking up our queries into chunks that can fit on multiple single instances for duckdb.
@MatthewRocklin
@MatthewRocklin 11 ай бұрын
Yup. Totally agreed. The query optimization in Dask Dataframe should handle what you ran into historically. The problem wasn't unique to you :)
@ravishmahajan9314
@ravishmahajan9314 10 ай бұрын
But what about distributed databases. Is DuckDB able to query distributed databases? Is this technology replacing spark framework??
@andrewm4894
@andrewm4894 Жыл бұрын
Great talk, thanks
@rjv
@rjv 11 ай бұрын
Such a good video! So many good insights clearly communicated with proper data. Also love the interfaces you've built, very meaningful, clean and minimalistic. Have you got comparison benchmarks where cloud cost is the only constraint and the number of machines or their size and type (GPU machines with cudf) is not restricted?
@richerite
@richerite 5 ай бұрын
Great talk! What would you recommend for ingesting about 100-200GB of geospatial data on premise?
@mooncop
@mooncop Жыл бұрын
you are most welcome (suffered well) worth it for the duck
@taylorpaskett3703
@taylorpaskett3703 11 ай бұрын
What software did you use for generating / displaying your plots? It looked really nice
@taylorpaskett3703
@taylorpaskett3703 11 ай бұрын
Nevermind, if I just kept watching you showed the GitHub where it says ibis and altair. Thanks!
@FabioRBelotto
@FabioRBelotto 5 ай бұрын
My main issue with dask is the lack of support of the community (very different from pandas!)
@ravishmahajan9314
@ravishmahajan9314 10 ай бұрын
But DuckDB is good if your data fits one single machine. But the benchmarks shows different story when data is distributed. What about that?
@o0o0oo00oo00
@o0o0oo00oo00 Жыл бұрын
I don’t see duckdb and polars kick spark dask ass on 10gb level in my practical usage.😅 we can’t always trust TPC-H benchmarks.
@bbbbbbao
@bbbbbbao Жыл бұрын
It's not clear to me if you can use autoscaling with coiled.
@Coiled
@Coiled Жыл бұрын
You can use autoscaling with Coiled. See the `coiled.Cluster.adapt` method.
@kokizzu
@kokizzu 9 ай бұрын
Clickhouse ftw
@maksimhajiyev7857
@maksimhajiyev7857 8 ай бұрын
The problem is that in fact RUST based tooling actually wins and all the paid promotions just suck . The actual reason why RUST based tooling is sort of suppressed is very simple , hyperscalers (big cloud tech) earn a lot of money and if things are faster there is no huge bills for your spark clusters 😊)) , I was playing with RUST and huge datasets myself without external benchmarks course I don t trust all this market shit .Rust based EDA is maybe witch kraft but this thing runs as beast . try yourself guys with a huge datasets .
DuckDB: Supercharging Your Data Crunching  by Richard Wesley
30:45
Dask DataFrame is Fast Now
54:28
Coiled
Рет қаралды 1,2 М.
One day.. 🙌
00:33
Celine Dept
Рет қаралды 80 МЛН
黑天使被操控了#short #angel #clown
00:40
Super Beauty team
Рет қаралды 58 МЛН
Accelerating Python Data Analysis with DuckDB
35:05
Michigan Python
Рет қаралды 1,8 М.
How Fast can Python Parse 1 Billion Rows of Data?
16:31
Doug Mercer
Рет қаралды 221 М.
I’ve Switched to UV for Python, and So Should You
17:35
ArjanCodes
Рет қаралды 59 М.
Coiled Overview
15:20
Coiled
Рет қаралды 517
One day.. 🙌
00:33
Celine Dept
Рет қаралды 80 МЛН