Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

  Рет қаралды 69,785

Spark Summit

Spark Summit

Күн бұрын

Пікірлер: 24
@stephaniedatabricksrivera
@stephaniedatabricksrivera 3 жыл бұрын
Emily's Parkay butter pics made me laugh. Really enjoyed this. Great job Emily!!
@HasanAmmori
@HasanAmmori 2 жыл бұрын
Fantastic talk! I wish there was a little more info on the format spec itself.
@gmetrofun
@gmetrofun 5 жыл бұрын
AWS S3 supports random access queries (i.e., Range Header), consequently pushdown is also supported on AWS S3
@bnsagar90
@bnsagar90 4 жыл бұрын
Can you please some text or link where I can read more about this. Thanks.
@Tomracc
@Tomracc 2 жыл бұрын
this is wonderful, enjoyed start to end :)
@flwi
@flwi 7 жыл бұрын
Wow, great presentation!
@manjunath15
@manjunath15 6 жыл бұрын
Very informative and nicely articulated.
@amitbhattacharyya5925
@amitbhattacharyya5925 2 жыл бұрын
good explanations , this would be great if some git code they can mention
@maa1dz1333q2eqER
@maa1dz1333q2eqER 6 жыл бұрын
Great presentation, touched a lot of important areas, thanks
@HughMcBrideDonegalFlyer
@HughMcBrideDonegalFlyer 7 жыл бұрын
Great talk on a very important (and too often overlooked ) topic
@tianzhang3120
@tianzhang3120 3 жыл бұрын
Awesome presentation!
@clray123
@clray123 6 жыл бұрын
Eh so basically any sort of growing data can be only partitioned in one way (along the dimension of the growth - which for many use cases will be some meaningless "autoincrement" id). Which then defeats all the push-down filtering for any other dimension. Not to mention that if your data keeps growing in small increments and you need access to latest of it, you will have to jump through hoops to somehow integrate all those small increments into bigger files - because scanning 20000 tiny files ain't gonna be efficient (and this means lots of constant rewriting - that's why write speed DOES matter and it's not "write-once", but write-many)...
@betterwithrum
@betterwithrum 5 жыл бұрын
Where are the slides?
@TheAjit1111
@TheAjit1111 5 жыл бұрын
Great talk, Thank you
@bogdandubas3978
@bogdandubas3978 4 жыл бұрын
Amazing speaker!
@djibb.7876
@djibb.7876 7 жыл бұрын
Great talk!!! I set up a spark-cluster with 2 workers. I save a Dtaframe using partitionBy ("column x") as a parquet format to some path on each worker. The matter is that i am able to save it but if i want to read it back i am getting these errors: - Could not read footer for file file´status ...... - unable to specify Schema ... Any Suggestions?
@pradeep422
@pradeep422 6 жыл бұрын
The only thing I liked is the way Emily executed it.
@ardenjar7942
@ardenjar7942 7 жыл бұрын
Awesome thanks!
@thomasgong5538
@thomasgong5538 4 жыл бұрын
具有一定的指导学习作用。
@deenadayalmuli2756
@deenadayalmuli2756 6 жыл бұрын
to my experience, orc supports nesting...
@mikecmw8492
@mikecmw8492 6 жыл бұрын
Why is everyone a "spark expert"?? Get real and just show us how to do it...
@betterwithrum
@betterwithrum 5 жыл бұрын
there are spark experts, just far and few between. I've hired a few, but they were unicorns
Сюрприз для Златы на день рождения
00:10
Victoria Portfolio
Рет қаралды 2,1 МЛН
Kluster Duo #настольныеигры #boardgames #игры #games #настолки #настольные_игры
00:47
Apache Spark Beyond Shuffling • Holden Karau • GOTO 2017
41:45
GOTO Conferences
Рет қаралды 19 М.
Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)
32:18
Spark Summit
Рет қаралды 96 М.
The columnar roadmap: Apache Parquet and Apache Arrow
41:39
DataWorks Summit
Рет қаралды 33 М.
Deep Dive: Apache Spark Memory Management
26:13
Spark Summit
Рет қаралды 56 М.
Apache Spark - Computerphile
7:40
Computerphile
Рет қаралды 250 М.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
28:09
Microservices are Technical Debt
31:59
NeetCodeIO
Рет қаралды 531 М.