Skew Mitigation For Facebook PetabyteScale Joins

  Рет қаралды 2,048

Databricks

Databricks

Күн бұрын

Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (maxmind-databr....
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: databricks.com...
See all the previous Summit sessions:
Connect with us:
Website: databricks.com
Facebook: / databricksinc
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com...

Пікірлер: 3
@karthiksitharam
@karthiksitharam 3 жыл бұрын
Good session 👍 good to know real world scenario
@rishigc
@rishigc 3 жыл бұрын
for which version of Spark is this ? They talk about Skew hint but it is only available in Spark 3.0
@Mike.e
@Mike.e 2 жыл бұрын
From what I understood, they were talking about skew hint in terms of manually dealing with skews, not a specific skew hint API.
Spark SQL Join Improvement at Facebook
17:18
Databricks
Рет қаралды 3 М.
On Improving Broadcast Joins in Apache Spark SQL
27:39
Databricks
Рет қаралды 10 М.
FOREVER BUNNY
00:14
Natan por Aí
Рет қаралды 16 МЛН
Noodles Eating Challenge, So Magical! So Much Fun#Funnyfamily #Partygames #Funny
00:33
Каха и лужа  #непосредственнокаха
00:15
бабл ти гель для душа // Eva mash
01:00
EVA mash
Рет қаралды 9 МЛН
Database Sharding and Partitioning
23:53
Arpit Bhayani
Рет қаралды 97 М.
34. Databricks - Spark: Data Skew Optimization
15:03
Raja's Data Engineering
Рет қаралды 29 М.
Delta Live Tables A to Z: Best Practices for Modern Data Pipelines
1:27:52
Spark SQL Bucketing at Facebook - Cheng Su (Facebook)
39:24
Databricks
Рет қаралды 4,6 М.
Broadcast Joins & AQE (Adaptive Query Execution)
20:37
Afaque Ahmad
Рет қаралды 7 М.
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 3,6 МЛН
Intro to Databricks Lakehouse Platform Architecture and Security
28:47
FOREVER BUNNY
00:14
Natan por Aí
Рет қаралды 16 МЛН