Рет қаралды 2,048
Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (maxmind-databr....
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: databricks.com...
See all the previous Summit sessions:
Connect with us:
Website: databricks.com
Facebook: / databricksinc
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com...