Understanding Databricks & Apache Spark Performance Tuning: Lesson 01 - Spark Architecture

  Рет қаралды 3,773

Bryan Cafferky

Bryan Cafferky

Күн бұрын

Пікірлер: 17
@homeitems8113
@homeitems8113 5 ай бұрын
Waiting for next video
@biouman4
@biouman4 6 ай бұрын
Thanks, very nice video 😉
@RM-xm5gl
@RM-xm5gl 6 ай бұрын
Thank you so much. Excellent video.
@BryanCafferky
@BryanCafferky 6 ай бұрын
Thanks
@sarthakmane2977
@sarthakmane2977 3 ай бұрын
5:54, better comedian than half the comedians in the world
@mfdba
@mfdba 6 ай бұрын
I don't know if it's always true, but I've recently discovered that python can be significantly faster that some spark SQL operations such as joins. I'll check, but do you have a video about monitoring cluster performance? I kind of miss the ganglia ui. Thanks Bryan. As always, you're a great teacher and explainer of things. ❤
@BryanCafferky
@BryanCafferky 4 ай бұрын
This may help learn.microsoft.com/en-us/azure/databricks/compute/cluster-metrics
@Prmmani
@Prmmani 6 ай бұрын
Thanks Sir ❤
@Andy-rw4hn
@Andy-rw4hn 5 ай бұрын
11:50 I actually thought that the data for the query in the black box does not have to be distributed/indexed by City and the select/group-by can be easily made concurrent by itself
@BryanCafferky
@BryanCafferky 5 ай бұрын
I am oversimplifying but when you request joins or aggregations, you trigger a shuffle which the documentation explains as reordering the data over the cluster nodes to, for example, co-locate data keys from the joined tables. See www.talend.com/resources/intro-apache-spark-partitioning/
@BryanCafferky
@BryanCafferky 5 ай бұрын
i'm trying to find more detailed info on this. Thanks.
@Andy-rw4hn
@Andy-rw4hn 5 ай бұрын
@@BryanCafferky Thank you
@Andy-rw4hn
@Andy-rw4hn 5 ай бұрын
is it possible to run spark nodes on already concurrent HDFS?
@BryanCafferky
@BryanCafferky 5 ай бұрын
Park can read from HDFS. Is that your question?
@Andy-rw4hn
@Andy-rw4hn 5 ай бұрын
@@BryanCafferky yes. I was not aware of PXF HDFS connector hdfs:parquet. Thank you.
@TJ-hs1qm
@TJ-hs1qm 4 ай бұрын
There's also Spark Rapids (GPU)
@BryanCafferky
@BryanCafferky 4 ай бұрын
Cool. Did not know that. Here's how to set that up on Databricks. Thanks! docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html
Master Data Workload Automation: Introduction
24:09
Bryan Cafferky
Рет қаралды 1,5 М.
Крутой фокус + секрет! #shorts
00:10
Роман Magic
Рет қаралды 27 МЛН
Офицер, я всё объясню
01:00
История одного вокалиста
Рет қаралды 5 МЛН
Do you choose Inside Out 2 or The Amazing World of Gumball? 🤔
00:19
An Unknown Ending💪
00:49
ISSEI / いっせい
Рет қаралды 58 МЛН
Introduction to Scaling Analytics Using DuckDB with Python
29:33
Bryan Cafferky
Рет қаралды 2,6 М.
Airflow for Beginners: Build Amazon books ETL Job in 10 mins
13:13
Sunjana in Data
Рет қаралды 7 М.
Microservices are Technical Debt
31:59
NeetCodeIO
Рет қаралды 413 М.
Accelerating Apache Spark Workloads with Apache DataFusion Comet (Andy Grove)
45:58
The intro to Docker I wish I had when I started
18:27
typecraft
Рет қаралды 130 М.
Apache Spark - Computerphile
7:40
Computerphile
Рет қаралды 249 М.
Apache Spark Memory Management
23:09
Afaque Ahmad
Рет қаралды 11 М.
Shuffle Partition Spark Optimization: 10x Faster!
19:03
Afaque Ahmad
Рет қаралды 9 М.
Крутой фокус + секрет! #shorts
00:10
Роман Magic
Рет қаралды 27 МЛН