Understanding Databricks & Apache Spark Performance Tuning: Lesson 01

Understanding Databricks & Apache Spark Performance Tuning: Lesson 01 - Spark Architecture

Рет қаралды 4,361

Bryan Cafferky

Күн бұрын

Пікірлер: 17

@biouman4 9 ай бұрын

Thanks, very nice video 😉

@homeitems8113 7 ай бұрын

Waiting for next video

@Andy-rw4hn 7 ай бұрын

11:50 I actually thought that the data for the query in the black box does not have to be distributed/indexed by City and the select/group-by can be easily made concurrent by itself

@BryanCafferky 7 ай бұрын

I am oversimplifying but when you request joins or aggregations, you trigger a shuffle which the documentation explains as reordering the data over the cluster nodes to, for example, co-locate data keys from the joined tables. See www.talend.com/resources/intro-apache-spark-partitioning/

@BryanCafferky 7 ай бұрын

i'm trying to find more detailed info on this. Thanks.

@Andy-rw4hn 7 ай бұрын

@@BryanCafferky Thank you

@RM-xm5gl 9 ай бұрын

Thank you so much. Excellent video.

@BryanCafferky 9 ай бұрын

Thanks

@sarthakmane2977 6 ай бұрын

5:54, better comedian than half the comedians in the world

@mfdba 8 ай бұрын

I don't know if it's always true, but I've recently discovered that python can be significantly faster that some spark SQL operations such as joins. I'll check, but do you have a video about monitoring cluster performance? I kind of miss the ganglia ui. Thanks Bryan. As always, you're a great teacher and explainer of things. ❤

@BryanCafferky 7 ай бұрын

This may help learn.microsoft.com/en-us/azure/databricks/compute/cluster-metrics

@Andy-rw4hn 7 ай бұрын

is it possible to run spark nodes on already concurrent HDFS?

@BryanCafferky 7 ай бұрын

Park can read from HDFS. Is that your question?

@Andy-rw4hn 7 ай бұрын

@@BryanCafferky yes. I was not aware of PXF HDFS connector hdfs:parquet. Thank you.

@Prmmani 9 ай бұрын

Thanks Sir ❤

@TJ-hs1qm 6 ай бұрын

There's also Spark Rapids (GPU)

@BryanCafferky 6 ай бұрын

Cool. Did not know that. Here's how to set that up on Databricks. Thanks! docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html