Advancing Spark - Understanding the Spark UI

Рет қаралды 55,775

Күн бұрын

Пікірлер: 39

@yashodhannn 2 жыл бұрын

Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.

@joerokcz Жыл бұрын

The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content

@WKhan-fh2pp 2 жыл бұрын

Extremely helpful and great to touch different aspects of databricks.

@akhilannan 4 жыл бұрын

Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)

@phy2sll 4 жыл бұрын

You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.

@AdvancingAnalytics 4 жыл бұрын

I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - kzbin.info/www/bejne/oJ3VaZKIpaZ6q7c But yes, in general there's a spark performance tuning video/session I should probably write at some time!

@AdvancingAnalytics 4 жыл бұрын

@@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!

@LuciaCasucci 3 жыл бұрын

Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!

@jaimetirado6130 Жыл бұрын

Thank you - very useful as I prep for the ADV DBX DE cert!

@matthow91 4 жыл бұрын

Great video Simon - thanks!

@raviv5109 4 жыл бұрын

You are too good! Lot of important and tons of info. Thx for sharing!

@sergeypryvala7750 2 жыл бұрын

Thank you for such as simple and powerful explanation

@auroraw6357 11 ай бұрын

this video is super helpful, thank you very much! :) I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?

@samirdesai6438 4 жыл бұрын

I love you Sir... Please keep on adding such videos

@alacrty9290 Жыл бұрын

Fantastic explanation. Thanks a lot!

@PakIslam2012 4 жыл бұрын

Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful

@the.activist.nightingale 4 жыл бұрын

I was waiting for this !!!!!! Finally ! Thanks 😊

@tqw1423 2 жыл бұрын

Super helpful!! Thank you so much!!!

@eduardopalmiero6701 2 жыл бұрын

hi! do you know why sometimes executors on executor's tab turn blue?

@carltonpatterson5539 4 жыл бұрын

Really really appreciate this. I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job

@AdvancingAnalytics 4 жыл бұрын

Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one! Simon

@sid0000009 3 жыл бұрын

how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)

@iParsaa 4 жыл бұрын

thank you, it is very clear. Do you agree to share the code used during the explanation?

@AdvancingAnalytics 4 жыл бұрын

I generally don't have time to tidy the code up and make it separately runnable - maybe in future :) For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice: docs.databricks.com/_static/notebooks/aqe-demo.html

@rashmimalhotra123 4 жыл бұрын

How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance

@AdvancingAnalytics 4 жыл бұрын

While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism! Hopefully that gives you something to look at, if not that helpful! Simkn

@rashmimalhotra123 4 жыл бұрын

@@AdvancingAnalytics are you suggesting to use repartitions or not ? as my notebook takes one hours plus?

@nastasiasaby8086 3 жыл бұрын

Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?

@GhernieM 4 жыл бұрын

Thanks for this intro. Get ready Spark jobs, you're gonna be examined

@skms31 4 жыл бұрын

Really Wonderful stuff Simon ! Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.

@AdvancingAnalytics 4 жыл бұрын

Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality. Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream Simon

@skms31 4 жыл бұрын

@@AdvancingAnalytics ah I see, thanks for the insights !