Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.
@joerokcz Жыл бұрын
The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content
@WKhan-fh2pp2 жыл бұрын
Extremely helpful and great to touch different aspects of databricks.
@akhilannan4 жыл бұрын
Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)
@phy2sll4 жыл бұрын
You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.
@AdvancingAnalytics4 жыл бұрын
I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - kzbin.info/www/bejne/oJ3VaZKIpaZ6q7c But yes, in general there's a spark performance tuning video/session I should probably write at some time!
@AdvancingAnalytics4 жыл бұрын
@@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!
@LuciaCasucci3 жыл бұрын
Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!
@jaimetirado6130 Жыл бұрын
Thank you - very useful as I prep for the ADV DBX DE cert!
@matthow914 жыл бұрын
Great video Simon - thanks!
@raviv51094 жыл бұрын
You are too good! Lot of important and tons of info. Thx for sharing!
@sergeypryvala77502 жыл бұрын
Thank you for such as simple and powerful explanation
@auroraw635711 ай бұрын
this video is super helpful, thank you very much! :) I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?
@samirdesai64384 жыл бұрын
I love you Sir... Please keep on adding such videos
@alacrty9290 Жыл бұрын
Fantastic explanation. Thanks a lot!
@PakIslam20124 жыл бұрын
Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful
@the.activist.nightingale4 жыл бұрын
I was waiting for this !!!!!! Finally ! Thanks 😊
@tqw14232 жыл бұрын
Super helpful!! Thank you so much!!!
@eduardopalmiero67012 жыл бұрын
hi! do you know why sometimes executors on executor's tab turn blue?
@carltonpatterson55394 жыл бұрын
Really really appreciate this. I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job
@AdvancingAnalytics4 жыл бұрын
Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one! Simon
@sid00000093 жыл бұрын
how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)
@iParsaa4 жыл бұрын
thank you, it is very clear. Do you agree to share the code used during the explanation?
@AdvancingAnalytics4 жыл бұрын
I generally don't have time to tidy the code up and make it separately runnable - maybe in future :) For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice: docs.databricks.com/_static/notebooks/aqe-demo.html
@rashmimalhotra1234 жыл бұрын
How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance
@AdvancingAnalytics4 жыл бұрын
While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism! Hopefully that gives you something to look at, if not that helpful! Simkn
@rashmimalhotra1234 жыл бұрын
@@AdvancingAnalytics are you suggesting to use repartitions or not ? as my notebook takes one hours plus?
@nastasiasaby80863 жыл бұрын
Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?
@GhernieM4 жыл бұрын
Thanks for this intro. Get ready Spark jobs, you're gonna be examined
@skms314 жыл бұрын
Really Wonderful stuff Simon ! Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.
@AdvancingAnalytics4 жыл бұрын
Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality. Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream Simon
@skms314 жыл бұрын
@@AdvancingAnalytics ah I see, thanks for the insights !
@Debarghyo14 жыл бұрын
Thanks a lot for this.
@MrDeedeeck3 жыл бұрын
thanks for the great video! Pls do make a ganglia-focused one when you have time :)