21 Broadcast Variable and Accumulators in Spark | How to use Spark Broadcast Variables

  Рет қаралды 3,175

Ease With Data

Ease With Data

Күн бұрын

Пікірлер: 18
@sureshraina321
@sureshraina321 11 ай бұрын
@8:50 , I have one small doubt " we have already filtered out the department_id == 6 , In that case we wont have any other department other than 6. Do we need to really groupBy(department_id) after filtering ?? ".
@easewithdata
@easewithdata 11 ай бұрын
Yes, since the data is already filtered you can directly apply sum on it. Group by is not mandatory
@sureshraina321
@sureshraina321 11 ай бұрын
​@@easewithdata Thank you 👍
@NiteeshKumarPinjala
@NiteeshKumarPinjala Ай бұрын
Hi Subham, I have few questions on Cache and Broadcast 1. Can we un broadcast the dataframes or variables like we unpersist? 2. Whenever our cluster is terminated, restarted again, Does the broadcasted variables or cached data is still exist? or it get's vanished every time our cluster is terminated?
@easewithdata
@easewithdata 27 күн бұрын
1. you ca suppress the broadcast using spark config. 2. Yes, the cluster is cleaned up. If you like my content, please make sure to share this with you network over LinkedIn 💓
@ayyappahemanth7134
@ayyappahemanth7134 24 күн бұрын
one doubt sir, When I did direct where, sum, it took 0.8s for both stages. Whereas accumulator took 3s. Is it due to the forced use case for demonstration? Can you give me a example where accumulator could benefit? Even computation wise, accumulator went row by row, where as filter and exchange seems using less compute.
@easewithdata
@easewithdata 23 күн бұрын
Yes this was just for demonstration. If you like my content, Please make sure to share with your network over LinkedIn 👍
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
In last video you mentioned that we should avoid UDF but here you used it during getting the broadcast value. Will it impact the performance?
@easewithdata
@easewithdata 11 ай бұрын
Yes we should avoid Python UDF as much as possible. This example was just for demonstration of an use case of broadcast variable. You can always use UDF written in Scala and registered for use in Python.
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
@@easewithdata thanks
@DEwithDhairy
@DEwithDhairy 10 ай бұрын
AWESOME
@devarajusankruth7115
@devarajusankruth7115 6 ай бұрын
hi sir, what is the difference between broadcast join and broadcast variable. in broadcast join also a copy of smaller dataframe is stored at each executor,so no shuffling happens across the executors
@easewithdata
@easewithdata 6 ай бұрын
Broadcast joins implements the same concept of broadcast variable. It simplifies the use in Dataframes
@sushantashow000
@sushantashow000 5 ай бұрын
can accumulator variables be used to calculate avg as well? as when we are calculating the sum it can do for each executors but average wont work in the same way.
@easewithdata
@easewithdata 5 ай бұрын
Hello Sushant, To calculate avg, the simplest approach is to use two variables one for sum and another for count. Later you can divide the sum with count to get the avg. If you like the content, please make sure to share with your network 🛜
@at-cv9ky
@at-cv9ky 10 ай бұрын
pls can you provide the link to download sample data ?
@easewithdata
@easewithdata 9 ай бұрын
All datasets are available on GitHub. Checkout the url in video description
Как Я Брата ОБМАНУЛ (смешное видео, прикол, юмор, поржать)
00:59
УДИВИЛ ВСЕХ СВОИМ УХОДОМ!😳 #shorts
00:49
99.9% IMPOSSIBLE
00:24
STORROR
Рет қаралды 25 МЛН
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 37 МЛН
What is OpenTelemetry?
12:55
Highlight
Рет қаралды 15 М.
05 Spark Streaming Output Modes, Optimization and Background
12:32
Ease With Data
Рет қаралды 3,4 М.
Learning Pandas for Data Analysis? Start Here.
22:50
Rob Mulla
Рет қаралды 120 М.
Apache Spark Architecture - EXPLAINED!
1:15:10
Databricks For Professionals
Рет қаралды 11 М.
25. Databricks | Spark | Broadcast Variable| Interview Question | Performance Tuning
13:33
Как Я Брата ОБМАНУЛ (смешное видео, прикол, юмор, поржать)
00:59