Top 15 Spark Interview Questions in less than 15 minutes Part-2

  Рет қаралды 39,872

Sumit Mittal

Sumit Mittal

Күн бұрын

Пікірлер: 14
@AnkitChaudhary-f2m
@AnkitChaudhary-f2m 2 ай бұрын
Could you please stop the background music or can you remove background music Your video are very helpful but the background music 🎶 sometimes irritating
@killbiilpandey
@killbiilpandey 4 ай бұрын
sumit mittal sir, your vedios are gave me huge knowledge. thank you🤝🤝
@vaibhavj12
@vaibhavj12 9 ай бұрын
Helpful❤
@piyushjain5852
@piyushjain5852 8 ай бұрын
how number of stages = no of wide transformations + 1 ?
@sugunanindia
@sugunanindia 8 ай бұрын
In Apache Spark, the number of stages in a job is determined by the wide transformations present in the execution plan. Here's a detailed explanation of why the number of stages is equal to the number of wide transformations plus one: ### Transformations in Spark #### Narrow Transformations Narrow transformations are operations where each input partition contributes to exactly one output partition. Examples include: - `map` - `filter` - `flatMap` These transformations do not require data shuffling and can be executed in a single stage. #### Wide Transformations Wide transformations are operations where each input partition can contribute to multiple output partitions. These transformations require data shuffling across the network. Examples include: - `reduceByKey` - `groupByKey` - `join` Wide transformations result in a stage boundary because data must be redistributed across the cluster. ### Understanding Stages #### Stages A stage in Spark is a set of tasks that can be executed in parallel on different partitions of a dataset without requiring any shuffling of data. A new stage is created each time a wide transformation is encountered because the data needs to be shuffled across the cluster. ### Calculation of Stages Given the nature of transformations, the rule "number of stages = number of wide transformations + 1" can be explained as follows: 1. **Initial Stage**: The first stage begins with the initial set of narrow transformations until the first wide transformation is encountered. 2. **Subsequent Stages**: Each wide transformation requires a shuffle, resulting in the end of the current stage and the beginning of a new stage. Thus, for `n` wide transformations, there are `n + 1` stages: - The initial stage. - One additional stage for each wide transformation. ### Example Consider the following Spark job: ```python from pyspark import SparkContext sc = SparkContext.getOrCreate() # Sample RDD rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)]) # Narrow transformation: map rdd1 = rdd.map(lambda x: (x[0], x[1] * 2)) # Wide transformation: reduceByKey (requires shuffle) rdd2 = rdd1.reduceByKey(lambda x, y: x + y) # Another narrow transformation: filter rdd3 = rdd2.filter(lambda x: x[1] > 4) # Wide transformation: groupByKey (requires shuffle) rdd4 = rdd3.groupByKey() # Action: collect result = rdd4.collect() print(result) ``` **Analysis of Stages**: 1. **Stage 1**: Includes `parallelize`, `map`. This is all narrow transformations. 2. **Stage 2**: Starts with `reduceByKey` (a wide transformation) which triggers a shuffle. 3. **Stage 3**: Includes `filter`, which is a narrow transformation. 4. **Stage 4**: Starts with `groupByKey` (another wide transformation) which triggers another shuffle. So, there are two wide transformations (`reduceByKey` and `groupByKey`) and three stages (`number of wide transformations + 1`). ### Conclusion The number of stages in a Spark job is driven by the need to shuffle data between transformations. Each wide transformation introduces a new stage due to the shuffle it triggers, resulting in the formula: `number of stages = number of wide transformations + 1`. This understanding is crucial for optimizing and debugging Spark applications.
@basedsonu
@basedsonu 7 ай бұрын
bhai ne bola bapu dikhta toh bapu dikhta
@SabyasachiManna-sw2cg
@SabyasachiManna-sw2cg 6 ай бұрын
To answer your questions, Let's assume you have one wide transform function reduceByKey() , it will create two stages stage0 and stage1 where shuffling is involved. I hope it helps you.
@AmitBenake
@AmitBenake 5 ай бұрын
For wide transformation there is shuffling of data. Like the extra stage is used for aggregation of data.
@cajaykiran
@cajaykiran 2 ай бұрын
What is logic behind the answer - 2 wide transformations would result in 3 stages?
@luckykr94
@luckykr94 23 күн бұрын
Two wide transformations will split the processing into three stages: the first stage processes the initial data, the second stage handles the shuffle from the first wide transformation, and the third stage processes the shuffle from the second wide transformation.
@maninderbhambra3796
@maninderbhambra3796 2 ай бұрын
Sumit sir please add real time use case...for data skewness i feel answer is not correct .salting is the right method...
@HimanshuSingh-yj2wh
@HimanshuSingh-yj2wh 17 күн бұрын
2:37 not true
@harish7548
@harish7548 5 күн бұрын
true she is more focused on how the code will be captured ,and about physical & logical plans but the question is about Architecture
@ReadingForLearning
@ReadingForLearning Ай бұрын
Some questions are too easy and should not be asked in interviews.
Live Mock Interview For Data Analyst| Data Science  By Ranjan Sir | Datagyan
18:47
DataGyan Analytics Academy
Рет қаралды 420 М.
Сестра обхитрила!
00:17
Victoria Portfolio
Рет қаралды 958 М.
10 recently asked Pyspark Interview Questions | Big Data Interview
28:36
How He Got $600,000 Data Engineer Job
19:08
Sundas Khalid
Рет қаралды 217 М.
How i Cracked 5 offers in 30 days
14:17
TechDataEngineer
Рет қаралды 2,3 М.
Azure Data Engineer Mock Interview - ADF Scenario Based Special
25:36
Azurelib Academy
Рет қаралды 24 М.