109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

  Рет қаралды 20,689

Raja's Data Engineering

Raja's Data Engineering

Күн бұрын

Пікірлер: 38
@prasadtelu9873
@prasadtelu9873 Жыл бұрын
FIrst time I saw scenario-based and interview-based solutions in KZbin videos. Thanks for your commitment and for sharing the knowledge.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks Prasad, for your comment! Hope it helps people in bigdata community
@adiityagupta-wu1tz
@adiityagupta-wu1tz Жыл бұрын
Please continue this series it will be very helpful to crack the interview and thank for starting this series.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Sure, I will create many videos on this series
@DataEngineering-ni2ot
@DataEngineering-ni2ot Жыл бұрын
It should be 1 in first row of change event at 08:10 as previous value is not same with first row of event status but why it is coming as 0?
@johnsonrajendran6194
@johnsonrajendran6194 10 ай бұрын
for first row, previous value is null and we cannot compare null value with anything so by default our logic will go to else condition which is 0 in this case
@Vk-gw6ii
@Vk-gw6ii 9 ай бұрын
Excellent 👌
@rajasdataengineering7585
@rajasdataengineering7585 9 ай бұрын
Thank you!
@harithad1757
@harithad1757 8 ай бұрын
can i get the code copy pasted in description or maybe ink to the notebook
@prabhatgupta6415
@prabhatgupta6415 Жыл бұрын
Thanks Sir..create playlist of coding questions which are frequently asked.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Sure I will a playlist and add more coding scenarios using Pyspark and SQL
@rawat2608
@rawat2608 Жыл бұрын
Thanks you sir
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
You are welcome!
@jinsonfernandez
@jinsonfernandez 5 ай бұрын
Thanks for this video, But I am curious why didnt you directly use min max with group by which would have fetched the same result ``` result = df.withColumn("event_date", F.to_date("event_date")) \ .groupBy("event_status") \ .agg( F.min("event_date").alias("event_start_date"), F.max("event_date").alias("event_end_date") ) \ .orderBy("event_start_date") result.show() ```
@rajasdataengineering7585
@rajasdataengineering7585 5 ай бұрын
Thanks for sharing your approach. Yes there are various approaches
@boyasuresh5998
@boyasuresh5998 5 ай бұрын
This won't work please check your code
@rajasdataengineering7585
@rajasdataengineering7585 5 ай бұрын
It worked
@funworld8659
@funworld8659 5 ай бұрын
@@rajasdataengineering7585 it did not show the last won start date and end date
@beingnagur
@beingnagur 3 ай бұрын
How will this approach work as we are grouping by event status, in that case 7th also will be grouped to this and start date will be 1st and end date will be 7th
@arrooow9019
@arrooow9019 Жыл бұрын
Hi sir could you please share the notebook and dataset in the description. as it will helpful for our practice. Thanks in advance.
@prabhatgupta6415
@prabhatgupta6415 Жыл бұрын
One more suggestion plz do put the daatset in description
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Sure, will add the dataset in description
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
@@rajasdataengineering7585 you havent added dataset
@namratachavan6317
@namratachavan6317 Жыл бұрын
Hi sir could you please share the notebook and the github repository link to access the code
@saurabh011192
@saurabh011192 Жыл бұрын
This solution will work only when the dates are in order wrt events. Tried jumbling them, didnt work.
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
lol brother you can apply sort first on window function , it will work.
@bonysrivastava7779
@bonysrivastava7779 8 ай бұрын
🙌
@rajasdataengineering7585
@rajasdataengineering7585 8 ай бұрын
🙌
@roshniagrawal4777
@roshniagrawal4777 5 ай бұрын
This solution will not work if you have data like this, may be some tweak will be needed - data = [ ("2020-06-01","Won"), ("2020-06-02","Won"), ("2020-06-03","Won"), ("2020-06-03","Lost"), ("2020-06-04","Lost"), ("2020-06-05","Lost"), ("2020-06-06","Lost"), ("2020-06-07","Won") ]
@landchennai8549
@landchennai8549 9 ай бұрын
here is my SQL query for the same. declare @Event_Table table ( Event_date date, Event_status varchar(8)) insert into @Event_Table select getdate()+Value, case when value 3 and value < 7 then 'Lost' else 'Won' end from generate_series(1,10,1) ; with cte as ( select * , row_number() over ( order by Event_date) - row_number() over ( order by Event_status,Event_date) as GroupId from @Event_Table ) select GroupId , min(Event_status) as Event_status , min(Event_date) as Start_date , max(Event_date) as End_Date , count(1) as Consequtive_Events from cte group by GroupId
@rajasdataengineering7585
@rajasdataengineering7585 9 ай бұрын
Thanks for sharing your approach
@landchennai8549
@landchennai8549 9 ай бұрын
keep on post more like this@@rajasdataengineering7585
@starmscloud
@starmscloud Жыл бұрын
I did it in something like this . By suing a default date , a running number and datediff from pyspark.sql.functions import to_date ,row_number,asc,date_add,lit,datediff,min,max from pyspark.sql.window import Window eventDF.withColumn("event_date",to_date(col="event_date" ,format= "dd-MM-yyyy")) \ .withColumn("rank",row_number().over(Window.partitionBy("event_status").orderBy(asc("event_date")))) \ .withColumn("startDate",date_add(lit("1900-01-01"),"rank")) \ .withColumn("datediff",datediff("event_date","startDate")) \ .groupBy("datediff","event_status").agg(min("event_date").alias("start_date"),max("event_date").alias("end_date")) \ .drop("rangeDate") \ .sort("start_date").show()
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Great!
@ShivamGupta-wn9mo
@ShivamGupta-wn9mo Ай бұрын
my sol: window_base=Window.orderBy('date') Dataframe API df_t=df.withColumn("diff", dense_rank().over(window_base)- dense_rank().over(window_base.partitionBy("status")))\ .groupBy("status","diff").agg(min("date").alias("start_date")\ ,(max("date").alias("end_date")))\ .orderBy("start_date")\ .show() Spark SQL 1 df.createOrReplaceTempView("match") spark.sql(''' with cte as( select *, dense_rank() over(order by date)- dense_rank() over(partition by status order by date) as diff from match) select status,diff, min(date) as start, max(date) as end from cte group by status,diff order by start ''' ).show() Spark SQL 2 df.createOrReplaceTempView("match") spark.sql( ''' with cte as( select *, dense_rank() over(order by date) as rn1, dense_rank() over(partition by status order by date) as rn2, dense_rank() over(order by date) -dense_rank() over(partition by status order by date) as diff from match) select a.status,max(a.start_date) as start_date,max(a.end_date) as end_date from (select date,status,diff, min(date) over(partition by status,diff ) as start_date, max(date) over(partition by status,diff ) as end_date from cte order by date) a group by a.status,a.diff order by start_date asc ''' ).show() enjoy
@rajasdataengineering7585
@rajasdataengineering7585 Ай бұрын
Thank you for sharing your approach
111. Databricks | Pyspark| SQL Coding Interview: Exchange Seats of Students
22:50
Raja's Data Engineering
Рет қаралды 8 М.
4 Recently asked Pyspark Coding Questions | Apache Spark Interview
28:39
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН
10 PySpark Product Based Interview Questions
39:46
The Data Tech
Рет қаралды 25 М.
01. Databricks: Spark Architecture & Internal Working Mechanism
41:34
Raja's Data Engineering
Рет қаралды 287 М.
10 recently asked Pyspark Interview Questions | Big Data Interview
28:36
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН