Пікірлер
@MuekeMwangangi
@MuekeMwangangi 3 сағат бұрын
This helped me
@Arnob_111
@Arnob_111 10 күн бұрын
def flatten_list(lst): flat_list = [] for item in lst: if isinstance(item, list): flat_list.extend(flatten_list(item)) else: flat_list.append(item) return flat_list
@Billy_Butcher....
@Billy_Butcher.... 10 күн бұрын
WITH cte1 AS ( SELECT product_id, total_sales_revenue, LAG(total_sales_revenue,1) OVER(PARTITION BY product_id ORDER BY [year]) AS prev_year_revenue FROM sales ) ,cte2 AS ( SELECT product_id FROM cte1 GROUP BY product_id HAVING min(total_sales_revenue - prev_year_revenue) > 0 ) SELECT P.* FROM cte2 C JOIN products P ON C.product_id = P.product_id
@punpompur
@punpompur 12 күн бұрын
14:12 orderBy was used to avoid shuffling the data, but my question is since partitionBy is already used on the group_ column, is it really mandatory to use orderBy? I tried without orderBy and got the same result
@punpompur
@punpompur 12 күн бұрын
Can there be a scenario where the job_role in each group is not at the first position, but placed randomly? If yes, then how can we approach such problem?
@prajju8114
@prajju8114 12 күн бұрын
this was a tough question damn!
@punpompur
@punpompur 16 күн бұрын
Does it matter if we use row_number or dense_rank? I tried the code with row_number and the output matched the expected result
@punpompur
@punpompur 17 күн бұрын
One good thing about this series is that you share the dataframes directly which makes it quite easy to start with coding directly. Quite helpful
@punpompur
@punpompur 17 күн бұрын
What if I want to replace the null value with "-" using a method from DataFrameNaFunctions class? I tried to replace NULL with 0 using na.fill() since the percentage column is numerical so it won't take "-" in the na.fill() method
@sajjadkhan8932
@sajjadkhan8932 20 күн бұрын
with basic knowledge i did like this from pyspark.sql.functions import col,max,concat,lit df = user_flags.join(flag_review , on = (user_flags.flag_id == flag_review.flag_id), how = 'inner') df = df.withColumn('user_name', concat(col('user_firstname'),lit(" "),col('user_lastname'))) #df = df.filter(col('reviewed_outcome') == 'APPROVED') df = df.select('user_name', 'video_id') \ .distinct() \ .filter(col('reviewed_outcome') == 'APPROVED')\ .groupBy('user_name') \ .count() \ dfs = df.agg(max('count')).collect()[0][0] df = df.select('user_name').filter(col('count') == dfs) df.show()
@rawat7203
@rawat7203 20 күн бұрын
Thank you sir
@punpompur
@punpompur 23 күн бұрын
Hi. I want to practice the SQL exercises on the website you shared. Although it does provide the table structure and sample data, it does not directly provide access to already created tables and data. Instead of creating tables and inserting data every time before attempting a question, is there a resource where we can just practice SQL queries with the tables and data already created?
@surbhipaliwal5896
@surbhipaliwal5896 23 күн бұрын
REMEMBER SURBHI M YOUR COWORKER
@surbhipaliwal5896
@surbhipaliwal5896 23 күн бұрын
HI DHEERAJ
@KapilKumar-hk9xk
@KapilKumar-hk9xk 24 күн бұрын
1 query, attacker_king 1 (House Stark) won in Riverslands. Then how come it's not in output? Similarly defender_king 4(House Greyjoy) also won in The North, this is also not in output. Correct shall be below region,house,no_of_wins "The North","House Stark",2 "The Reach","House Stark",1 "The Riverlands","House Stark",1 "The Riverlands","House Baratheon",2 "The North","House Greyjoy",1 "The Reach","House Martell",1
@shankarkr1603
@shankarkr1603 25 күн бұрын
i am trying to sort but, its not working df3=df2.groupby(df2.candidate).agg(round(sum(df2.vote_percent),3).alias('s')) df4=df3.sort(df3.s,asecnding=False) df4.show() candidate| s| +----------+-----+ | null| 0.25| | Paul |1.516| | Kathy |1.949| | Anthony | 2.4| | Nicole | 2.7| | Ryan |5.149| |Christine |5.283| +----------+-----+
@shankarkr1603
@shankarkr1603 26 күн бұрын
can we get vids for all the ques in stratascrath
@ChandanDeveloper
@ChandanDeveloper Ай бұрын
when I try to install spark in windows home then getting error
@prajju8114
@prajju8114 Ай бұрын
student_final=student_df.join(student_perc,'student_id') student_final.show() student_final=student_final.withColumn('Result',when(col('percentage')>=70,'Distinction').when((col('percentage')>=60)&(col('percentage')<=69),'First Class').when((col('percentage')>=50)&(col('percentage')<=59),'Second Class').when((col('percentage')>=40)&(col('percentage')<=49),'Third Class').otherwise('Fail')) student_final.show()
@prajju8114
@prajju8114 Ай бұрын
from pyspark.sql.functions import * student_perc=marks_df.groupBy('student_id').agg((sum('marks')/count('student_id')).alias('percentage').cast('int')) student_perc.show()
@rawat7203
@rawat7203 Ай бұрын
winner_df = battle_df.withColumn('winner_king', when(col('attacker_outcome') == 1, col('attacker_king')).otherwise(col('defender_king'))) joinDf = winner_df.join(king_df, winner_df.winner_king == king_df.k_no).select('region', 'house', 'battle_number') resultDf = joinDf.groupBy('region', 'house').agg(count('*').alias('battles_won_count')).withColumn('rn', rank().over(Window.partitionBy('region').orderBy(col('battles_won_count').desc()))).filter(col('rn')==1).drop('rn') resultDf.show()
@narravularushitha6806
@narravularushitha6806 Ай бұрын
Hi, When am trying run sample code, Getting below error. Please help C:\Users ushi\Pyspark_Demo\venv\Scripts\python.exe C:\Users ushi\Pyspark_Demo\Test.py Traceback (most recent call last): File "C:\Users ushi\Pyspark_Demo\Test.py", line 1, in <module> from pyspark.sql import SparkSession File "C:\spark\spark-3.4.2-bin-hadoop3\python\pyspark\__init__.py", line 58, in <module> from pyspark.conf import SparkConf File "C:\spark\spark-3.4.2-bin-hadoop3\python\pyspark\conf.py", line 23, in <module> from py4j.java_gateway import JVMView, JavaObject ModuleNotFoundError: No module named 'py4j' Process finished with exit code 1
@talamanchisudheernadh698
@talamanchisudheernadh698 Ай бұрын
from pyspark.sql import Window from pyspark.sql.functions import * df1=df.withColumn('rn',lead('flag').over(Window.partitionBy('emp_id').orderBy('log_date')))\ .withColumn('enddate',lead('log_date').over(Window.partitionBy('emp_id').orderBy('log_date')))\ .filter((col('rn')=='Y')&(col('flag')=='Y')).select('emp_id','log_date','enddate') df2=df1.union(df1).distinct() df3=df2.withColumn('daysk', dayofmonth(to_date(col('log_date'), 'dd-MM-yyyy')))\ .withColumn('rn',row_number().over(Window.partitionBy('emp_id').orderBy('daysk')))\ .withColumn('diff',col('daysk')-col('rn'))\ .groupBy('emp_id','diff').agg(min('log_date'),max('log_date')).show()
@talamanchisudheernadh698
@talamanchisudheernadh698 Ай бұрын
from pyspark.sql import Window from pyspark.sql.functions import * df.withColumn('rn',lead('flag').over(Window.partitionBy('emp_id').orderBy('log_date')))\ .withColumn('rn1',lead('log_date').over(Window.partitionBy('emp_id').orderBy('log_date')))\ .filter((col('rn')=='Y')&(col('flag')=='Y')).show() +------+----------+----+---+----------+ |emp_id| log_date|flag| rn| rn1| +------+----------+----+---+----------+ | 102|02-01-2024| Y| Y|03-01-2024| | 102|05-01-2024| Y| Y|06-01-2024| | 102|06-01-2024| Y| Y|07-01-2024| | 103|05-01-2024| Y| Y|06-01-2024| +------+----------+----+---+----------+
@nsreeabburi2292
@nsreeabburi2292 Ай бұрын
Thanks Diraj. Am trying to do via notebook when am execting the code am getting py4JJavaerror. And how can I see pyspark kernel in notebook do u have any idea about it
@rawat7203
@rawat7203 Ай бұрын
Sir My way: joinedDf = friends_df.join(likes_df, [friends_df.friend_id == likes_df.user_id]).select(friends_df.user_id, 'friend_id', col('page_id').alias('fpage_id')) friendsLikeDf = joinedDf.groupBy('user_id').agg(collect_set('fpage_id').alias('friends_likes')) selfLikeDf = likes_df.groupBy('user_id').agg(collect_set('page_id').alias('self_likes')) allDf = friendsLikeDf.join(selfLikeDf,'user_id').select(friendsLikeDf.user_id,'friends_likes','self_likes') exceptDf = allDf.withColumn('test', array_except('friends_likes', 'self_likes')) resultDf = exceptDf.select('user_id', explode('test').alias('page_id')).orderBy('user_id')
@rajas6248
@rajas6248 Ай бұрын
Lead and lag will helps to solve easily?
@rachitahuja1382
@rachitahuja1382 Ай бұрын
likes_df_renamed = likes_df.select(col("user_id").alias("l_user_id"), col("page_id")) df_joined = friends_df.join(likes_df_renamed, on = friends_df.friend_id == likes_df_renamed.l_user_id , how= "inner") df_joined = df_joined.select(col("user_id"),col("page_id")).distinct() df_result = df_joined.join( likes_df, on=["user_id", "page_id"], how="left_anti" )
@rachitahuja1382
@rachitahuja1382 Ай бұрын
# Create a window specification to partition by the row_id window_spec = Window.orderBy("row_id").rowsBetween(Window.unboundedPreceding, 0) # Fill null values with the last non-null value in the window job_skills_df = job_skills_df.withColumn("job_role", F.last("job_role", ignorenulls=True).over(window_spec))
@AnishNandyrevealed
@AnishNandyrevealed Ай бұрын
Why can't we use self join instead of window functions to solve this?
@user-gq6cg3ls7f
@user-gq6cg3ls7f Ай бұрын
with cte as( select year(created_at) year, month(created_at) month, sum(valuess) total_sales from amazon_monthly_rev group by year(created_at), month(created_at) ) select concat(year, '-', '0',month) year_month , format(100*cast(total_sales - lag(total_sales) over (order by month)as decimal) / lag(total_sales) over (order by month),'##.##') pct from cte
@natarajbeelagi569
@natarajbeelagi569 2 ай бұрын
Can combine all steps into single step #importing functions from pyspark.sql.functions import * from pyspark.sql.window import Window as W from pyspark.sql.types import * window_spec = W.partitionBy(col("emp_id")).orderBy("log_date") df2 = ( df.filter(col("flag")== 'Y') .withColumn("log_date", to_date(col("log_date"),'dd-mm-yyyy')) .withColumn("rnk", row_number().over(window_spec)) .withColumn("Day", day("log_date")) .withColumn("day_diff", lit(col("Day")-col("rnk"))) .groupBy("emp_id", "day_diff").agg( count(lit(1)).alias("Consecutive_days"), min(col("log_date")).alias("Start_Date"), max(col("log_date")).alias("End_Date") ) .filter("Consecutive_days >=2") .drop("day_diff") ) #df2.show()
@Aishwarya-w8d
@Aishwarya-w8d 2 ай бұрын
finaldf = (sdf.withColumn('fn',coalesce(sdf.split_names.getItem(0),lit(None))) .withColumn('mn',coalesce(sdf.split_names.getItem(1),lit(None))) .withColumn('ln',coalesce(sdf.split_names.getItem(2),lit(None))) ).show()
@punpompur
@punpompur 12 күн бұрын
This answer will treat Virat Kohli's last name as middle name.
@seansupp2315
@seansupp2315 2 ай бұрын
Hi Dhairy! I know you are incredibly busy and you probably get a lot questions, so this is going to take you 5 mins max. Could you please help me why my code is wrong? Your answer is going to make my day, if not weeks. My code: ------------------------------------------------------------ WITH maintable AS ( SELECT user_id, action, CAST(timestamp AS date) AS date, timestamp FROM facebook_web_log WHERE action in ('page_load','page_exit') ), page_load AS ( SELECT user_id, date, timestamp, ROW_NUMBER() OVER(PARTITION BY user_id, date ORDER BY timestamp DESC) AS rnk_load FROM maintable WHERE action = 'page_load' ), page_exit AS ( SELECT user_id, date, timestamp, ROW_NUMBER() OVER(PARTITION BY user_id, date ORDER BY timestamp) AS rnk_exit FROM maintable WHERE action = 'page_exit' ) SELECT *, e.timestamp - l.timestamp FROM page_load l INNER JOIN page_exit e ON l.user_id = e.user_id AND l.date = e.date AND l.rnk_load = 1 AND e.rnk_exit = 1 -------------------------------------------------------------------------------- When calculating for the differences in time, it seems like StrataScratch miscalculated 'only' for user 0. Could you please tell me why that is the case? I don't see flaws in my logic.
@VikashKumar-jx4qx
@VikashKumar-jx4qx 2 ай бұрын
Hey, I want to know, is there any platform like hackerrank or leetcode for PySpark. Where I can practice PySpark.
@vivekdutta7131
@vivekdutta7131 2 ай бұрын
with credit AS ( select user_id,user_name, isnull((credit + amount),credit) as tot_credit from users u left join Transactions t on u.user_id = t.paid_to ) select user_id,user_name, isnull((tot_credit - amount),tot_credit) as final_credit, case when isnull((tot_credit - amount),tot_credit) > 0 then 'NO' else 'Yes' end as credit_limit_breached from credit u left join Transactions t on u.user_id = t.paid_by
@satirthahalder9528
@satirthahalder9528 2 ай бұрын
with cte as (select *,lag(log_date,1,log_date) over(partition by emp_id order by log_date) prev_date from t where flag ='Y'), cte2 as (select *,case when log_date-prev_date<2 then 0 else 1 end gp from cte), cte3 as (select *,sum(gp) over(partition by emp_id order by log_date) as gps from cte2) select emp_id, min(log_date)start_date,max(log_date)end_date,count(*) consecutive_days from cte3 group by emp_id, gps having count(*)>=2
@user-gq6cg3ls7f
@user-gq6cg3ls7f 2 ай бұрын
another approach with success as( select *, DATEADD(day, -1*ROW_NUMBER() over (order by success_date), success_date) s_bucket, ROW_NUMBER() over (order by success_date) S_RN from succeeded where year(success_date) != '2018' ), s_final as( select min(success_date) start_date, max(success_date) end_date, ROW_NUMBER() over (order by (select null)) RN from success group by s_bucket ), fail as( select *, DATEADD(day, -1*ROW_NUMBER() over (order by fail_date), fail_date) f_bucket, ROW_NUMBER() over (order by fail_date) F_RN from failed where year(fail_date) != '2018' ), f_final as( select min(fail_date) start_date, max(fail_date) end_date, ROW_NUMBER() over (order by (select null)) RN from fail group by f_bucket ) select 'Succeeded' as period_date, format(start_date, 'dd-MM-yyyy') as start_date, format(end_date,'dd-MM-yyyy') as end_date from s_final union all select 'failed' as period_date, format(start_date,'dd-MM-yyyy') as start_date, format(end_date,'dd-MM-yyyy') as end_date from f_final order by start_date
@user-gq6cg3ls7f
@user-gq6cg3ls7f 2 ай бұрын
another approach using SQL Server with U1 as( select User1, count(*) cnt from facebook group by User1 ), U2 as( select User2, count(*) cnt from facebook group by User2 ),final as( select coalesce(User1, User2) as Users, (isnull(U1.cnt,0) + isnull(U2.cnt,0)) as Popularity_Percentage from U1 full join U2 on U1.User1=U2.User2 ) select Users, round(cast(round(Popularity_Percentage,2) as decimal)*100/ count(user) over (order by user rows between unbounded preceding and unbounded following),3) total_users from final
@nagumallasravansai249
@nagumallasravansai249 2 ай бұрын
you are awesome buddy!
@Tech_world-bq3mw
@Tech_world-bq3mw 2 ай бұрын
my solution: #df- original, and df1-with range numbers df_missing=df1.join(df, df1.id==df.id, how='left_anti') df_missing.show()
@vivekdutta7131
@vivekdutta7131 2 ай бұрын
with cte AS ( select question_id, isnull(sum(case when action = 'show' then 1 end),0) as show_count, isnull(sum(case when action = 'answer' then 1 end),0) as answer_count from survey_log group by question_id ),ctf as ( select question_id,(answer_count/show_count) as answer_rate, row_number() over(order by (answer_count/show_count) desc ,question_id asc) as rn from cte ) select question_id from ctf where rn = 1
@torrentdownloada
@torrentdownloada 2 ай бұрын
Hello, i think you might have missed this scenario, what if the id's are some random integers or some strings. Please review my solution and let me know if my understanding is correct. Thanks with cte as (select *,row_number() over() rn from swap_id) select name,id,case when a%2=0 then lag(id) over() when a%2<>0 then coalesce(lead(id) over(),id) else id end swapped from cte
@Hope-xb5jv
@Hope-xb5jv 2 ай бұрын
with cte as ( select question_id, sum(case when action = 'answer' then 1 else 0 end)/sum(case when action = 'show' then 1 else 0 end) as answer_rate from survey_log group by question_id ),cte2 as ( select question_id ,row_number()over(order by question_id ) as id, dense_rank()over(order by answer_rate desc) as rate from cte ) select question_id from cte2 where id = 1 and rate = 1
@Aman-lv2ee
@Aman-lv2ee 2 ай бұрын
select name, case when id%2 =0 then lag(id,1,id)over(order by id) when id%2 !=0 then lead(id,1,id)over(order by id) end as swap_ids from swap_id;
@AishwaryaAishwarya-e4l
@AishwaryaAishwarya-e4l 2 ай бұрын
Thankyouuuuuuuuuu
@DEwithDhairy
@DEwithDhairy 2 ай бұрын
Glad u found useful
@rahulmittal116
@rahulmittal116 2 ай бұрын
Nice🤙
@DEwithDhairy
@DEwithDhairy 2 ай бұрын
Thanks Rahul
@Ak12345-g
@Ak12345-g 2 ай бұрын
with cte as( select t2name,STRING_AGG(topping_name,',') as pizza,sum(cost) as total_cost from( select t1.*,t2.topping_name as t2name,t2.cost as t2cost from toppings t1 cross join toppings t2 where t1.topping_name <> t2.topping_name)a group by t2name) select pizza,total_cost from cte order by total_cost;
@DEwithDhairy
@DEwithDhairy 2 ай бұрын
Great approach.
@Ak12345-g
@Ak12345-g 2 ай бұрын
@@DEwithDhairy Thank you so much 🥰
@Aman-lv2ee
@Aman-lv2ee 2 ай бұрын
nice video, share more on python questsion for DE, thanks
@DEwithDhairy
@DEwithDhairy 2 ай бұрын
Thabks, Its in the pipeline
@vivekdutta7131
@vivekdutta7131 2 ай бұрын
with cte AS ( select * , row_number() over(order by topping_name) as rn from toppings ),ctf AS ( select t.topping_name as topping_name1,g.topping_name as topping_name2,t.cost as cost from cte t , cte g where 1 = 1 and t.rn <> g.rn ) select string_agg(topping_name1,',') as pizza,sum(cost) as total_cost from ctf group by topping_name2