pyspark scenario based interview questions and answers |

  Рет қаралды 2,462

DEwithDhairy

DEwithDhairy

Күн бұрын

pyspark scenario based interview questions and answers
pyspark interview questions and answers
spark interview questions and answers
Create DataFrame Code :
====================
data = [ ('c1', 'New York', 'Lima'),
('c1', 'London', 'New York'),
('c1', 'Lima', 'Sao Paulo'),
('c1', 'Sao Paulo', 'New Delhi'),
('c2', 'Mumbai', 'Hyderabad'),
('c2', 'Surat', 'Pune'),
('c2', 'Hyderabad', 'Surat'),
('c3', 'Kochi', 'Kurnool'),
('c3', 'Lucknow', 'Agra'),
('c3', 'Agra', 'Jaipur'),
('c3', 'Jaipur', 'Kochi')]
schema = "customer string , start_location string , end_location string"
df = spark.createDataFrame(data = data , schema = schema)
df.show()
Need Help ? Connect With me 1:1 - topmate.io/dew...
Let's connect on LinkedIn : / dhirajgupta141
top interview question and answer in pyspark :
• top interview question...
PySpark Installation and Setup : • Spark Installation | P...
DSA In Python Interview Series : • dsa for data engineer ...
PySpark Interview Series : • pyspark interview ques...
Pandas Interview Series : • pandas interview quest...
SQL Interview Series : • sql interview question...
#freshworks #deloitte #zs #fang #pyspark #sql #interview #dataengineers #dataanalytics #datascience #StrataScratch #Facebook #data #dataengineeringinterview #codechallenge #datascientist #pyspark #CodingInterview
#dsafordataguy #dewithdhairy #DEwithDhairy #dhiarjgupta #leetcode #topinterviewquestion

Пікірлер: 11
@Tech.S7
@Tech.S7 2 ай бұрын
Thanks for informative stuff. Instead of specifying all conditions in the join. Just we can specify only one condition ( I mean not required and Or conditions) It works and fetch expected output. Cheers!!
@PoojaM22
@PoojaM22 5 ай бұрын
awesome bro! please keep up the work!
@user-ju4ih5xr8e
@user-ju4ih5xr8e 6 ай бұрын
here is my solution #creating two dataframes for start and end df1=df.select('customer','start_location').alias('a') df2=df.select('customer','end_location').alias('b') #checking for locations df3=df1.join(df2,concat(col('a.customer'),col('a.start_location'))==concat(col('b.customer'),col('b.end_location')),'leftanti') df4=df2.join(df1,concat(col('a.customer'),col('a.start_location'))==concat(col('b.customer'),col('b.end_location')),'leftanti') #final output df5=df3.join(df4,["customer"],'inner')
@siddharthchoudhary103
@siddharthchoudhary103 6 ай бұрын
In last after finding unique record can we use collectlist by using group by on customer then using indexes as start and end location in withcolumn?
@user-dv1ry5cs7e
@user-dv1ry5cs7e 5 ай бұрын
with t1 AS (select customer,start_loc from travel_data where start_loc not in (select end_loc from travel_data)) ,t2 AS (select customer,end_loc from travel_data where end_loc not in (select start_loc from travel_data)) select t1.customer,t1.start_loc,t2.end_loc from t2 join t1 on t2.customer=t1.customer
@prabhatgupta6415
@prabhatgupta6415 6 ай бұрын
df1=df.select("customer","start_location") df2=df.select("customer","end_location") df3=df1.union(df2).groupBy("customer","start_location").agg(count("start_location").alias("count")).filter("count==1") df3.alias("a").join(df3.alias("b"),["customer"],"inner").filter("a.start_location
@tradingwith10k10
@tradingwith10k10 3 ай бұрын
No udf, no join, no subquery display(df.groupBy("customer") .agg(collect_set("start_location").alias("start_list"),collect_set("end_location").alias("end_list")) .withColumn("start_location",array_except("start_list","end_list").getItem(0)) .withColumn("end_location",array_except("end_list","start_list").getItem(0)) .drop("start_list","end_list"))
@user-tm4zj2zz8x
@user-tm4zj2zz8x 5 ай бұрын
from pyspark.sql.functions import collect_list, udf from pyspark.sql.types import StringType def loc(x, y): a = [i for i in x if i not in y] return a[0] loc_udf = udf(loc, StringType()) df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'), collect_list('end_location').alias('end_list')) display(df1) df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list')) display(df2)
@user-tm4zj2zz8x
@user-tm4zj2zz8x 5 ай бұрын
from pyspark.sql.functions import collect_list, udf from pyspark.sql.types import StringType def loc(x, y): a = [i for i in x if i not in y] return a[0] loc_udf = udf(loc, StringType()) df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'), collect_list('end_location').alias('end_list')) display(df1) df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list')) display(df2)
SCHOOLBOY. Мама флексит 🫣👩🏻
00:41
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 6 МЛН
The FASTEST way to PASS SNACKS! #shorts #mingweirocks
00:36
mingweirocks
Рет қаралды 13 МЛН
Фейковый воришка 😂
00:51
КАРЕНА МАКАРЕНА
Рет қаралды 6 МЛН
10 recently asked Pyspark Interview Questions | Big Data Interview
28:36
4 Recently asked Pyspark Coding Questions | Apache Spark Interview
28:39
Flatten Nested Json in PySpark
9:22
GeekCoders
Рет қаралды 3,1 М.