Spark Interview Question | Scenario Based | Merge DataFrame in Spark

Spark Interview Question | Scenario Based | Merge DataFrame in Spark | LearntoSpark

Рет қаралды 43,586

Күн бұрын

In this video, we will learn how to merge two Dataframe in Spark using PySpark. we will discuss all the available approach to do it. Hope this video will be helpful in your Spark Interview Preparation.
Blog link to learn more on Spark:
www.learntospark.com
Linkedin profile:
/ azarudeen-s-83652474
FB page:
/ learntospark-104523781...

Пікірлер: 87

@user-co8oc1rm5w 3 жыл бұрын

being a newbie to spark I find it very helpful boss.keep it up brother.looking forward to see more such from you.

@shubne Жыл бұрын

Now you can use unionByName() function as well. df3 = df.unionByName(df2, allowMissinColumns=True) df3.show()

@4brogames 3 жыл бұрын

Real and true looking forward to see more videos

@ajaykiranchundi9979 2 жыл бұрын

Last approach was incredible. Did not know it was possible to subtract the columns to get the delta!!

@davimonteiropaulelli9649 3 жыл бұрын

Excelent video Azarudeen, you helped me alot! Thankssss

@arvindyadav1504 3 жыл бұрын

Thanks Azar for making such a nice scenario based question series with demo.

@Rajgupta-fh3yt 3 жыл бұрын

u r doing great job and its helping a lot to the beginners. Thanks

@nareshvemula2204 Жыл бұрын

Good videos. Thank you. One small info, in "Automated Approach" if number of columns difference between two data frame is more than one and not in alphabetical order then it won't work. We need to sort the columns while performing union operation like below. df_final=df_file1.select(sorted(df_file1.columns)).union(df_file2.select(sorted(df_file2.columns)))

@nagamohanreddy1602 4 жыл бұрын

Really its nice help friend

@SurendraKapkoti 2 жыл бұрын

Very clear and useful. Thank you very much

@ankbala 3 жыл бұрын

very nice approach and clear explanation! Thank you very much.

@dattaningole8063 3 жыл бұрын

Very good explanation of each scenario .... Thanks a lot @Azarudeen Shahul... Keep it up

@AzarudeenShahul 3 жыл бұрын

Thanks for your support.. 😊

@sumitkumarsahoo 4 жыл бұрын

The tutorial is very lucid and clear

@smileplease6151 2 жыл бұрын

Thank you so much for the videos. They definitely increased my hope towards practical learning!!!

@AzarudeenShahul 2 жыл бұрын

Thanks for your support 🙂

@aneksingh4496 4 жыл бұрын

Good video ..please keep posted on new scenario based questions

@AzarudeenShahul 4 жыл бұрын

Sure, move videos to come

@abhinavsingh9333 Жыл бұрын

Nice video.. informative.. ❤❤

@AzarudeenShahul Жыл бұрын

Thanks for all your support

@sasmigration1920 2 жыл бұрын

Awesome Azharuddin, your videos are very helpful...Do you take any online coaching?

@krishnakishorenamburi9761 4 жыл бұрын

great work Azar. I used the automatic technique for a datawareshousing project.

@AzarudeenShahul 4 жыл бұрын

Thanks for your support, share with your bigdata frnds

@heenagirdher6443 2 жыл бұрын

Hi Azarudeen. Thank you so much for this video. I have implemented the same question in spark scala but I am facing problem in implementing the automated approach in spark scala. Could you please help me on this and provide me solution for the same.

@madhavkondapalli785 4 жыл бұрын

Thank you so much for these real time scenario videos brother Eagarly waiting for more such All the best

@AzarudeenShahul 4 жыл бұрын

Thanks for your support, pls share with ur frnds aswell :)

@souravsardar 3 жыл бұрын

Excellent. Thanks for sharing. Can u make a video on reading data from multiple parquet files of different schema using schema evolution.

@AzarudeenShahul 3 жыл бұрын

Sure, can except the same soon👍

@4brogames 3 жыл бұрын

Awesome work man. Appreciated

@rohitrathod8150 2 жыл бұрын

How outer join worked? We have same columns in both the DF, which columns it will take?

@awanishkumar6308 3 жыл бұрын

HI Azarudeen its Awanish your video really helpful,,, actually i have installed Spark but while i am checking on command prompt by entering pyspark its saying path is not specified , even though i have made many correctness and checked even environment variables as well many times

@DiverseDestinationsDiaries 3 жыл бұрын

Hi Shaul, Superb content. Never seen such an clear and all possible approaches in KZbin. Thanks a lot. Not only for the interview , to get out daily jobs done ,you're videos so helpful.

@sarjfud 4 жыл бұрын

Great example and nice explaination

@AzarudeenShahul 4 жыл бұрын

Thanks for your support, :-)

@sravankumar1767 2 жыл бұрын

Superb bro 👌 👏

@DiverseDestinationsDiaries 3 жыл бұрын

For the same scenario, I have used motonically I'd column for two then I have done left join. Is that approach was correct?

@ashwinc9867 3 жыл бұрын

Can you also make some videos on spark using scala? All your videos are brilliant

@adshakin 3 жыл бұрын

Great pyspark tutorial thanks

@puggyk4220 3 жыл бұрын

I'm trying string (json style) -> parquet for merging different columns dataframe

@pavithrasri1890 3 жыл бұрын

Hi..your videos are really helpful... could you please post a video on spark incremental data load and merge that data with scd2 type (using SCALA)...

@monicakannan9731 2 жыл бұрын

When merging 5 different data format files how it will work ?? Your answer will be helpful

@srinugoriparthi4608 2 жыл бұрын

Can you help in merge two dataframes with date column and big int column i am getting error like failed to merge

@0305ram 4 жыл бұрын

@Azarudeen Shah - In the example the missing column is at the last for one of the dataframe. So with_column automatically adds at the end. What if the column is missing in middle of the table structure ? Thank you!!

@AzarudeenShahul 4 жыл бұрын

Thanks for the question Before merging, we can select the columns in same order as that of other like Df1.select(df2.columns) Hope this helps you :)

@0305ram 4 жыл бұрын

@@AzarudeenShahul wow.. cool thanks Azar..

@awanishkumar6308 3 жыл бұрын

so can you help me to fix it ? can you check i am ready to share my screen ? dear please helpp i have learnt theory part of Hadoop and spark but not feeling confident because of no good hands on because of no environment

@AzarudeenShahul 3 жыл бұрын

Please mail me the error message scrnshot and steps u followed.. if needed we can chk on screen sharing

@sriharipinapaka1030 3 жыл бұрын

Awesome Bro !.. If you can, please do the video on the same scenario by using Scala.

@AzarudeenShahul 3 жыл бұрын

Sure 👍

@muddy8107 3 жыл бұрын

Boss , you are beauty!!’

@priyankas6354 4 жыл бұрын

Very nice explanation of the concepts. How we can achieve this in scala. Also it will be great if you also explain some scenarios using Scala . Thank you

@realMujeeb 2 жыл бұрын

Hi Sir, in for loop we see df2=df2.withColumn(i,lit("null")) here we are able to update the dataframes, but how is it possible if dataframes are immutable.

@murari5921 Жыл бұрын

DataFrames are immutable that is the reason why we are assigning it to variable

@DataIsBusiness Жыл бұрын

thanks a lot bro,

@AzarudeenShahul Жыл бұрын

Thanks for all your support 😊

@DanishAnsari-hw7so Жыл бұрын

How can we get the code for all the scenarios in this playlist?

@AzarudeenShahul 9 ай бұрын

we have a github link provided in description of all recent video. u can find notebook for some scenario based question.

@srinuch9531 4 жыл бұрын

Thanks Azar for making real-time scenario based videos.. how automated process works when both data frames have different column names ?

@AzarudeenShahul 4 жыл бұрын

Thanks for your support,; Are you referring to same data with different column names. If so, then automated approach does not suits.. try schema method...

@himanshujain2047 2 жыл бұрын

@@AzarudeenShahul Just if the order of columns is not same between 2 DFs then this will fail. In that case, we can use unionByName or do df2= df2.select(df1.columns) first then we can apply union.

@localmartian9047 2 жыл бұрын

@@himanshujain2047 there is also allowMissingColumns param in unionByName that does the same as this video

@ritikgupta8478 4 ай бұрын

We can use unionByName in scala

@anuvindkorivi5262 2 жыл бұрын

Hi bro how to achieve the same using scala

@viswasp3388 3 жыл бұрын

nice !

@vineethkyatham536 3 жыл бұрын

How to compare two data frames, with matched records and unnmatched record values?

@swaroopsuki1322 2 жыл бұрын

Can we do this using unionByName

@ashwinc9867 3 жыл бұрын

Can you please share the scala code for automated approach

@Real_Nature_shorts222 2 жыл бұрын

bro pls help me to install spark share me doc of steps i have windows 10

@SpiritOfIndiaaa 4 жыл бұрын

Thank you , but in automated approach , updating df2 in for loop it won't work in java

@SpiritOfIndiaaa 4 жыл бұрын

Whatever changed inside is not accessible outside of loop...can you help me how to handle it

@ashwinc9867 3 жыл бұрын

How can I achive same in scala? I tried following code but not working.consider a and b as two dataframe Val diffcol=a.columns.diff(b.columns) for(i

@fortheknowledge145 3 жыл бұрын

Just add a scenario if we do not have columns in same order in both dataframes after loop? New columns arrive or some columns may disappear over time but the merge/union should keep happening daily. - we need to select columns in right order before doing union we use foldLeft instead of loop (more functional programming way)

@awanishkumar6308 3 жыл бұрын

how to get your mail id ?

@pranayshukla9980 4 жыл бұрын

From where input1.csv is fetched, do u have uploaded any CSV file there.?

@sangamrathore7850 3 жыл бұрын

Yes Parnay I have created and uploaded csv file in my databricks account

@sudippandit9855 3 жыл бұрын

Awesome content!! please help me if we save the output => df1.union(df2).show() and save it to new dataframe as df, and apply df.show(), it didn't work, why?

@pshar2931 3 жыл бұрын

Your methods will not work if both tables have one an extra column. For example TableA: name, age, salary TableB: name,age,gender

@MyVaibhavraj Жыл бұрын

we can achieve this by using UnionByName: union_df = df1.unionByName(df2, allowMissingColumns = True)

@AzarudeenShahul Жыл бұрын

Here we discuss about spark below 3.1 unionByName works when both DataFrames have the same columns, but in a different order. An optional parameter was also added in Spark 3.1 to allow unioning slightly different schemas.