RDD vs DataFrame vs Datasets | Spark Tutorial Interview Questions

No video

RDD vs DataFrame vs Datasets | Spark Tutorial Interview Questions

Рет қаралды 85,102

Күн бұрын

As part of our spark Interview question Series, we want to help you prepare for your spark interviews. We will discuss various topics about spark like Lineage, reduceby vs group by, yarn client mode vs yarn cluster mode etc. As part of this video we are covering
difference between rdd , dataframe and datasets.
Please subscribe to our channel.
Here is link to other spark interview questions
• Spark Interview Questions
Here is link to other Hadoop interview questions
• 1.1 Why Spark is Faste...

Пікірлер: 74

@ravinderkarra3187 6 жыл бұрын

DataFrame also serialize the data into off-heap storage in binary format and then perform transformations directly on off heap memory as spark understands the schema. Also provides a Tungsten physical execution back-end which explicitly manages memory and dynamically generates byte-code for expression evaluation. So does memory management better here.

@akp7-7 2 жыл бұрын

yes ,so dataframe is more fast as compare to dataset?

@souravsinha5330 Жыл бұрын

Nice and clear explanation. To the point thanks.

@DataSavvy Жыл бұрын

Glad it was helpful!

@apekshatrivedi8689 3 жыл бұрын

Very nice explanation. Your videos really help me while preparing for interviews. Highly recommend. Thank you!

@someshmungikar4466 3 жыл бұрын

cooollll great answer sir... thanks !!!

@ganeshdhareshwar6053 4 жыл бұрын

nicely explained. Thank you for your effort on gathering information and publishing it. much needed videos it is

@rameshgangabathula6221 4 жыл бұрын

Nice explanation. Can you please explain how to do check pointing & resume a failed spark job(due to action/transformation failure and executor memory exceeded) in another video?

@bhargavhr1891 6 жыл бұрын

Again an very nice video, thanks and it would be great if you provide a pseudo code or simple code sytax for each abstractions so that understanding will be very clear

@TusharKakaiya 3 жыл бұрын

Really helpful content. Much appreciated.

@rahulshandilya880 4 жыл бұрын

When to use dataframe and when to use dataset and when to use Rdd and spark sql, sparkSession

@arundhingra4536 5 жыл бұрын

@Data Savvy - A small correction, at 8:10 you mentioned that we cannot do map, join and other operations on a DataFrame

@sharathchandra5314 5 жыл бұрын

Data Savvy should have said that if we use Map, Join and other operations that take HigherOrderFunctions then lets forget Optimization by Spark Framework.

@naresh5273 5 жыл бұрын

Thank you. Last time in my interview, interviewer asked me same question...

@DataSavvy 5 жыл бұрын

Thanks Kartik... I am happy this content was useful to you... can you share other questions asked by your interviewer?

@max6447 3 жыл бұрын

Thanks your videos are very useful !

@DataSavvy 3 жыл бұрын

Thanks Mayank :)

@nehabansal677 5 жыл бұрын

Great content... Very helpful for interviews

@DataSavvy 5 жыл бұрын

Thanks ... Please watch full spark interview series

@raviyadav-dt1tb 8 ай бұрын

Please provide aws questions and answers

@DataSavvy 7 ай бұрын

Planning that

@raviyadav-dt1tb 7 ай бұрын

@@DataSavvy can you please provide interview questions from scala program, several times I getting rejections due to scala program

@DataSavvy 7 ай бұрын

I will add that in my list... Need to work on when can i start that

@raviyadav-dt1tb 7 ай бұрын

@@DataSavvy please do thank you 🙏

@DataSavvy 7 ай бұрын

Thank you

@shubhamkumar-uz7ux 5 жыл бұрын

Very informative ..just one thing voice is too low in video .

@RajKumar-zw7vt 5 жыл бұрын

Nice video bro...

@DataSavvy 5 жыл бұрын

Thanks Raj...

@chiranjeevikatta8116 3 жыл бұрын

I am new to the spark and big data world. I choose to use/learn pyspark because I am familiar with python. I got to know the python is not type-safe and does not support for datasets. Can someone say does pyspark is used in building real-world applications Or Do I need to learn scala/java. Thanks. -Great video

@DataSavvy 3 жыл бұрын

PySpark is used for lot of real time projects... I have generally seen people doing ml or data analysis project using pyspark... Data ingestion teams use scala... However this is not always true... It usually boils down to comfort level of developer and team composition

@chiranjeevikatta8116 3 жыл бұрын

@@DataSavvy thank you. Good work. I took online courses but I got more clarity after watching your videos.

@DataSavvy 3 жыл бұрын

Thanks Chiranjeevi... Very happy to hear that :)

@Pratik0917 5 ай бұрын

Then people arent using dataset everywhere?

@RahulRawat-wu1vv 5 жыл бұрын

It will serialize the data or deserialize coz as far as i know we deserialization is conversion of byte stream to. Java object. Please correct if i am wrong.

@ajaypratap4025 5 жыл бұрын

When to use dataframe and when to use dataset?

@owaisshaikh3983 4 жыл бұрын

when you have strict data type use data frame (more convenient) or else dataset

@alexanderkorchagin67 4 жыл бұрын

ERROR! Actualy Dataframe, Dataset, RDD - it is correct order of performance from very effective to not effective. DF is better performance then DS because not using serialization and desirialization when work with data

@DataSavvy 4 жыл бұрын

Hi Alexander, Excuse me if explanation was not clear. Message was that DS uses encoders which are more efficient way of serializing and deseriailizing data than kryo or default serialization... so shuffle operations will be more efficient as they involve serialization and deseriailizing. In general there is not much difference in performance of data frame and dataset these days...could you elaborate on df not using serialization and deseriailization... I did not get what u meant there

@akp7-7 2 жыл бұрын

@@DataSavvy i recently learnt that in DF serialization is managed by tungsten binary format..encoders however in DS serialization is managed by java serialization.so DF performance is little fast than DS

@TheBjjninja 5 жыл бұрын

Can you fix your volume please

@anilcvs1 6 жыл бұрын

Please show me some real times scenarios in videos.

@DataSavvy 6 жыл бұрын

Hi Anil, thanks for comment... Give me some example, I will create video for that ... Please subscribe to channel

@akshathab.s6751 5 жыл бұрын

Hi real time scenario like industry level data processing I mean for performance tunning when their is large amount of data to be process, like wise which component to be preffered like dataframe r daraset or rdd... In what suitation which methodology is suitable.

@ambikaiyer29 5 жыл бұрын

Hi - Can you please share details on why dataset api is not available in Python?

@DataSavvy 5 жыл бұрын

Because Python is not type safe ... And datasets are type safe

@vinodmani3900 5 жыл бұрын

Thanks @@DataSavvy . I was behind this for some time , wondering why all the API are for dataframe in PySpark. So it means in PySpark we need to code with dataframe itself right ?

@yeoreumkwon 5 жыл бұрын

If I understood correctly, PySpark does not support the Datasets because Python is not a type-safe language, right?

@DataSavvy 5 жыл бұрын

You are right my friend... Data set philosophy is different from philosophy of python language

@yeoreumkwon 5 жыл бұрын

@@DataSavvy So I will have to learn Scala for using Spark Datasets. Thank you very much for your effort. I really enjoy your series.

@DataSavvy 5 жыл бұрын

If u are particular about using dataset then yes.. use a JVM language Scala or java... I am happy that you like the series... Please suggest more topics that you are interested in

@SuperSazzad2010 5 жыл бұрын

Hi Please throw some light on the fact that DataFrame make use of Java Serialization.. But What is the use of off-heap?

@DataSavvy 5 жыл бұрын

Dataframe can use java serialization or kryo... Off Heap is used for shuffle

@akp7-7 2 жыл бұрын

@@DataSavvy So dataframe uses java serialization or it is used by dataset?

@iftekharkhan3254 11 ай бұрын

Sound is low

@rakeshkumarsharma3920 6 жыл бұрын

How can we automate incremental import in SQOOP?

@lokeshmvs 5 жыл бұрын

Use "Sqoop job --create " create a job for incremental import, so that sqoop job will store the meta data of incremental load.

@rakeshkumarsharma3920 5 жыл бұрын

@@lokeshmvs I am asking if I have to do incremental import for 50 table, and that job will get execute at mid night .then how do I archive it . Please let me know with examples

@akashchaudhary6953 2 жыл бұрын

sir mera interview leke mujhe famous ker do.. 💌

@dharmendrabhojwani 5 жыл бұрын

Very low voice.

@luckyomprakash8437 5 жыл бұрын

What questions one can ask to check Spark RDD experience?

@mikecmw8492 6 жыл бұрын

Please remake this video with real examples. for example, open a spark2 REPL and load a file full of data. Show how to create RDD, DF, DS. Then show some operations with each. Having just text on the screen will not help in an interview. Most interviews are now hands on, especially with big data. Thank you

@DataSavvy 5 жыл бұрын

sure Mike, will do this... adding this in my next steps.. I appreciate these suggestions.