Trending Big Data Interview Question - Number of Partitions in your Spark Dataframe

Рет қаралды 23,229

Күн бұрын

Hi, I am so glad that we are starting with this big data interview series.
The first question is a trending question asked in pyspark interviews, which is "Lets say you create a spark dataframe by loading a file, how many partitions would be there in your spark dataframe?"
There are a lot of factors which define this, and in this video I have covered all possible scenarios.
I am sure will will truly enjoy it.
big data interview
apache spark interview
pyspark interview
frequently asked big data questions
data engineering interview questions and answers
data engineering interview
big data interview questions & answers
top 10 big data interview questions with answers
number of partitions in spark dataframe

Пікірлер: 37

@Anonymous-fe2ep 9 ай бұрын

Hello Sir, I was asked the following questions for AWS Developer role. Please make a video on this. Thanks. Q1. We have *sensitive data* coming in from a source and API. Help me design a pipeline to bring in data, clean and transform it and park it. Q2. So where does pyspark come into play in this? Q3. Which all libraries will you need to import to run the above glue job? Q4. What are shared variables in pyspark Q5. How to optimize glue jobs Q6. How to protect sensitive data in your data Q7. How do you identify sensitive information in your data Q8. How do you provision a S3 bucket? Q9. How do I check if a file has been changed or deleted? Q10. How do I protect my file having sensitive data stored in S3 Q11. How does KMS work? Q12. Do you know S3 glacier? Q13. Have you worked on S3 glacier?

@arunsundar3739 3 ай бұрын

was curious why spark handles smaller files differently, & also had a fixed view that partition size is 128 MB all times, that view of mine is debunked now, beautifully explained , thank you very much sir :)

@tarunpothala6856 11 ай бұрын

Sir, Great to see such scenarios explained clearly. We would love to watch some interview questions on databricks. Kindly post them.

@AbhishekVerma-hx8bq 11 ай бұрын

Excellent explanation, highly informative!

@Rajesherukulla 11 ай бұрын

Was literally waiting for your video series... Congo for a great start sumit sir.

@kirtisingh7698 11 ай бұрын

Thank you Sir for explaining the answers with a scenario. It's really helpful.

@eyecaptur 11 ай бұрын

Great explanation sir as always

@sufiyaanbhura6343 11 ай бұрын

Thank you sir!

@25683687 11 ай бұрын

Really very well explained!

@Ronak-Data-Engineer 11 ай бұрын

Very well explained

@siddheshkankal7567 11 ай бұрын

thank you so much for the great explanation in detail, can you discuss more on like many times interviewer might ask you have worked on how much big data size for that what could be cluster configuration, how you decide it, what can be optimized solution, what kind of data and its size, and more on next to next video expectation on spark optimization techniques

@kavyasri6654 11 ай бұрын

Thank you Sir, also please continue advanced sql playlist, I have completed both basic and advanced playlist it's very helpful.

@RohanKumar-mh3pt 11 ай бұрын

very insightful please cover more spark internals scenerio based questions

@himanshupatidar9413 11 ай бұрын

Thanks for the simplified explanation, please make next video on deciding configuration for our jobs , ex: which one is better config i)10 executors with 4 cores and 4gb ram each or ii) 5 executors with 8 cores and 8 gb ram, there is no proper explanation about this concept anywhere

@arpittapwal4651 11 ай бұрын

Great explanation as always. Thank you Sumit sir, waiting for much such videos in future 😊

@sumitmittal07 11 ай бұрын

thank you Arpit

@sonurohini6764 4 ай бұрын

Good explanation sir. Make a video on possible scenario based questions like this

@DEwithDhairy 6 ай бұрын

PySpark Scenario Based Interview Question And Answers: kzbin.info/aero/PLqGLh1jt697zXpQy8WyyDr194qoCLNg_0&si=Ddhve6jjcy0ZvaLV

@virajjadhav6579 11 ай бұрын

Thank you Sir, the start of the series is great. Do we have to explain each answer with scenarios?

@soumikdutta77 11 ай бұрын

Insightful and informative concept, thank you Sir for clearing it out with ease ✅

@sumitmittal07 11 ай бұрын

thank you Soumik

@user-pp4pu8kp7v 3 ай бұрын

Please continue the series

@Momlifeindia 11 ай бұрын

Well explained as always. I was asked the same question in one of the interviews.

@sumitmittal07 11 ай бұрын

thats great to know..

@deepakpatil4419 2 ай бұрын

Hi Sir, Thankyou for the explanation.. I have a situation, I am executing a databricks pipeline through Airflow. In one of the task, I am writing the data from dataframe to a path ( in parquet file). The writing operation suppose to create the path on daily basis and write the data into the path. Path is being created but after writing, when I am checking the count, it is showing zero. It's not giving any error as well so really difficult to identify the issue. But, when I am reprocessing the same task then it's writing the data.

@ritumdutta2438 11 ай бұрын

A very interesting start of an exciting series :) ... appreciate all your effort .... Just wanted to confirm one thing ... in case of RDD-s the partition size is always 128 mb right (what you explained applies for dataframe/higher level API-s)?

@sumitmittal07 11 ай бұрын

thats correct, in case of rdd.. it depends on the block size of underlying filesystem. in case of hdfs it will be 128 mb.

@localmartian9047 4 ай бұрын

@@sumitmittal07And in case of object store/s3, will it be default Parallelism or the number of splits in source file from s3

@anandattagasam7037 11 ай бұрын

Hi sir, I wanted to get confirm. are you saying based on cpu cores, number of partition would happen. Like you said for 1GB data, there would be 8 partition due to paralellism then it will be 4 partition, correct. Pls correct me if i am wrong.

@bharanidharanm2653 18 күн бұрын

3rd scenario is not clear. Are we updating ant congratulation setting to avoid small files problem

@vusalbabashov8242 10 ай бұрын

In the example I have, I am getting df.rdd.getNumPartitions() equal to 200 which seems to be the default. I have 160 cores available in the cluster. How should we understand this in the light of what you say in the video, I feel like this part is missing. Also, when should we use spark.conf.set("spark.sql.shuffle.partitions", "auto")

@rohitshingare5352 6 ай бұрын

in the context of video he just explained about intial stage of partitions, in your case data is get shuffled that why it has created 200 by default partitions.

@dipeshchaudhary2188 2 ай бұрын

As per my understanding, 160 tasks will be perfomed parellely and the remaining 40 tasks will wait in queue. And those 40 tasks will be performed when any 40 out of 160 cores are available again.