Big Data Engineer Live Mock Interview | Topics: Pyspark, Delta Lake, Data Profiling, Data Governance

Рет қаралды 30,825

Күн бұрын

Пікірлер: 23

@AshishStudyDE 7 ай бұрын

Interviewer (Chandrali) definitely have deep understanding of topics. If we can get some interview videos for senior post where she is giving the interview for scenario based question that will be more helpful.

@sampadapatil5225 2 ай бұрын

salting for data skewness is correct as it adds a key to the data which helps it to be evenly spread out but repartition is not used in data skew scenario, because it just shuffles the data again so the uneven chunk of data is still there and it is not getting partitioned properly

@adityanjsg99 Ай бұрын

A discussion of a high quality. Interviewer asked very conceptual questions

@Jimmy-jc6pb Ай бұрын

One of the most knowledgeable interviews.

@rutvikkokane2997 3 ай бұрын

Really Insight-full video , as a fresher got to learn a lot

@MeowUniX 8 ай бұрын

Data governance includes access permission for a user or a group , we can restrict user to read any specific location or we call it data masking ... Second thing unity catalog also store metadata

@ravulapallivenkatagurnadha9605 8 ай бұрын

Please continue python and Sql series

@SusheelGajbinkar 6 ай бұрын

Insightful! thank you chandrali and abhirup

@avicool08 Ай бұрын

Professional interview 👍

@jithindev9185 7 ай бұрын

If repartition happens at driver, do u think entire gbs of data travell to driver from all executors and then driver divide and give it to all executors... Then your understanding about spark is wrong.....

@gawlianilnrayan 14 күн бұрын

in repartition, data will re-shuffle across all available partitions

@jithindev9185 14 күн бұрын

@ yea it wont go to driver

@abhishek_kumar0709 Ай бұрын

Which company’s interview is this ?

@gauravgaikwad2939 3 ай бұрын

I think answers are not on point. for ex. difference between dataframe and dataset: They both are basically the same but with slight difference. One thing I have observed is that, Dataset Provides type safety and compile-time checks, meaning errors are caught during code compilation (e.g., wrong column names or data types) on the other hand DataFrame Errors occur at runtime due to its untyped nature. If you ask me what is more preffered then in my opinion, dataframe is more preferred as dataset comes with overhead, dataframe serialization is managed by tungsten binary format but dataset's serialization is managed by java serializaer(slower). So, using dataset will help us to cut down on developer mistake, but it will come with an extra cost of casting and expensive serialization.

@skybluelearner4198 6 ай бұрын

Her question on rank and dense rank was not complete. Rank and dense rank but based on what was not told to the candidate.

@AkashRusiya 2 ай бұрын

Right. Also, partition by clause was not even required if she actually wanted to understand the difference among the 3 functions based on the given data.

@sandhu01 2 ай бұрын

import re def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=re.search(r"(^[a-zA-Z].*?)(\s[a-zA-Z].*)",x) if a.group(1)[0].lower()==a.group(2)[1].lower(): ## remove lower() if need to match cases also print("True") else: print("false") lis="Crazy Chocolate" comp_str(lis)

@sandhu01 2 ай бұрын

if don't want to use 're', here is another simpler version: def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=x.split() if a[0][0].lower()==a[1][0].lower(): print("true") else: print("false") lis="Crazy Chocoloate" comp_str(lis)