Big Data Engineer Live Mock Interview | Topics: Pyspark, Delta Lake, Data Profiling, Data Governance

Рет қаралды 23,314

Күн бұрын

Пікірлер: 16

@sampadapatil5225 12 күн бұрын

salting for data skewness is correct as it adds a key to the data which helps it to be evenly spread out but repartition is not used in data skew scenario, because it just shuffles the data again so the uneven chunk of data is still there and it is not getting partitioned properly

@AshishStudyDE 5 ай бұрын

Interviewer (Chandrali) definitely have deep understanding of topics. If we can get some interview videos for senior post where she is giving the interview for scenario based question that will be more helpful.

@rutvikkokane2997 Ай бұрын

Really Insight-full video , as a fresher got to learn a lot

@MeowUniX 6 ай бұрын

Data governance includes access permission for a user or a group , we can restrict user to read any specific location or we call it data masking ... Second thing unity catalog also store metadata

@ravulapallivenkatagurnadha9605 6 ай бұрын

Please continue python and Sql series

@SusheelGajbinkar 4 ай бұрын

Insightful! thank you chandrali and abhirup

@jithindev9185 5 ай бұрын

If repartition happens at driver, do u think entire gbs of data travell to driver from all executors and then driver divide and give it to all executors... Then your understanding about spark is wrong.....

@gauravgaikwad2939 Ай бұрын

I think answers are not on point. for ex. difference between dataframe and dataset: They both are basically the same but with slight difference. One thing I have observed is that, Dataset Provides type safety and compile-time checks, meaning errors are caught during code compilation (e.g., wrong column names or data types) on the other hand DataFrame Errors occur at runtime due to its untyped nature. If you ask me what is more preffered then in my opinion, dataframe is more preferred as dataset comes with overhead, dataframe serialization is managed by tungsten binary format but dataset's serialization is managed by java serializaer(slower). So, using dataset will help us to cut down on developer mistake, but it will come with an extra cost of casting and expensive serialization.

@skybluelearner4198 4 ай бұрын

Her question on rank and dense rank was not complete. Rank and dense rank but based on what was not told to the candidate.

@AkashRusiya Ай бұрын

Right. Also, partition by clause was not even required if she actually wanted to understand the difference among the 3 functions based on the given data.

@sandhu01 23 күн бұрын

import re def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=re.search(r"(^[a-zA-Z].*?)(\s[a-zA-Z].*)",x) if a.group(1)[0].lower()==a.group(2)[1].lower(): ## remove lower() if need to match cases also print("True") else: print("false") lis="Crazy Chocolate" comp_str(lis)

@sandhu01 23 күн бұрын

if don't want to use 're', here is another simpler version: def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=x.split() if a[0][0].lower()==a[1][0].lower(): print("true") else: print("false") lis="Crazy Chocoloate" comp_str(lis)