salting for data skewness is correct as it adds a key to the data which helps it to be evenly spread out but repartition is not used in data skew scenario, because it just shuffles the data again so the uneven chunk of data is still there and it is not getting partitioned properly
@AshishStudyDE5 ай бұрын
Interviewer (Chandrali) definitely have deep understanding of topics. If we can get some interview videos for senior post where she is giving the interview for scenario based question that will be more helpful.
@rutvikkokane2997Ай бұрын
Really Insight-full video , as a fresher got to learn a lot
@MeowUniX6 ай бұрын
Data governance includes access permission for a user or a group , we can restrict user to read any specific location or we call it data masking ... Second thing unity catalog also store metadata
@ravulapallivenkatagurnadha96056 ай бұрын
Please continue python and Sql series
@SusheelGajbinkar4 ай бұрын
Insightful! thank you chandrali and abhirup
@jithindev91855 ай бұрын
If repartition happens at driver, do u think entire gbs of data travell to driver from all executors and then driver divide and give it to all executors... Then your understanding about spark is wrong.....
@gauravgaikwad2939Ай бұрын
I think answers are not on point. for ex. difference between dataframe and dataset: They both are basically the same but with slight difference. One thing I have observed is that, Dataset Provides type safety and compile-time checks, meaning errors are caught during code compilation (e.g., wrong column names or data types) on the other hand DataFrame Errors occur at runtime due to its untyped nature. If you ask me what is more preffered then in my opinion, dataframe is more preferred as dataset comes with overhead, dataframe serialization is managed by tungsten binary format but dataset's serialization is managed by java serializaer(slower). So, using dataset will help us to cut down on developer mistake, but it will come with an extra cost of casting and expensive serialization.
@skybluelearner41984 ай бұрын
Her question on rank and dense rank was not complete. Rank and dense rank but based on what was not told to the candidate.
@AkashRusiyaАй бұрын
Right. Also, partition by clause was not even required if she actually wanted to understand the difference among the 3 functions based on the given data.
@sandhu0123 күн бұрын
import re def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=re.search(r"(^[a-zA-Z].*?)(\s[a-zA-Z].*)",x) if a.group(1)[0].lower()==a.group(2)[1].lower(): ## remove lower() if need to match cases also print("True") else: print("false") lis="Crazy Chocolate" comp_str(lis)
@sandhu0123 күн бұрын
if don't want to use 're', here is another simpler version: def comp_str(x: list): if len(x.split()) != 2: return print("wrong input") else: a=x.split() if a[0][0].lower()==a[1][0].lower(): print("true") else: print("false") lis="Crazy Chocoloate" comp_str(lis)
@AMM20122 ай бұрын
interesting
@hdr-tech43504 ай бұрын
Data lake vs delta lake Unity catalog Data profiling Data governance