But, I was taught that cleaning and encoding should be done before splitting and that scaling shoud be performed . Is it wrong?
@joeyk23463 жыл бұрын
Hi Rachit - why not use get dummies with the entire data (before splitting into train/test)? Wouldn't it solve the potential problem? Thanks!
@rachittoshniwal3 жыл бұрын
No, if you use it before splitting, you're essentially looking at the entire data, which would lead to data leakage. If you have an unknown category in the test set and you use get dummies, you'd make a column for that beforehand itself, which is incorrect technically
@akashkunwar2 жыл бұрын
@@rachittoshniwal But, I was taught that cleaning and encoding should be done before splitting and that scaling shoud be performed . Is it wrong?
@utkar12 жыл бұрын
@@akashkunwar yes, look up data leakage
@BiologyIsHot Жыл бұрын
@@utkar1 standard one-hot encoding before splitting should not lead to data leakage. In fact you generally *should* one-hotencode before splitting if you want to be efficient about it. That said I don't use pd.get_dummies() so I am not sure if it does something weird.
@BiologyIsHot Жыл бұрын
@@rachittoshniwal maybe it's a language thing, but I'm not sure what you mean by "unknown categories" but if you mean missing values that is not data leakage. Data leakage is when you have some information encoded in a variable that,is incorporating information about the distribution of the test split. For instance, if you have some encoding/processing step that involves min/max calculations or taking the mean of the entire dataset. This should be done using only values from the training split. One-hot encoding doesn't incorporate info about the distribution of categorical predictors in the test split. It sounds like you're theorizing some sort of scenario where uncommon categorical labels are included in onlg one split. That is an entirely separate issue from data leakage. And if you really have so few observations relative to the abundance of categorical levels you will never find a satisfactory solution.. Encoding after splitting to somehow "avoid" that is *not* a good solution, nor is it an example of data leakage.
@devpython89562 жыл бұрын
Hi, Rachit Toshniwal , what a good point. See, to overcome that, after applying get.dummies you have to align the dateframe. If you do that, then you can run any machine learning model and it will be ok. Please see an example below: X_train = pd.get_dummies(X_train) X_valid = pd.get_dummies(X_valid) X_test = pd.get_dummies(X_test) X_train, X_valid = X_train.align(X_valid, join='left', axis=1) X_train, X_test = X_train.align(X_test, join='left', axis=1)
@wtfashokjr4 ай бұрын
why pd.get_dummies not working for me ?
@KA00_75 ай бұрын
learned something new today. Thank you so much
@eleonoraocello6102 жыл бұрын
Hi Rachit, so what to do to encode categorical variables avoiding mismatch? I'm working on a large dataset 8before the splitting) and I already missed some categories.
@rachittoshniwal2 жыл бұрын
You can use one hot encoder
@eleonoraocello6102 жыл бұрын
@@rachittoshniwal thanks!
@venkyvenky47152 ай бұрын
but you can do getdummies before traintestsplit
@atiaspire4 жыл бұрын
I was doing wrong for whole time. you saved
@rachittoshniwal4 жыл бұрын
I'm glad I could help! :)
@Chiefempress3 жыл бұрын
Me 2...🥺🥺
@jeweltilak7672 жыл бұрын
Namaste and Thank you. Your videos are very helpful
@sandeshkharat22733 жыл бұрын
What to use when dataset column has multiple categorical values??? ( like 200 )
@rachittoshniwal3 жыл бұрын
Hi, I believe you could try categorizing the "rare" categories into a new "other" category, by saying that "any category having less than 1% records will be put into "other".
@sumanthpichika52953 жыл бұрын
For the Sandesh answer. It means we are replacing one more category with the name "other" for the the categories with less than 1% if we have more categories. So that we can avoid having more categories now
@ArpitRawat2 жыл бұрын
@Sandesh check hash method.
@Brain_quench3 жыл бұрын
Thank you.
@BiologyIsHot Жыл бұрын
This video is wrong and I advise you all to ignore it.