Why NEVER use pandas' get dummies for creating dummy variables

Why NEVER use pandas' get dummies for creating dummy variables | Machine Learning

Рет қаралды 5,009

Күн бұрын

Пікірлер: 27

@akashkunwar 2 жыл бұрын

But, I was taught that cleaning and encoding should be done before splitting and that scaling shoud be performed . Is it wrong?

@joeyk2346 3 жыл бұрын

Hi Rachit - why not use get dummies with the entire data (before splitting into train/test)? Wouldn't it solve the potential problem? Thanks!

@rachittoshniwal 3 жыл бұрын

No, if you use it before splitting, you're essentially looking at the entire data, which would lead to data leakage. If you have an unknown category in the test set and you use get dummies, you'd make a column for that beforehand itself, which is incorrect technically

@akashkunwar 2 жыл бұрын

@@rachittoshniwal But, I was taught that cleaning and encoding should be done before splitting and that scaling shoud be performed . Is it wrong?

@utkar1 2 жыл бұрын

@@akashkunwar yes, look up data leakage

@BiologyIsHot Жыл бұрын

@@utkar1 standard one-hot encoding before splitting should not lead to data leakage. In fact you generally *should* one-hotencode before splitting if you want to be efficient about it. That said I don't use pd.get_dummies() so I am not sure if it does something weird.

@BiologyIsHot Жыл бұрын

@@rachittoshniwal maybe it's a language thing, but I'm not sure what you mean by "unknown categories" but if you mean missing values that is not data leakage. Data leakage is when you have some information encoded in a variable that,is incorporating information about the distribution of the test split. For instance, if you have some encoding/processing step that involves min/max calculations or taking the mean of the entire dataset. This should be done using only values from the training split. One-hot encoding doesn't incorporate info about the distribution of categorical predictors in the test split. It sounds like you're theorizing some sort of scenario where uncommon categorical labels are included in onlg one split. That is an entirely separate issue from data leakage. And if you really have so few observations relative to the abundance of categorical levels you will never find a satisfactory solution.. Encoding after splitting to somehow "avoid" that is *not* a good solution, nor is it an example of data leakage.

@devpython8956 2 жыл бұрын

Hi, Rachit Toshniwal , what a good point. See, to overcome that, after applying get.dummies you have to align the dateframe. If you do that, then you can run any machine learning model and it will be ok. Please see an example below: X_train = pd.get_dummies(X_train) X_valid = pd.get_dummies(X_valid) X_test = pd.get_dummies(X_test) X_train, X_valid = X_train.align(X_valid, join='left', axis=1) X_train, X_test = X_train.align(X_test, join='left', axis=1)

@wtfashokjr 4 ай бұрын

why pd.get_dummies not working for me ?

@KA00_7 5 ай бұрын

learned something new today. Thank you so much

@eleonoraocello610 2 жыл бұрын

Hi Rachit, so what to do to encode categorical variables avoiding mismatch? I'm working on a large dataset 8before the splitting) and I already missed some categories.

@rachittoshniwal 2 жыл бұрын

You can use one hot encoder

@eleonoraocello610 2 жыл бұрын

@@rachittoshniwal thanks!

@venkyvenky4715 2 ай бұрын

but you can do getdummies before traintestsplit

@atiaspire 4 жыл бұрын

I was doing wrong for whole time. you saved

@rachittoshniwal 4 жыл бұрын

I'm glad I could help! :)

@Chiefempress 3 жыл бұрын

Me 2...🥺🥺

@jeweltilak767 2 жыл бұрын

Namaste and Thank you. Your videos are very helpful

@sandeshkharat2273 3 жыл бұрын

What to use when dataset column has multiple categorical values??? ( like 200 )

@rachittoshniwal 3 жыл бұрын

Hi, I believe you could try categorizing the "rare" categories into a new "other" category, by saying that "any category having less than 1% records will be put into "other".

@sumanthpichika5295 3 жыл бұрын

For the Sandesh answer. It means we are replacing one more category with the name "other" for the the categories with less than 1% if we have more categories. So that we can avoid having more categories now

@ArpitRawat 2 жыл бұрын

@Sandesh check hash method.