Impute missing values using KNNImputer or IterativeImputer

Рет қаралды 44,248

Күн бұрын

Need something better than SimpleImputer for missing value imputation?
Try KNNImputer or IterativeImputer (inspired by R's MICE package). Both are multivariate approaches (they take other features into account!)
👉 New tips every TUESDAY and THURSDAY! 👈
🎥 Watch all tips: • scikit-learn tips
🗒️ Code for all tips: github.com/jus...
💌 Get tips via email: scikit-learn.tips
=== WANT TO GET BETTER AT MACHINE LEARNING? ===
1) WATCH my video series: • Machine learning in Py...
2) ENROLL in my courses: www.dataschool...
3) LET'S CONNECT!
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham

Пікірлер: 109

@dataschool 4 жыл бұрын

Thanks for watching! 🙌 Let me know if you have any questions about imputation and I'm happy to answer them! 👇

@Liftsandeats 4 жыл бұрын

Does imputation a method to replace missing / null values in the dataset ?

@hakunamatata-qu7ft 4 жыл бұрын

Awsome bro very useful technique

@dataschool 4 жыл бұрын

That's correct: "missing value imputation" means that you are replacing missing values (also known as "null values") in your dataset with your best approximation of what values they might be if they weren't missing. Hope that helps!

@Steegwolf 3 жыл бұрын

Have you worked with fancyimpute? Offers even more variety and works great

@yunusemreylmaz3642 3 жыл бұрын

Thanks for all the videos it helped me a lot. However I searched on google a long times but I could not find my problem. I am trying to fill missing values with others columns. I mean there are some missing values about cars body type but there are information about body type in another column.

@levon9 3 жыл бұрын

I really love your videos, they are just right, concise and informative, no unnecessary fluff. Thank you so much for these.

@dataschool 3 жыл бұрын

Thank you so much for your kind words!

@fobaogunkeye3551 10 ай бұрын

Awesome video! I was wondering if you can share how the process works behind the scenes for cases where we have rows with multiple columns that are null, with respect to the iterative imputer that builds a model behind the scenes. I understand the logic when we only have a single column with null values but can't wrap my head around what will be assigned as training and test data if we have multiple columns with null values. Looking forward to your response. Thanks

@sachin-b8c4m Жыл бұрын

thank you. love the clarity in your explanation!

@dataschool Жыл бұрын

Glad it was helpful!

@lovejazzbass 4 жыл бұрын

Kevin, you just expanded my column transformation vocabulary. Thank you.

@dataschool 4 жыл бұрын

Great to hear!

@Matt-me2yh 3 ай бұрын

Thank you! I really needed this to understand the concepts, you are an outstanding teacher.

@dataschool 2 ай бұрын

Glad it was helpful!

@atiqrehman8435 11 ай бұрын

God bless you man such valuable content you are producing!

@dataschool 11 ай бұрын

Thank you so much! 🙏

@rore3801 19 күн бұрын

Great explanation , Thank you

@dogs4ever1000 Жыл бұрын

Thank you, this is exactly what I need. Plus you've explained it very well!

@dataschool Жыл бұрын

Glad it was helpful!

@ilducedimas 2 жыл бұрын

Awesome video, couldn't be clearer. Thanks

@dataschool 2 жыл бұрын

Thank you! 🙏

@-o-6100 2 жыл бұрын

Question: If we impute values of a feature based on other features, wouldn't that increase the likelihood of multicollinearity?

@mooncake4511 4 жыл бұрын

Hi, I tried encoding my categorical variables (boolean value column) and then running the data through a KNNImputer but instead of getting 1's and 0's I got values inbetween those values, for example 0.4,0.9 etc. Is there anything I am missing, or is there any way to improve the prediction of this imputer ?

@matrix4776 3 жыл бұрын

That's also my question.

@dataschool 3 жыл бұрын

Great question! I don't recommend using KNNImputer in that case. Here's what I recommend instead: (1) If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. (2) If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

@koklu01 3 жыл бұрын

@@dataschool In that case; can we interpret the results (0.4 , 0.9) as the probabilities of those values being 0 or 1. Does it makes sense to assign a threshold like 0.5, transform below to 0 and above to 1?

@vishalnaik5453 9 ай бұрын

That column or feature if having discrete values like 0 or 1 , better check the semantic of that column,most probably it would be under categorical

@dizetoot 3 жыл бұрын

Thanks for posting this. For features where there are missing values, should I be passing in the whole df to impute the missing values, or should I only include features that are correlated with the dependent variable I'm trying to impute?

@GabeNicholson 2 жыл бұрын

This goes back to the bias variance tradeoff. If you are adding 100s of other columns that are likely to be uncorrelated, then I would suggest not doing that since that will likely overfit the data. You could use the parameter "n_nearest_features" which makes the IterativeImputer only use the top "n" features to predict the missing values. This could be a way to add all your columns from your entire dataframe while still minimizing the increase in variance.

@rajnishadhikari9280 7 ай бұрын

We can do this for numerical data but what in the case of categoical data? Can you mention any method for that?

@dariomelconian9502 Жыл бұрын

Do you have a recommended tool/package for doing imputation with categorical variables?

@dataschool 11 ай бұрын

The simplest way is to use scikit-learn's SimpleImputer.

@seansantiagox 3 жыл бұрын

You are awesome man!! Saved me a lot of time yet again!!!!

@dataschool 3 жыл бұрын

That's awesome to hear! 🙌

@mapa5000 Жыл бұрын

Fantastic video !! 👏🏼👏🏼👏🏼 … thank you for spreading the knowledge

@dataschool Жыл бұрын

You're welcome! Glad it was helpful to you!

@ericsims3368 3 жыл бұрын

Super helpful, as always. Is IterativeImputer the sklearn version of MICE?

@dataschool 3 жыл бұрын

Great question! IterativeImputer was inspired by MICE. More details are here: scikit-learn.org/stable/modules/impute.html#multiple-vs-single-imputation

@SixSigmaData 2 жыл бұрын

Hey Eric! 😀

@saravanansenguttuvan319 4 жыл бұрын

What about the best imputer for categorical variables??

@dataschool 4 жыл бұрын

Great question! For categorical features, you can use SimpleImputer with strategy='most_frequent' or strategy='constant'. Which approach is better depends on the particular situation. More details are here: scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

@saravanansenguttuvan319 4 жыл бұрын

@@dataschool Ok...Thanks :)

@lejuge7426 3 жыл бұрын

@@dataschool Thanks a lot mate you re a LEGEND

@joxa6119 2 жыл бұрын

This imputation return an array as the OHE want a dataframe. How can we solve this if we want to put both inside a pipepline?

@AnumGulzar-iy7tl Жыл бұрын

Respected Sir, Can we multiple imputation in eviews9 for panel data?

@dataschool Жыл бұрын

I'm not sure I understand your question, I'm sorry!

@primaezy5834 3 жыл бұрын

very nice video, however i want to ask, is the knn-imputer can use for data object (string )?

@dataschool 3 жыл бұрын

Great question! KNNImputer can't be used for strings (categorical features), but you can use SimpleImputer in that case with strategy='constant' or strategy='most_frequent'. Hope that helps!

@akshatrailaddha5900 Жыл бұрын

is this works for categorical features also ??

@dataschool 11 ай бұрын

SimpleImputer works for categorical features, but KNNImputer and IterativeImputer do not.

@ling6701 Жыл бұрын

Thanks, that was very interesting.

@dataschool Жыл бұрын

Glad you enjoyed it!

@gisleberge4363 Жыл бұрын

No need to standardise the SibSp and Age columns (e.g. between 0 an 1) before the imputation process? Or is that not relevant here?

@dataschool 11 ай бұрын

Great question! That's not relevant here because imputation values are learned separately for each column.

@RA-sv3bv 3 жыл бұрын

In the example we have only 1 missing so the imputer is having "easy" mission. What if we had not only a few missing per this column/feature and we were facing "randomly" missing values for different col/features. How does the imputer decides to fill : which column first will be imputed and then based upon this filling it will advance to the "next best" (impute handling) column and fill in missing...and so on

@dataschool 3 жыл бұрын

Great question! I don't know the specific logic it uses in terms of order, but I don't believe it tries to use imputed values to impute other values. For example, IterativeImputer is just doing a regression problem, and it works the same way regardless of whether it is predicting the values for one row or multiple rows. If there are missing values in the input features to that regression problem, I assume it just ignores those rows entirely. I'm not sure if that entirely answers your question... it's not easy for me to say with certainty how it handles all of the possible edge cases because I haven't studied the imputer's code. Hope that helps!

@soumyabanerjee1424 3 жыл бұрын

can iterative imputer and knn imputer works with only numerical values ? Or can it also impute string/alphanumeric values as well?

@dataschool 3 жыл бұрын

Great question! Only numerical.

@rishisingh6111 2 жыл бұрын

Thanks for sharing this! Why cannot KNN imputer be used for categorical variables? KKN algorithms works with classification problems.

@dataschool 2 жыл бұрын

With KNNImputer, the features have to be numeric in order for it to determine the "nearest" rows. That is separate from using KNN with a classification problem, because in a classification problem, the target is categorical. Hope that helps!

@dariomelconian9502 Жыл бұрын

Are you generally performing your imputation prior to any feature selection, or after ? I always see mixed reviews about performing it before and after..

@dataschool 11 ай бұрын

Great question! Imputation prior to feature selection.

@evarondeau6595 Жыл бұрын

Hello ! Thank you very much for your interesting video ! Do you know where I can find a video like this one to know how many neighbors choose ? Thank you very much

@dataschool Жыл бұрын

Sure! kzbin.info/www/bejne/bJXFo4VjjN6goKs

@matrix4776 3 жыл бұрын

How to handle missing categorical variables?

@dataschool 3 жыл бұрын

If you're using scikit-learn version 0.24 or later, and you have categorical data with missing values, OneHotEncoder will automatically encode the missing values as a separate category, which is a good approach. If you're using version 0.23 or earlier, I recommend instead creating a pipeline of SimpleImputer (with strategy='constant') and OneHotEncoder, which will impute the missing values and then one-hot encode the results. Hope that helps!

@hardikvegad3508 2 жыл бұрын

Hey Kevin, quick question... should k in knn should always be odd... if yes than why and if no than why? as me in the interview... Thank for all your content.

@dataschool 2 жыл бұрын

Great question! For KNNImputer, the answer is no, because it's just looking at other numeric samples and averaging them (there is never a "tie"). For KNN with binary classification, then yes an odd K is a good idea in order to avoid ties. Hope that helps!

@isfantauhid Жыл бұрын

Can this apply on categorical data? Or for numerical only?

@Tazy50 Жыл бұрын

No, only numerical. He mentions it at the end of video

@rongshiu2 4 жыл бұрын

Kevin, how does it work if let's say B and C are both missing?

@dataschool 4 жыл бұрын

I haven't read the source code, and I don't think the documentation explains it in detail, so I can't say... sorry!

@susmitvengurlekar 4 жыл бұрын

My idea: line plot of cols which have null values with other continuous cols and box plot for discrete and then impute constant value according to result of this process, like say, Pclass is 2, so you impute median fare of Pclass 2 wherever fare is missing and Pclass is 2. Basically similar to iterative imputer, only manual work, slow but maybe better results because of human knowledge about problem statement. What are your thoughts about this idea ?

@dataschool 4 жыл бұрын

It's an interesting idea, but a manual process is probably not practical for any large dataset, and it's definitely impractical for cross-validation (since you would have to do the imputation after each split). In general, any "manual" process (in which your intervention is required) does not fit well into the scikit-learn workflow. Hope that helps!

@susmitvengurlekar 4 жыл бұрын

@@dataschool I meant finding the values in an exploratory way and then using the values found as a constant in simple imputer in a pipeline during cross validation and evaluation. A custom transformer can also be created which does the imputation according to fitted values, like during transformation find similar records and then use the median. But then that's pretty much similar to KNNImputer and Iterative Imputer

@dataschool 4 жыл бұрын

Sure, you could probably do that using a custom transformer. Or if you think you could make a strong case for this functionality being available in scikit-learn, then you could propose it as a feature request!

@susmitvengurlekar 4 жыл бұрын

@@dataschool I am not sure whether this is the correct platform, but I have written a library named custom_transformers which contains transformers for handling date,time, null, outlier and some commonly needed custom transformers, if you have time I would be greatly appreciated if you provided your valuable feedback on kaggle This is the notebook demonstrating the use of library www.kaggle.com/susmitpy03/demonstrating-common-transformers I intend to package it and publish on PyPi

@riyaz8072 2 жыл бұрын

why don't you have 2M subscribers man ?

@dataschool 2 жыл бұрын

You are so kind! 🙏

@ashwinkrishnan4435 3 жыл бұрын

What do I use if the values are catagorical

@dataschool 3 жыл бұрын

You can use SimpleImputer instead, with strategy='most_frequent' or strategy='constant'. Here's an example: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb Hope that helps!

@alazkakumu 3 жыл бұрын

How to use KNN to interpolate time series data?

@dataschool 3 жыл бұрын

I'm not sure the best way to do this, I'm sorry!

@joxa6119 2 жыл бұрын

What is the effect to the dataset after imputation? Any bias or something? I understand it's a mathematical way to insert a valueinto NaN but I feel there must be any effect on this action. Then, when do we need to remove NaN and when do we need to use imputation?

@whaleg3219 2 жыл бұрын

If the percentage of the NaN in a column is more than 50%, we should eliminate the column, otherwise we should impute it using univariate methods like SimpleImputer or multivariate methods mentioned by the author.

@joxa6119 2 жыл бұрын

@@whaleg3219 @DataSchool I see, what if there's NaN in target feature? Can we use imputation? Or removal or NaN is better?

@Kenneth_Kwon 3 жыл бұрын

What if the first column has a missing value? T It is a categorical feature and it would be better if we use multivariate regression. It has 0 or 1 but if we use KNNimputrr or IterativeImputer, it imputes as float value. I think there's the same question as mine in comments.

@dataschool 3 жыл бұрын

In scikit-learn, multivariate imputation isn't currently an option for categorical data. I recommend using SimpleImputer instead. Hope that helps!

@aronpollner 2 жыл бұрын

@@dataschool Is there any library that has this option?

@shreyasb.s3819 3 жыл бұрын

I have one doubt ...which is first process missing value impuation or outlier removal?

@dataschool 3 жыл бұрын

Off-hand, I don't have clear advice on that topic. I'm sorry!

@hemantdhoundiyal1327 3 жыл бұрын

In my opinion, if you are using methods like median, you can first impute missing value, but if you are imputing by methods like mean ( outliers will effect these) so it is good to remove outliers first.

@sergiucasian3085 3 жыл бұрын

Thank you!

@dataschool 3 жыл бұрын

You're welcome!

@SUGATORAY 4 жыл бұрын

Could you please consider making another video on MissForest imputation? (#missingpy)

@dataschool 4 жыл бұрын

Thanks for your suggestion!

@jongcheulkim7284 2 жыл бұрын

Thank you^^

@dataschool 2 жыл бұрын

You're welcome 😊

@whaleg3219 2 жыл бұрын

It seems that we should definitely not try it in a large dataset. It takes forever.

@WheatleyOS 4 жыл бұрын

I can't think of a realistic example of where KNNImputer is better than IterativeImputer, IterativeImputer seems much more robust. Am I the only one thinking this?

@dataschool 3 жыл бұрын

The "no free lunch" theorem says that no one method will be better than other in all cases. In other words, IterativeImputer might work better in most cases, but KNNImputer will surely be better in at least some cases, and the only way to know for sure is to try both!

@aniket1152 3 жыл бұрын

Thank you for such an amazing video! I used to encode my categorical data into numerical one and then ran the KNNImputer but its giving me Error - TypeError: invalid type promotion. Any insights what might be going wrong?

@dataschool 3 жыл бұрын

I'm not sure, though I strongly recommend using OneHotEncoder for encoding your categorical features. I explain why in this video: kzbin.info/www/bejne/r6eXkpd6fMh5e5o Hope that helps!