Step-by-Step procedure of KNN Imputer for imputing missing values

Step-by-Step procedure of KNN Imputer for imputing missing values | Machine Learning

Рет қаралды 18,838

Rachit Toshniwal

Күн бұрын

Пікірлер: 74

@kumarnikhil8197 4 жыл бұрын

Very well explained! Teacher like you must be appreciated!!

@rachittoshniwal 4 жыл бұрын

Wow, thanks Nikhil ! Appreciate your kind words! :)

@kumarnikhil8197 4 жыл бұрын

@@rachittoshniwal I know its a really tough job to make edu videos which hardly gets much views as compared to filth which is piling up on YT, please don't lose motivation, just remember there is always that one weak person who by your help can sleep peacefully that night.

@rachittoshniwal 4 жыл бұрын

@@kumarnikhil8197 you're making me nervous now Nikhil :p thanks btw!

@r.s.572 5 ай бұрын

thank you for explaining this! :) poor PhDs are thankful for people like you who use their free time to do such videos!

@ritvikpalvankar1903 2 жыл бұрын

Hello, thank yo so much for a clear explanation. I was asked this question in an interview and I think I did a good job by watching this video a day before. :)

@rachittoshniwal 2 жыл бұрын

Wow, I'm so glad it helped Ritvik! I hope you get the job! :)

@pushpakkothekar9271 2 жыл бұрын

Learned KNN imputation buddy thank you. Liked and Subscribed brother....

@DrizzyJ77 8 ай бұрын

Thanks Needed a clear explanation for my missed class😅

@tridibpal857 2 жыл бұрын

Sir you are awesome . Please take a bow .

@rachittoshniwal 2 жыл бұрын

Haha, thanks!

@ivanrazu 4 жыл бұрын

Nice example, I do have a question. When you do the imputation for the other missing values, do you use the imputed value you just found when computing distances with respect to that row? Or do you do all imputations simultaneously?

@rachittoshniwal 4 жыл бұрын

hi Ivan! No, we do not pay attention to any newly imputed values for imputing other values. All NaN's get imputed independent of each other. So basically yeah, in a sense all get imputed simultaneously

@ivanrazu 4 жыл бұрын

@@rachittoshniwal Ok got it. Thank you, Rachit!

@rachittoshniwal 4 жыл бұрын

@@ivanrazu :)

@kennethbassett6330 2 жыл бұрын

Thanks for the great video! I have a question: Let's say I am finding the 5 nearest neighbors. I am trying to fill in a missing value for column A for a certain data point. One of it's nearest neighbors is also missing a value in column A, should I take the average of the remaining 4 neighbors, or should I include the next closest neighbor (6th furthest) in the average?

@prithvisingh4173 3 жыл бұрын

nice , bro you got my concept and doubts cleared . Thanks ......

@rachittoshniwal 3 жыл бұрын

I'm glad it helped, Prithvi!

@pumpitup1993 4 жыл бұрын

Very nicely explained, can you do the same for MICE imputation?

@rachittoshniwal 4 жыл бұрын

I'm glad you liked it! I'll look into MICE!

@rachittoshniwal 4 жыл бұрын

Hi Sourav, I've just published one on MICE here: kzbin.info/www/bejne/jYHMioKJaNZ-bZI Do check it out and let me know if you do (or if you do not ! ) find it useful :)

@pumpitup1993 4 жыл бұрын

@@rachittoshniwal yes i just saw it, really helpful,thanks a lot!

@akshatjain1746 2 ай бұрын

short simple informative!

@bhavnatanwar8591 3 жыл бұрын

your vedio is really helpful:)), i have a question, does KNN imputer uses the imputed values in calculation further, what i mean is suppose we have imputed the missing values in the first column and now we have to compute the values in the second column so does it uses the values imputed in the first column or does it consider it as missing and is used in weight as missing entry??

@rachittoshniwal 3 жыл бұрын

Hi Bhavna, thanks! Well, no. It doesn't take into account the imputed values in one column while imputing other columns. It considers the "original" missing values as missing. All columns are independent of each other during imputations basically.

@bhavnatanwar8591 3 жыл бұрын

@@rachittoshniwal thanks this was really helpful:))

@Mflegend426 3 жыл бұрын

very nice explanation bro. Is there any way/method to select number of neighbours while imputing values?

@rachittoshniwal 3 жыл бұрын

Thanks Ajeeth. Glad it helped! Well, you could try a grid search or even give the elbow method a shot

@heteromodal 3 жыл бұрын

Thank you again for a great tutorial! Can you give an outline to when this method would be preferable to MICE for example?

@rachittoshniwal 3 жыл бұрын

First of all, thanks! I'm glad you liked it! Well, mice is helpful when the features are "correlated", and if you know that's the case, go ahead with it. Otherwise, look for other methods (like knn for example) But it's more of trial and error really. The Imputer that gives the best results is the best one!

@heteromodal 3 жыл бұрын

@@rachittoshniwal Thanks again! Really appreciate your videos and responses! :)

@rachittoshniwal 3 жыл бұрын

@@heteromodal thanks! My pleasure!

@ethiopiansickness 3 жыл бұрын

I'm surprised you don't have more subscriptions to your channel. A lot of your videos are at the top of search queries on youtube, so I am sure eventually you will get the subscriptions and views that you deserve. Keep up the great work!

@rachittoshniwal 3 жыл бұрын

Haha! Thank you Shiffraw! Appreciate that!

@ismafoot11 3 жыл бұрын

Excellent video however what is the impact of doing this when the features having extreme variability. For example if one column ranged between 0 and 1Million and the other columns hovered around 10-20. Should you normalize/standardize your data before hand ? If so, should you normalize or standardize and how would you do it if you have missing values in that column

@rachittoshniwal 3 жыл бұрын

Thanks! Yes, we should ideally normalize the data if they're on different scales. Scikit learn will ignore presence of missing values, and scale the columns based on the non missing values. The NaNs remain NaNs. You can then impute those values. There is no one correct answer as to whether to normalize or standardize. Trial and error, whichever works best on your data

@tugce2326 2 жыл бұрын

Hi Rachit, Very nicely explainedThat's why I want to ask you something. I have 440 data belonging to 9 precipitation observation stations (data matrix:440×9). There are missing values at each station. 9 none of precipitation series shows a normal distribution. However, the missing in 9 precipitation series are completely random. My question is;1) Can I use the k-NN/Random forest/ and MICE methods even though 9 precipitation data is not distributed normally? 2) Are there any prerequisites/conditions for using these methods? 3) Could I use these methods if my data was not MCAR?

@rachittoshniwal 2 жыл бұрын

Hi! Following a distribution is not a pre req for imputation. However, if the data is skewed, it is better to go for median imputation than mean, because median is a better approximate than mean in that case. So MICE works better when the data is MAR, if not you might get suboptimal results. At the end of the day though, it is mostly trial and error while finding the best method. Hope it helps!

@SumitKumar-sj5xw 2 жыл бұрын

very good explanation

@TheReluctantCoder 3 жыл бұрын

Very good explanation! Thank you!

@TheElementFive 2 жыл бұрын

Suppose you want to apply this technique to a dataset where the outcome variable is discrete. Would it be logical to limit your set of neighbors to those belonging to the class associated with the row you are imputing (i.e., calculate the Euclidean distance between the row to impute, and only all other rows for which y_current_row == y_neighbor row) ?

@ThePablo505 2 жыл бұрын

Thank you so much

@md.faisalsohail9108 4 жыл бұрын

simply awesome. thanks, brother.

@rachittoshniwal 4 жыл бұрын

I'm glad you liked it! :)

@md.faisalsohail9108 4 жыл бұрын

@@rachittoshniwal hope u and ur channel grows exponentially.

@rachittoshniwal 4 жыл бұрын

@@md.faisalsohail9108 whoa! Thank you for the kind words! :)

@rizkiekaputri2122 2 жыл бұрын

please input subtitle in this video, my final task is about this topic, i really hope u put subtitle here so I can understand what are u explain in.

@shoaibahmed5848 Жыл бұрын

What about 1 row missing value and 4th row missing value is those values to be filled necessary?

@noorbariahmohamad8759 2 жыл бұрын

Prof, what if NAN happened at the same time ? means Friends, GOT, Suit, Breaking Bad, HIYM all missing at row 2. Still can impute using kNN method?

@barathwajas6702 2 жыл бұрын

Hi Rachit, Quick Question how do you evaluate and tune the models if the imputer did predict the correct or nearby value or not? Thanks in advance.

@rachittoshniwal 2 жыл бұрын

We can only judge goodness of the imputation by the model performance. If we get a good final model, it means the imputer was able to get close to the real values

@barathwajas6702 2 жыл бұрын

@@rachittoshniwal correct but in your example case was there any tuning done if so can you share that insight? TIA.

@NitinMukeshIITB 3 жыл бұрын

Awesome explanations

@rachittoshniwal 3 жыл бұрын

Thanks Nitin! I'm glad it helped!

@mohitgoyal229 3 жыл бұрын

Rachit can you recommend some books, where we can find techniques like these in details.

@rachittoshniwal 3 жыл бұрын

I don't really have any good recommendations, but you can check the scikit learn documentations of the algorithm you want, they usually refer to a research paper/ a good reference on which their implementation is based.

@KartikRai-YrIDDCompSciEngg 2 жыл бұрын

What if (Row 3 Col0),(Row 4 Col0) also had missing values. So the mean ( (50+29)/2 ) would not be possible, then how does the algorithm proceed.

@RS-fe1hk 4 жыл бұрын

2 doubts : 1) if v r giving k-neighbour =2 and the nan value is present in 1st row instead of 2nd row which are the rows will be selected for calculating euclidean? 2) while 'weight' value 'total' / present cord .. What is the total nd present cord value if the both the values are nan.. Example : in ur example say instead of 85 there is Nan value.. Then wat is total and present cord value while computing with 1st row?

@rachittoshniwal 4 жыл бұрын

If I understand your second question, Total will always be 4 in this case, because there are 4 other columns. For a combination to be considered as "present coord", both values must be present, so if 85 was a nan, then while comparing person 2 with 1, both will be nan for HIMYM column and hence won't be considered in "present coord" I didn't quite get your first question. By 1st row, you mean 1st person or 0th person?

@RS-fe1hk 4 жыл бұрын

@@rachittoshniwal @Rachit Toshniwal that's answers the 2nd question.. Thanks... And for 1st question 1st row means 1st index not the 0th index.. (ie friends =44)

@rachittoshniwal 4 жыл бұрын

@@RS-fe1hk so you mean to say for person 1, both Friends and HIMYM are nans?

@RS-fe1hk 4 жыл бұрын

@@rachittoshniwal yeahh... If we try to fill Nan for person 1 and give k-neighbors as 2... How it will select rows. Because Above person 1 there is only 1row ( person 0) is present rite.. So in that case wat r the rows will get selected for imputation?

@rachittoshniwal 4 жыл бұрын

@@RS-fe1hk It doesn't matter how many rows are above or below the row in which there is a missing value. It will scan through all rows in the dataset and find the top 2 neighbors I hope that solves your query. Let me know if it doesn't!

@yv4000 3 жыл бұрын

Do we need to scale the features before imputation

@rachittoshniwal 3 жыл бұрын

If the features are in different scales, then yes.

@yv4000 3 жыл бұрын

@@rachittoshniwal how should we scale a feature with null values present? min-max scalar wont work with null values present.

@rachittoshniwal 3 жыл бұрын

@@yv4000 it does work with missing data. It just ignores the presence of those missing values

@yv4000 3 жыл бұрын

@@rachittoshniwal Thanks!

@venkateshwarlusonnathi4137 3 жыл бұрын

should you not normalize the values before doing KNN? or is it because all of them are supposed to be in the same range of 0 to 100, we dont need to do here?

@rachittoshniwal 3 жыл бұрын

Yes, precisely. Since all are on the same scale, it isn't mandatory to perform normalization. But if features are set on different ranges, then you should