Dealing With Missing Data - Multiple Imputation

Рет қаралды 48,707

ritvikmath

Күн бұрын

Пікірлер: 76

@MichaelPham-t7u 9 ай бұрын

Wow, you have a natural ability to make complicated concepts become simple.

@amiriqbal1871 3 жыл бұрын

I was struggling with the concept, but your video made it crystal clear to me, thanks

@robertcsalodi1207 3 жыл бұрын

This explanation is awesome! Congratulations!

@ritvikmath 3 жыл бұрын

Glad you think so!

@ysc2652 4 жыл бұрын

This was so clear and easy to understand! Thank you!

@zhaoqian58 2 жыл бұрын

Thank you for producing this high-quaity video.

@bhushantayade7984 4 жыл бұрын

Amazing sir. It's really helpful.

@ThuHuongHaThi 3 жыл бұрын

A thousand thanks, your explanation is very easy to understand, it's really helpful.

@ritvikmath 3 жыл бұрын

You are welcome!

@alexslayerking 2 жыл бұрын

This is an outstanding explanation. Thank you so much for making this.

@sean_gruber 2 жыл бұрын

VERY clear explanation. Thank you!

@mehmetkaya4330 5 жыл бұрын

Very very clear. Very helpful. Thank you!

@newbie8051 Жыл бұрын

Great explanation ! Thanks a lot

@emicat7045 5 жыл бұрын

Thanks you very much! love your videos, they were always clearly explained.

@ritvikmath 5 жыл бұрын

Thanks !

@elkalaiibrahim365 4 жыл бұрын

~2~Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?

@vishnumohank1299 3 жыл бұрын

Data points selection when sampling is almost always done randomly, in order to avoid bias. A similar sample-&-test approach is taken when you perform cross-validation. Same logic follows when choosing the right sample size.

@gamespauls7765 2 жыл бұрын

Very informative! Thank you, good sir :)

@alfinpradana 4 жыл бұрын

Great explanation and excellent in describing how multiple imputations! But I have a question to ask, how could I choose the final value for the imputation if there is 5 value? should I go average 5 of the value instead, or is there any better approach? Thank You

@qwertyuiop-qy6hb 4 жыл бұрын

Yes. As explained in the video, you calculate the mean of the 5 values. You also calculate the standard deviation as well.

@tinghuachen7844 4 жыл бұрын

@@qwertyuiop-qy6hb I am confused, is calculating the mean of the predicted value as a final chosen value or calculating the mean of the sample means as a chosen value? It makes more sense to me using the mean of predicted values, but why do we want to see the standard deviation of the mean for the sample mean? What will actually affect our decision?

@qwertyuiop-qy6hb 4 жыл бұрын

@@tinghuachen7844 I am not a statistician so what I will tell you is the way I understand it. Multiple imputation gives you an estimation of missing data points of a specific variable in a data set. So this estimation is based on the values of the same variable (one column) of the other "individuals or rows" in the data set. Everytime you perform the imputation the resultant value depends on the selected "individuals" which should be selected randomly. So if you do imputation let's say 5 times, you end up with five estimated values. Here the mean of these values gives the estimation of the missing data point. Calculating the standard deviation (from here this is my own understanding) gives you an idea how variable these estimations are. Same thing if standard of error is calculated. If the estimated values are (spread or not close in value), SD will be high and I'd be careful in my assumptions and interpretation of the final analysis . I have not done multiple imputation in my domain (medicine). I would be very careful using multiple imputation but certainly this is a great method to avoid missing data and use all the sample size.

@jessicalambert4019 Жыл бұрын

Thanks very clear and useful!

@ayselceferzade8587 2 жыл бұрын

clearly explained! thanks a lot!

@StarFlex21 6 жыл бұрын

Thank you for the interesting and helpful series about missing data. Also, great video quality.

@滚去写论文 2 жыл бұрын

This is so clearly explained. Thank you very much for this concise and informative video! I have a question. I believe the purpose of step 2 - calculating the standard deviation - is to confirm that the mean is a reliable one. What if the standard deviation is too large? Does it imply that the imputation method is not a reliable one and should not be adopted? Thank you!

@tracykakyoalexis2155 4 жыл бұрын

This was my aaahaaa moment. Thank you!

@brendali5803 4 жыл бұрын

Great job!

@ΔημητρηςΠαπαγεωργιου-γ2υ 2 жыл бұрын

This is an amazing video. Thank you so much. Do we have to check the assumptions for linear regression for each model for each imputed variable?

@lucavisconti1872 2 жыл бұрын

Thanks for the practical example, not clear to me, at the end which value we have to use to fill in the missing value with the multiple imputation method. Could you please clarify?

@davidrussell3433 3 жыл бұрын

Very helpful, thank you!

@ericazombie793 3 жыл бұрын

Very clear!

@chancesofrain6480 4 жыл бұрын

What we do with 5 imputations that have been calculated? which of them can be considered as the imputed value finally if we want just to show this as a graph?

@bonflaneur3194 4 жыл бұрын

Yeah I am missing the final information too. We have the values for the total final mean and the standard deviation of the means to the final mean but what are we supposed to do with these values? How do we decide, which values to we impute for the missing data?

@MyMy-tv7fd 2 жыл бұрын

very clear and easy to follow thanks, but will we not get as good a result by taking one regression sample of 250 data items, as opposed to five sets of fifty, then taking the mean of the means?

@raterake 3 жыл бұрын

Thanks for the great video! Question: suppose I have 5 different random samples with which I can get 5 regressions, and then \mu_1, ..., \mu_5, to find an aggregate mean \mu_A. Why not just pool those 5 data sets into one large data set and compute the grand mean \mu_B that way? Wouldn't my answer \mu_B be more precise (less variable) than just taking the average of the 5 means to get \mu_A?

@vishnumohank1299 3 жыл бұрын

A few things to note, >The 5 random samples that you taken from the existing data, may have common elements. So, straight up combining them might increase bias towards the repeating data points. >Lets say you avoid having repeating data points, then, combining the 5 samples only help create a subset of the original dataset. Thus you would be better off just running a regression imputation. >In my opinion, the whole point here is to have multiple versions of estimated values, so that you may better understand how well the estimation fits our data. Usualy, if the variance or spread of the final values is quite high then we might not want to go ahead with imputation or we may wanna use something other than regression to estimate the missing values. >Therefore, multiple regression is giving you a clearer & broader picture of what & how much compromises you are making for implementing imputation to replace missing values.

@raterake 3 жыл бұрын

@@vishnumohank1299 thanks for the thoughtful reply. I think that I was missing the point of multiple imputation, but your last two points clear that up for me. Thanks again!

@ajanasoufiane3903 6 жыл бұрын

Thanks a lot for this very clear video. Do you know if we can combine multiple imputation with variable selection (with lasso for example) for prediction purposes?

@judewells1 2 жыл бұрын

It wasn’t apparent to me why this estimator would be less biased than a single imputation, you mentioned that doing multiple regressions and the aggregating ‘washes away the noise’ but each of your individual regressions would also be more noisy than a single regression that uses the whole dataset - so how do I know that in the aggregate they are less noisy than a single regression?

@cynical_dd 6 жыл бұрын

Goodjob! Great video!

@phumlanimbabela-thesocialc3285 3 жыл бұрын

Thank you very much.

@andreibarbulescu7812 4 жыл бұрын

Isn't it actually even more complicated than that? Isn't it that for each regression, instead of imputing the missing fine value with the value predicted by the regression we actually randomly sample from the distribution of fine values around that predicted value (the distribution of fine conditional on distance)? This adds even more of the uncertainty involved in the guess we are making to our imputation process.

@mtcloris 6 жыл бұрын

Thanks for the clear explanation. One thing I'm struggling to understand is when you are running multiple iterations, say 5, how are the different sets of data points generated? In your example, you fit lines among 50 data points. Do you randomly select 50 data points among those that have non-missing value in the raw dataset?

@cynical_dd 6 жыл бұрын

i guess he randomly pick 50 data points you might wanna hear at 3.27

@mtcloris 6 жыл бұрын

Thanks!

@diptikalyan 5 жыл бұрын

It would be great if you can share links to some of the papers or books that you refer here.

@rorysamuels2829 3 жыл бұрын

Thanks for the video! If the subsets are random, all the estimators are unbiased right? The aggregated estimator would just have lower variability.

@bevansmith3210 5 жыл бұрын

Thank you very much. Quick question, which imputed values do you end up leaving in the dataset for further analysis. Say now I want to impute values to be used later for a variety of machine learning applications. Surely, I cant use multiple imputation every time I want to implement a new machine learning model and measure a metric?

@SolomonHatcher 4 жыл бұрын

You would use the grand mean that you calculated at the end after investigating whether it is a valid metric.

@qwertyuiop-qy6hb 4 жыл бұрын

Great explanation, thanks. I have done many retrospective clinical research projects and I have never dealt with missing data. I always left these blank knowing that they will automatically be excluded from analysis. I believe leaving these missing data unfilled is better to avoid any chance of bias influenced by data of other patients in the study cohort. What do you think? Now looking at your clear video I am thinking about this approach as well for future projects. I am not a statistician and I've done all these while in training.

@Ampelman123 2 жыл бұрын

You have to think about what you do when you just omit the Data with missing values. You heighten the statistical power of the other Entries. Why are the values missing? When the data is missing for a reason you introduce bias into your data set and ergo reduce variance and unless you cant proof that the data is missing completly at random its best practice to act as if it is missing not at random. With methods like this you try your best to keep the variance of the data set. Its important to use a method, that tries to model the most plausible value otherwise you would reduce variance. I recommend you look into the types of Missing Data "Missing Completly at Random", "Missing at Random" and "Missing not at Random".

@BelleandBoos Жыл бұрын

oh my gosh thank you so much!

@PedroRibeiro-zs5go 5 жыл бұрын

Thanks! That was a really nice explanation!!

@claymarzobestgoofy 4 жыл бұрын

Can you actually do standard deviation? Won't that just reduce the sd for each regression by adding a bunch of point that perfectly fit the regression?

@alecvan7143 4 жыл бұрын

Awesome!

@PS3HARDCOREZOCKER 6 ай бұрын

Is this the PMM approach?

@SirDerRosen Жыл бұрын

Thank you very much :)

@kyliestaraway2492 4 жыл бұрын

Can you do regression imputation next? I really loved this vid

@سحرالثقافة 3 жыл бұрын

Sir, you said we need from 5 to 10 models. How to calculate the exact needed number? thank you

@hamishthecat666 2 жыл бұрын

How does PMM identify nearby candidates when there are a mixture of numeric and categorical variables? Thanks :)

@joefishq11 4 жыл бұрын

Great explanation! But one that also seems at odds with what I'm reading from other sources, which make it sound like parameters in the model estimating the outcome are what get randomly selected for each iteration, not the observations used to make the prediction. Is what I'm describing an alternative approach to the same thing, or am I misunderstanding the approach?

@lbryan250 4 жыл бұрын

What do you mean by randomly selecting parameters? His choice of single imputation method is least squares regression and the parameters (the a and b in your "ax + b" regression line) have a closed-form solution. If you use the same dataset, the parameters of least squares don't have any variability in and of themselves. Maybe you can elaborate more on what you mean?

@psychoriginal1670 6 жыл бұрын

That was very helpful, thank you!

@Discogolf97 9 ай бұрын

I don't understand why it is more unbiased to run 5 OLS regressions with only 50 out of the 2000 rows. Why not just run a single OLS regression with all rows and use that as my predictor for the missing values?

@leoncioblp 3 жыл бұрын

Wouldn't this be problematic if your objective with the dataset is precisely to demonstrate if there is any relationship (like a linear relationship) between those 2 variables? Filling a missing value through a method which assumes the very same linear realtionship you are trying to demosntrate would actually be begging the question, isn't it?

@deepakarumugam5866 5 жыл бұрын

Here you know that the fine amount is a dependent variable that depends on the distance from the library.... but what if you have a data set with missing values in a particular column but the column is actually a independent variable column... how will you use multiple imputation in that case... can you do something like using the distribution to find the values

@TheOraware 5 жыл бұрын

imputation is for missing data treatment , no matter it is dependent or independent

@AnkushSharma-zv5hv 4 жыл бұрын

thank you so much

@keeszethof6272 4 жыл бұрын

Thank you!

@deojeetsarkar2006 11 ай бұрын

Noiccccccceeeeeeeeeeeeeeeeeee

@me3jab1 5 жыл бұрын

thank you

@ayakhaled5316 4 жыл бұрын

TTTTHhHHHHhhaaaannnnnkkkkk yooooooooooooooooooooooooou very very very much

@f2gms647 5 жыл бұрын

I found it confusing !! Especially you move the paper up n down when you talk!

@sultanmehmood5022 3 жыл бұрын

Data for 1.7 & 2.1 mi is not, prima faci true

@Hari-888 4 жыл бұрын

I just wish that you were more neat instead of writing everything on that one paper and you keep moving it and it isnt clear what you are referring to when you point your finger on the paper as you've written everything in every nook and corner of that paper.