Statistical Thinking - Imputing Missing Values

Рет қаралды 13,781

AIEngineering

Күн бұрын

Пікірлер

@aravindcr4998 4 жыл бұрын

One of the very few members in the Data Science community who provides quality content.

@akhileshlekurwale364 4 жыл бұрын

And that too selflessly

@divyamarora4903 4 жыл бұрын

I can't believe that all this is free. This level of practical content is awesome and rare to find

@arpitakar3384 2 ай бұрын

Multivariate Imputation by Chained Equations (MICE) or Iterative imputation from scratch just got it's back scratch here... Great video sir... Love from your own country.. WRITE DOWN THE DIFFERENCE: SCKIT-LEARN ITERATIVE IMPUTATER

@arpitakar3384 2 ай бұрын

Iterative Imputer This estimator is still experimental for now: the predictions and the API might change without any deprecation cycle. To use it, you need to explicitly import enable_iterative_imputer: # explicitly require this experimental feature from sklearn.experimental import enable_iterative_imputer # noqa # now you can import normally from sklearn.impute from sklearn.impute import IterativeImputer

@ravi_krishna_reddy 3 жыл бұрын

Very good content and awesome explanation. Thank you so much.

@jayaraghavendra9025 4 жыл бұрын

Awesome and Expecting more info like this

@raviirla459 4 жыл бұрын

Wow movement vedios.. you have nailed it... your vedios are fun to watch with great content.. looking forward more vedios on visualization, feature engineering and data interpretation..

@abhisheksolet8494 4 жыл бұрын

Amazing Tutorial Sir. Thank you so much for providing such great learning material. Looking forward to have many more.

@madhukerbillapati3944 4 жыл бұрын

Good one. Worth reading, wish to see more video's

@blue_sapphire8650 3 жыл бұрын

Simple and neat. I goddamn love the way you covered concepts in this video. In fact, I have been searching for a content like this for a while and now I got it here 😊. Thanks a lot sir.

@datatales1063 3 жыл бұрын

@19:36 - If we look at the graph, it shows that the slope is touching the y-axis above 0. But, in the equation the intercept value is negative, -924.8180. Why is it like that??

@MrSmarthunky 4 жыл бұрын

Very good video Srivatasan sir. Happy if you can make more videos on such foundational knowledge.

@AIEngineeringLife 4 жыл бұрын

🙏

@muralikrishna9499 4 жыл бұрын

Your videos are making me more and more inspired!

@chidiedim3166 4 жыл бұрын

great one sir

@rakeshkedar4096 4 жыл бұрын

Thanks for this video . I have a question which was even asked in one of the interviews. How can we evaluate our imputation strategy without applying any machine learning model? for example if i would have replaced Total Charges with mean/median and i do not have actual values to compare as you had in this case. so in that case what are the various statistical approaches to check how good is our imputation strategy ?

@AIEngineeringLife 4 жыл бұрын

You can still evaluate the regression model output by creating split within data point you have value to evaluate Mean and median can be good strategy when your data points are close and not spread out. Another option I would say rather imputing use models that can handle missing values

@rakeshkedar4096 4 жыл бұрын

@@AIEngineeringLife Yes i agree about the intuitive part of using mean/median strategy & using the models that can handle missing values, but i am curious to know whether there are any statistical test to evaluate if the mean/median imputation works for our case?

@TravelWithIndoCanadian 4 жыл бұрын

Very well explained.

@hardikraja 4 жыл бұрын

Awesome...

@midhileshelavazhagan2541 4 жыл бұрын

Why does imputing very high values works with gradient booting method? As mentioned in 8:15

@AIEngineeringLife 4 жыл бұрын

Sorry for confusion. I think I did not articulate it better.. In models like xgboost I can just makes null values as larger numbers or high negative numbers (in this case since 0 can be valid values, default is 0). Since GBM work on splits it might create separate split for these values. You can check for sparsity aware splitting in below doc arxiv.org/pdf/1603.02754v3.pdf

@arianaquek6036 4 жыл бұрын

@AIEngineering hello sir, thank you for the insightful video! Just to clarify a few points: - What does 'makes null values as larger numbers or high negative numbers (in this case since 0 can be valid values, default is 0)' you wrote in a comment mean? Does the '0' you mention as 'default is 0' represent a missing value or a value that you impute as a 'value' in the dataset? As what i understood from the paper, xgboost is capable of taking dataset with missing values, impute them by splitting them into different directions and then choose the best route to impute. There is a little part where it says 'The same algorithm can also be applied when the non-presence corresponds to a user specified value by limiting the enumeration only to consistent solutions.' I assume that this is what you meant by making 'makes null values as larger numbers or high negative numbers' - but what does larger/high negative numbers mean?

@AIEngineeringLife 4 жыл бұрын

Ariana.. There are 2 things in xgboost. You can set your own missing value in params like the example below xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values Now what I meant is in case if a continuous value being zero is normal for your business and not really meant missing then keeping the default value is not right. Say a retailer has offer to give a product free if someone buys another product. Then value can be zero for the free product and cannot be treated as missing value. So in these cases we typically impute it with high negative number Now in many cases if the number is substantially an outlier then XGBoost will create separate split sometimes and can be covered in alternate tree path during split even without setting missing value. This can be viewed by visualizing xgboost trees I hope it makes sense.. Will try to cover it in one of my future video where I will be visualizing and interpreting trees

@arianaquek6036 4 жыл бұрын

@@AIEngineeringLife Thank you for your reply Sir! Am looking forward to more of your videos!

@manassharma869 4 жыл бұрын

awesome explanation i hope more parts are coming, thanks

@AIEngineeringLife 4 жыл бұрын

Yes framing the problem is difficult.:).. need to see good problems to provide solution. Suggestions welcome as well

@mukeshkund4465 4 жыл бұрын

Worth watching.

@udaysai2647 4 жыл бұрын

Srivatsan- I am very excited for this series of videos. It is a great elucidation of we can impute missing values. To generalize your point we need to check how the column containing NaN's varies with the target variable and how other independent variables influence column with NaN's and then figure out best way to impute. This is awesome but just curious about the case where other independent variables might also contain NaN in applying such technique. For example if 'MonthlyCharges' column contains 'NaN's or 'tenure' contains 'NaN's how will we implement this 'lmmodel' technique in this example.

@AIEngineeringLife 4 жыл бұрын

Uday.. Nice questions.. First thing in DS process is understand source of data and why nulls are populated. Is Null an unavailability scenario or exception scenario. In Case say monthly charges is null due to error capturing, I will try first imputing it if that field cannot be dropped. Say can I impute it by contract type and services. Below is original dataset github.com/srivatsan88/KZbinLI/blob/master/dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv I have individual services the customer has and can i run KNN based on similarity. Then if I am able to approximate can I use it to get to TotalCharges Again as I said there are probabilistic and there is no one solution to it :)

@udaysai2647 4 жыл бұрын

Srivatsan- Thank You for the explanation. I think the very first line answers my question. I will try to impart this perception when dealing with a dataset and try to figure out reason that is causing the NaN's . As always you add value to your videos with these suggestions :), Thank you once Again

@mohdhammadkhan5570 3 жыл бұрын

This content is so rare.

@kachrooabhishek 2 жыл бұрын

How we switched from "Monthly Charges" to "Tenure". at 19.34 Sir was that some random guess to check what will be the R-Square and std error with that. ?

@rushikeshbulbule8120 4 жыл бұрын

Comprehensive ✌ How to do normal distribution by transformation... expecting ahead.. .

@vaibhavbhatia4641 4 жыл бұрын

Great video sir. Can you please also share a link to the notebook through description, thank you.

@AIEngineeringLife 4 жыл бұрын

Notebook is in my git repo here - github.com/srivatsan88/KZbinLI/tree/master/statistics

@vaibhavbhatia4641 4 жыл бұрын

@@AIEngineeringLife thanks a lot.

@anishnama2091 4 жыл бұрын

Thanks for this informative video.. How to impute categorical missing values using statistics thinking?

@AIEngineeringLife 4 жыл бұрын

Anish.. It depends on distribution of categories. In most cases you can tag it as others and train model or impute with value of max categories. Also similar approach can be followed to see if we can find the category from other variable but this is applicable in very few cases

@anishnama2091 4 жыл бұрын

@@AIEngineeringLife Thanks

@sachingalugade8092 4 жыл бұрын

Thanks for video sir..can u please make video which will show various ways to impute values for categorical variables?

@AIEngineeringLife 4 жыл бұрын

Categorical is simple typically. We can go with mode of data or use logistics regression to impute it depending on data and business need

@bharathjc4700 4 жыл бұрын

we can use mIce to impute missing continuous data is this technique better than mice what are the gaps please drop your insights sir

@AIEngineeringLife 4 жыл бұрын

Bharath.. The video highlights how to analyze data and use statistical techniques for it, MICE internally uses the same technique but in case if you already have knowledge of data better to use that knowledge instead of have MICE doing the wrong stuff

@antoniushka 4 жыл бұрын

Hi Guys! Great job! For some reason I got the same values for both columns "TotChargeNew" and "TotChargesAct", where could be the mistake?

@AIEngineeringLife 4 жыл бұрын

Antonio, Thats interesting.. Seeing the data it is very rare to not have any standard error, while you may get value different than mine due to some randomness. Is you data before and after pandas concat same for TotChargeNew column?. You can check my notebook below to compare with yours colab.research.google.com/drive/1fzf5bm_HvbtAQS_2jxR8UoQsCliDr5fa

@antoniushka 4 жыл бұрын

@@AIEngineeringLife Thank you! I'll check it out!

@devpratap 4 жыл бұрын

first of all, thanks for sharing this all. Sir, I executed the notebook codes after typing them myself to get better understanding. The TotalCharges has only 11 NA values but yours had 28. Also when I load the merged the values in actual Total Charges were empty. Did I do something wrong or have you made changes to the dataset?

@AIEngineeringLife 4 жыл бұрын

Devpratap.. My bad.. I think I overwrote the file by mistake.. Check now. Created a new one.. It has 27 though but must work as expected. Let me know if you still have problem

@devpratap 4 жыл бұрын

@@AIEngineeringLife I checked using my notebook. It worked fine now. Thanks.

@ashirbaddas2573 4 жыл бұрын

Hello Sir. Could not we do by simply checking all the correlation values and then we could have gone for best fit couple . And we could have easily find line function for predicting the missing..Please correct me if I am wrong.Thank you for all this.

@AIEngineeringLife 4 жыл бұрын

You can but this is simple dataset for demo. think of sparse data more correlated value can introduce bias as well.. it is like how we do feature selection for models even in case of imputing analysis cycle helps

@arulsebastian6338 4 жыл бұрын

Thanks for the post. What is the github url for this code?

@AIEngineeringLife 4 жыл бұрын

Here you go - github.com/srivatsan88/KZbinLI/tree/master/statistics

@ragulshan6490 4 жыл бұрын

Sir, please do make more videos on different kinds of t-test using python? please elaborate more about different types of normality test.

@AIEngineeringLife 4 жыл бұрын

Ragul.. Will do as and when I get time.. Have too many in backlog and finding less free time so bear with me please

@ragulshan6490 4 жыл бұрын

@@AIEngineeringLife Take your time, sir. I'll be waiting for that!

@rajeshk1739 4 жыл бұрын

Thanks a lot for your efforts. Request you to please share the ipynb file.

@AIEngineeringLife 4 жыл бұрын

It is in my gitrepo - github.com/srivatsan88/KZbinLI/tree/master/statistics

@rajeshvenaganti6797 3 жыл бұрын

where can i find this code

@username42 4 жыл бұрын

any github links for jupyter notebooks?

@AIEngineeringLife 4 жыл бұрын

Here it is - github.com/srivatsan88/KZbinLI/blob/master/statistics/Statistical_Thinking_Imputing_Missing_Value.ipynb

@username42 4 жыл бұрын

@@AIEngineeringLife thanks :)

@sumanthreddy1542 4 жыл бұрын

Why are we imputing values of TotalCharge with Monthlycharge where tenure = 'Zero',Why Can't we put it zero?

@AIEngineeringLife 4 жыл бұрын

Sumanth.. we can.. I am just assuming they might anyway have to pay first month. If contract they get penalized for breaking contract. But you can put zero as well. I was just showing thinking to differentiate user personas

@sumanthreddy1542 4 жыл бұрын

@@AIEngineeringLife Thank you for your response. Very much appreciate your kind effort to share your knowledge.