Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Рет қаралды 121,777

Күн бұрын

Пікірлер: 119

@codebasics 2 жыл бұрын

Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

@dipto624 3 жыл бұрын

man!! I was struggling with how to use statistics in EDA. I knew std, mean n all but couldn't use them in the EDA flow. u just cleared my confusion!!!! u won't believe how long I have been struggling with this.. thank god I found this video.. u r a great teacher.. I had the tools but couldn't use them. u just taught me how to use it..

@codebasics 3 жыл бұрын

☺️👍

@Mohammed-rx6ok 4 ай бұрын

@sultanhusnoo8552 3 жыл бұрын

Can't thank you enough for the amazing work you do. It is explained in such simple honest way. Many KZbinrs explain things in incomplete way and then keep referencing their paid courses. You are probably the only one who has complete course and complete explanations and exercises all available for free and you even provide some level of feedback to those who interact with you. This is so rare and precious. I have been learning programming and data science with view to improve my career. As soon as I get a salary from any coding related work, I promise to join your patreons. Can't thank you enough for what you do. All the best for you and your family.

@codebasics 3 жыл бұрын

Sultan, you are a very kind person and thanks for all your appreciation :) This kind of feedback motivates me to continue my work on youtube!

@sumitkumarsah8782 4 жыл бұрын

Sir i just wanna say that my respect for you is increasing alot. Keep making such videos. Thank you for your efforts.🙏

@subuqerpsmja 4 жыл бұрын

You are such an inspration for people like me who are looking for a transition towards data science day and night im spending my time in this quarantine with datascience and your youtube videos plays a huge role in increasing my caliber. I am a system engineer in cts and now i wish to move my career towards data science. Tirelessly im preparing my portfolio and my resume to forward as per your latest video for the evalation

@ndosh1man Жыл бұрын

you made it?

@krishnanarwade1467 4 жыл бұрын

I am totally inspired by dhaval sir and krish naik sir Thank you very much for sharing your valuable knowledge with us

@kirandeepmarala5541 4 жыл бұрын

I have no words how to say Thank You..You always providing Such a knowledge for free all the time...I pray god to keep safe for you and your Family all the Time with Health, Wealth and Prosperity..Thank You once again

@jaganinfo 4 жыл бұрын

we will not stop the video :) we will watch entire video . each info is very valuable to us (learners)

@shaiksuleman3191 4 жыл бұрын

Simply Super B Star.You and Krish are two eyes of Data science

@cesarkastoun5752 4 жыл бұрын

Hello, 1st of all, I love your videos. You have a great talent for teaching and are putting it to good use. Just a small nit: the heights file you're using is not really a normal distribution, but a bi-modal one, as it has 2 modes. And the reason is very simple, it's because you're lumping together males & females. If you use separate data sets for each gender, you get much "cleaner" normal distributions. Cheers -CJK

@anandshimpi8011 3 жыл бұрын

Really amazing lecture sir,i increasing interest on Data science sir

@hardikatri7803 4 жыл бұрын

One of the finest tutorials. Great teaching style.

@codebasics 4 жыл бұрын

Thanks Hardik, Keep learning.

@hardikatri7803 4 жыл бұрын

Thankyou for the support and guidance. Your exercise part in tutorials is just awesome. I really loved your way of teaching

@kakmca 2 жыл бұрын

Wah... extra-ordinary explanation sir. Thank you...

@saifansari6459 2 жыл бұрын

Excellent explanation in every topics, it really helps me alot for my data science career.. thanks

@sa89879 4 жыл бұрын

very good and neat explanation but there is one draw back in this Z -score it deal with mean calculation when there is some extreme outlier entry or human made error it can be affected instead of that if we go for Median calculation for outliers it will be roboust,what ever the value it will only take the mid values alone,thanks for your teaching z score

@Hale-xn6ec 3 жыл бұрын

It is a really beneficial and useful video on this topic, thank you!

@likhithsasank8017 3 жыл бұрын

Thank you so much sir your way of teaching is so clear and easily understandable

@abdeali004 4 жыл бұрын

Great Greaaaaat and a fulll too Greaaattttt explanation man. Loved it.

@hasanbutt8622 4 жыл бұрын

best tutorial thanks alot sir you are great i have learnt alot of concept from your videos GOD bless you and keep making more videos

@siddharthmodi2740 3 жыл бұрын

woww! what a simple and easy to understand tutorial. Love it. Thank you sir.

@bhavindedhia9968 4 жыл бұрын

TOP content seriously thanks sir waiting for more videos specially EDA

@Deepsim 3 жыл бұрын

Your tutorial is so clear. Well done!

@codebasics 3 жыл бұрын

Glad it was helpful!

@akshaypatil8155 2 жыл бұрын

16:38 this is just trimming technique. If we want to do capping that means replacing outliers with either lowest defined value or highest defined value, how to do it?

@subuqerpsmja 4 жыл бұрын

Really my sincere thanks for your valuable efforts and im keenly following your guideliness

@Medjdiptiranjan 2 жыл бұрын

you are simply amazing , yr simple explanation helping a lot , thanks a trillion

@learnerlearner4090 2 жыл бұрын

Your videos are easy to understand. Thanks so much!

@chivalrousforlan238 4 жыл бұрын

Nice one Sir, thank you. One thing sir, I would like you to please make a tutorial on SQL. Thank you sir

@whimsicalkins5585 Жыл бұрын

Thanks very much for your simple and clear code.

@fahadreda3060 4 жыл бұрын

Great video, Thanks man , keep up the good work

@srishtikumari6664 4 жыл бұрын

Very well explained sir!! Worth watching

@codebasics 4 жыл бұрын

👍😊

@python360 2 жыл бұрын

Great tutorial, thanks for using readily available sample CSV as well. ☑☑

@hrushik10 2 жыл бұрын

You can also use seaborn to plot the bell curve. It's much easier than matplotlib method. seaborn.histplot(data=df.height, kde=True) kde is the kernal density estimate line

@jp-hm Ай бұрын

Great video - well explained!

@dhananjaykansal8097 4 жыл бұрын

Long time sir. I wished you took at least dataset with 5-6 features. Nonetheless it's fantastic

@yogeshbharadwaj6200 4 жыл бұрын

Tks for the very detailed explanation sir...

@AryanFelix 3 жыл бұрын

How do we determine the Z-Score range for Skewed data? Do I use the same range on either side (like -3 to 3) or can I use different values like -1 to 3 (for left skewed data) after looking at the histogram plot? Thanks in advance!

@haythemb4214 2 жыл бұрын

same question i don't know what is the right range for my data because the (3 , -3) doesn't work for my case

@pranjalgupta9427 4 жыл бұрын

Sir if data is non-normally distributed then which technique we prefer for removing outliers?

@stuttzzzi 3 жыл бұрын

there are ways to convert data into normal distribution..learn scaling

@haintuvn 4 жыл бұрын

Thank you for your lectures! I have learnt a lot from the lectures. We can only apply method of Std and Z score to remove the outliers if the data set is normal distribution or we can apply these two methods to all "types" of data set ( normal or not normal distributions)? Thank you again!.

@codebasics 4 жыл бұрын

You would do that if you have normal distribution

@haintuvn 4 жыл бұрын

@@codebasics Thank you very much! Does that mean we need to test to see if the data set is normal distribution before we apply "Z score or standard deviation " method to remove the outlier?

@prdfrnd 4 жыл бұрын

Hi sir, your explanation is really amazing, I recently started to learn data science i have some doubts in this video kindly please explain the question is we have mean of 66.36755 and if we add 3.8475 then it will become 69 how it will be one standard deviation.

@hustleto-n6d 4 жыл бұрын

one standard deviation = 3.8475

@nareshchinnam8349 4 жыл бұрын

Thanks so much for explaining in such a easy way. Could you please clarify what would we need to do if other columns contains important values in the same row where outlier exist? Still we can go ahead and remove the entire row?

@naveenkalhan95 4 жыл бұрын

thank you very much again... i am really following all your video.. really knowledgeable ... @5:50 of this video, you created the bell curve.. i am aware of one function .kde() which does the same thing. Is it wise to use that? or there is some difference in that to this function you created for drawing bell curve? Thank you very much again. Really appreciate.

@codebasics 4 жыл бұрын

Naveen, actually I don't know about kde() function. What does API specification say about that function? Can you try plotting it and see if result is same as mine?

@naveenkalhan95 4 жыл бұрын

@@codebasics thank you for your reply. I went through your advice and plotted the height using .kde() method and it produced the bell curve same but with a slight difference but plotted the same normal curve. I just had to write this line to draw it: df.Height.plot.kde(); But, thank you again for your precious work. Because it's opening up my brain to think the more agile way of drawing it to understand mathematically.

@hardikatri7803 4 жыл бұрын

We can also plot through seaborn using parametre ( kde = True)

@vishalvig01 4 жыл бұрын

Concise Explanation !

@pranjalgupta9427 4 жыл бұрын

Do we remove outlier before feature scaling and after feature scaling?

@codebasics 4 жыл бұрын

We don't need to remove them all the time. We need to treat them which means we might end up changing the value to some resonable value

@codebasics 4 жыл бұрын

Yes we remove them before feature scaling

@obigvee 4 жыл бұрын

I have question. Let's assume a Dataframe has some missing values with the presence of outliers and I don't want to just remove the outliers I want to winsorize the outliers. Is it right to treat the missing values first before winsorization or the other way round?

@sahanjayawarna4894 4 жыл бұрын

Very good session as always. I came across this situation but couldn't figure out why. Unless we pass this argument "density=True" in matplotlib.pyplot.hist(), it is not possible to see the normal curve and histogram together in the graph. What is the reason for that?

@flaviobrienza7697 2 жыл бұрын

A little suggestion to make it simpler. In Z-Score method I can calculate its absolute value through np.abs and I can only write < 3 in my condition for the new dataframe. In addition, to visualize the curve it is better to use sns.histplot with kde=True

@sarfrazhussain9851 2 жыл бұрын

Nice effort

@tucomax 4 жыл бұрын

Question, say you have a df of drink consumption and if you don't want to eliminate the outliers but instead replace them with NaN and keep the zero values of the dataframe, what would you do? Thanks

@satyavardhan8204 4 жыл бұрын

Also make videos regarding Seaborn please

@ajaykushwaha-je6mw 3 жыл бұрын

Removing outlier is good option of replacing outliers with other value is good option ?

@estherugwueke5409 2 жыл бұрын

how can you apply this rule when you have about 10 features? Do you do them one by one?

@modhua4497 3 жыл бұрын

Does this work only if the feature is normally distributed? Most of the features in real world data are not normally distributed.

@priyantangupta5176 3 жыл бұрын

Hello! Your lesson is very helpful for me. Can you just say how can I find outliers using multiple parameters? Like I want to find the outliers using all the column of data together that I have. What should I do?? Thank you in advance.

@trinayanbharadwaj146 3 жыл бұрын

How can we apply this to multiple columns? Is there any short way or we have to do it manually for every column?

@Aaron_duckroast 3 жыл бұрын

hey. why cant we use 'StandardScaler' and delete all outliers ?

@pythonenthusiast9292 4 жыл бұрын

awesome.

@sadikaljarif9635 Жыл бұрын

why we choose height column ??why dont we chose weight column???

@Artech.Ranjit 3 жыл бұрын

How to decide 3 as a threshold value to calculate zscore values? you have considered ex: zscore >3

@bikashpokharel478 4 жыл бұрын

It really helped me. Thank You

@codebasics 4 жыл бұрын

Glad it helped!

@GusMD84 4 жыл бұрын

what happens when the std deviation is way bigger than the mean? Currently exploring a dataset where mean price is ~220 and std dev is ~395? Evidently, there's some big outliers that can be seen straightaway (i.e. min price of 4 and max price of 36000). Should I remove those 'clear' outliers manually and then apply the remove outliers function? (i presume that if I don't do this, the function will remove a lot of 'non-outliers'?

@shounaksushantadasgupta8440 3 жыл бұрын

how to remove outlier from dataframe which has categorical as well as continuous data, as by percentile technique I am getting NaN value in categorical columns

@reshaknarayan3944 4 жыл бұрын

Clear and succinct

@ajaykushwaha-je6mw 3 жыл бұрын

I have a question kindly answer. Suppose we have 20 column and from all 2 column we are removing outliers, then we are excluding small amount of data from each column, i.e. all together we are loosing huge data. Is this a correct way to handle outliers ?

@beautyisinmind2163 2 жыл бұрын

hello sir, can we learn personally from you? and how can we contact you

@piyush_sh98 Жыл бұрын

How standard deviations is selected as 3 and zscalar 3 too? Please someone explain

@rsinh3792 3 жыл бұрын

Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.

@harleyquinn5245 4 жыл бұрын

Sir can l become data analyst after 12th

@HabeshaTV1 Жыл бұрын

can you provide mock interview?

@pukyalligator 2 жыл бұрын

Great Video. Thx!!

@zehraup4722 4 жыл бұрын

Here is a great explanation: www.kaggle.com/c0derr/outlier-detection?scriptVersionId=39511980

@saurabhbarasiya4721 4 жыл бұрын

Great sir

@boubacaramaiga4408 4 жыл бұрын

Fantastic, many thanks.

@harshal_ajetrao 4 жыл бұрын

Thanks for the video Sir. I am new to the Machine Learning Well I use percentile,standard deviation and zscore method but problem I get in standard dev nd zscore method is the outliers removed doesn't changes values in our data i.e df, rather it gets stored in new frame df_no_outlier_std_dev. So how to update new values after removing outliers in our data i.e df. please help....

@viveksingh881 3 жыл бұрын

that is because we are storing it in new dataframe not the original one....in case u want the changes to be reflected in original dataframe store it in original and use inplace = True df = df([......code.....,inplace = True) happy learning

@harshal_ajetrao 3 жыл бұрын

@@viveksingh881 Thanks..It was 6months back story..Now I at intermediate level in machine learning 👍

@viveksingh881 3 жыл бұрын

@@harshal_ajetrao thats great bro....clearing some doubts on random yotube videos..happy learning :)

@harshal_ajetrao 3 жыл бұрын

@@viveksingh881 Thanks for helping man..Keep it up 🤘🤘🤘

@anirbaniitgn8407 2 жыл бұрын

Everything is good when you are applying Z_score for searching outliers which are either positive or negative outliers. If both positive and negative values are present together then it does not work..!! data = [1, 2, 2, 2, 3, 1, 1,-19, 2, 2, 2, 3, 1, 1, 2,19,25] try with this simple dataset. with IQR method you can detect -19,19,25 all three but with Z_score it is not working. I don't know the reason. If you know Sir then let us know.

@ssrriinniivvaass 4 жыл бұрын

Hi Sir, How do I decide Z score values, does it depend on my data or is it always -3 to +3?

@codebasics 4 жыл бұрын

Usually is is between 3 and -3 but yes it depends on data. Sometimes people use more than 3 based on data distribution

@pythongui5199 3 жыл бұрын

Very nice

@nikhilgaikwad9954 4 жыл бұрын

how to select the number of standard deviation in zscore technique to remove outliers?

@codebasics 4 жыл бұрын

General guideline is 3 or more. If data set is small people use 2 STD dev too but just be careful that you don't remove data point that can add value to data analysis process