Being in a teaching profession ,I assure this is the best explanation about Pearson correlation.. Please make more likes.
@nurnasuhamohddaud7282 жыл бұрын
Very comprehensive explanation for someone from non AI background. Thanks Sir keep up the good work!
@prakash5644 жыл бұрын
Sir your channel is a perfect combination of sentdex and statquest. You are doing a great work 🙌more power to you!!
@shubhambhardwaj36434 жыл бұрын
Any word is not sufficient to thank you for your work sir ....🙏🙏
@andyn60532 жыл бұрын
In which order should u do the feature selection steps? 0. Clean the dataset, get rid of NaN and junk values. Check format for datatypes in testset etc 1. Use z-method to eliminate outliers 2. Normalize the train_X data 3. Check correlation between x_train variables and y_train. Drop variables that have a low correlation with the target variable. 4. Use pearsons correlation test to drop highly correlated variables from x_test 5. Use variance threshold method to drop x_train variables with low variance. All variables that have been removed from the x_train data should be removed from the x_test aswell. 6. Fit x_train and y_ train to a classification model 7. Predict y(x_test) 8. Compare the predicted y(x_test) output with y_test to calculate accuracy 9. Try different classification models and see which one performs the best (have the highest accuracy) Is this the right order? Have I missed something?
@waytolegacy3 жыл бұрын
I think instead of dropping "either of" 2 highly correlated features, we should check from both of them how each of them correlates with the target as well and then drop the less correlated with the target variable. Which might increase some accuracy instead of considering dropping whichever comes first. Again, I think it is.
@djlivestreem40392 жыл бұрын
good point
@beautyisinmind21632 жыл бұрын
you can check importance value of each using RF and one can be dropped which has less importance value
@niveditawagh81712 жыл бұрын
Good point
@niveditawagh81712 жыл бұрын
Can you please tell me how to drop the less correlated variable with the target variable?
@beautyisinmind21632 жыл бұрын
@@niveditawagh8171 you only drop when two feature variables are highly correlated but you don't have to drop feature that is less correlated with target variable because less correlated feature with target variable could be a good predictor variable in combination with other features.
@ashishkulkarni81404 жыл бұрын
Sir, could you please upload more videos on feature selection to this playlist? It is very amazing. I followed all the videos from feature engineering playlist. You are doing a great work. Thank you.🙏🏻
@rukmanisaptharishi66384 жыл бұрын
If you are transporting ice-cream in a vehicle, the number of ice-cream sticks that reach the destination is inversely proportional to temperature, higher the temperature, lesser are the sticks. If you want to effectively model the temperature of the vehicle's cooler and make it optimal, you need to consider this negatively correlated features, outside air temperature and number of ice-cream sticks at the destination.
@KnowledgeAmplifier14 жыл бұрын
I want to point out a veryyy important concept which is missing in this video discussion: Suppose 2 input features are highly correlated then it's not like that , I can drop any between those 2 , then I have to check which feature between those 2 has weaker correlation with output variable , that one has to be dropped.
@siddharthdedhia114 жыл бұрын
what do you mean by weaker? do you mean the most negative?
@KnowledgeAmplifier14 жыл бұрын
@@siddharthdedhia11, here , weaker means lesser correlation with output feature .
@siddharthdedhia114 жыл бұрын
@@KnowledgeAmplifier1 so for example between -0.005 and -0.5 , -0.005 is the one with lesser correlation right?
@KnowledgeAmplifier14 жыл бұрын
@@siddharthdedhia11 yes , correct as correlation value towards 0 is considered as less value and towards 1 or -1 means strong relationship :-)
@amankothari55083 жыл бұрын
@jayesh naidu
@sukanyabag61344 жыл бұрын
Sir, the videos you uploaded on feature selection helped a lot ! , Please upload the rest tutorials and methods too! Eagerly waiting for it !
@alphoncemutabuzi69493 жыл бұрын
I think the abs is important since it's like having two rows one being the opposite of the other
@MrKaviraj753 жыл бұрын
Yes, I think so too. If changes to one feature affects another feature, they are dependent, in other words, they are correlated.
@suhailsnmsnm539710 ай бұрын
amazing teaching skills you have bhaai ... THNX
@elvykamunyokomanunebo1441 Жыл бұрын
Thanks krish, You've earned a rocket point from me :) Would have been nice, if the function also printed which feature it was strongly correlated with: because from the code you dropped all the features that met the threshold, not one was kept.
@yashkhant58744 жыл бұрын
GREAT CONTRIBUTION SIR.... THIS CHENNAL SHOULD 20M SUBSCRIBER🤘🤘
@parms11914 жыл бұрын
I write the threshold code simply like [df.corr()>0.7 OR df.corr()
@codertypist3 жыл бұрын
Let's say variables x, y and z are all strongly correlated to each other. You would only need to use one of them as a feature. By saying [df.corr()>0.7 or df.corr()
@suhailabessa99012 жыл бұрын
thank you sOOo much , perfect explaining :) good luck with your channel that is recomended
@neelammishra56222 жыл бұрын
Your knowledge is really invaluable. Thanks
@gurdeepsinghbhatia28754 жыл бұрын
I think it all depends on domain that whether to involve the neg corr or not , or we can train two diff models and compare their scores , Thanks Sir
@ActionBackers3 жыл бұрын
This was incredibly helpful; thank you for the great content!
@dinushachathuranga765710 ай бұрын
Thanks a lot for very clear explanation.❤
@RandevMars43 жыл бұрын
Well explained. Really great work sir. Thank you very much
@naysharm3 жыл бұрын
watching this video from Boston (BU Student
@ireneashamoses42094 жыл бұрын
Great video!! Thank you!👍👍💖
@JithendraKumarumadisingu3 жыл бұрын
Great tutorial it helps a lot thanks @Krish Sir
@abinsharaf83053 жыл бұрын
since we are giving only one positive value for threshold, the code abs allows check for both negative and positve values with threshold, so i feel its better if it stays
@Moiz_tennis2 жыл бұрын
I have a doubt. Suppose if A and B have correlation greater than threshhold and the loop includes column A from the pair. Further B and C are highly correlated(although C is not highly correlated with A)and the loop includes B in the list. Now if we drop A and B wouldn't that affect the model as both A and B will be dropped?
@СалаватФайзуллин-щ3д Жыл бұрын
Should small values of correlation such as -0.95 be deleted or they are good to train our model and they should stay in data frame?
@JenryLuis Жыл бұрын
Hi friend, I think the correlation function is removing more than expected because when the fors loops are iterating not validate if for a value > threshold the column and index already was removed before. I corrected the function and in this case the features removed are these: {'DIS', 'NOX', 'TAX'}. Also I tested creating the correlation matrix again and verify that there is not values > threshold. Please can you check it. def correlation(dataset, threshold): col_corr = set() corr_matrix = dataset.corr() for i in range(len(corr_matrix.columns)): for j in range(i): if abs(corr_matrix.iloc[i, j]) > threshold: if (corr_matrix.columns[i] not in col_corr) and (corr_matrix.index.tolist()[j] not in col_corr): colname = corr_matrix.columns[i] col_corr.add(colname) return col_corr
@gabrielegbenya74792 жыл бұрын
great video. very informative and educative. Thank you
@tigjuli4 жыл бұрын
Nice! please upload more on this topic!! thank you!
@abhishekd10123 жыл бұрын
In this video it's said negatively correlated features are both imp. lets take an example, when we have both percentage and ranks in a dataset, for 100% we have 1 in rank and 60% lets say 45(last) in rank. both resemble the same importance in the dataset. So what I think is we can remove one feature among those 2 features, otherwise we will be giving double weightage for that particular feature. Hope someone can correct this if I was wrong.
@TejusVignesh2 жыл бұрын
You are a legend!!🤘🤘
@hibaabdalghafgar Жыл бұрын
again I wish if you explain how to handle the test set...but the explination is excellent am really gratful
@nahidzeinali199110 ай бұрын
Thanks so much! very useful. you are so good
@josephmart75283 жыл бұрын
The abs takes care of both positive and negative numbers. If not specified, the function will only take care o positively correlated features
@niveditawagh81712 жыл бұрын
Nice explanation.
@salihsarii Жыл бұрын
Thanks Krish 😊
@pankajkumarbarman7652 жыл бұрын
Very helpful . Thank you sir.
@amitmodi78824 жыл бұрын
Wonderful explanantion. Krish as mentioned in video you said you upload 5-6 videos for feature selection. Can you please share the link for rest of them.
@kalvinwei193 жыл бұрын
Thank you man, good for my assignment
@antoniodefalco61793 жыл бұрын
thank you, so usefull, good teacher
@hirakaimkhani33382 жыл бұрын
wonderful tutorial sir!!
@raghavkhandelwal10944 жыл бұрын
waiting for more videos in the playlist
@pratikjadhav12423 жыл бұрын
We cheak the correlation between inputs and the output so why you drop output column and then cheak correlation we use a VIF (variance inflection factor) to cheak the relationship between inputs and the threshold value is preffer 4.
@youcefyahiaoui14657 ай бұрын
Great tutorial, but I think you're mistaken about the abs(). You're actually considering both with abs(). If you remove abs() and you keep the > inequality then a 0.95 would be > Thresh=0.9, but -0.99 would not satisfy this condition! If you want to remove abs(), then you need to test 2 conditions, like if corr_matrix.iloc[i,j] > +1*thesh (assuming thres is always +ve) and corr_matrix.iloc[i,j]
@nkechiesomonu87642 жыл бұрын
Thanks sir for the good job you have been doing . God bless you. Please sir my question is can we use correlation on image data. Thanks
@shivarajnavalba50423 жыл бұрын
Thank you Krish,
@perumalelancgoan98393 жыл бұрын
please clear it the below if any independent variables are highly corelated we shouldn't remove them right because its give very positive outcome
@siddhantpathak62893 жыл бұрын
Hi Krish, I checked it somewhere and I think if the dataset has perfectly positive or negative attributes then in either case there is a high chance that the performance of the model will be impacted by Multicollinearity.
@nmuralikrishna45992 жыл бұрын
General Question - What if we drop few of the import features from and data and train again ? Will the accuracy drop ? or precision ?
@drshahidqamar2 жыл бұрын
LOL, you are jsut amazing Boss
@thecitizen9747 Жыл бұрын
You are doing a great job but can u please do similar series on categorical features in a regression problem?
@yasharthsingh8054 жыл бұрын
Sir , can you please tell which website should I refer if I want to start reading white papers.... Please please do reply....I follow all ur videos!!
@deepanknautiyal57254 жыл бұрын
Hi krish please a make a video on complete logistic regression for Interview preparation
@suneel84804 жыл бұрын
Sir make video on how to select features for clustering?
@killerdrama55212 жыл бұрын
What if we have some features numerical and some features are categorical against categorical output .. which feature section method will be helpful
@jannatunferdous103 Жыл бұрын
Sir, what you've shown in the last of this video, in that big data project, after deleting those 193 features, how I can deploy the model? Please share a video (or link if you have in your playlist) the deployment phase after deleting features. Thanks. ❤
@mariatachi83986 ай бұрын
Amazing content!~
@sanketargade3685 Жыл бұрын
Why we are droping highly correlated feature after spliting train and test either it is easy to drop features from original data set and then we can simply split the dataset?❓😕🤔
@omi_naik4 жыл бұрын
Great explanation :)
@kjrimer2 жыл бұрын
Hello nice video, how to do feature selection if we have more than one target variable? i.e. In case of MultiOutput Regression problem how we can do feature selection. do we have to perform the pearson correlation individually on each of target variable or is there another convenient way that can solve the problem?
@chineduezeofor24814 жыл бұрын
Another great video!!!
@waatchit3 жыл бұрын
Thank you for such a nice explanation. Does having 'abs' preserve the negative correlation ??
@TelugodiPrapanchaYathra2 жыл бұрын
Can we drop features while comparing correlation of dependent variable with independent variables by taking some threshold....!
@conceptsamplified4 жыл бұрын
Of the highly correlated columns, Should we not keep one of the columns in our X_train dataset?
@ajaykushwaha-je6mw2 жыл бұрын
Hi everyone i need one help. this technique to select numerical features only. Suppose we have done one hot encoding on categorigal data and converted into numerical then can we apply this technique on that features as well(entire data set with numerical column and categorical column converted into numerical with some encoding technique.) Kindly help me to understand.
@laxmanbisht26383 жыл бұрын
Hi, thanks for the lecture. What if we have a dataset in which categorical and numeric features are present. Will pearson's correlation be applicable?
@Jnalytics2 жыл бұрын
Pearson's correlation only works with numeric features. However, if you want to explore the categorical features, you can use Pearson's Chi-square test. You can use the SKBest from scikit-learn and chi2. Hope it helps!
@marcastro80522 жыл бұрын
Thanks, Sir.
@bishwa24june4 жыл бұрын
Hello Krishna thanks for your video but along with please explain real life use as well. Where can we use in real life.
@amarkumar-ox7gj4 жыл бұрын
If idea is to remove highly correlated features, then both highly positive and negative correlation should be considered!!
@Learn-Islam-in-Telugu3 жыл бұрын
The function used in the example will not deliver high correlation with the dependent variable. Because at the end you dropped the columns without being checking the correlation with dependent variable.
@Eric-bq1jo2 жыл бұрын
Is there any way to apply this approach to a classification problem where the target variable is 1 or 0?
@Egor-sm4bl3 жыл бұрын
Perfect defence on 3rd place!
@levon93 жыл бұрын
Two quick questions: (1) Why not remove redundant features, ie highly correlated variables, from X before splitting it into training and test? What would be wrong with this approach? (2) If one features variable is correlated with a value of 1 and another variable with a value of -1 with regard to a given feature, are these also considered redundant?
@ankitmahajan36743 жыл бұрын
Hi Krish while removing the correlated features we haven't checked that the independent variable is corelated to dependent variable. As you said in staring we should not remove the features that are highly correlated to dependent variables so while generating the heatmap should we include the dependent variable also ? let me know if my understanding is correct?
@prateekkhanna45903 жыл бұрын
Hi Ankit, If we include the dependent variable in this feature selection process, the accuracy of our model might get compromised. Also if you can see in video if 2 features are highly correlated we are only removing 1 feature. So if that feature has good correlation with dependent variable which we don't know yet it is still in the dataset. (As we have dropped only one feature out of those 2)
@erneelgupta2 жыл бұрын
what is the importance of random_state in train_test split ? How the values of random_state (0,42,100 etc.) affect the estiamation???
@sivadevil48458 ай бұрын
Hi @krish naik, i want to know how much data cleaning and models selection and models performance and how we can do that. I hope u will explain if u find this comment.
@meshmeso8 ай бұрын
These are on numeric features, what of correlation between categorical features ?
@phyuphyuthwe6704 жыл бұрын
Dear teacher, May I ask a question? In my case, I want to predict sale of 4 products with weather forecast information, season and public holiday one week ahead. So, do I need to organize weekly based data? When we use SPSS, we need to organize weekly data, how about Machine Learning? I feel confused for that. In my understanding, ML will train the data with respect to weather information. So, we don't need to organize weekly data because we don't use time series data. Is it correct? Please kindly give me a comment.
@doggydoggy5782 жыл бұрын
Hello can I ask a question ? Is Pearson Correlation the same as Correlation-based Feature Selection ?
@TelugodiPrapanchaYathra2 жыл бұрын
Multi collinearity has checked but what about the Correlation of dependent vs independent variables
@aritratalapatra84523 жыл бұрын
If I have 3 correlated columns, I should drop 2 out of 3 right ? why do you drop all correlated features from training and testing set ?
@asha45452 жыл бұрын
Hello Sir my dataset contains 17000 features, when I execute corr() its taking more than 5 minutes to execute and also for generating heatmap memory related error generating. Can you help to solve the issue?
@MominSaadAltafnab2 жыл бұрын
I didnt understood why we are just considering X_train for finding corr you said to avoid overfitting we are doing that but i am still not getting it like how it will be overfitted it we use all data can someone pls tell me why we are doing that
@venkatk15914 жыл бұрын
Do we need use the entire datasets for correlation testing. Are we not missing something by considering the train set only?
@piyushdandagawhal88434 жыл бұрын
Instead of doing X_train , x_test split, if we find correlation of the whole data and then we compare correlated column's correlation with the dependent feature and then drop only those features among the correlated columns which are less correlated?....does my question makes sense? if it does, would it affect the model?
@PraveenKumar-pd9sx4 жыл бұрын
Same doubt
@YS-nc4xu4 жыл бұрын
I believe those should be two separate questions. Regarding the split, it is necessary to split before getting correlation to understand its effect on the test data. If you do not split, then when testing, you're already assuming the correlation to be present in the test data and thus overfitting. Remember, the actual "test" data will always be unknown to us, and the split helps us validate the model and generalize it for the future unknown data. For the second question: Yes, that makes sense to me. After getting the "multi-correlated" columns, we can calc the correlation of each with the target, and drop the ones with low absolute correlation.
@PraveenKumar-pd9sx4 жыл бұрын
@@YS-nc4xu Why should we split before the correlation check
@piyushdandagawhal88434 жыл бұрын
@@YS-nc4xu YES!! i get it now, Thank you for sorting the issue!
@piyushdandagawhal88434 жыл бұрын
@@PraveenKumar-pd9sx if we check correlation of whole data rather than splitting(X_train, X_test). there is a chance that the correlation of whole data will be slightly different than the correlation if we had split. this might give us a better result on the validation (X_test) but would not perform on the actual test data when we deploy it in real world. this is my understanding from @Y S's comment.
@StanleySI3 жыл бұрын
Hi sir, there's an obvious flaw in this approach. You can't drop all correlated features, but only some of them. e.g. perimeter_mean & area_se are highly correlated (0.986507), and they both appear in your corr_features. However, you can't drop all of them because from pairplot, you could see perimeter_mean has a clear impact on the test result.
@HumaidAhmadKidwai176 ай бұрын
How to check correlation between numerical column (input) and categorical output(in the form of 0s and 1s)
@megalaramu4 жыл бұрын
Hi kris, in multicollinearity conceps we have both corrlation matrix as well as VIF to remove the collinearity. Which method is best or does that depend upon data
@krishnaik064 жыл бұрын
Both are good...u can use any of them
@megalaramu4 жыл бұрын
@@krishnaik06 i worked on a dataset which was highly correlated features and both these methods gave me different results. Hence was confused which method to use. Thats why this question. Thanks
@krishnaik064 жыл бұрын
But I have vif was much more good
@PraveenKumar-pd9sx4 жыл бұрын
Hi. megala.. What is VIF. Can you pls tell me
@arjundev49084 жыл бұрын
@@PraveenKumar-pd9sx in short VIF is Variation inflation factor which also helps in finding multicolinearity between independent variables.
@abebebelew2056 Жыл бұрын
Best!
@rafibasha41452 жыл бұрын
Hi Krish,how to check in case of categorical variables
@prabhusantoshpanda52594 жыл бұрын
While dropping the columns using the list of all corelated columns arent we deleting all of them and not even retaining the ones we actually want. for example, suppose we get 3 corelated columns in the list. and then apply, corelated_columns=[f1,f2,f3] : corr>0.8 for e.g x_train=x_train.drop(corelated_columns,axis=1) then all 3 are getting dropped whereas we want only 2 to drop and retain one?? Please clarify.
@YS-nc4xu4 жыл бұрын
That's a great question! I believe, we would want to retain one and drop the rest. Dropping all will be a loss of information imo. I would also suggest adding i and j column to the 'set' as well. This would help get pairs of correlated columns rather than just a list. For example, replacing col_corr.add(colname) by col_corr.add((corr_matrix.columns[i], corr_matrix.columns[j])) will give us the pairs, and then we can decide which one to keep. Again this is just my opinion, I might be wrong. Happy learning!
@prabhusantoshpanda52594 жыл бұрын
@@YS-nc4xu Actually this approach of getting corelated pairs is correct. But there is one flaw. I myself have faced this flaw and its quite problematic when tackling a dataset with feature columns more than 500. What happens is we get too many combinations of corelated pairs and they are double in number because while iterating we will get both . for e.g corelated list below . [f4,f9],[f9,f4] ,[f5,f9],[f8,f7],[f4,f8],[f8,f9],[f7,f8],[f9,f5] Check: kaggle.com/MoA prediction competition. And run Pearson's Corr on the dataset. You will be shocked Again going through the whole list and finding out the corelated columns for respective feature while tackling duplicate lists is going to be a very diificult one if done manually. I am in process of trying to figure out a solution to this and hopefully i will. Peace.
@YS-nc4xu4 жыл бұрын
@@prabhusantoshpanda5259 Sure, my response was just for your point of dropping all correlated cols in the given problem. Additionally, the for loops shown in the video, takes care of the repetition mentioned by you. The 'for j in range(i): ' considers only the lower triangular matrix, thus eliminating the repetitions. Furthermore, for data with more than 500 cols, obviously one wouldn't want to go with Pearson's corr. I believe, this video was to give a basic use case of corr on simple data and not on a high dimension data. In my opinion, PCA / SVD might help for your problem . Peace out!
@arunkrishna10362 жыл бұрын
Hi Krish.. how about using an VIF to find the correlated features?
@rmrz2225 Жыл бұрын
Hi, sorry for my question, but why is he dropping the features most correlated, it shouldnt keep those features and drop loss correlated features?
@__SHRUTHISRINIVASAN6 ай бұрын
Same doubt here
@mgfg223 жыл бұрын
Why you don't use corr_features = correlation( X , 0.7 ) instead of X_train. (Please look at 08:22)
@teenamadhu78833 жыл бұрын
How to get the name of the column which is highly correlated to the given column. Please help
@aayushdadhich48404 жыл бұрын
Should i practice by writing my own full code including the hypothesis functions, cost functions, gradient descent or fully use sklearn?
@YS-nc4xu4 жыл бұрын
If you're a student and have time to explore, please go ahead and implement it from scratch. It'll really help you to not only understand the basic working but also the software development aspect of creating any model (refer sklearn documentation and source code) and get to know more about industry level coding practices.
@oladosuoladimeji3703 жыл бұрын
How can correlated features be selected for a multi label learning task especially in images
@rahuldevnath147923 жыл бұрын
Krish, can we not use VIF for collinearity?
@nishadseeraj70343 жыл бұрын
Can someone explain how the 2nd for loop is working? I am not getting it. For instance "for j in range(i)", wouldn't that give an error when i=0 for the first iteration of the first for loop when i=0, unless I am missing something?