Principle Component Analysis (PCA) using sklearn and python

Рет қаралды 218,169

Күн бұрын

Here is a detailed explanation of PCA technique which is used for dimesnionality reduction using sklearn and python
Reference :Special thanks to Jose Portila
Github Link: github.com/kri...
You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
Packt url : prod.packtpub....
Amazon url: www.amazon.com...

Пікірлер: 157

@assiaben1220 Жыл бұрын

Wow, fascinating; I've never seen a KZbin video that was that clear and easy to understand. Excellent explanation. Deepest appreciation. 🧡👌

@AmirAli-id9rq 3 жыл бұрын

A lot of people in comment asked about intuition of PCA. So here it is .We plot samples using the given features. for example imagine plotting different students (samples) on 3D graph (features English literature marks, Math marks and English Language Marks, x axis English literature marks ,y English Language and z axis Math ). Intuitively Someone who is good in English Literature must be good in English Language , so if I ask u to consider only two dimensions(features) for any classification model ,you will consider Maths and either of English, bcz we know by experience the variation in both English subjects would be less. Thus in PCA we actually project the samples (students in our example) in n numbers of PCA axis and choose the PCA which explains the maximum variation in data. If we add variation of all PCAs it will be 1 or 100%. Thus, instead of using all three subject marks I would rather use PC1 and PC2 as my features. For PCA follow the steps 1. Once u have these 3d plot ready we calculate PC1 , which is a best fitting line that passes through the origin 2. Calculate Slope of PC1 3. calculate the eigen vector for the best fitting line 4.Find PC2 i.e. is a line perpendicular to PC1 and passes through the origin 5. Now rotate the graph such that PC1 is x axis and Pc2 is y axis , and project ur samples Its kind tough to imagine ,Do read out more. Hope this helps

@Mathschampion510 Жыл бұрын

why we are calculating eigen vectors

@ashleypursell9702 4 жыл бұрын

really good video that covers how each piece of the code works and how to implement it into other programs.

@Amr_ZR 4 ай бұрын

Easy to understand, brief and clear Thank you and may Allah reward you good ✨❤ Alhamdulillah

@nithishh2384 2 жыл бұрын

Literally, I have searched and seen many videos , but this one has the best explanation

@RohitGupta-ox6tn 3 жыл бұрын

It is known that PCA causes loss of interpretability of the features. What is the alternative to PCA if we don't want to lose the interpretability? @Krish Naik. In case we have 40K features and we want to reduce dimension of the dataset without loosing the interpretability.

@kunaljha6204 5 жыл бұрын

well explained sir...if possible kindly make video on mathematics behind PCA...awaiting for more videos on ML

@krishnaik06 5 жыл бұрын

Sure Kunal I will upload the video on mathematics behind PCA. Please make sure you press the bell icon. As soon I upload the video the notification will come. Within a week I will upload it

@kunaljha6204 5 жыл бұрын

@@krishnaik06 bell already pressed sir..thnx for yor response...excited to watch yor upcoming videos.

@Taranggpt6 4 жыл бұрын

I don't find any explaining the maths ... Can any1 give me link if it is here??

@nareshjadhav4962 4 жыл бұрын

@tarang gupta... same here but math behind PCA vidio is not there.

@saliherenyuceturk2398 Жыл бұрын

amazing simple and straightforward

@sharifdmd 5 жыл бұрын

Hello Krish, First of all thanks for very neat and detailed explanation. Requesting you to explain about Undetfitting & Overfitting models , How to Identify weather model is Underfitting or Overfitting , How to avoid this scenario etc .. If already video is available share it with me ... Thanks

@sikansamantara9504 4 жыл бұрын

Hi Krish i have a doubt, how we can know that how many components are required and which is the feasible solution for that. Because here 30 features were present and you have reduced it to 2 so my question is if i will reduce it to 5 will it effect my model's performance? How we can able to know what is the number of features for a model?

@lillyban9393 Жыл бұрын

yes it reduce the model score

@AnitaDevkar 3 жыл бұрын

Apply Basic PCA on the iris dataset. • Describe the data set. Should the dataset been standardized? • Describe the structure of correlations among variables. • Compute a PCA with the maximum number of components .• Compute the cumulative explained variance ratio. Determine the number of componentskby your computed values. • Print thekprincipal components directions and correlations of thekprincipal compo- nents with the original variables. Interpret the contribution of the original variables into the PC. • Plot the samples projected into thekfirst PCs. • Color samples by their species

@hokapokas 5 жыл бұрын

Nice video kish..one suggestion ,if you can explain the pros and cons as well to each technique/algo in future videos. keep up the good work. cheers !!!

@badalsingh3733 5 жыл бұрын

Same video available on udemy also, but this is much better and easy to understand. Now would you please apply any algorithm on this data.

@ramyasrigorle2609 3 жыл бұрын

Sir how to know what features(column names) are selected with pca ?

@caynaferraz8080 2 жыл бұрын

Did you find the answer? i have the same question and i couldn't find :(

@rajus8943 2 жыл бұрын

@@caynaferraz8080 any luck !?

@caynaferraz8080 2 жыл бұрын

@@rajus8943 yeah, actually pca generates new components. So none of the original columns are maintained after pca, because pca combined them

@janetagboola8563 Жыл бұрын

@@caynaferraz8080 ok, thank you

@MegaFanFan21 5 жыл бұрын

Best video clarifying pca in sklearn

@brilliantknock7438 4 жыл бұрын

Thanks i got the complete details. I am New to Data Science - We are Converting the Features/columns into two Principal Component Analysis - PC1 and PC2 for Example - how we can use PC1 and PC2 columns in to our Model. I am lost here. earlier we know the Column names and we have rough idea about that. now we dont have column names since we converted into PC1 and PC2 as well. how can we derive from here to the model prediction.. help me to understand this part

@alpha-beginner 4 жыл бұрын

firstly it is up to you whether you need two or more dim. literally if you want to understand it mathematically then you can use metrics . we are not using PC1 and PC2 in our model . we just standerised our model to understand it better. It is also known as a general factor analysis where regression determines a line of best fit.

@AmirAli-id9rq 3 жыл бұрын

Okay, here it is we plot samples using the given features. for example imagine plotting different students (samples) on 3D graph (features English literature marks, Math marks and English Language Marks, x axis English literature marks ,y English Language and z axis Math ). Intuitively Someone who is good in English Literature must be good in English Language , so if I ask u to consider only two dimensions(features) for any classification model ,you will consider Maths and either of English, bcz we know by experience the variation in both English subject would be less. Thus in PCA we actually project the samples (students in our example) in n numbers of PCAs and choose the PCA which explains the maximum variation in data. If we add variation of all PCAs it will be 1 or 100%. Thus, instead of using all three subject marks I would rather use PC1 and PC2 as my features. For PCA follow the steps 1. Once u have these 3d plot ready we calculate PC1 , which is a best fitting line that passes through the origin 2. Calculate Slope of PC1 3. calculate the eigen vector for the best fitting line 4.Find PC2 i.e. is a line perpendicular to PC1 and passes through the origin 5. Now rotate the graph such that PC1 is x axis and Pc2 is y axis , and project ur samples Its kind tough to imagine ,Do read out more. Hope this helps

@kamilc9286 3 жыл бұрын

Shouldn't we validate how many PCA's are needed ?

@FKP-7 9 ай бұрын

It is stated that 2 comps were used due to graphing purposes.

@praveenpandey4804 11 ай бұрын

really sir, thanks for the knowledge. it helped me to solve your assignment in machine learning segment from PW skill..

@Taranggpt6 4 жыл бұрын

After training the model on these Principal components , we can evaluate it on the same 2 dimensional PC, .. for unseen datasets we need to convert that data to PC first ?? Or how it will tackle with that??

@alpha-beginner 4 жыл бұрын

we just need to convert our data b/w 0 and 1 .

@gahmusdlatfi4205 3 жыл бұрын

Hi Naik, do we apply the pca only on training dataset, or the whole dataset(training+test)? some litterature advise to apply pca on training only, but in this case how to predict test set with the transformed data? waiting for your reply, thank you in advance

@ajaychhillar1033 6 жыл бұрын

Best video for PCA

@LAChinthaka 3 жыл бұрын

Very clear and teach to the point. Thanks a lot.

@ruturajjadhav8905 3 жыл бұрын

why did he choose 2?

@LAChinthaka 3 жыл бұрын

@@ruturajjadhav8905 You mean for 2 principal components? What you should do is identify the best PCS by considering cumulative proportional values and consider the minimum number of PCs. In here, he just choose 2 PCs.

@ruturajjadhav8905 3 жыл бұрын

@@LAChinthaka hmm!! Thank you.

@sivaramramkrishna5627 2 жыл бұрын

i feel so good by seeing this ..thanks bro ...you help me out little bit ...make more videos on this type..

@sandipansarkar9211 4 жыл бұрын

I think this has been the repetition of the previous video. No issues. Thanks

@nukestrom5719 10 ай бұрын

Well explained it with an easy to understand example. Thanks

@kasoziabudusalaamu50 Жыл бұрын

so much insightful. The concept is well understood.

@sherin7444 2 жыл бұрын

from sklearn.decompostition import PCA pca=PCA() pc=pca.fit_transform(df) plt.figure() plt.plot(np.cumsum(pca.explained_varience_ratio)) plt.xlabel('Column') plt ylabel('EVR') plt.show()

@sherin7444 2 жыл бұрын

This code give a plot from where we can choose no of columns giving 90% of info.That will be the PCAs

@ruchisaboo29 4 жыл бұрын

Very well explained.. thanks Also if possible please make video on other dimensionality reduction techniques like SVD..

@joehansie6014 3 жыл бұрын

Great work... 4 thumbs for you. Greetings from a master student.

@sandipansarkar9211 4 жыл бұрын

Finished my practice the code in jupyter notebook. Cheers

@ytg6663 3 жыл бұрын

Cute

@tehreemqasim2204 2 жыл бұрын

Your video is very helpful. God bless you brother

@Data_In_real_world Жыл бұрын

HELLO can you please make video on PCA use along with clustering and then explain the PCA values obtained in clusters

@justinrobinm437 5 жыл бұрын

PCA, in general, this video is fine but there is a drawback in reducing the dimensions lost of data and variance

@teacherHub6723 4 жыл бұрын

I have (29400,784) size of data and I want to reduce the dimension . How I decide the Number of Component. Plzz help.

@alpha-beginner 4 жыл бұрын

it is up to you that how ease you want to see your model.

@RohitGupta-yl6gl 6 жыл бұрын

Thanks a lot for an awesome video. can you please explain how to conclude the n_components i.e 2 is better not 4 ? Is there a way to find suitable value ?

@krishnaik06 6 жыл бұрын

Hi the higher number of n_components the more variance is captured..

@veerasrikanth3556 5 жыл бұрын

If you give n_components=2 the top 2 variance is captured. More the value more the variance. suppose the first variable x_ principle component may have like variance 0.7 and y_ principle component may have like 0.2 If you provide n_components value as the number of features itself then if you sum up all those various you will get exactly 1. Note: For this same problem if you proved n_components as 3 first two remains the same and a new feature will be added and that variance will be less than 0.2 ( less than the 2nd one) And more over if you choose n_components =2 you can plot a nice graph. if you choose more then we can't.

@sevicore 2 жыл бұрын

@@veerasrikanth3556 also i have another question! here when we scale the whole dataframe, but if we decide to feed our prediction algorithm with our pca transformed data, wouldnt data cause data lakeage? because it is well known that if we train test split after scaling that is a source of error.

@md.faysal2318 11 ай бұрын

I have my own data with some column of questionnaire, so what will be my column name there on the code for instance, you put columns = cancer[feature_name], what I will put there on my own data? all the column name one by one? df-pd.DataFrame( cancer data' columns-cancer ['feature _nanes")

@p15rajan 2 жыл бұрын

Excellent.. Appreciate it. .. liked your video

@nerdymath6 2 жыл бұрын

can we get to know what dimensions have been reduced and what 2 left there? how we will infer from the graph after applying pca

@b_113_debashissaha9 3 жыл бұрын

Excellent work for begineers

@rahulgarg6363 3 жыл бұрын

Hi krish , how Eigen values and Eigen vectors plays a role in capturing Principal components

@piyushsharma417 5 жыл бұрын

nice thanks for explaining

@rickymacharm9867 4 жыл бұрын

Well explained. Thanks a million

@ashdbms 5 жыл бұрын

Wonderful explanation !!

@krishnaik06 5 жыл бұрын

Thanks. Please subscribe and support the channel

@himansu1182 3 жыл бұрын

I think small mistake on MinMaxscala here only used standard scaler

@ruchiraravirala1008 5 жыл бұрын

So, the two dimensions it reduces to, what are they? Which variables do they correspond to?

@krishnaik06 5 жыл бұрын

These are the newly created variables from dimensionality reduction.

@ruchiraravirala1008 5 жыл бұрын

Actually I am pretty confused . So when we make a logistic regression or anything else later on with these two dimensions , what are we gonna call them when we are using these to predict something ? Haha I am pretty confused and appreciate your help !

@borntobefree8298 5 жыл бұрын

@@krishnaik06 how do we know which independent variable has the most effect on target variable....what is pca's output its very confusing...please reply!!

@krishnaik06 5 жыл бұрын

@@borntobefree8298 there is a terminology called as explained Variance ratio. When we perform dimensionality reduction there is a some loss of data. These explained Variance specifies how much data is lost. To see that value u can print the value of model.explained_variance_ratio

@yogitajain8003 3 жыл бұрын

ValueError: Found array with 0 sample(s) (shape=(0, 372)) while a minimum of 1 is required by StandardScaler. But there is no missing value

@nayanparnami8554 3 жыл бұрын

sir how to figure out no. of prinicipal componenets to which we want to reduce the original dimension ??

@techsavy5669 3 жыл бұрын

At time 10.28, when we do .. plt.scatter(x_pca[:,0] , .. shouldn't the second parameter here be target output column!! Why are we plotting it against x_pca[:,1] ?

@pavankumarjammala9262 11 ай бұрын

It's a 2nd index value for data_pca

@joeljoseph26 9 ай бұрын

More dimensions lead to overfitting right?

@RagHava_world 3 жыл бұрын

Firstly, Thanks for explaining PCA technique very clearly. Suppose, we do not know the features of a higher dimensional data. Is there any way to find the features and target within the data ? Is that possible by any chance. I am working with Hyperspectral raw data.

@asishswain1259 3 жыл бұрын

It' s just a guess, from correlation or covariance matrix you can know ! Any way please elaborate the question lil bit !

@gowthami712 3 жыл бұрын

Have your doubt cleared are you able to apply PCA on hyperspectral image!?

@RagHava_world 3 жыл бұрын

@@gowthami712 Hi Goutham, I spoke with one person who is working on same topic as Phd. So, he has helped me with some documents. It was clear now and got results as well.

@gowthami712 3 жыл бұрын

@@RagHava_world I am also working on the same hsi dataset can you help me out !?

@RagHava_world 3 жыл бұрын

@@gowthami712 Sure why not. You can reach me on FB or Insta with the same name or you can also write me an email - raghavendrapoloju@gmail.com

@pavankumarjammala9262 11 ай бұрын

Actually !! I took a PCA on digit recognition data there I have took n_component value as 2 but in the visualization it coming multiple colors after executing. can anyone say what will be solution for that ?

@bommubhavana8794 2 жыл бұрын

Hello, I have newly started working on a PCR project. I am stuck at a point and could really use some help...asap Thanks a lot in advance. I am working on python. So we have created PCA instance using PCA(0.85) and transformed the input data. We have run a regression on principal components explaining 85 percent variance(Say N components). Now we have a regression equation in terms of N PCs. We have taken this equation and tried to express it in terms of original variables. Now, In order to QC the coefficients in terms of original variables, we tried to take the N components(85% variance) and derived the new data back from this, and applied regression on this data hoping that this should give the same coefficients and intercept as in the above derived regression equation. The issue here is that the coefficients are not matching when we take N components but when we take all the components the coefficients and intercept are matching exactly. Also, R squared value and the predictions provided by these two equations are exactly same even if the coefficients are not matching I am soo confused right now as to why this is happening. I might be missing out on the concept of PCA at some point. Any help is greatly appreciated.Thank you!

@bhavsarswapnil90 5 жыл бұрын

One question.. After converting into one hot encoding my columns become around 2000. So how can i decide how many dimension should I keep using pca?

@alpha-beginner 4 жыл бұрын

it is up to you that how ease you want to see your model. more dim. becomes difficult to understand.

@harshagarwal7711 4 жыл бұрын

you can see how much variance they explain,u can research about pca.explained_variance_ratio, it gives u cumalative variance v/s number of features selected.

@yogeshrunthla9350 4 жыл бұрын

Loved your explanation sir

@phiriavulundiah9249 2 жыл бұрын

A very insightful video

@parijatchatterjee5086 5 жыл бұрын

Can you also do a python interpretation of PCA explaining the math behind it and then coding it using Numpy. Also the problem with having data in higher dimensions and how it may lead to faulty analysis due to the abnormalities that higher dimensions bring.

@AmirAli-id9rq 3 жыл бұрын

Might be late to suggest do check out the Channel "Stat Quest" for understanding the working and logic behind PCA

@divyanshvishwakarma3180 6 ай бұрын

Thanks for the consice explanation

@megharaghu372 5 жыл бұрын

@Krish Naik In housing price data, if we have to predict the prices of the house then after using PCA how do I get to know which independent variable affects the target variable (price) the most, I am unable to interpret the PCA output. kindly explain by an example (kindly reply ASAP ), Thankyou,

@SanjaySingh-qf4tk 3 жыл бұрын

Did you get the answer

@anupambiswas2588 4 жыл бұрын

can we find which features got tagged to which PC when we reduced the features to 2 PC in this example?

@rachitsingh4913 3 жыл бұрын

How to know that how much data wo lost on decreasing dimensionality and how many components are best ??

@GregorianWater 6 ай бұрын

you don't lost any data

@negusuworkugebrmichael3856 7 ай бұрын

Excellent. very helpful

@ML_Engineerr 5 ай бұрын

There is another video with high quality on the same channel.

@dhy9361 3 жыл бұрын

thank you for solving my question!

@sohamkurtadikar9578 4 жыл бұрын

Can we use PCA when there is no target variable in the data ???

@brownwolf05 3 жыл бұрын

yes pca can be used at any place where we want to reduce the dimension of given data

@katienefoasoro1132 2 жыл бұрын

Line 2 its a library that help you to import the dataset?

@dheerajkumar9857 3 жыл бұрын

Very neat explanation.

@bora_yazilim 2 жыл бұрын

Hi , in this scenerio we had 2 outputs, what happens when the number of outcome increases. For my case, I have 4 output

@pavankumarjammala9262 11 ай бұрын

Then !! set a value of 4 for n_components

@andriruslam5089 3 жыл бұрын

Nicee, keep do good things my brother

@devashishrathod3462 2 жыл бұрын

how can we find the variance between the 2 components that the code is reduced to??

@AmirAli-id9rq 3 жыл бұрын

Great Video . At the end of the video u said lost of data ,I guess Its not 100 percent correct to phrase that its not "loss of data:, its actually the essence or rather the info of the data is not lost rather its converged into two dimensions

@someshkumar1578 3 жыл бұрын

Bhai mere agar features independent honge to pca lagaye hi kyun.

@madannikalje760 4 жыл бұрын

Does pca work on data that contains both categorical variables as well as numerical data ?

@jottilohano2232 3 жыл бұрын

pca.fit(scaled_data) when i use this one its give error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

@haji3787 3 жыл бұрын

check if your data has NaN values by pd.isna().sum(). if it returns non zero, it means you have NaN value

@jottilohano2232 3 жыл бұрын

@@haji3787 Thanks

@devinshah234 Жыл бұрын

can you please upload the data set

@almasrsg 4 жыл бұрын

Very well explained!

@teetanrobotics5363 4 жыл бұрын

Can you make a video for LSTM , RBM ,VAE , GAN code ?

@bhaskersaiteja9531 3 жыл бұрын

How did you come to know that 'data' and 'feature_names' need to be considered for creating a dataframe from the file? Could you please explain

@Renan2792 2 жыл бұрын

When he imported de dataset, it came as a dictionary. The 'data' key contains the dataframe values without considering the index and columns names. The 'feature_names' contains the columns names

@MartinHroch 3 жыл бұрын

Exactly half of the video was intro to data loading and explanation.... Where is the PCA???

@Manoj-Kumar-R 3 жыл бұрын

Insightful video.. Can we have a PCA vs LDA comparison video? Much appreciated work!

@surajshah5630 3 жыл бұрын

great effort. thankyou!

@chaitanyatuckley4666 3 жыл бұрын

Thanks a lot Krish

@deepcontractor6968 4 жыл бұрын

What is PCR and how to apply it in this

@dipadityadas8551 5 жыл бұрын

Well Explained Thank you

@dhanashripatil3784 3 жыл бұрын

Sir please explain MLR with pca

@preeethan 4 жыл бұрын

Can PCA be used for linear regression.?

@09Aditya 5 жыл бұрын

On what basis you have selected PCA 2 , how much variance it captures , you have not explained . Tae time to make your videos , dont just make them with half efforts

@piyushsharma417 5 жыл бұрын

Thanks for mentioning can you elaborate what he missed

@adityasharma2667 5 жыл бұрын

my question is same

@kmahim82 5 жыл бұрын

factors are extracted based on the eigen value…. and it also explains the variance of factors…..but he is directly restricting the factors to 2…. which is not ideal