StatQuest: PCA - Practical Tips

  Рет қаралды 176,404

StatQuest with Josh Starmer

StatQuest with Josh Starmer

Күн бұрын

Пікірлер: 178
@statquest
@statquest 2 жыл бұрын
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@buihung3704
@buihung3704 Жыл бұрын
This is a gold mine for Data Scientist, Data Engineer, ML/DL engineer. I can hardly think of anyone else that can teach the same concept more clearly.
@statquest
@statquest Жыл бұрын
Thank you very much! :)
@geethanjalikannan5527
@geethanjalikannan5527 5 жыл бұрын
Dear Josh. I had so much issues with stats as I am from a totally different background. Watching Ur videos helped me overcome my insecurities. Thank you so much.
@statquest
@statquest 5 жыл бұрын
Hooray! I'm glad the videos are helpful.
@Jason-xe4tt
@Jason-xe4tt 6 жыл бұрын
All prof in the world need to learn how to teach from you ! Thanks !
@statquest
@statquest 6 жыл бұрын
You're welcome!!! :)
@caperucito5
@caperucito5 5 жыл бұрын
Josh's videos are so cool that I usually like them before watching.
@statquest
@statquest 5 жыл бұрын
That's awesome! :)
@alecvan7143
@alecvan7143 4 жыл бұрын
I concur :P
@statquest
@statquest 4 жыл бұрын
@@alecvan7143 :)
@stevenmugishamizero8471
@stevenmugishamizero8471 Жыл бұрын
The best on this platform hands down🙌
@statquest
@statquest Жыл бұрын
Thank you!
@shwetankagrawal4253
@shwetankagrawal4253 5 жыл бұрын
Your initial music always make me smile😂😂
@statquest
@statquest 5 жыл бұрын
Thanks! :)
@iloveno3
@iloveno3 6 жыл бұрын
The intro with you singing is so cute, made me smile...
@bendiknyheim6936
@bendiknyheim6936 3 жыл бұрын
Thank you for all the amazing videos. I would be having a really hard time without them
@statquest
@statquest 3 жыл бұрын
Glad you like them!
@jesusfranciscoquevedoosegu4933
@jesusfranciscoquevedoosegu4933 6 жыл бұрын
Thank you so much for, basically, all your videos on PCA
@statquest
@statquest 6 жыл бұрын
You're welcome!!! I"m glad that you like them. :)
@yurobert3007
@yurobert3007 Жыл бұрын
This PCA series (step-by-step, practical tips , then R) is brilliant! I found them very helpful. Thank you for these great videos! Would you be considering to do a series on factor analysis?
@statquest
@statquest Жыл бұрын
Thanks! One day I hope to do a series on factor analysis.
@AakashOnKeys
@AakashOnKeys Жыл бұрын
Thanks for the headsup! Very helpful!
@statquest
@statquest Жыл бұрын
Happy to help!
@ylazerson
@ylazerson 6 жыл бұрын
Fantastic video once again!
@statquest
@statquest 6 жыл бұрын
Hooray!!!! I'm glad you're enjoying them. :)
@boultifnidhal2600
@boultifnidhal2600 2 жыл бұрын
Thank you so much for switching to Math and reading, cause the genes and cells things were giving headaches. Nevertheless; Thank you so much for your efforts ♥♥
@statquest
@statquest 2 жыл бұрын
Noted!
@johnfinn9495
@johnfinn9495 Жыл бұрын
Very nice videos. Have you considered a segment on kernel PCA?
@statquest
@statquest Жыл бұрын
I'll keep that in mind.
@samirsivan8134
@samirsivan8134 6 ай бұрын
I love statequest❤
@statquest
@statquest 6 ай бұрын
Thank you! :)
@etornamtsyawo6407
@etornamtsyawo6407 Жыл бұрын
Let me like the video before I even start watching.
@statquest
@statquest Жыл бұрын
BAM! :)
@urd4651
@urd4651 3 жыл бұрын
well explained !!!!! thank you very much!
@statquest
@statquest 3 жыл бұрын
Glad you liked it!
@arifahafdila5531
@arifahafdila5531 3 жыл бұрын
Thank you so much for the videos 👍
@statquest
@statquest 3 жыл бұрын
Glad you like them!
@mjifri2000
@mjifri2000 5 жыл бұрын
Man ; you are the best.
@statquest
@statquest 5 жыл бұрын
Thanks! :)
@joyousmomentscollection
@joyousmomentscollection 5 жыл бұрын
Thanks Josh... If your data contains one-hot encoded data(transformed from categorical data) and discrete data along with continuous data types, what kind of scaling would be prefered before applying PCA technique
@statquest
@statquest 5 жыл бұрын
It may be better to use lasso or elastic-net regularization to select the variables that are most important than to use PCA. Regularization can remove variables that are not useful for making predictions. If you're interested in this subject, I have several videos on it. Just look for "regularization" on my video index page: statquest.org/video-index/
@marianaferreiracruz5398
@marianaferreiracruz5398 Жыл бұрын
love the music
@statquest
@statquest Жыл бұрын
Thanks!
@samggfr
@samggfr Жыл бұрын
Hi Josh. Thanks for your videos, especially when you are diving into details and tips. In tip#2 concerning centering, you show 2 sets of 3 points and you present the centering to the mean. Let's imagine an experiment with 3 patients with drug A and 3 patients with drugs A and B. Let's say the lower/left set if the reference, drug A, and the upper/right set is the test, drug A+B. What about centering on A (set A will be at the origin)? This centering should show the total effect of adding drug B to drug A, whereas the mean centering shows half the effect. In the same vein, the variables plot should show the variables that change from drugA set to drugAB set instead of showing variables that change from the mean experiment ie ((drugA+drugAB)/2). What's your view?
@statquest
@statquest Жыл бұрын
Centering using all of the data does not change the relationship between the two groups of points - they are still the same distance apart from each other, and the eigenvalue will reflect this and give you a sense of how much a difference A is from A+B.
@samggfr
@samggfr Жыл бұрын
@@statquest Thanks for your reply concerning the distance, which I might interpret as the effect size. Could you tell me your view concerning the plot of variables?
@statquest
@statquest Жыл бұрын
@@samggfr I'm not 100% certain I understand your question about the variables plot, but the loadings for the variables on PC1 will tell you which variables have the largest influence in causing variation in that direction.
@shubhamgupta6567
@shubhamgupta6567 4 жыл бұрын
Can u make a video on partial least square regression please
@paulotarso4483
@paulotarso4483 3 жыл бұрын
Hey Josh thx so much for your videos... 3 quick questions: 1. 7:54 says "if there are fewer samples than variables, the number of samples puts an upper bound on the number of PCs with eigenvalues greater than 0", but in the example there, the number of samples is equal to the number of variables, not less. Should the statement be "if # of samples
@statquest
@statquest 3 жыл бұрын
1) What matters is that there is an upper bound and it depends on the number of variables and the number of samples, and that means we can actually write it both ways: "if # of samples
@ptflecha
@ptflecha 3 жыл бұрын
Thanks so much!!
@paulohmarco
@paulohmarco 4 жыл бұрын
Hi professor Josh Starmer, Thanks a lot for your videos. This a very joyful way to teach these methods! lease, let me ask you: I am giving a lecture about PCA online in Brazil, in Portuguese language, and I would like to ask your permission to use some of your examples to teach PCA. Of course, I will reference it to your StatQuest channel. Thanks in advance!
@statquest
@statquest 4 жыл бұрын
Feel free to use the examples and cite the video.
@vigneshvicky6720
@vigneshvicky6720 3 жыл бұрын
Your r using datapoints to get pca's but in general we r using covariance matrix to get pca y??
@statquest
@statquest 3 жыл бұрын
The old way to do pca was to use a covariance matrix. However, no one does that any more. Instead, we apply Singular Value Decomposition directly to the data.
@vigneshvicky6720
@vigneshvicky6720 3 жыл бұрын
@@statquest tq love frm india💖
@urjaswitayadav3188
@urjaswitayadav3188 6 жыл бұрын
Thanks for the video Joshua! Would you please consider doing a video on hypergeometric distribution and hypergeometric test? I have seen that it is often used to check the significance of overlaps between lists generated by high throughput analyses, but I am always confused on how to set it up when I have to do one myself. Thanks a lot!
@urjaswitayadav3188
@urjaswitayadav3188 6 жыл бұрын
Yes! That's exactly what I wanted. Thanks a lot :)
@jo91218
@jo91218 6 жыл бұрын
great video! quick question: converting raw scores into z-scores would both center and scale my data? thanks!
@sane7263
@sane7263 Жыл бұрын
Great Video Josh! I am wondering @ 7:32 "Find the line perpendicular to PC1 that fits best" what does this means? I mean either you can have line perpendicular or a best fit line.
@statquest
@statquest Жыл бұрын
When you have more than 2-dimensions, then the first perpendicular line can rotate around PC1 and still be perpendicular. Thus, any line in that plane will be perpendicular. For more details, see: kzbin.info/www/bejne/fJjEnI2ta7Bkh7M
@sane7263
@sane7263 Жыл бұрын
@@statquest Thanks for the lighting fast reply Josh! I have already seen that video and after watching this I had same question. If a PC2 (a line) is passing through PC1 (another line) perpendicular i.e., at 90 degree, how can it rotate and still maintain that angle?
@statquest
@statquest Жыл бұрын
@@sane7263 If we have 3-dimensions, PC1 can go anywhere. PC2 however, can go anywhere in a plane that is perpendicular to PC1 and PC3 has not choice but to be perpendicular to both PC1 and PC2. I try to illustrate this here: kzbin.info/www/bejne/fJjEnI2ta7Bkh7M
@sane7263
@sane7263 Жыл бұрын
@@statquest Ahh! I see! So if we have a 2D plane PC1 can go anywhere but in this case PC2 will have no choice but to be perpendicular. Right? I think now I got it.
@statquest
@statquest Жыл бұрын
@@sane7263 That's right. When we only have 2-dimensions, then the first line can go anywhere, but once that is determined, the second line has no choice. When we have 3-dimensions, things are a little more interesting for the second line.
@namithacherian1743
@namithacherian1743 2 жыл бұрын
DOUBT: When there are only 2 points, and you mentioned that you can fit only one line through them (Correct). However, there is no guarantee that it will pass through the origin. In other words, when there are only 2 points, you can draw a line that goes through the origin and fit it with one data point for sure. But having it pass through both data points is a matter of chance. Right?
@statquest
@statquest 2 жыл бұрын
What time point, minutes and seconds, are you asking about?
@Asia25Asia
@Asia25Asia 3 жыл бұрын
Hi Josh! thank you for your videos. Could you please give some hint what to do with NA (not obtained) values in PCA? How to deal with them? Additionally - what is better - to use raw data (abundance) or relative abundance (percentage) as an input to PCA?
@statquest
@statquest 3 жыл бұрын
You can try to impute the missing values. And depending on what you want to show, it can be better to use raw data or some sort of transformed version.
@Asia25Asia
@Asia25Asia 3 жыл бұрын
@@statquest Thanks a lot for quick response. Can you recommend some easy and friendly function for imputing biological data?
@statquest
@statquest 3 жыл бұрын
@@Asia25Asia Not off the top of my head.
@Asia25Asia
@Asia25Asia 3 жыл бұрын
@@statquest OK, no problem :)
@mojojojo890
@mojojojo890 3 жыл бұрын
If you could explain why the first PCA is the eigen vector that would be nice I know that eigen vectors are the vectors that their span doesn't change after a transformation even if they are scaled... so here what exactly is the transformation applied ?
@statquest
@statquest 3 жыл бұрын
I explain eigen vectors in this video: kzbin.info/www/bejne/fJjEnI2ta7Bkh7M
@mojojojo890
@mojojojo890 3 жыл бұрын
@@statquest I watched that video but it does not explain why the the first PCA is an eigen vector
@statquest
@statquest 3 жыл бұрын
@@mojojojo890 Ah, I see. First, that video focuses on Singular Value Decomposition, which is the modern way to do PCA and doesn't actually involve calculating eigenvectors. However, the old method does by applying eigen decomposition to the variance-covariance matrix of the raw data. And in the old method, PC1 was an eigenvector for the variance/co-variance matrix. In other words, if the variance/co-covariance matrix is V, then V x PC1 = eigenvalue * PC1, which makes PC1 an eigenvector.
@mojojojo890
@mojojojo890 3 жыл бұрын
@@statquest That sends me somewhere to find my answer... Thanx a lot !!
@AyatUllah-zr6ij
@AyatUllah-zr6ij 7 ай бұрын
Good ❤
@statquest
@statquest 7 ай бұрын
Bam! :)
@k_a_shah
@k_a_shah Жыл бұрын
which application is used to plot this graph ? or any software
@statquest
@statquest Жыл бұрын
I give all my secrets away in this video: kzbin.info/www/bejne/maOviX19Yqp0ns0
@MyKornflake
@MyKornflake 4 жыл бұрын
Great explanation, I made a PCA plot with 96 genes from 6 different samples using SPSS, but I am having a hard time trying to interpret PC1 and PC2 as what do they represent. Could you please give me some idea on this? Thanks in advance.
@statquest
@statquest 4 жыл бұрын
Look at the magnitude of the loading scores for PC1 and PC2.
@Patrick881199
@Patrick881199 4 жыл бұрын
Hi, Josh, , I am a little confusing that at 2:37, you mentioned using standard deviation, well, if we have math scores(0-100) with standard deviation of 5 and, in the same time, the reading scores(0-10) also has sd of 5, then by dividing sd, math and reading are still NOT in the same scale.
@statquest
@statquest 4 жыл бұрын
Regardless of the original scale, if you divide each value in a set of measurements by the standard deviation of that set, the standard deviation of the new values will be 1. And that puts all variables on the same scale.
@statquest
@statquest 4 жыл бұрын
For more details, see: stats.idre.ucla.edu/stata/faq/how-do-i-standardize-variables-in-stata/
@Patrick881199
@Patrick881199 4 жыл бұрын
@@statquest Thanks, Josh
@rezkyilhamsaputra8472
@rezkyilhamsaputra8472 Жыл бұрын
Are these tips also applied in principal component regression (PCR)?
@statquest
@statquest Жыл бұрын
They apply to any time you want to use PCA, so yes, they would also apply to PCR.
@rezkyilhamsaputra8472
@rezkyilhamsaputra8472 Жыл бұрын
@@statquest and if the software gives us an option whether or not we want to center and/or scale the data, is there a condition where we shouldn't center/scale the data or we must always do it?
@statquest
@statquest Жыл бұрын
@@rezkyilhamsaputra8472 I can't think of a reason you wouldn't want to center your data. Scaling depends on the data itself. If it's already on the same scale, you might not want to do it.
@rezkyilhamsaputra8472
@rezkyilhamsaputra8472 Жыл бұрын
@@statquest alright, thank you so much for the crystal clear explanation!
@kartikmalladi1918
@kartikmalladi1918 Жыл бұрын
What is the need of pca if you can use average scores as contribution?
@statquest
@statquest Жыл бұрын
My main PCA video gives a reason to use it: kzbin.info/www/bejne/fJjEnI2ta7Bkh7M
@kartikmalladi1918
@kartikmalladi1918 Жыл бұрын
@@statquest I mean I've gone through your videos. Great work by the way. Main goal of PCA is to understand the contribution of each variable to a sample. However finding out average of each variable and their percentage contribution still gives some idea. So how is this average contribution different from PCA loading scores?
@statquest
@statquest Жыл бұрын
@@kartikmalladi1918 What time point in the video, minutes and seconds are you asking about?
@kartikmalladi1918
@kartikmalladi1918 Жыл бұрын
@@statquest it's from the main PCA video, discussing about loading score contribution.
@statquest
@statquest Жыл бұрын
@@kartikmalladi1918 What time point?
@Amf313
@Amf313 4 жыл бұрын
How we should scale for the variables which don’t have clear upper or lower bounds? For example If our 2 variables are human height and Weight ... is it rational to scale them based on the maximum height and weight existing in the whole samples? What if we have just one person weighting above 100 Kg and his weight is 160Kg; If we drop only this sample from the Data, the scale and whole PCAs will differ significantly. So is it rational to consider the variable scale based on the max and min of the values existing in the samples? (for such variables without intrinsic upper and lower bounds) 🤔
@statquest
@statquest 4 жыл бұрын
You scale the data based on the data itself, not theoretical bounds.
@lucaliberato
@lucaliberato 3 жыл бұрын
Hello Josh, i have a question. You say that, to find a 3rd PC, we should find a line perpendicular to PC1 and PC2 and it's not possible. But in the first video you say we cand find PC3 that goes through the origin and is perpendicular to PC1 and PC2. I lost something in the video for sure, can you help me pls?
@statquest
@statquest 3 жыл бұрын
In the first video, we have enough data points on the graph that we can meaningfully create 3 axes. However, in this example, we don't have enough data to do that. The point being made in this video is that the maximum number of PCs can be limited by the number of data points. So, even if you have 3-D data, if you only have 2 points, then you will only have 1 PC, because 2 points only define a specific line. We need 3 points to define a specific plane (for 2 pcs) and we'd need 4 points to define 3 PCs etc.
@lucaliberato
@lucaliberato 3 жыл бұрын
@@statquest Thank you so much Josj, you're super😎
@misseghe3239
@misseghe3239 4 жыл бұрын
Can PCA be used for Regression problems or only classification problem? Thanks .
@statquest
@statquest 4 жыл бұрын
There are actually several types of regression that use PCA. PCA reduces the number of variables in your model.
@raisaoliveira7
@raisaoliveira7 Жыл бұрын
@statquest
@statquest Жыл бұрын
You're welcome again! :)
@basharabdulrazeq4349
@basharabdulrazeq4349 5 ай бұрын
Hello Josh. @ 7:57, you explained that if there are fewer samples than variables then the number of samples puts an upper bound on the number of PCs. In the last example, there are 3 samples and 3 variables (therefore the number of samples isn't fewer than the number of variables), and the number of PCs should be 3 (not 2). could you explain why did you decide that the number of PCs should be 2!!. (BTW I watched all of your videos about PCA, but I don't understand this specific example).
@statquest
@statquest 5 ай бұрын
The answer to your question starts at 5:09. The key is that we don't include PCs that have an eigenvalue = 0. If there's no variation in a direction, then there is no need for an axis in that direction. Thus, 3 data points can only define a 2-dimension plane, and thus PC3 will have an eigenvalue = 0 and thus, we can exclude PC3.
@basharabdulrazeq4349
@basharabdulrazeq4349 5 ай бұрын
@@statquest I agree, but I still can see variation in a third direction. I just can't comprehend the idea that in there isn't, because there are three variables for three students and all of them change with each other. I'd be so grateful, if you could prove or give me some source to see a proof that there shouldn't be PC3, because I really need to comprehend the idea.
@statquest
@statquest 5 ай бұрын
@@basharabdulrazeq4349 For each student, we have 3 values for the 3 variables that represent a single point in the 3-dimensional space. Thus, we have 3 points in the 3-dimensional space, one per student. 3 points define a plane, which is only a 2-dimensional space. Thus, only 2 PCs can possibly have eigenvalues > 0.
@basharabdulrazeq4349
@basharabdulrazeq4349 5 ай бұрын
@@statquest Thanks a lot. I understand now that no matter which direction you arrange any three points, they will always lie on the same plane.
@lingxinhe4627
@lingxinhe4627 4 жыл бұрын
Hi Josh, Thank you for the amazing videos, the content on this channel on stats is so much better than everything else I've found online. I have a quest(ion): Once you get PC1 and PC2 as the main components that explain variation, how can we get back to the variables that compose them? Thank you!
@statquest
@statquest 4 жыл бұрын
I show how to do this exact thing in my PCA in R kzbin.info/www/bejne/ZnvTZZqpm7R_g9U and PCA in Python kzbin.info/www/bejne/gqTYlmWderJsepI videos.
@alexisvivoli8963
@alexisvivoli8963 4 жыл бұрын
Hi josh ! thanks a lot for your videos. I vizualise very well how it works for 3 variables thanks to your animation, but i'm struggling to understand what happen if you add more variables : how do you project/center and calculate PC since you can't have more than 3 dimension. So how it works if you have, let say 4 or even more like 100 variables ? Thanks !
@statquest
@statquest 4 жыл бұрын
If I have one variable, called var1, then I can center it by calculating the mean for var1 and subtracting that from each value. If I have two variables, var1 and var2, I can center the data by calculating the mean for var1 and subtracting that from all of the var1 values and calculating the mean for var2 and subtracting that from all of the var2 values. If have 3 variables, var1, var2 and var3, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values and calculating the mean for var3 and subtracting that from all of the var3 values. If I have 4 variables, var1, var2, var3 and var4, then I can center it by calculating the mean for var1 and subtracting that from all of the var1 values, calculating the mean for var2 and subtracting that from all of the var2 values, calculating the mean for var3 and subtracting that from all of the var3 values and calculating the mean for var4 and subtracting that from all of the var4 values. If I have N variables, var1, var2, var3 ... varN, then I can center it by calculating the mean for var_i and subtracting that from all of the var_i values, where I is a value from 1 to N. etc. Does that make sense?
@jasperli7794
@jasperli7794 2 жыл бұрын
@@statquest Thanks, I understand this idea of centering the data for all variables. But then how do you draw the principle components, for all the variables, beyond 3? After you draw principle component 1 through the origin (so that it best fits the data, using SVD etc.), and place principle component 2 through the origin perpendicular to it, and principle component 3 perpendicular to both 1 and 2, how do you continue placing principle components perpendicular to the first 3? Is there an explanation for further principle components which does not rely on the restrictions of the physical 3D world? Thank you very much!
@statquest
@statquest 2 жыл бұрын
@@jasperli7794 It's just relatively abstract math, which isn't limited to 3-dimensions. However, the concepts are the same, regardless of the number of dimensions.
@jasperli7794
@jasperli7794 2 жыл бұрын
@@statquest Okay, so if I understand correctly, the principle components capture various axes which are related to each other by position, and which explain (decreasing amounts of) variance within the data and the relative contributions of each feature/variable at each principle component. Thanks!
@statquest
@statquest 2 жыл бұрын
@@jasperli7794 Yep!
@reytns1
@reytns1 6 жыл бұрын
I have a question: Could I enter a percentage value in order to obtain a PCA?
@danielasanabria3242
@danielasanabria3242 4 жыл бұрын
Did you already make a video about Partial Least Squares?
@statquest
@statquest 4 жыл бұрын
Not yet.
@addisonmcghee9190
@addisonmcghee9190 4 жыл бұрын
So Josh, would the upper bound for Principal components be: minimum{ # of variables, (# of samples - 1)}
@statquest
@statquest 4 жыл бұрын
I answer this question at 3:30
@addisonmcghee9190
@addisonmcghee9190 4 жыл бұрын
@@statquest Ok, so if we had 2 students and 5 variables, wouldn't we only have 1 principal component? These are two points in a 5-dimensional space, so it would be a line, right? So, (# of samples - 1)? I'm just trying to find a pattern...
@statquest
@statquest 4 жыл бұрын
@@addisonmcghee9190 That is correct.
@addisonmcghee9190
@addisonmcghee9190 2 жыл бұрын
@@statquest Revisiting this comment because I'm learning about PCA in G-school....old StatQuest for the win!
@mrweisu
@mrweisu 4 жыл бұрын
At 6:19, even the two points are on a line, but does the line necessarily go through (0,0)? If not, there still can be two PCs. Can you help clarify? Thanks.
@statquest
@statquest 4 жыл бұрын
PC1 always goes through the origin. That's why we center the data to begin with.
@mrweisu
@mrweisu 4 жыл бұрын
@@statquest Yes, but the line connecting the two points might not.
@statquest
@statquest 4 жыл бұрын
If the data are centered, then the line connecting the two points will go through the origin. If they are not centered, then, technically, you are correct, we will have 2 PCs - but neither PC will do a good job reflecting the relationship of the data as well as the PC derived from the centered data.
@mrweisu
@mrweisu 4 жыл бұрын
@@statquest does centering data make the connecting line go through (0,0)?
@statquest
@statquest 4 жыл бұрын
@@mrweisu Yes
@reytns1
@reytns1 6 жыл бұрын
Other question regarding PLS, as I know PLS is a regression over PCA, Is that rigth? and PLS can make over only one variable to regress (I mean the Y variable)? uhmm another question if you have a lot of trait to do a PCA, there are some statistics that show me what is the best and second trait that it is important for that PCA? I mean not the autovalue?? thanks
@statquest
@statquest 6 жыл бұрын
Partial Least Squares (PLS) and Principle Component Regression (PCR) are both ways to combine regression with PCA, as a way avoid overfitting the model (if there are more variables than samples, you'll overfit your model and your future predictions will be bad). PCR does PCA on just the variables (the measurements used to predict something). As a result, it focuses on the variables responsible for most of the variation in the data. In contrast, PLS does PCA on both the variables and the thing you want to predict. This makes PLS focus on variation in the variables as well as variables that correlate with the thing you want to predict. As for statistics on which variable is the most important for PCA (other than just looking at the loading scores), you could probably use bootstrapping, but, at least at this time, I don't have a lot of experience with this.
@haydergfg6702
@haydergfg6702 Жыл бұрын
which programming using
@statquest
@statquest Жыл бұрын
I'm not sure I understand your question. Are you asking how I created the video? If so, see: kzbin.info/www/bejne/maOviX19Yqp0ns0
@DabeerRoy
@DabeerRoy 2 ай бұрын
Hi I need where To practice this please anyone who can help me?
@statquest
@statquest 2 ай бұрын
in R: kzbin.info/www/bejne/ZnvTZZqpm7R_g9U and in PyThon: kzbin.info/www/bejne/gqTYlmWderJsepI
@doubletoned5772
@doubletoned5772 5 жыл бұрын
I have a trivial question at 1:39 . If the recipe to make PC1 is using approx 10 parts Math and only 1 part reading, why does that mean that Math is '10' times more important than Reading to explain the variation in data? I mean I understand that it will be more important but is that specific number (10) correct?
@statquest
@statquest 5 жыл бұрын
I think my wording may have been sloppy here.
@wolfisraging
@wolfisraging 6 жыл бұрын
Thank u sooooooooo much, that's damn awesome.
@giosang1111
@giosang1111 4 жыл бұрын
Hi, is it always true that to scale the values of the variants we divide the values by their SDs?
@statquest
@statquest 4 жыл бұрын
For PCA, yes.
@giosang1111
@giosang1111 4 жыл бұрын
Thanks! Can you make a video which summarizes which statistical methods used in which cases? There are so many methods out there and I am really confused which and when to use them. Thanks a lot.
@statquest
@statquest 4 жыл бұрын
@@giosang1111 Since there are so many methods, this would probably be a series of videos, rather than a single video, but either way, it's on the to-do list. However, it will probably be a while before I can get to it.
@giosang1111
@giosang1111 4 жыл бұрын
@@statquest Hi! I am looking forward to it. All the bests!
@kushaltm6325
@kushaltm6325 6 жыл бұрын
Josh, Thank you very much for helping us out with stats. When i get a job, I sure should contribute towards your efforts. I am struggling to understand things @3:10 Why should it be a problem if we do NOT centre the data ? Can you please explain with respect to your "PCA -Clearly Explained" Video. My Prof would't answer it. So asking a Cool-Stat-Guru about it :) If it requires too much eleboration please point me to other resources.... Thanks Again. Best Wishes from India... :)
@statquest
@statquest 6 жыл бұрын
Thanks!! Do mean try to explain it in terms of "PCA-Clearly Explained" or do you mean "PCA Step-By-Step". The former shows the "old" or "original" method of PCA, which was to find the eigenvectors of the covariance matrix. The latter, "Step-by-step", shows how PCA is done using the more modern technique using Singular Value Decomposition. I think it is easier to understand centering in terms of SVD.
@survivio8937
@survivio8937 4 жыл бұрын
Thank you so much for these amazing videos. With my new found free time, I am trying to learn about PCA in preparation for upcoming RNA-seq experiments. I have yet to do this and will probably understand more once I have practical experience, but one thing struck me as odd in your video. When scaling data, you state that the typical method of doing this is to divide by the standard deviation assuming large values will have larger SD. But, intuitively it would make sense to scale data based on the mean rather than the SD. For example if I had one gene which is highly expressed but not variable, then it would not be scaled down appropriately and would have an oversized contribution to PC1. Am I thinking about this wrong or is there some reason what mean is a bad choice? Next, it seems that with scaling, small changes in rare transcripts (that might just be error and not true transcripts) would contribute a lot to the variability and thus PC1; does this not present a problem? Also, another comment: I find this and the prior video on PCA from 2018 much more intuitive than the one you produced previously in which you discuss generating a PC axis by looking at the spread of data for 2 cells with multiple transcripts and coming up with weights or "loading scores" for each trancript based on high and low expression. Thank you
@statquest
@statquest 4 жыл бұрын
In the PCA step-by-step video ( kzbin.info/www/bejne/fJjEnI2ta7Bkh7M ) one of the first things we do is center the data. This is the equivalent of subtracting the mean value from each dimension in the data. So, for your example, if you have a gene with high expression, but no variation, we will subtract the mean of that gene from each replicate. So that part of the data standardization is already taken care of. The original PCA video is based on the old way of doing PCA, which is still taught as if the new way does not exist. The old way is based on creating a variance/covariance matrix of all the observations. I agree, that it is not as intuitive to understand as the new way, which is to use Singular Value Decomposition.
@gspb4
@gspb4 4 жыл бұрын
​@@statquest Hi Josh. You mention the "old way" of performing PCA using the variance/covariance matrix versus the new way of using SVD. Do both techniques produce identical results? Further, have you considered producing videos for non-linear PCA? Anyways, thanks so much for what you do. I'm currently taking a computational biology course in grad school and wouldn't be able to get through it without your videos!!
@mostafael-tager8908
@mostafael-tager8908 4 жыл бұрын
Thanks for the video, but I think there is a simple mistake at @2:08 when you said mix 0.77 Math with 0.77 Reading , I thought that both must add up to 1 , or I got something wrong ?
@statquest
@statquest 4 жыл бұрын
0.77 for math and 0.77 for reading represent 2 sides of a triangle that has been normalized so that the hypotenuse = 1. In other words, using the Pythagorean theorem, sqrt(0.77^2 + 0.77^2) = 1. For more details about this, see minute 11 and second 16 in this video: kzbin.info/www/bejne/fJjEnI2ta7Bkh7M
@thuyduongnguyen1231
@thuyduongnguyen1231 4 жыл бұрын
Dear Josh, I have watched PCA (step-by-step) and this video of yours. It really helps me get over the scare of math and try to understand these terminologies. However, I wonder what if our data has 20 attributes (not 2 or 3 attributes like in the video), does it mean we will have 20 PCA? or there will be another approach to determine the maximum number of PCA? Thank you very much
@statquest
@statquest 4 жыл бұрын
I answer this question at 3:30
@majidkh2695
@majidkh2695 5 жыл бұрын
@ 7:01, for the case we have 2 points, 3 features, shouldn't the number of PCs be 2?! With 3 features we don't have a line anymore, but a hyperplane!
@sarahjamal86
@sarahjamal86 5 жыл бұрын
Ok... since PCA uses the SVD and covariance matrix, so not centering the data to the origin means, that the data is not mean free, freeing the data from the mean is part of constructing the covariance matrix. So not having zero mean data means that our eigenvector will not be 100% correctly derived.
@statquest
@statquest 5 жыл бұрын
PCA uses SVD or the covariance matrix. It doesn't use both. Older PCA methods use a covariance matrix and a covariance matrix is automatically centered, so you don't need to worry about this. However, newer PCA methods use SVD because it is more likely to give you the correct result (SVD is more "numerically stable"), and when using SVD, you need to center your data (or make sure that the program you are using will center it for you). Otherwise you get the errors illustrated at 2:55.
@糜家睿
@糜家睿 6 жыл бұрын
Two quick questions, Joshua. When we deal with RNA-seq data, we should log-transformed the data before running PCA right? Can I say it is a way to minimize the effect from outliers when determining PC?. My second question is in R, there is a embedded function called prcomp; also in many other packages there are functions like runPCA and plotPCA, how do I know these functions will center the data before doing calculating variation and doing projections? Thanks!
@statquest
@statquest 6 жыл бұрын
Log transforming RNA-seq data before PCA is a good idea and I generally do it. For prcomp(), there is a parameter "scale" that you can set to TRUE. When you do this, prcomp() will center and scale your data for you. In general, you can always look up the documentation for the PCA function you are using. In R you can get the documentation for prcomp() with the call "?prcomp()".
@bz6445
@bz6445 Ай бұрын
@@statquest Or how about not using log or scaling? I’m trying to decide if I should pursue using just the normalized read counts as typically I get meaningful results relevant to our interests in the organisms biology. Is this a good reason not to scale or log transformations? The results of the principle components are high (explained variance) as well as the contributing variables as the genes are highly expressed but are relevant to the development of the organism. What do you think, is this a valid reason?
@statquest
@statquest Ай бұрын
@@bz6445 Maybe you can create a histogram of the reads to see how things are distributed. Then do the same thing with the log scaling. You may notice a few outlier genes - look into those - see if they are relevant to your experiment. (You could also look at the PCA loading scores).
@mojo9Y7OsXKT
@mojo9Y7OsXKT 4 жыл бұрын
How come this video is gone "Private"? The screen says: "Video Unavailable" "This video is private"!!
@statquest
@statquest 4 жыл бұрын
This specific video? Or are you asking about another video?
@mojo9Y7OsXKT
@mojo9Y7OsXKT 4 жыл бұрын
@@statquest This video was showing as private yesterday. Its come back us today! Could've been a glitch. Thanks for all your vidz.
@statquest
@statquest 4 жыл бұрын
@@mojo9Y7OsXKT Yeah, something strange must have happened. I'm glad it's back. :)
@JaspreetSingh-eh1vy
@JaspreetSingh-eh1vy 4 жыл бұрын
So technically speaking, number of PC = number of features but if the number of samples < number of features, then the number of PC = number of samples - 1. Am I right?
@statquest
@statquest 4 жыл бұрын
yep
@pattiknuth4822
@pattiknuth4822 3 жыл бұрын
Drop the song.
@statquest
@statquest 3 жыл бұрын
Noted.
@seazink5357
@seazink5357 5 жыл бұрын
love you
@shwetankagrawal4253
@shwetankagrawal4253 5 жыл бұрын
Hey John,I am not able to understand kernal PCA, can u explain or tell me the book name which can give me the clear understanding of this?
@1989ENM
@1989ENM 5 жыл бұрын
...for me?
@statquest
@statquest 5 жыл бұрын
Hooray!
StatQuest: Linear Discriminant Analysis (LDA) clearly explained.
15:12
StatQuest with Josh Starmer
Рет қаралды 774 М.
StatQuest: Principal Component Analysis (PCA), Step-by-Step
21:58
StatQuest with Josh Starmer
Рет қаралды 2,9 МЛН
БУ, ИСПУГАЛСЯ?? #shorts
00:22
Паша Осадчий
Рет қаралды 2 МЛН
When u fight over the armrest
00:41
Adam W
Рет қаралды 24 МЛН
Haunted House 😰😨 LeoNata family #shorts
00:37
LeoNata Family
Рет қаралды 4,2 МЛН
ЛУЧШИЙ ФОКУС + секрет! #shorts
00:12
Роман Magic
Рет қаралды 31 МЛН
Principal Component Analysis (PCA) clearly explained (2015)
20:16
StatQuest with Josh Starmer
Рет қаралды 1 МЛН
StatQuest: PCA in Python
11:37
StatQuest with Josh Starmer
Рет қаралды 206 М.
ROC and AUC, Clearly Explained!
16:17
StatQuest with Josh Starmer
Рет қаралды 1,5 МЛН
Data Analysis 6: Principal Component Analysis (PCA) - Computerphile
20:09
Principal Component Analysis (PCA)
26:34
Serrano.Academy
Рет қаралды 414 М.
Principal Component Analysis (PCA) - easy and practical explanation
10:56
StatQuest: PCA in R
8:57
StatQuest with Josh Starmer
Рет қаралды 286 М.
AdaBoost, Clearly Explained
20:54
StatQuest with Josh Starmer
Рет қаралды 774 М.
Principal Component Analysis (PCA)
13:46
Steve Brunton
Рет қаралды 395 М.
БУ, ИСПУГАЛСЯ?? #shorts
00:22
Паша Осадчий
Рет қаралды 2 МЛН