Principal Component Analysis (PCA) [Matlab]

Рет қаралды 87,856

Күн бұрын

Пікірлер: 69

@starriet 2 жыл бұрын

for future ref) In the first part of the video, X's _colums_ (not rows) are each points(correction: 5:02 not every rows but every columns are X average). And, note that the code is using 'svd' function, not 'pca' function. This can be confusing because Prof. Brunton says in the previous lecture(the first vid on PCA) that PCA assumes 'rows' represent each individual(e.g. person, etc.), contrast to SVD which assumes 'columns' does it. *_BUT,_* in the second part(ovarian cancer), even though the code is using 'svd' function, the 'obs' matrix is 216x4000(216 patients) where each 'row' represents individual patient. Thus, here, U and V is actually like V and U in the first part of the lecture, respectively. Also, in the for loop in the code, the code plots each patient(each dot) in the 3 "principal" axes, in the for loop(in Matlab, A' means conjugate tranpose of A). *_However,_* the code calculates the dot products of the two long vectors(4000 elements, and this can be even larger in different examples). We _don't_ need this calculation because U already contains the exact same values(this U would have been V if each individual patient were represented by column, not row). So, we can just use U(i,1), U(i,2), U(i,3) for x,y,z in the for loop, instead of calculating dot products. (I don't use MATLAB but it should work. If it were Python, the only difference would be the indices start from 0 and using square brackets instead of parentheses). But, still, knowing why those dot products("projection" onto orthonormal vector, in this case) works is important in understanding SVD and PCA. Anyway, thanks a lot for this great series of lectures, awesome.

@armbusta Жыл бұрын

THANK YOU. Your comment has saved my sanity, it was the final puzzle piece that made it all click. This should be included in the description of the video word-for-word.

@ratnaa6326 4 жыл бұрын

thank you so much! my understanding increased exponentially when you explained with the ovarian cancer example.

@MageshJohn 3 жыл бұрын

Excellent! Your 15-minute video really captures the majority of the 100 years of information on PCA. SVD works!

@shankyxyz 6 күн бұрын

is there an implied fallacy in the ovarian data model when you determine the cutoff for the eigenmode? the cancer dependence could be on a obscure weak eigencoordinate.

@Eigensteve 6 күн бұрын

Great point. Yes indeed, we may well be truncating important information in low variance correlations. These may be especially important for rare events like cancer.

@nwxxzchen3105 4 жыл бұрын

Code is more understandable for me, thanks for your great job. This example has shown how PCA looks like in the gemotry way. Also there's some implicit relationship between the data points' shape and the centralized matrix's transformation capability which is not mentioned in linear algebra course.

@lesh2956 3 жыл бұрын

thank you so much for this video. The exaplanation with the example is gold.

@drdlecture6137 3 жыл бұрын

Thank for this excellent lecture. I have a question. Why didn't you subtract the mean of the rows before computing the SVD like in the previous example & explained in the PCA video?

@rakulansivanesapillai6215 4 жыл бұрын

Lovely setup and great presentation. Thanks

@DewanggaPrabowo 4 жыл бұрын

such a nice presentation This is what I'm looking for and what I'm coming for Thanks...

@sapertuz 4 жыл бұрын

I just don't understand why for the ovarian cancer example you don't do the preprocessing steps (mean and division by sqrt(Nmeas))

@jhonportella5618 3 жыл бұрын

Well, I will try answer to your question. Actually, what you are reconstructing in the Gaussian Data example with SVD is a scale version of the standard deviation: STD/SQRT(n). And this is done because he is trying to plot different confidence intervals. As you may remember the equation for the confidence interval is the Mean + Z*STD/SQRT(n). Which is what he is plotting in the red circles. So with a Z value of 3, you can capture almost 99 % of the data, which is what we see in the plots. Thus that normalization term is only because of the application of that code only. For PCA you don´t have to always normalize or standardize data, it is only needed when you are working with correlations or when the application demands it. In fact, if you are working with SVD, Data doesn´t even have to be centered by the mean, which is one of the advantages to USE truncated SVD instead of PCA

@HD141937 4 жыл бұрын

At 8:27, isn't it the columns of V (not U) that point into the directions of maximum variance?

@mataFot 2 жыл бұрын

Mr Steve first of all I would like to thank you about this video.. secondly I would like to ask you a question because it is my first time studying on PCA.. at the point 5:47 of your, video you explained that you divide B with the square root of the nPoints, how this come up.. I mean you did this because you wanted to minimize the value of the division????

@abolfazlabbasi4854 4 жыл бұрын

High quality presentation, Thanks for sharing.

@Eigensteve 4 жыл бұрын

Glad you liked it!

@haideralishuvo4781 3 жыл бұрын

Great Video , But one confusion , Arent we supposed to subtract the mean before computing the SVD? in the ovarian cancer case

@jhonportella5618 3 жыл бұрын

He is working with SVD as a way to compute PCA and one of the advantages of SVD in front of classical PCA with the eigendecomposition formulation is that SVD or truncated SVD does not require the data to be centered by the mean

@fermijman 4 жыл бұрын

Excellent lecture. Question: once you have determined the magnitude of the principle components is there a way of determining which features they represent in your original data? For instance determining which features from the cancer data correlated strongest to a cancer diagnosis?

@ElPrestigo 2 жыл бұрын

BTW to use a legend for the ovarian-data you can make use of plot handlers as follows: h = zeros(2,1); ... if(grp{i}=='Cancer') h(1) = plot3(...); else h(2) = plot3(...); ... legend(h, 'Cancer', 'Normal')

@yaraali4493 4 жыл бұрын

Thanks.. How can i do varimax rotation to pca's in matlab???

@DeepakKumar-tc9iy 4 жыл бұрын

@Steve I work with spatial time series data(3d x.y.t eg: temperature). I seen codes reshaping the spatial dimensions into 1d, so I have 2d series then apply the PCA analysis. But I need to work on vectorial data (eg: wind) which is in components (u,v) in 3d. Which makes then 4d...will it make sense if I reshape 3d(spacial 2d+ components 1) into a 1d which makes 2d data and then apply the pca?

@zhengyangkrisweng3338 3 жыл бұрын

Just to clarify, when you mention the energy of the statistical data, you're referring to the extent to which it captures the trend in the data, right?

@mataFot 2 жыл бұрын

Also, is there anyway to get this code for practice? .. Thank you in advance!!

@sohummisra8969 4 жыл бұрын

Wonderful series of lectures. I have a question regarding using the top 3 PCAs. Why are you not scaling the top 3 eigen vectors with their associated eigen values from S in order to find x, y and z?

@gzitterspiller Жыл бұрын

Because that only tells you how much variance there is in those directions, he only projects the datapoints into those directions and plots.

@yourswimpal 3 жыл бұрын

great explanation! thanks! may i know how do i tell which genes has the highest "impact" with regards to PC 1 ? (in the Ovarian Cancer example) - Is there a way i can tell from matrix U or matrix V ? i just learnt PCA 3 days ago , sorry if this is a noob question :)

@Eta_Carinae__ 2 жыл бұрын

You tell by the sigma matrix afaik. Look for the largest eigenvalue in sigma, and find it's corresponding eigenvector in V, and that's your most significant factor.

@Sheepyyyyyyyy 4 жыл бұрын

Thank you so much! I code the same with you, but 2 line are not perpendicular. In the code, the circle must be red ('r-') and the line must be blue ('c-').

@roger_island90 4 жыл бұрын

Hello sir, please I'm using your code to visualize the classification of ECG signals with 3 labels. The diagram generating is not correct. I think the problem is from the "for" loop. Please help me rectify this coz I tried severally but to no avail

@DrAndyShick Жыл бұрын

2:10 Actually, that would be 16 times as much variance

@alex.ander.bmblbn 4 жыл бұрын

dear Steve, I see that in my data set 2 states contribute to 90% of the data, how do I know, which ones?

@ifan9390 4 жыл бұрын

Is singular data decomposition also use in 3-dimensional data plots?

@chichungchan6766 3 жыл бұрын

really great video! However, can I relocate the PC1 2 3 to the actual variables?

@Assault137 4 жыл бұрын

Off-topic, but how do you get the IDE to be dark for your presentations?

@abdjahdoiahdoai 3 жыл бұрын

he inverted the color on the OS level

@ifan9390 4 жыл бұрын

what're the differences between the 2 Dimensional and 3 Dimensional data set plots?

@Daniel88santos 4 жыл бұрын

Hello Steve ... I would like first to Thank you by your effort in sharing and teaching this amazing technics. I also would like to ask you if it is possible you make a video on how to find the best r value using the Gavish-Donoho method using python language. This would be very useful for me. Thanks a lot and keep going.

@Eigensteve 4 жыл бұрын

Thanks for the comment. Yes, that video is coming up (in Matlab and Python). Just need a few days to process and upload.

@mybean1096 4 жыл бұрын

Python? What snakes gotta do with it?

@AdityaDiwakarVex 4 жыл бұрын

I got a little bit confused, what's the intuition behind calculating x, y, and z by doing V times b (observations)? What is x, y, and z showing? Sorry for the silly question, thanks in advance.

@Eigensteve 4 жыл бұрын

Here, x y and z are just the first three principal components of the data set. So it allows us to visualize how the data scatters in these new V coordinates. There are interesting patterns in V(:,4) and V(:,5) too, but I can't plot in x y z u v coordinates and make sense of it as a puny human stuck in 3D.

@AdityaDiwakarVex 4 жыл бұрын

@@Eigensteve Oh, so it's the data reconstruction but just using 3 of the components rather than all 4000 of them... why do we do V * d or V * obs?

@Eigensteve 4 жыл бұрын

@@AdityaDiwakarVex V * obs essentially takes the "observations" (i.e. the data) and transforms it into the V coordinate system. (it is also V-transpose * obs, which is an important subtlety when computing these things)

@AdityaDiwakarVex 4 жыл бұрын

@@Eigensteve Oh right, I do see that it is V-transpose. Thank you so much, that cleared it up. You are easily one of the best professors/teachers I've come across, thank you!

@mybean1096 4 жыл бұрын

Been trying to to write a formula to combine both Honey Mustard [ detaSet ] and Ranch BBQ Sauce [ dataSet(2×2) ] as one component while randomly scaling calories and sugar. Don't see what The Matrix movie has to do with anything though.

@Eigensteve 4 жыл бұрын

Nice. Actually, people do think about food, flavors, and chemistry in PCA coordinates. Some neat and unexpected food pairings have been discovered this way.

@jalilkhan321 4 жыл бұрын

How is Proper Orthogonal Decomposition ,used in fluids , is different from PCA or SVD.

@justinli19901027 4 жыл бұрын

love the series, thank you

@burakyesilyurt9544 4 жыл бұрын

Is there a convention about signs? I was convincing myself.. and what made me confused is T1, T2, T3 (scores) matrices in code below have same values with different signs. I found some article and code about flipping sign of svd and pca but I couldnt be sure... I'd be very happy if you made it clear for me, thanks! %% CODE clear; close all; clc; load fisheriris X = meas; % X = 5*randn(300, 10); [W, D] = eig(X'*X); W = W(:, end:-1:1); D = D(end:-1:1, end:-1:1); T1 = X*W; [U, S, V] = svd(X, 'econ'); T2 = U*S; [coeff, score, latent] = pca(X, 'Algorithm', 'svd', 'centered', false); T3 = score;

@erikschiferle3385 3 жыл бұрын

One other question, in the U/S/V, which index corresponds to PC2? is it V(2,1) or V(2,2). Thank you!

@erikschiferle3385 3 жыл бұрын

NVM, I think I answered question. Variable "V" is 4000x216 so I believe it would be row "label" if there were one for ovarian cancer data?

@ayaahmed-hc6wu 4 жыл бұрын

great Steve ... I would like to Thank you for your effort . I ask you to help me please with Matlab code to make feature Extraction using PCA to galaxy images ..I searched a lot and did not find any result .

@Eigensteve 4 жыл бұрын

I think any old PCA code in Matlab will work if your data is structured as a matrix.

@namiramomokhan6414 4 жыл бұрын

which matlab software to use i have 2014

@td1738 3 жыл бұрын

where i can find the code cheers

@erikschiferle3385 3 жыл бұрын

Can someone explain why for the log and cumulative singular value graphs, we have 216 along the x axis? Why is it not 4000 for the number of genetic markers?

@zhankunxi1058 3 жыл бұрын

Hi Erik. I got the same confusion at the beginning. But diving a bit to Steve's previous video, singular value plot is to show how much variance is captured by each principal component and cumulative sum via Sum(lambda_k)/all the lambda's. From dimension-wise, the dimension of B matrix (subtracted with means) is 216*4000. Through SVD, U is 216*216, Sigma is 216*216 and V transpose is 216*4000. I think both plots are drawn against the number of sigma's (216 here).

@yangyangliu9226 4 жыл бұрын

Thanks Steve for the amazing explanation! One thing I dotn't quite understand: why U*S*[cos(theta); sin(theta)] captures 1 std of the data?

@jhonportella5618 3 жыл бұрын

What I think he is actually doing is capturing, in SVD procedure, the diffeomorphism (fancy terminology but in few words is a linear transformation which inverse exists in this case ) to reconstruct STD/SQRT(N) which is part of the equation for the confidence interval. Then he is plotting those confidence intervals up to a z value of 3 that corresponds to almost 99% of the data

@Nikh__ 2 жыл бұрын

( know this comment is 2 years old but..) but if anyone else is wondering, scroll back to how X is created initially. And note that U only rotates vectors and sigma stretches. And, U, sigma are SVDs of B.

@Alejo10messi 4 жыл бұрын

16 times more variance in one direction than the other 2:10

@prabhath8618 4 жыл бұрын

can i get the code?

@Eigensteve 4 жыл бұрын

All code on databookuw.com

@realsemig 4 жыл бұрын

Me being a simple pleb: That looks like a galaxy!

@leif1075 4 жыл бұрын

No youre right it does.

@Старкрафт2комедия 2 жыл бұрын

this example is way to complicated. you should stick to like 10-20 data points for initial demonstration. Otherwise its too hard to understand exactly. only in hand wavy terms