Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@falaksingla62422 жыл бұрын
Hi Josh, Love your content. Has helped me to learn a lot & grow. You are doing an awesome work. Please continue to do so. Wanted to support you but unfortunately your Paypal link seems to be dysfunctional. Please update it.
@elnurazhalieva12625 жыл бұрын
Rarely do I recommend a youtube channel for someone, but this channel is must-watch!
@statquest5 жыл бұрын
Thank you! :)
@redaaitouahmed82506 жыл бұрын
You're making the life of a student so much easier and happier ... Thankkkkk youuuuuu !!!
@statquest6 жыл бұрын
You're welcome!!! :)
@luig2121 Жыл бұрын
I literally watch your videos as if I'm watching TV. I don't know how you've pulled this off but you are incredible
@statquest Жыл бұрын
Wow, thank you!
@hadihadiyar11854 жыл бұрын
Hi, I got my master in Epidemiology, trying to review statistics and found your channel, you are awesome, you really make statistics easy to understand, TRIPLE BAM for you
@statquest4 жыл бұрын
Thank you very much! :)
@PunmasterSTP8 ай бұрын
I remember learning about t-tests well before linear regression, but it's cool seeing things applied in a different way, especially while going into the deeper concepts. This whole playlist is a stats and machine-learning goldmine!
@statquest8 ай бұрын
BAM! Yes, usually t-tests are taught before linear regression, but I like teaching them in this order (regression first) since the extension of a t-test into ANOVA is way more obvious.
@PunmasterSTP8 ай бұрын
@@statquestThat sounds like a good plan.
@Russet_Mantle4 жыл бұрын
This is a really smooth transition from linear models to ANOVA, which is sadly not covered in many stats textbooks.
@statquest4 жыл бұрын
Thanks!
@justalittleguy7332 жыл бұрын
i am seriously failing my beginner stats course because try as i might the lectures are quite literally incomprehensible. i owe you my life!!! thank you for these amazing videos -- i feel like this is the first time ALL semester I am understanding something!
@statquest2 жыл бұрын
HOORAY! I'm glad the videos are helpful.
@justind69316 жыл бұрын
It actually takes me a while to realize the F-statistic shown in this video is the same as standard T-statistics. Great vid!
@statquest6 жыл бұрын
Thanks!!! I know, it's a little weird to look at a t-test from this perspective, but it shows how the F-statistic is a generalization of T-statistics. (Here's a cool hint - just like the F-statistic is a generalization of T-statistics, Chi-square statistics are a generalization of normal statistics....)
@amorrismusic3 жыл бұрын
Never in my life has learning math been easier. Excellent work Josh!
@statquest3 жыл бұрын
Thank you very much!!! :)
@DamianEQuijanoA6 жыл бұрын
Hablo muy poco inglés, pero tu metodología de enseñanza( es muy profesional) es magnífica. A pesar que es inglés, yo logro entender mejor que todas las clases de estadísticas en español. Haces un enorme esfuerzo para que tus clases sean intuitivas y fáciles de comprender para personas no expertas en estadísticas. Te felicito.
@statquest6 жыл бұрын
Muchas gracias!!!!
@rwei20497 жыл бұрын
this is the clearest explanation of design matrices I've ever seen!! Thank you soooo much Joshua!
@Sn-nw6zb6 жыл бұрын
Wow, this is smart way to explain ANOVA test, it looks so complicated at first, now it looks straight forward after resembling with linear regression. Great video!!!
@statquest6 жыл бұрын
Hooray!!! I'm so glad you like this video - it's one of my all time favorites. :)
@statquest6 жыл бұрын
Hooray! :)
@_Chafia6 жыл бұрын
I hope you will have the time to answer just in few words please! R sqr tell us how x is useful to predict y, so in the case of a t test or anova how to use it? we just talk about F & p, can we say it explains some % of the variance between treatments or it's useless!? Thank you so much Mr. Starmer
@statquest6 жыл бұрын
This is a great question. The traditional way to teach and perform t-tests (and ANOVA) only results in 't' or 'F' statistics and a p-value - no R-squared. However, as you see in this video, it's easy to also report R-squared - you just have to want to do it. The case of t-tests and ANOVA are just like regression and R-squared tells you the same thing - it gives you an estimate on the magnitude of the difference. The p-value just tells you that it is significant. If you did a t-test and got a small p-value, but also a small R-squared, then you could easily deduce that there's not a huge difference between the two groups (even if is statistically different). In contrast, if you did a t-test and got a small p-value and a large R-squared, then you would know that there's a big difference between the two groups. So we can see that R-squared is useful for even the t-test. I suspect that one reason presenting R-squared with t-test results is rare, is that often with t-tests, it is easy and very common to plot the data - so people will show you their data and give you the p-value. Seeing the data is sort of like a "visual R-squared" - you can see if the data are very close to each other or far apart.
@_Chafia6 жыл бұрын
THANK YOU SO MUCH.... YOU ARE VERY KIND SIR. I summarize if you allow : "significant p-value + R-squared" = how much is the différence Really GREAT! Thanks again & Good luck!
@ashmeetsingh7973Ай бұрын
Love the channel, one request to put some more statquests practising different stats model on different data sets! Thanks so much, binged the whole playlist
@statquestАй бұрын
That's a great idea.
@howardip79655 жыл бұрын
Your videos are very well-prepared and informative. Great teaching materials. You are so generous. Thanks a million.
@statquest5 жыл бұрын
Thank you very much! :)
@zahrahadavand22902 жыл бұрын
Awesome, there's nothing that can't be understood when you explain it, thanks a millionnnn
@statquest2 жыл бұрын
Thank you very much! :)
@user-bz7fj1fk2m4 жыл бұрын
You are blessed and STAY BLESSED. You significantly changed my life with STAT!!!
@statquest4 жыл бұрын
Thank you very much! :)
@SreenikethanI6 ай бұрын
StatBlessed :D
@clarasavary62656 жыл бұрын
Thank you very much for all your clear explanations. It's a real pleasure to listen to you and learn more about Statistics !
@statquest6 жыл бұрын
You're welcome! I'm glad to hear you think the videos are helpful. :)
@baharehbehrooziasl95177 ай бұрын
The interesting thing about this video is that it taught me something that I haven't noticed I didn't know!
@statquest7 ай бұрын
bam! :)
@xxMissCaprIce2 жыл бұрын
I think you might have just saved my life. This is so clearly explained, thank you!
@statquest2 жыл бұрын
Glad it helped!
@charlotteiosson62356 жыл бұрын
These videos are brilliant! I'm completing my PhD and there really isn't enough statistics support available which is as accessible as these videos (and considering we're meant to be doing research, that's not really good enough!) - thanks!
@markaitkin6 жыл бұрын
Love your videos. I have 3 requests... 1. Degrees of freedom 2. Linear regression with regularisation 3. Log linear regression and why coefficient indicates % change Thanks so much!
@statquest6 жыл бұрын
Thanks so much! The degrees of freedom StatQuest is high, high on the to-do list. It is never far from my mind. I have it about 1/2 done in my head, but the second half is tricky - some situations are easier to illustrate then others - but it's just a matter of setting aside time just for it and nothing else and it will get done. The good news is that I'm maybe 1 or 2 months away from doing StatQuests on ridge, lasso and elastic-net regression - all examples of linear regression (or, more generally, generalized linear regression since these ideas can be applied to logistic regression) with regularization. So that's sure to happen soon (just as soon as I can!) The last one, log-linear regression, is the logical follow up to logistic regression. I may do a "big picture/main ideas" StatQuest on that as soon as I can. It's on the list!
@markaitkin6 жыл бұрын
StatQuest with Josh Starmer thanks for your reply. Can't wait for the next videos
@lilmoesk8997 жыл бұрын
Thanks for the video. I'll have to watch this one a couple more times to fully digest it. It's the first time I've heard of a design matrix, so I'll have to spend some time looking into that.
@autumnp40775 жыл бұрын
Really appreciate the refresher of the regression on the side of the t-test! REINFORCEMENT FOR THE WIN!
@statquest5 жыл бұрын
Yes! :)
@Hajar1992ful7 ай бұрын
Thank you for your amazing videos Josh. You make us smarter!
@statquest7 ай бұрын
Glad you like them!
@alvarorodriguez35524 жыл бұрын
Best statistics teacher on internet!!!!
@statquest4 жыл бұрын
Thank you very much!!!! :)
@aickoyvesschumann34004 жыл бұрын
Great video! I think you should put parentheses around your SS differences in the F-statistics to have correct equations; (SS(mean)-SS(fit))/(p_fit-p_mean). Divisions have generally a higher priority than differences, but you want to first subtract and then divide.
@statquest4 жыл бұрын
Great suggestion! I've added your correction to a pinned comment that will be easy for other people to find.
@emmafoley89876 жыл бұрын
I've really had trouble understanding what a t test *is* and this was super helpful.
@statquest6 жыл бұрын
Hooray!!!! :)
@USER_GBME2 ай бұрын
Hi, love your videos. Just a quick checkup to see if I'm still on track. In the previous videos, I thought that you mentioned 'Degree of freedom' as an equation of (n-Pfit)/(Pfit-Pmean), if so, in the ANOVA example, since Pfit = 5, Pmean = 1, does the 'degree of freedom' equals (n-5)/4? if not, I think I need a solid explaination on this matter.
@statquest2 ай бұрын
Linear models have 2 different degrees of freedom - one for the numerator of the equation (n-pfit) and one for the denominator (pfit-pmean).
@junmingzheng74565 жыл бұрын
OMG, now that's how ANOVA and linear regression is connected.
@seanpitcher8957 Жыл бұрын
Bought the book. Nicely done and useful!
@statquest Жыл бұрын
Awesome, thank you!
@markobe085 жыл бұрын
I will just go on a liking spree on all of your videos
@statquest5 жыл бұрын
Hooray! :)
@redcat74673 жыл бұрын
I just a video on Confidence Intervals back from 2015 and the song was pretty much the same, yet what a difference!
@statquest3 жыл бұрын
:)
@karannchew25343 жыл бұрын
1:36 The goal of t-test is to compare means (eg two groups or categories of data) and see if they are significantly different. 9:23 ANOVA. Compare three or more groups of data.
@statquest3 жыл бұрын
bam!
@usfbge6 ай бұрын
Hi Josh, Your vidoes are amazing, easy to follow and understand. Just wondering if you could upload video on GLMM, LMM models and when to use which model? This will help to clarify.
@statquest6 ай бұрын
I hope to do that one day, however, it will probably be a while since I'm writing a book on neural networks right now.
@alexanderkononov40685 жыл бұрын
Maaan! I found you, I found glm, finally! Thanks!
@statquest5 жыл бұрын
Hooray! :)
@shichengguo80644 жыл бұрын
Hi Josh, It's time to bring linear mixed models. Thankkkk Youuuuu!!!
@statquest4 жыл бұрын
I'll keep that topic in mind.
@ronykroy5 жыл бұрын
I keeep coming here to hear the Baaaaam !! :)
@statquest5 жыл бұрын
Hooray! :)
@mohammadalidastgheib26882 жыл бұрын
Thank you for your clear explanations.
@statquest2 жыл бұрын
Bam! :)
@woodypham64744 жыл бұрын
What else i can say about this clip? You're the best
@statquest4 жыл бұрын
Hooray!!! :)
@rahuldey63693 жыл бұрын
1) So are we basically comparing the variability of the each data point of that categorical feature around the sample mean to the variability of individual data points around the grouped mean? Or how can I explain in a simple sentence what these tests are and what we can infer? 2) This is a univariate analysis right? 3) In the figures of Gene Expression you've taken 4 data points as example. What those are? I mean to say, how can I interpret those? are those control and mutant categories encoded? This was the only video, that dared to visualize what T-test & ANNOVA are
@statquest3 жыл бұрын
1) Yes, that's the main idea 2) The t-test is univariate. However, this series of videos also gives many multivariate examples. 3) Those 4 data points reflect how many mRNA transcripts are measured. If that doesn't mean anything to you, just imagine we counted something, like green apples, at 4 different grocery stores.
@rahuldey63693 жыл бұрын
@@statquest In that sense those green apples are the dependent variable in our dataset and are we grouping them by 4 different grocery store?
@statquest3 жыл бұрын
@@rahuldey6369 yes
@rahuldey63693 жыл бұрын
@@statquest Thank you so much for the clarification. Best wishes. Looking forward to learn more from you
@Tyokok5 жыл бұрын
Hi Josh, quick Q. Isn't the test you explained here F-test? Isn't t-test use t-score=(slope beta-0)/standarderror , and then get p-value from t-table? or are they the same thing? little confused here. Thank you!
@statquest5 жыл бұрын
This is a great question. The t-test is just a specific type of F-test. If you have statistics software, you can compare the results and see that the p-values are the same (however, the F-statistic itself will be the square of the t-statistic. Why the square? Because, as you saw in the first video in this series, the F-statistic can never be negative, but the t-statistic can.) There are multiple ways to calculate a t-test, this using an F-test is my favorite because it is much more flexible. Does that make sense?
@Tyokok5 жыл бұрын
@@statquest I knew you would took it to the further level. So basically the two tests are both about model parameters hypothesis significance test, just use different methods, so p-value should refer the same thing. BAM! Thank you so much!
@rookiedrummer68383 жыл бұрын
Thanks @Josh i have a some questions:- 1] Suppose we have 5 independent variables and a label ,How does ANNOVA calculates p-value for each feature in this case? 2] Does it fits a regression for each indipendentVariable~Label separately and than calculates p-value?
@statquest3 жыл бұрын
I describe how p-values are calculated for individual features in these videos: kzbin.info/www/bejne/sHq3enmKqM6phJo kzbin.info/www/bejne/nqDOcn-aftinbs0 The concepts apply to ANOVA in the exact same way.
@ai18886 жыл бұрын
Will the F-statistic calculated from this method be equal to the t-statistic? I understand that you are trying to standardize the way to calculate the t-test by using methods from linear regression, but does it produce the same values that a regular t-test does?
@benedettodiciaccio30246 жыл бұрын
According to this website [ onlinecourses.science.psu.edu/stat501/node/297/ ], the t-statistic and F-statistic produce equivalent p-values when the F-statistic's degrees of freedom in the numerator is 1. The relationship is t^2(n-p) = F(1,n-p), which apparently means the p-values for each will be identical. Don't know why that is but videos on the relationship between those two distributions may help. Anyway, I assume the relationship applies here in which the df = 1 for the F-statistic numerator when comparing two groups. As a side note, most slopes for p-values in multiple linear regression are calculated with t-tests. However, F-tests comparing the variance between models with and without the slope produce an identical p-value due to the above mentioned relationship. Thinking of slope significance in terms of how much more variance the model explains with vs without the slope seems much more intuitive to me, and I'm glad I found these videos.
@ashokmulchandani28416 жыл бұрын
I love your voice both while singing and explaining statistical concepts. Thank a ton for these videos. Do you mind if I can request you the videos on the following topics 1) 2 or more factor ANOVA (to be used as reducing the number of the independent variable) 2) Linear Multiple regression (to be used as reducing the number of the independent variable) 3) DOE and Taguchi
@statquest6 жыл бұрын
Glad you like the videos! I've added Taguchi, DOE and 2 or more factor ANOVA to my to-do list. I believe that my video on Multiple Regression in R may already satisfy your second request: kzbin.info/www/bejne/nqDOcn-aftinbs0
@ashokmulchandani28416 жыл бұрын
StatQuest with Josh Starmer Thanks 😀
@urdeathisnear8855 жыл бұрын
Hi Josh, great work on these videos, very helpful! One question: is it safe to say that ANOVA is just a generalized t-test for >2 groups?
@statquest5 жыл бұрын
Sure, I think that is a safe thing to say.
@nr75072 жыл бұрын
Thank you, I had a few questions. At 6:37, is there a reason we did not include the residuals in the overall equation of y? Also, why do we need the y equation at 6:13 to create a design matrix? Is it just not just a matrix where the number of ones corresponds to the number of data points for control and zero for mutant and vice versa for the next data point number of entries? Also, does the sample size have to be the same per category to create a design matrix? Great Tutorial!
@statquest2 жыл бұрын
1) This equation simply represents what goes into to the design matrix. The residual is the difference between this equation and what is observed. 2) The equation just illustrates how we create the design matrix and what it represents. 3) You don't need to have equal numbers of samples for each category (they can be different).
@zzzluke8906 Жыл бұрын
Your videos are extremely helpful! Can you go through things like kruskal-wallis test and why it is not sensitive to normal distribution? If you can share some insights on chi-squared test etc, it would be really helpful too!
@statquest Жыл бұрын
I'll keep those topics in mind.
@pg42343 жыл бұрын
At 10:42 if we get a small p value from the F-statistic, how do we know which of the categories is significant?
@statquest3 жыл бұрын
We then have to test each one separately to identify which one is significantly different.
@Dekike25 жыл бұрын
First of all, Thank you so much, Josh, for the time you spend sharing your knowledge about statistics. Students need more people like you... I wanted to ask something likely silly, can you make an ANOVA with an unbalanced sample? What can I do if some categories have more data than others? Thanks again, Josh!! I am looking forward to hearing from you!!!
@statquest5 жыл бұрын
ANOVA works fine with unbalanced samples. You just have more rows in your design matrix for one category than another.
@AkshayTumula9 ай бұрын
at 3:54 why the mean is best fit we can also draw the line parallel to y - axis?
@statquest9 ай бұрын
The goal is to use something that will be useful for predicting the y-axis values for each data point, and a vertical line would make it impossible to predict a y-axis value for each data point.
@AkshayTumula9 ай бұрын
@@statquest Thank You Sir Understood can I ask some more doubts can we connect on LinkedIn??
@statquest9 ай бұрын
@@AkshayTumula I'm maxed out on linkedin. And I generally prefer to answer questions about my videos in the comment sections of my videos.
@AkshayTumula9 ай бұрын
@@statquest Thank You Sir Btww
@shamshersingh96809 ай бұрын
Hi Josh, at time stamp 6.48 when you write the equation y = mean of control + mean of mutant, where have the residuals gone. How will we get the value of y using this equation without residuals. As y = mx + c in linear regression helps get y values from given x and same concept is being applied here. So why are dropping the residuals.
@statquest9 ай бұрын
We drop the residuals because it doesn't make any sense to include them in the predictions we make with this equation. The residuals only make sense when we are evaluating how well the model fits the data. But with predictions based on new data, we don't know the actual values, so we don't know the residuals.
@mariaaureliano84114 жыл бұрын
Thank you! Really great and helpful videos!
@statquest4 жыл бұрын
Glad you like them!
@tudorpricop5434 Жыл бұрын
I don't understand why at 3:10, when we are fitting the line for the control data (the same question for the mutant data), the fitting line is horizontal ? Shouldn't it be vertical, as in when it passes through all points ? It's obvious that the vertical line would fit the points the best since it passes through all the points. If someone can explain this to me I would really appreciate! Thanks:)
@statquest Жыл бұрын
Because we're using the control and mutant mice to predict gene expression. If they both make the same prediction, then there's no difference between control and mutant mice. If they make different predictions, then there's a difference. Thus, the horizontal lines tell us what the best prediction would be for a control or mutant mouse. If the lines were vertical, they would, as you say, go through all of the points, but they would not give us a predicted gene expression value.
@vanya.antonov5 жыл бұрын
Hello, Joshua! I am a bit confused at 7:42. If I understand correctly, you estimate the t-test p-value by computing the F-value (and using the F-distribution?). Although, according to Wikipedia, the test statistics in t-test follows the Student's t-distribution (and not the F-distribution). So, I was wondering if the t-test you describe here is the same as the standard t-test from the Wikipedia?
@juliar57414 жыл бұрын
I have the same question here. @StatQuest
@369standrealfine5 жыл бұрын
Thank you so much for your videos.
@statquest5 жыл бұрын
Thanks!
@albertrodrigo24322 жыл бұрын
It would be a triple BAM if you could do a quick Stat Quest about residual diagnosis in linear models!
@statquest2 жыл бұрын
I'll keep that in mind.
@thomasamet58533 жыл бұрын
Thank you so much Josh for all your amazing content and great silly songs. I don't manage to wrap my head around the reason you say the fit equation is written out like: y = mean_control + mean_mutant at 6:48 and 9:05. I would have written something like y = mean_control * x + mean_mutant (1-x), x taking 1 or 0. Any explanation on that from you or someone else is appreciated.
@statquest3 жыл бұрын
Because my equation is being multiplied by the design matrix, it is essentially the exact same thing that you have.
@thomasamet58533 жыл бұрын
@@statquest Bam!! Thank you for the explanation
@akyanus70423 жыл бұрын
Hi, So how to do a two-sample t test with bootstrapping for rna seq data? There are hardly any examples in the literature. Considered as an alternative method to EdgeR, but is it possible to get a bootstrapped t test for each gene in group comparison (like the model matrix in edgeR)? So how is the bootstrap t test used for gene expression analysis? (e.g. boot package in R). I 'dont understand how is identified differential expressed genes with botstrapping. Can you share information on the subject?
@statquest3 жыл бұрын
I have a video that shows how bootstrapping can be used for a t-test here: kzbin.info/www/bejne/n6SolJqleNKfhZI
@akyanus70423 жыл бұрын
@@statquest Thank you very much I checked it. I understood hypothesis for mean between two groups, bu still I do not understand how it is used for genes. This is complicated I think. I wanted to see a table for t and p values of genes. Am I thinking wrong?
@statquest3 жыл бұрын
@@akyanus7042 Replace the responses people had to the drugs (feeling better or worse) with the read counts for a gene in different samples. For example, you might have 3 samples that took drug A and 3 samples that took drug b. For Gene "X", bootstrap the read counts for the genes and calculate p-values as described.
@akyanus70423 жыл бұрын
@@statquestThank you.
@TheAugustinePark4 жыл бұрын
At 4:20 of the video, you mention the reason we combine the two lines of best fit into a single equation is to make the steps for computing "F" identical for regression and the t-test meaning a computer can do it automatically. In terms of what this actually looks like, I think this means having a single equation means one value for SS(fit) (instead of 2) which means we can use the "F" equation for regression. Is my reasoning correct? Also, why does a single equation mean a computer can do it automatically? Why could a computer not do it automatically if we had 2 equations? Thanks I love your videos!
@statquest4 жыл бұрын
Sure, a modern computer can handle more than one equation. But back in the day memory was limited and that limited the number of tests a computer could perform. So the the original idea was to unify as much of linear models into a single framework called "General Linear Models", with the idea that one equation could be used in a general setting on a computer without having to check a bunch of different conditions. In the early days, different conditions meant different look-up tables for figuring out the p-values and since computers had very little memory, this limited what they could do.
@casualcasual12342 жыл бұрын
Thanks a lot and at 8:40, after obtaining F value, to obtain p value, is it the same as in the linear regression video? another sample of data (n=9) --> obtain SS(mean) & SS(fit) --> obtain F --> plug into F value histogram and repeat... --> obtain distribution and obtain F value of original data --> p value? Thanks again in advance :)
@statquest2 жыл бұрын
The histogram that I used in the linear regression was intended to illustrate what an F-distribution represents, and it is the same here as well.
@visheshsharma21153 жыл бұрын
8:22 above graph for t test is the fitted one or the mean one ???
@statquest3 жыл бұрын
I'm not sure I understand your question, however, the graph in the top right corner at 8:22 shows a horizontal solid black line at the average of the y-axis coordinates.
@yeonseonjeon61185 жыл бұрын
Anova starts at 09:24
@urdeathisnear8854 жыл бұрын
Hi Josh, upon reviewing this, I'm wondering why you say you're using a t-test, but you actually calculate an F-statistic? In this case, isn't the two group case you show an F-test (i.e. a two group ANOVA) ?
@statquest4 жыл бұрын
t-test = two group ANOVA. In other words, a t-test is just a specific example of ANOVA, and an ANOVA is just a specific example of general linear models. In this case, the F-statistic is just the square of the t-statistic that we wold have gotten for a t-test and the p-values are the exact same. There are two ways to do a t-test, the way most people teach it and by using a general linear model, both give you the exact same results.
@urdeathisnear8854 жыл бұрын
@@statquest Great, thanks for explaining the relationship between them, very helpful! But technically, because in this video you are comparing the ratios of variances and not the difference between means across groups, this is an f-test, not a t-test, right? Or does t-test not necessarily imply comparing the difference between means (though I've seen this in multiple other resources) ?
@statquest4 жыл бұрын
@@urdeathisnear885 In both the t-test and in ANOVA, we are testing to see if the difference between (or among) the means is statistically signifiant. The concepts are the exact same. The differences in the equations are just technical details. In other words, if someone asked me to give them directions from my house to the grocery store, I could give them multiple routes to get there - all of them, however, would qualify as "directions from my house to the grocery store".
@urdeathisnear8854 жыл бұрын
@@statquest Sure, but in your analogy, there is likely one, optimal route to the grocery store, right? So to take the reverse approach and go from real-world to stats analogy, I guess a related question I have is: there are two types (F, T) of tests that yield two different statistics that share the same concepts, but surely there may be times where it's preferable to use one over the other, else why would there be two separate tests? If so, could you maybe give a simple example of when you'd prefer one over the other? Thanks, this feedback is really helpful!
@statquest4 жыл бұрын
@@urdeathisnear885 Ah, I have to be careful with my analogies. The F-test and Student's t-test yield different, but mathematically related, statistics. The F-distribution generalizes the t-distribution, just like the F-test generalizes Student's t-test, and it can be shown, mathematically, that a 2 sample ANOVA is equivalent to Student's t-test. So there is no difference and no reason to choose one over the other. That said, the Student's t-test was later modified (updated) by Welch to allow for unequal variances in the two groups. So there is a difference between Welch's t-test and a 2 sample ANOVA - and this is important. If you think you have different variances, then you need to use Welch's t-test (not Student's t-test or an F-test).
@beautyisinmind21632 жыл бұрын
Hi Professor Josh, Anova(F-test) is often used in Filter method for feature selection. Theory says, Anova should be used for feature selection when target is Binary but I saw in some practical use people also uses Anova when target is multi class. So Anova(F-test) can also be applied if our target is not binary and has multiple classes? another question Anova assumes features to be normally distributed, But in practice most of the time we encounter data that are not fully normal in such case does it matter much to apply it? or Transformation is compulsion?
@statquest2 жыл бұрын
ANOVA is really only intended to be used when the dependent variable is continuous.
@VCC13162 жыл бұрын
... the mutant mice are just normal mice that have a specific gene that has been knocked-out, and live in the sewers with 4 turtles. Also, this is really a fantastic intro to ANOVA, hats off.
@statquest2 жыл бұрын
Thank you!
@silentsuicide45442 жыл бұрын
i don't think if i get it completely. when we have two features like in the first example, over the graph is written "t-test", but we are calculating f-score, which using f-distribution gives us the p-value, but the definition for t-test is that it is every hypothesis test in which the test statistics follows a t-distribution under the null hypothesis. My question is why is it called "t-test" if we are using f-score and f-distribution to get p-value?
@statquest2 жыл бұрын
The F-distribution is a generalization of the t-distribution. In other words, the F-distribution can do everything we can do with a t-distribution and more.
@minederguy49324 жыл бұрын
How do you calculate the residuals for the equation + design matrix? Wouldn't that involve subtracting a matrix from a scalar?
@statquest4 жыл бұрын
The design matrix is just a general way to specify how each measurement fits into the equation.
@TheAugustinePark4 жыл бұрын
At 3:15 of the video, on the t-test graph we fit a horizontal line to get the least-squares fit. Intuitively, wouldn't having a line with the same placement but any slope (meaning also a different y-intercept) result in the same value for the least-squares fit since all the data points have the same x-value? Thank you
@statquest4 жыл бұрын
Any point at the mean of the data will have the same fit. I use a line to make it easier to see.
@TheAugustinePark4 жыл бұрын
In terms of when we should use linear regression vs. t-tests vs. ANOVA for testing our data, is linear regression for when our independent variable is continuous while t-tests and ANOVA for when our independent variable is discrete (e.g. categorical variables)? Thank you!
@statquest4 жыл бұрын
Technically, it is all linear regression. However, they give it different names. t-tests are when you have two distinct groups and ANOVA is when you have more than 2 distinct groups.
@DanWhalen5 жыл бұрын
still confused how do i interpret/operationalize "y=control1*2.2 + control2*3.6"? like at 6:54, are we saying "y=(4*2.2)+(4*3.6)"?
@statquest5 жыл бұрын
If you go back to 6:11, you see that the "design matrix" is formed from the 1's and 0's that turn on/off the values for the control mean and the mutant mean. So when you have y = column1 * 2.2 + column2 * 3.6, to predict a value for a new control sample, you plug in 1 for column1 and 0 for column2 and thus, the prediction is y = 1 * 2.2 + 0 * 3.6 = 2.2.
@drzun6 жыл бұрын
Thanks for the awesome video. I have a question about the p-value generated from the DE analysis by DESeq2. According to the description in DESeq2, the p-value seems calculated from "negative binomial GLM fitting for βi and Wald statistics". I wondered is this the same concept in the video? Is negative binomial regression also a kind of general linear model, and the variance of the negative binomial (μ+α μ^2) same with to the SS(Mean) and SS (fit)? Also, is Wald test the same with the t-test in the video, except that n is large in Wald test? Sorry for asking so many questions, I'm so confused.
@statquest6 жыл бұрын
GLM stands for two things "General Linear Models" and "Generalized Linear Models". Unfortunately, those two things are different - but when most people say "GLM", they most frequently mean "Generalized Linear Models". Generalized Linear Models are, in essence, a way to adapt the concept of a "design matrix" to a variety of problems and models. For example, in this video, we used design matrices to do t-tests and ANOVA. However, these same design matrices can be used with Logistic Regression (see those videos if you're interested) and they can also be used for DE analysis with DESeq2. However, the underlying math is different in all three cases. So the good news is that if you understand design matrices, you can do amazing things in a wide variety of contexts. The bad news is that SS(mean) and SS(fit) in these videos may or may not correspond to something in another system, like with DESeq2 or Logistic Regression. Logistic Regression, for example, doesn't use least squares at all, but instead relies on maximum likelihood to optimize the fit. Does this make sense?
@drzun6 жыл бұрын
@@statquest Thanks for the reply! I think I got your point. So the basic idea is to use the generalized linear model (GLM), which is more like a concept, to fit the data, and in the video the linear regression, which is more like a method, is used for the fitting. In programs like DESeq2, they use the negative binomial regression method to fit the RNA-Seq read counts, but the overall idea is still using GLM to describe how experimental factors (e.g. genotype and treatment) determine the expression of a gene (by a design matrix), and the p-value is kind of telling me how well the GLM fits ( or how convincing the result is).
@statquest6 жыл бұрын
@@drzun You've got it!
@drzun6 жыл бұрын
@@statquest Hooray! Before watching your videos, I had a really hard time understanding the statistics behind the data analysis of RNA-seq, and I can't express how grateful I am to you & the videos.
@statquest6 жыл бұрын
@@drzun Hooray!!! That's great. I'm glad my videos were so helpful! :)
@chinmayagarg39774 жыл бұрын
We can fit a vertical line passing through all the points of control data which will give the Least sum of squared residuals, @3:09, right? If that's the case then why did we fit horizontal line? Thanks in advance P.S.: The channel is awesome. Recommended it to many.
@statquest4 жыл бұрын
Sure, a vertical line would minimize the squared residuals, but you can't use it to make predictions. What Gene Expression value would you predict with a vertical line? All of them, and that makes vertical lines useless.
@teammdyss2 жыл бұрын
@@statquest sorry, wouldn't that effectively mean that for t-test we're not really looking for a best fit? Calling a something a fit then it's really not makes things confusing, in all the examples you show so-called "fit" is represented as a "mean" so wouldn't "just find equation for mean line" a better rule of thumb rather than talking about "lest squares"? Head melting right now
@statquest2 жыл бұрын
@@teammdyss Maybe a better way to say it is "best fit given some restrictions", and those restrictions are 1) the number of parameters we want to use and 2) we want a model that is useful for making predictions.
@haydrick4 жыл бұрын
Hi Josh - I am struggling to understand what the p-value means in this scenario. What would be the hypothesis statement that the p-value enables us to accept / reject?
@statquest4 жыл бұрын
The null hypothesis is that there is no difference. Thus, the p-value tells us if having parameters (other than just the intercept) are useful for distinguishing between groups. If there is no difference, then we should fail to determine that the estimated parameters values are significantly different from 0.
@heisenbergren15562 жыл бұрын
Hi Josh,Should it be like F= (SS(mean)-SS(fit))/(p_fit-p_mean) on the top of the formula?(one more bracket)
@statquest2 жыл бұрын
yep
@nikosterizakis2 жыл бұрын
I might have missed that, but what is the value of 'n' in the formula?
@statquest2 жыл бұрын
The number of observations or data points.
@hsinyenwu7 жыл бұрын
Thanks so much for this video!!! Never heard anyone explain those concepts so well. Do you have any plan to make videos about multiple comparisons adjustment?
@aaryan90584 ай бұрын
Hey Josh, Could you please answer this? If i calculate p-value using this method and also using student's t-test. Will it be the same? If yes, why? If not, why?
@statquest4 ай бұрын
It will be the same. The F-distribution is just the square of the distribution. For more details: coursekata.org/preview/book/fd645e20-5a0d-482e-ad16-ee689acb7431/lesson/15/6#:~:text=The%20F%2DDistribution%20and%20T%2DDistribution%20are%20Actually%20the%20Same&text=The%20reason%20is%20that%20fundamentally,get%20exactly%20an%20F%2Ddistribution!
@esperanzazagal72414 жыл бұрын
Is the overall mean always on the y axis because it is the outcome of interest? are we never interested in means on the x-axis?
@statquest4 жыл бұрын
We are predicting the y-axis value, and that is why we are interested in the y-axis more than the x-axis (the stuff on the x-axis is only being used to predict y-axis values.)
@BeefLoverMan3 жыл бұрын
This channel is a gift from the math gods. Question: I'm having a hard time linking this to Design of Experiments methods. It seems like it should be an easy connection, but I somehow can't quite work it out in my head. How would one use this to calculate the explained variation by individual terms of a linear model? 1 term == 1 "category"? And how do degrees of freedom factor into it?
@statquest3 жыл бұрын
The next video in this series may help you understand how to design experiments: kzbin.info/www/bejne/eaKveKmtnpJohsU
@scottrjjh3 жыл бұрын
I'm sure I'm missing something here, but why do we even need to do the whole design matrix thing? To get SS(fit) aren't we just adding the means of each group then taking the sum of squares of the residuals from that horizontal line?
@statquest3 жыл бұрын
The design matrix makes it easy to do these tests with a computer.
@franciscopala86554 жыл бұрын
For a classification problem "Gene Expression" would be the feature and ("Control", "Mutant") the classes of the target variable (Control = 0, Mutant = 1) right?
@statquest4 жыл бұрын
For a classification problem, you want to use logistic regression: kzbin.info/www/bejne/r3q8fIVqqMytf5o
@franciscopala86554 жыл бұрын
@@statquest Hi Josh! Thanks for answering! yes, I meant for selecting the k best features of a dataset based on the F-statistic
@statquest4 жыл бұрын
@@franciscopala8655 I don't fully understand your question. Can provide more details?
@franciscopala86554 жыл бұрын
@@statquest Hey Josh! Yes, I was trying to fit a classifier which predicted if someone's income was greater than $50K (income = 1) or under or equal (income = 0) based on a lot of different features (age, education level, marital status, occupation, native country, etc). I tried training a support vector classifier based on radial basis functions but each fit was taking ages because the dataset was huge, so I looked up different methods for eliminating the less relevant features and came across a function in python's sklearn library called SelectKBest that computes the F-value of each feature and keeps the top k features. I didn't quite understand what the F-value meant so I checked statquest to see if there was a video about it and ended up here. At first I was strugling to understand the concept but I think I've finally wrapped my head around it. For a feature like age, i get a f-value of 2692.08 and a p-value below 1e-8. Since the f-value is large, it means that the age difference between observations with an income over 50K and under 50K is high and centered around each group mean. If, on the other hand, the f-value was small, it could mean that either the mean age between groups is very simmilar, or that its large but with a lot of variance within each group. Also, I understand now that I probably shouldn't use the F-score to weed out irelevant features of a rbf-based support vector classifier since its not linear. Keep up the good work Josh, your channel is amazing.
@statquest4 жыл бұрын
@@franciscopala8655 Awesome! Sounds like you have it all figured out. :)
@hongdalin59536 жыл бұрын
hi Joshua, thanks for sharing. These videos are step-by-step processing and makes so much sense to me than the hedious textbooks. I was wondering if you can make a videos on repeated measures ANOVA biting into small pieces, thanks in advance.
@wisamtariq44125 жыл бұрын
Many thanks, great channel! I have a question please.. does t test approach here is what's called "one way ANOVA".. and f test for "factorial ANOVA" since there are more levels for the categorical variable?
@Dominus_Ryder5 жыл бұрын
StatQuest, is there a version of a T-Test, or an ANOVA Test, that allows me to compare the Standard Deviation, Skewness, of Kurtosis of two or more sample means to see if there is any statistical difference between the two? If not, is there any particular reason why? To me, it seems as if knowing if these statistical quantities were different from each other would also provide useful information or features for machine learning algorithms.
@statquest5 жыл бұрын
This is a great question. Unfortunately there are not many good or well known tests to compare standard deviations and other features (other than means). I'm not sure, but it could be that this is due to the lack of a central limit theorem like concept for standard deviations etc. (That's just a guess, so don't quote me on that).
@yimingshao42402 жыл бұрын
Thanks a lot for your video, it's really helpful! but i have a question, why the equation of y can be written as y= mean (control) + mean (mutant), where are the residuals in each set of data?
@statquest2 жыл бұрын
I'm not sure I understand your question. The residual for each measurement is paired with that measurement, so it is easy to keep track of.
@glaswasser3 жыл бұрын
can you make a statquest about linear mixed models / random effects? I'm extremely confused about them, when to use them and how to interpret the results...
@statquest3 жыл бұрын
I'll keep that in mind.
@4wanys3 жыл бұрын
hi thank you for the vedio ,Is the t-test is the machine learning regression with discrete inputs ?
@statquest3 жыл бұрын
I'm not sure what your question is. A t-test is a way to compare to categories of things (like "normal diet" vs "special diet") when you measure something continuous (like weight).
@krishnag57343 жыл бұрын
Hi Josh, Thanks for the video. :) What about adding residuals to the equations at 6:27 and 6:57 ? Isn't it necessary ?
@statquest3 жыл бұрын
The residuals squared and added when we solve for the optimal parameters. For details, see: kzbin.info/www/bejne/pJyVdIR_idKSm9E
@krishnag57343 жыл бұрын
@@statquest thank you josh :)
@bernaridho Жыл бұрын
Where is part 1? I did not find it your description.
@statquest Жыл бұрын
kzbin.info/www/bejne/pJyVdIR_idKSm9E
@madanmohan99076 жыл бұрын
what does the n refer in F formula?
@statquest6 жыл бұрын
'n' is the number of samples you have. So if you measured "gene expression" in 3 controls and 5 mutants, than n=8.
@madanmohan99076 жыл бұрын
@@statquest Thanks a lot
@mook14813 жыл бұрын
please do a MANOVA video !! this was so useful, Im doing a 2x2x3 MANOVA for my research project and would really appreciate a video :)
@statquest3 жыл бұрын
I'll keep that in mind.
@siddharthkhattak83815 жыл бұрын
I have a question, while calculating F we use SS(mean) and SS(fit) but for t-test or ANOVA there will be 2 and 5 (respectively as per this video) SS(fit) then do we take an avg of all the SS(fit) or add them.....??
@statquest5 жыл бұрын
SS(fit) is the sum of all of the squared residuals (the difference between the actual observation and the lines we fit to the data.)
@danielsobczynski21072 жыл бұрын
Hi Josh, great video as always. Just wanted to ask, what happens to the residual in the equations earlier in the video that had “+ residual” in them? Thanks so much for your help, definitely learning alot
@statquest2 жыл бұрын
What time point, minutes and seconds, are you asking about? (However, I'm guessing that you are asking about the difference between the equation that perfectly fits the data, because it includes the means + the residuals, and the equation that generates the residuals (because it only includes the means). The equation that does not include the residuals is the one we use to make predictions with future data.
@danielsobczynski21072 жыл бұрын
@@statquest Thanks Josh, that is the point I was asking about, I will review the video again once more
@raghavgaur89014 жыл бұрын
Hi Josh,I just wanted to confirm that if we have a data with very high cardinality then we would use anova and if we have data with only two categories then we would t test right?
@statquest4 жыл бұрын
When you only have 2 categories, you use a t-test. When you have more than 2 categories, you use anova. However, as you can see, a t-test is just a special case of the anova.
@raghavgaur89014 жыл бұрын
@@statquest thanks for answering
@chongliwinston52254 жыл бұрын
Dear Josh, just to make sure that F value = ((ss(mean)-ss(fit))/(p(fit)-p(mean)))/(ss(fit)/(n-p(fit))) right? Not F value = (ss(mean)-(ss(fit)/(p(fit)-p(mean)))/(ss(fit)/(n-p(fit))) where the numerator is “ss(mean)-ss(fit)”over(p(fit)-p(mean)) instead of ss(mean) - “ss(fit)” over (p(fit)-p(mean)) right?
@statquest4 жыл бұрын
That's correct - I was a little sloppy with the parentheses when I made these videos.
@chongliwinston52254 жыл бұрын
StatQuest with Josh Starmer noooo, this is very helpful, really appreciate that. I have to make sure just because I am not that familiar with this.
@AkshayJain-r5f10 ай бұрын
why are we fitting mean for t test instead of a vertical line
@statquest10 ай бұрын
The goal is to use something that will be useful for predicting the y-axis values for each data point, and a vertical line would make it impossible to predict a y-axis value for each data point.
@eye_oph2 жыл бұрын
Hi Josh, great video as always. Just wanted to ask, How to do the post hoc tests in linear models just like post hoc tests in ANOVA to explore differences between two groups? Thank you.
@statquest2 жыл бұрын
Post-hoc tests with ANOVA are just a matter of defining your "design matrices", which I illustrate in the next video in this series: kzbin.info/www/bejne/eaKveKmtnpJohsU
@eye_oph2 жыл бұрын
@@statquest If there are three drugs: drug A, drug B, and drug C, we use drug A as the reference level. We then use dummy coding to compare B vs. A; C vs. A in the linear model. In the linear model, we can determine the difference of B vs. A; C vs. A by calculating the p value of the coefficient. However, it seems that we can not determine the difference of B vs C in the above linear model? Thank you for your reply.
@user-hs7pv2qp2e2 жыл бұрын
Can SS(mean)-SS(fit) part of the F equation be written as SS(mean-fit)?
@statquest2 жыл бұрын
Unfortunately no.
@sushrutdhiman4183 жыл бұрын
How is F related to p value kindly explain.
@statquest3 жыл бұрын
We explain that in part 1: kzbin.info/www/bejne/pJyVdIR_idKSm9E