Design Matrices For Linear Models, Clearly Explained!!!

Рет қаралды 138,122

Күн бұрын

Пікірлер: 199

@statquest 2 жыл бұрын

Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

@macroxela 4 жыл бұрын

Statistics never appealed to me since it always seemed boring ... until I started watching your videos a few days ago. Now I'm hooked. Thanks for making statistics so fun and intuitive to learn!

@statquest 4 жыл бұрын

Awesome! I'm glad you are enjoying learning stats! It's a fun Quest! :)

@PunmasterSTP 8 ай бұрын

Out of curiosity, what do you think of stats three years later?

@fanzhang3746 2 жыл бұрын

SQ is so addictive. A simple concept clarification youtube search led me down hours and hours of SQ contents. Thank you, thank you, thank you!

@statquest 2 жыл бұрын

Wow! Thank you!

@taotaotan5671 4 жыл бұрын

WOWW. I have been watching this video for at least 5 times and I always learned something new! I was confused by people saying "regress out the batch effect", but it's that simple!!! Thanks Josh.

@statquest 4 жыл бұрын

BAM!!!

@leylayim 4 жыл бұрын

thank you! I can't believe how clear you are explaining this, seriously thank you!

@statquest 4 жыл бұрын

Glad it was helpful!

@MsDontBlink 4 жыл бұрын

im so happy when i look for a topic and see that you've covered it.

@statquest 4 жыл бұрын

Bam! :)

@paulpaschert6215 5 жыл бұрын

"turning something on by letting it be" - some proper life advice there

@dbarkan1 5 жыл бұрын

In the last part where you combine the linear regression and the t-test, you have a regression line for each category, but the slopes of the lines are identical. Isn't this rare? How would the equation change if you had two lines with different slopes?

@LikeWaterBeWater 3 жыл бұрын

my guess is: use one parameter for each slope. For the first parameter, load control group's weights as is but keep mutant group's weights at 0. For the other parameter, do the opposite

@abdoualgerian5396 Жыл бұрын

i think he didn't wanna make it more sophisticated as it seems to be , looks like there is more to it than his simple explanation , maybe in another quest he will be talking about it ps: this reply is 4 years afer your comment , and by now he might have done it , im watching the videos one by one

@Samurai_Jack__ Жыл бұрын

@@abdoualgerian5396 have u found something like this by now

@GregSteg 5 жыл бұрын

6:06 , having flashbacks to week one and two of Andew Ng's ML Coursera course, but now it feels more intuitive!

@statquest 5 жыл бұрын

Wow! That's quite a complement. Thank you. :)

@laurag.6122 4 жыл бұрын

I never get tired of watching your videos, I have learned a lot. This is my favorite channel :) Thank's!!!!!! Would you consider making a video on assessing the significance of mixed models? Please! this topic is complicated

@statquest 4 жыл бұрын

I hope to cover mixed models in the future.

@sudinroy7979 3 жыл бұрын

All the topics of statquest are well explained. Thank you sir for this nice statistics subject based channel Statquest.Good wishes and happy journey for this successful statquest youtube channel.

@statquest 3 жыл бұрын

Thank you!

@amoghbharadwaj9252 4 жыл бұрын

wow so helpful, this cleared my doubt of combining and interpreting categorical and continuous predictors. Thanks a ton:)

@statquest 4 жыл бұрын

Hooray!!! Thank you very much. :)

@Russet_Mantle 4 жыл бұрын

13:32 About the term for difference(mutant - control), is that an average of Lab A's difference(mutant - control) and Lab B's difference(mutant - control)?

@statquest 4 жыл бұрын

In this case, yes. If the data are unbalanced (i.e. we have more measurements from lab A than lab B), it might be the weighted average.

@Russet_Mantle 4 жыл бұрын

@@statquest Got it. Thanks a bunch!

@jaychan3207 4 жыл бұрын

BAM!!! crystal clear explanation! Thanks!

@statquest 4 жыл бұрын

Glad it was helpful!

@parthbhardwaj8435 5 жыл бұрын

do a statquest for wald's test , chi- squared test and fisher's exact test please!!

@WalyB01 4 жыл бұрын

First you would need random variables for the walds

@rizkykiky7721 2 жыл бұрын

how do you count the p-value from that F-value? I was a bit lost in there

@statquest 2 жыл бұрын

See: kzbin.info/www/bejne/pJyVdIR_idKSm9E

@karannchew2534 2 жыл бұрын

7:46 Compare mean model vs type-only model, p>0.05 12:17 Compare size-weight-type model vs type-only model: p=0.0025

@statquest 2 жыл бұрын

bam! :)

@rrrprogram8667 6 жыл бұрын

Awesome one josh.... Keep up the great work...

@statquest 6 жыл бұрын

Thank you!

@briankirk962 4 жыл бұрын

First of all props for your excellent series of videos. First rate introduction to some really hard stuff. An answer to your question at 2:46 is that the problem in general implies two distinct linear equations but algorithms for linear models (eg lm() function in R) only allow for one general linear equation. So yes you can solve it by hand using two separate linear equations but the algorithm won't let you enter the problem in that format. So how do you get around this problem of having only one linear equation to work with when you have two linear equations in reality? Ans: Break up the linear model by using dummy variables in a thoughtful manner or let the algorithm do it for you but check that it's not messing with you. Either way you got to know what's going on and here's one explanation.... For the mutant/control example we have the following linear model: (Note i should be read as a subscript for the ith term and e is the error term. So ei is NOT some madness in the complex plane but simply the ith error term. If ei throws you just treat it as a symbol related to irreducible error (the noise that is always around). That's all it is.) yi = B0 + B1 xi + ei (eq 1) (pretty much y = b + mx with e as some reality thrown in) where B1 is the slope of the line with xi as its associated input values (eg the labels mutant and control but as values in this example) and B0 is the y-intercept. As you can see there are no input values associated with B0 so we can not directly associate input values to B0 through the first column of the design matrix. This explains why the first column of the design matrix is fixed to all ones. This is essentially saying that B0 exists for all i and it's up to the gods of regression to determine what B0 becomes. All is not lost though. Nothing says we can't monkey with the linear model (eq 1) through its variable xi in a creative way that ends up associating B0 with a label. And that's what we're going to do. But first we need to deal with the issue that our labels are not numbers and this creates an opening for some linear equation monkey business without defying the gods. Since our equation won't work on labels we need to assign numerical values (dummy variables) and by selecting the appropriate dummy variables for our labels, we can separate the general equation into two separate equations each of which corresponds uniquely to each label. Word of caution though, how we select our dummy variables determines how our labels get assigned to the separate equations, so it's not an arbitrary choice. So let's try the following: xi = 1 if i is a mutant xi = 0 if i is a control (0 and 1 are the dummy variables and here we are assigning actual numerical values to xi. These are the values assigned in the second column of the design matrix) in which case yi = B0 + B1 xi + ei (eq 1) becomes yi = B0 + B1 + ei if i is a mutant (eq 2) yi = B0 + ei if i is a control (eq 3) (Notice that there is no longer any separate xi term in eqs 2 & 3 since xi has been assigned dummy variable values) and this allows us to interpret our controls relative to B0 whereas our mutants correspond to B0 + B1. Pretty slick and no lightning bolts from above. In this case B0 is the mean for the controls (intercept in the summary report), whereas B0 + B1 is the mean for the mutants. It's important to note that B1 (what is returned second in the summary report) is the mean difference between mutants and controls (ie mean of mutants-mean of controls). If the p-value for B1 is significant that means adding the the difference of mutants-controls to our model is significant with respect to the control alone (ie mutants are different relative to the controls and what we're interested in). Now if we switched our dummy variables, xi=1 for controls and xi=0 for mutants, then B0 would be the mean for mutants; B0 + B1 is the mean for the controls; and B1 is the mean of controls-mean of mutants (ie got reversed). If the p-value for B1 is significant here that means adding the difference of controls-mutants to our model is significant with respect to MUTANTS alone (ie controls are different relative to mutants so equivalent to what we want but is kinda upside down and weird). To get totally weird we could assign xi=1 for mutants and xi=-1 for controls then B0 would be the overall average for the combination of mutants and controls. Bottom line: How we set our dummy variables determines how we can interpret B0 (as well as B0+B1 and B1) and is a slick trick that allows us to separate out from our linear model, two linear equations that uniquely correspond to our labels. An Introduction to Statistical Learning by James Gareth gives some nice examples of this on pg 84 at this level of math. Available for free on-line and also provides details on how to assess the quality of your model which is critical. And if you've gotten this far....Hey Josh, how about some banjo??? Some Ola Belle Reed would fit nicely here.....I've endured, I've endured, how long can one endure!!!!

@statquest 4 жыл бұрын

Wow! You get a prize for longest comment ever. You even have equation numbers. Very nice! :)

@briankirk962 4 жыл бұрын

@@statquest Curious how the mathematics is short and concise whereas the exegesis on equations delves into the land of Proust. Perhaps there is something to this mathematics thing...

@summerxia7474 3 жыл бұрын

So clear！！！ Thank you for answering my confusion in a such simple way!

@statquest 3 жыл бұрын

Glad it was helpful!

@gren287 6 жыл бұрын

I love u, its out now, but i need a longer intro. :)

@statquest 6 жыл бұрын

This one is way too short! :)

@sharan9993 Жыл бұрын

2:40 It might be because, the standard needs only one bit to represent both values, since only change is 2nd bit. We can just ignore the 1st bit while storing thus reducing the size. Just a speculation.

@statquest Жыл бұрын

Perhaps

@dainegai 5 жыл бұрын

Great video (as usual)! You're definitely one of my favorite "thing-explainers" I've come across :D I was left with a question near the end though, with respect to "correcting for batch effects". After a quick online search, I see this is usually an issue and many packages to attempt to correct it. I could imagine two explanations that lead to different explanations for the difference: i) "We ran the exact same protocol in two different labs. However, the sensors were differently calibrated, so there is a bias in readouts." -> This suggests the batch-effect correction. ii) "We ran the exact same protocol in two different labs. We ensured the sensors were equally calibrated, but there's *still* a bias in readouts." -> This could just be due to inherent variability in the sample, right? (It is probably not *too* likely for the data to be the same, just shifted down a bit. But it's possible! Questions: 1) This correction *assumes* that the difference in batches is *not* due to inherent variability in the features we're measuring (but is instead due to e.g. technician error), right? There would be no way to *prove* it one way or the other, would there? 2) If it's (ii), wouldn't "correcting for batch effects" throw out useful information about the response variable's distribution? 3) Ideally, hopefully both labs calibrated their sensors via e.g. blanks, so (1) shouldn't be immediately the reason. How would you suggest teasing out sensor bias (1) vs sample variability (2)? Would we have to assume a model for the data and compare whether Lab A's two group's parameters significantly differ from Lab B's? (Or maybe the "ideal" situation happens infrequently enough that going for (1) is usually not unreasonable?) Thanks again! Will continue to Quest On :D

@statquest 5 жыл бұрын

If you are worried about whether or not "sensor bias" plays a big role in your measurements from two labs, you can always do technical replicates. In other words, have each lab do the experiment 3 different times. If Lab A is always higher than Lab B (or the other way around, or a t-test suggests that the results are significantly different), then you can be pretty confident you have a batch effect due to sensor bias or the technician or something like that.

@munaalhammadi4237 3 жыл бұрын

Thank you for this clear explanation. I just have a question in8:26 if the lines had different slopes, how will be the design matrix in that case?

@statquest 3 жыл бұрын

If the lines have different slopes, then you have something called an "interaction". This means the mutation has a different effect on different weights. So we would add an "interaction term" to help compensate for this. Interactions are a whole StatQuest for themselves and one day I'll make it.

@channel_panel193 3 жыл бұрын

@@statquest oooo +1 for a StatQuest on interaction terms plz!

@statquest 3 жыл бұрын

@@channel_panel193 If all goes according to plan this month, I should have that video out soon.

@MR-yi9us 2 жыл бұрын

@@statquest yes! +1 for a StatQuest on interaction terms too!

@dandyyu8561 4 жыл бұрын

Thank you very much! I really learned a lot from your channel. I have a question, at 13:25, the second term Lab B offset. Is the Lab B offset = lab B control mean - lab A control mean?

@statquest 4 жыл бұрын

Yes.

@dandyyu8561 4 жыл бұрын

@@statquest Thank you so much! By the way, can I treat the last example as a simple linear mixed model?

@statquest 4 жыл бұрын

No, mixed models are different. My understanding is that in this case, we are measuring distinct labs and not trying to generalize to other labs, and this constitutes a fixed effect. There is no random effect, so there is no "mixture" of effects.

@mesmaeili1 3 жыл бұрын

Great. Really clearly explained. Thanks.

@statquest 3 жыл бұрын

Glad you liked it!

@Han-ve8uh 4 жыл бұрын

1. The idea I get from this video is we can use any design matrix we want to create a test between any complex vs simpler model and interpret the significance of their difference in equation terms from the p-value right? 2. I have trouble relating the conclusion at 11:08 (p-value small--> fancy better than simple mean model) with linear regression and what the p-value here means. Why in linear regression there is a p-value for every coefficient (so a whole linear regression has multiple p-values) but here there is only a single p-value?

@statquest 4 жыл бұрын

I answer your question about what all of the p-values are for when doing a relatively fancy linear regression in the follow up video: kzbin.info/www/bejne/fqPVY5SkrrCSa9U

@Han-ve8uh 4 жыл бұрын

@@statquest Thanks for the heads up, I didn't know that video existed, I watched and it clearly explained the different p-values. Something left unexplained was what does the p-value of the intercept mean? Is that a comparison of control group mice with a line that must past through origin, vs mutant mice with mutant offset amount above that? I think someone else here asked this too, why is the slope for both control/mutant same? Could it have been modelled as 2 different slopes, something like 2 new columns in design matrix slopecontrol 11110000 and slopemutant 00001111 to replace the single slope column. Does this work? If these p-values make sense, would I be able to infer anything about the single slope used in this video from the results of those 2 type-dependant slopes?

@statquest 4 жыл бұрын

@@Han-ve8uh The answer to your questions about the design matrix are in this video starting at: 0:33 (if you want to see the worked out example, see: kzbin.info/www/bejne/hHeYkJWqhMZ2n8k ). The answer to your question about the p-value for the intercept - this just tells us if the intercept value is significantly different from 0. Generally speaking, we are not interested in this (one way or the other) since we are more interested in comparing the two groups.

@amirwagih4797 3 жыл бұрын

10:33 why the degrees of freedom of the fancy model = n-3 , since we have two lines, we have n-2*2 = n-4 degrees of freedom cuz each line can pass through any two points?

@statquest 3 жыл бұрын

Since both lines share the exact same slope, we only need to estimate one parameter for it instead of two (one for each line).

@amirwagih4797 3 жыл бұрын

@@statquest Thanks alot Josh, I get it now, I always struggled with degrees of freedom and i really would love to see a statquest about that , Thanks again for the amazing content you produce!

@patelprateekramesh2442 5 жыл бұрын

What if the slopes are different? Do we consider two separate terms for each slope? (and similarly for more terms when there are more variables)

@statquest 5 жыл бұрын

If the slope are significantly different, then there is an "interaction" this means the drug or whatever it is you are comparing, has a different effect on the different groups. In this case, you add an "interaction term".... and that's the subject of another video.

@jasperkirton6848 5 жыл бұрын

@@statquest Thanks for this! Do you have this other video or can point me in the right direction?

@AnimeshSharma1977 6 жыл бұрын

Awesome, wondering how you deal with missing values in such cases?

@statquest 6 жыл бұрын

That's a great question. I think you have to use some method to impute the missing values.

@AnimeshSharma1977 6 жыл бұрын

@@statquest will the random forest imputation method you suggested work here :)

@statquest 6 жыл бұрын

@@AnimeshSharma1977 Yes it would. But, depending on what you're modeling, there might be some specialized method that may work better. I'd look around and if I didn't find anything, try out the Random Forest method. I love how flexible it is.

@alexandergarcia6479 5 жыл бұрын

maybe moving average?

@alonsomartinez9588 2 жыл бұрын

It would be nice to see a video on how matrices and matrix multiplication in neural networks transform data, and edit the dimensionality of the inputs. Transforming data makes a lot of sense spatially, but what does changing dimensions do? What are the different ways in which you can interpret matrix notation? Talk about the special relationships once you organize the data in that particular format

@statquest 2 жыл бұрын

I'll keep that in mind.

@Harshavardhan-bu2tp 4 жыл бұрын

in the last example for batch effect, Did you suppose that difference(mutant-control) is same for both the labs?

@statquest 4 жыл бұрын

Yes

@gabrielcournelle3055 4 жыл бұрын

Awesome video as usual. Thank you

@statquest 4 жыл бұрын

Thanks! :)

@angelamilton5134 2 жыл бұрын

Please do you have any video on how to build the A matrix from a stochastic model?

@statquest 2 жыл бұрын

Not yet!

@rayaneadam4778 29 күн бұрын

in the case where i want to compare to the full model (to check significance of some parameters in explaining the variability in y), in this example is the fancy model considered the full one?

@statquest 28 күн бұрын

yep

@claradong4649 5 жыл бұрын

I love your video, really easy to understand

@statquest 5 жыл бұрын

Thank you! :)

@nicholaskiulia9649 3 жыл бұрын

Could you also kindly explain dirichlet regression and also the segmented regression

@statquest 3 жыл бұрын

I'll keep that in mind.

@karannchew2534 2 жыл бұрын

13:31 y = labA control mean + labB offset + difference Shouldn't there be two "difference" i.e. y = labA control mean + labB offset + difference.labA + difference.labB ?

@statquest 2 жыл бұрын

@gabrielpadilha8638 3 жыл бұрын

Do a statsquest on the F distribution, please

@statquest 3 жыл бұрын

I talk about the F-distribution in this video: kzbin.info/www/bejne/pJyVdIR_idKSm9E

@kventinho 6 жыл бұрын

Statquest is getting bigger, watch out! hahaha i can't stop humming this

@statquest 6 жыл бұрын

Nice!!! :)

@RadomName3457 3 жыл бұрын

Hi Josh, could I ask which distribution table u looked at or based on to get the pvalue for the F I calculated?

@statquest 3 жыл бұрын

See: kzbin.info/www/bejne/fqPVY5SkrrCSa9U

@gianmarcolevantino1239 5 ай бұрын

hi! can you explain briefly how to obtain p-values from F? i'm currently preparing "advanced statystics for business" in management engineering course in the university of Palermo and i'm really enjoying your videos! thanks a lot!

@statquest 5 ай бұрын

I give the concepts in this video: kzbin.info/www/bejne/pJyVdIR_idKSm9E

@deletedacc27834 5 ай бұрын

Hey Josh! Thanks for this video. For the first example, I was curious about the equation y = control intercept + mutant offset + slope. Suppose our slopes for our two lines were different. Would the exchange become: y = control intercept + mutant intercept offset + control slope + mutant slope offset? Thanks!

@statquest 5 ай бұрын

If the slopes are different, the you have something called an "interaction" between the classes (control vs mutant) and the things we are measuring. Interactions have to be dealt with in a special way and would require an entire video to explain. In the mean time, check out this page: developer.nvidia.com/blog/a-comprehensive-guide-to-interaction-terms-in-linear-regression

@sudinroy7979 3 жыл бұрын

Is there any other application of design matrix instead of regression analysis ?

@statquest 3 жыл бұрын

Design Matrixes can be used for all general linear models (which includes linear regression, but also ANOVA and many other more complicated models) as well as all generalized linear models (which includes Logistic Regression and many other more complicated models).

@CompBioQuest 3 жыл бұрын

it would be great one video about interaction terms! and how to use for deconvolution of cell types. :-)

@statquest 3 жыл бұрын

I'd like to do that one day.

@diamagneetik 4 жыл бұрын

Sorry! But how you get p-value = 0.003 (11:06 minutes)? From table?

@statquest 4 жыл бұрын

I talk about that in Linear Models Part 1 (this is part 3!): kzbin.info/www/bejne/pJyVdIR_idKSm9E

@pendantdrop3710 5 жыл бұрын

How do you interpret r-square if you have those 2 regression lines? about what line does r-square speak?

@statquest 5 жыл бұрын

By default, the r- squared value compares the residuals around the full model (in this case, that’s the two lines) to the residuals around a single, horizontal line that is at the height of the average y-axis value.

@yenhoeooi9 3 жыл бұрын

For the mouse weight/mouse size/mutant example, if the two slopes are different, does that mean my new equation can be: y= control intercept+ intercept offset+ control slope+ slope offset And now my new Pfancy= 4, is that true?

@statquest 3 жыл бұрын

In linear regression terminology, if the slopes are different we call it an "interaction" and the new equation has what we call an "interaction term", but the idea is the same - it compensates for the differences in slopes.

@manouheart4906 4 жыл бұрын

Hi, thx for all the hard work. Could you explain what the difference(mutant-control) at 13:36 is?

@statquest 4 жыл бұрын

The difference(mutant-control) is the difference between the mutant and control groups.

@manouheart4906 4 жыл бұрын

@@statquest So, is that equal to control_mean(LAB_A)-mutant_mean(LAB_A)+control_mean(LAB_B)-mutant_mean(LAB_B)?

@statquest 4 жыл бұрын

@@manouheart4906 Off the top of my head I can't remember exactly how it is calculated, but I suspect it is some sort of weighted average of the differences between control and mutant in labs A and B.

@RandomGuy-hi2jm 4 жыл бұрын

can we use pythogorus theorem to find cooedinates on the line

@statquest 4 жыл бұрын

Probably, but it's easier to just plug in the x-axis values into the equation.

@dreama1375 3 жыл бұрын

Thank you very much for your video! Can you please tell me, what you explained (when combined t-test and regression) is also called mixed modelling or multilevel modelling?

@statquest 3 жыл бұрын

I believe mixed models are used when you do not have enough data to create a proper design matrix like this.

@HunterDriguez 4 жыл бұрын

Awesome video! In the Control vs Mutant scatterplot @6:16, is each individual data point meant to represent the expression of a single gene within a control or mutant replicate (4 reps per group)? I'm just trying to make sense out of a design matrix that I got from my RNA-seq data with thousands of genes. I wonder if the SSmean for each sample is being calculated using the mean expression of all 18,000+ genes....I'm wondering the same for the SSfit...sigh.

@statquest 4 жыл бұрын

If you're doing RNA-seq, then you are probably using edgeR or DESeq2. Both of those methods use just the genes with similar expression to get a sense of SSmean and SSfit.

@HunterDriguez 4 жыл бұрын

@@statquest thanks! I will go read more about what edgeR does.

@this-is-bioman 2 жыл бұрын

You actually need to watch these two parts backwards as the purpose of the t-test becomes clear at the end of this video. I wish you showed us first why we do that and then explained how.

@statquest 2 жыл бұрын

Noted

@PunmasterSTP 8 ай бұрын

Design matrices? More like "Dang good videos are these!"

@statquest 8 ай бұрын

Ha! BAM! :)

@shamshersingh9680 9 ай бұрын

Hi Josh, the 2 lab example is a bit confusing. 1. First of all when we say difference between control and mutant means, do we mean (lab A control mean - lab A mutant mean) and (lab B control mean - lab B mutant mean). 2. Secondly, what is lab B offset? Is it (lab A control mean - lab B Control mean). 3. How lab B mutant equals (lab A Control Mean + lab B offset + difference). As per the figure, the lab B mutant should be equal to lab A Control Mean + Difference between lab A Control Mean and lab B mutant mean. Why do we have lab B offset in this equation.

@statquest 9 ай бұрын

1) In this case, we use the average between lab A control and mutant and lab B control and mutant. Thus, we have a single difference that we use for both lab A and lab B. 2) Yes 3) We do it the way we do it since we have a single value that represents the difference between the mutants and the control, regardless of the lab. NOTE: The reason we use a single difference is because we assume (hypothesize) that the effect of the mutation is the same, regardless of the lab and that the only differences are in the lab itself - possibly due to different measurement techniques.

@relatively_random4903 2 жыл бұрын

I'd like to know more about the number F he keeps calculating. Is there a Wikipedia article about it, as a starting point? Or a name commonly used for it?

@statquest 2 жыл бұрын

See: kzbin.info/www/bejne/pJyVdIR_idKSm9E

@janakiramanbalachandran504 4 жыл бұрын

Excellent video. However, I had a few questions. How to design a design matrix for regression+control group when the slopes are not the same for the two groups. Similarly for comparing Lab A and Lab B measurements, which difference in the mean values (mutant-control) should be taken, since there are two values from the two labs.

@statquest 4 жыл бұрын

When you have 2 different slopes, then you need to add something called an "interaction term". That's just another column in the design matrix. And when you want to compare Lab A and Lab B, you just pick one to be the "base" and the other one will be the difference from the base.

@janakiramanbalachandran504 4 жыл бұрын

@@statquest Thank you again. The interaction terms seems like a nice way to handle 2 different slopes

@krisc9211 4 жыл бұрын

@@statquest When the 2 slopes are different, you need this interaction term. But I'm wondering how the mutant offset is defined? When the 2 slopes are the same, the mutant offset is the same everywhere. When the 2 slopes are different, then the offset will vary. So where do you define the mutant offset?

@PeihuiBrandonYeo 6 жыл бұрын

I am here for the singing intro

@statquest 6 жыл бұрын

Hooray! :)

@raghavgaur8901 4 жыл бұрын

Hi Josh,Actually I wanted to ask you how to decide a null hypothesis for any case .As I understood the concept of p value but I didn't understand how to decide a null hypothesis for any given case.

@statquest 4 жыл бұрын

The typical null hypothesis is that there is no difference between two things. If we reject that hypothesis, then the data suggest that there is a difference.

@raghavgaur8901 4 жыл бұрын

@@statquest thanks for answering sir

@ismailel-shimy7431 2 жыл бұрын

Thank you for your wonderful videos which I became addicted to recently :D I found this one particularly useful for me to understand the concept of design matrices and how one can use them not only to turn on/off certain terms in equations for categorical variables, but also scale terms for continuous variables. Now, I have a question regarding the batch effect example you kindly provided. You assume that the difference between mutant and control mice in lab A is the same as in lab B and one can represent this as the average difference of the 2 labs. You also mentioned in the comments that if we had more measurements from one lab than the other, we can use a weighted average of the differences in the 2 labs. What if the mutant-control difference in lab A really differs from that in lab B. You can already see that in the dot plot. Would it make sense to add an offset term for the difference too as you did for the lab A control mean? In this case the equation should be Y = lab A control mean + lab B offset for the mean + lab A difference + lab B offset for the difference ?

@statquest 2 жыл бұрын

I believe so.

@yaozhang8368 5 жыл бұрын

Hi Josh, I really like your videos that clearly explain lots of things! In the last example of this video, is it similar with using a mixed model? Intuitively, it seems the lab was treated as a random variable and mutant was treated as a fixed variable, and here we are interested in the difference between mutant and control after removing the impact of lab A/B.

@statquest 5 жыл бұрын

You are correct! You could definitely analyze this data with a mixed model.

@somalkant6452 4 жыл бұрын

Hi josh, M highly indebted to you for all of your awesome videos. I have one doubt which is making me restless. please if you could help. suppose i have only one independent Variable which is categorical (control(0)/mutant(1) ) and one dependent varible(DV), which is continous, so i will get the graph as shown in 0:53 in this video. There is no other independent variable(IV) to make the graph look like as in 8:29. How to check "linearity" between IV and DV, as it is an important assumption of linear regression? We cannot draw a line with some slope and intercept in this case (0:53) or this " linearity" assumptions will not be required to check? and other Linear regression assumptions such as "normality of residuals" and "Homoscedasticity" are they also not required to be ensured in my example? please help. Thanks

@statquest 4 жыл бұрын

When we talk about "linearity" with respect to "linear models" like these, the only thing that is linear are the coefficients that connect the independent and the dependent variables. In this case, the independent variable is linear transformation of the dependent variable because we are just multiplying the the dependent variable by a coefficient. As for the other assumptions of linear regression, like "normality of the residuals" - we still have residuals in this case (see 1:57 ), so we can still check if the residuals themselves are normal.

@ebateru 3 ай бұрын

Thanks for another great video Josh. In the example where you test if mutant mice are significantly larger than control mice, couldn't you simply fit a single regression model with an added predictor that is a dummy variable saying if a mouse is control =1 otherwise = 0 and taking a look at the coefficient and p-value of that coefficient? Would that be the same as fitting a model with mouse type and comparing it to a model without a mouse type?

@statquest 2 ай бұрын

Yes - you could do that and it would be simpler. However, that's not how most programs (like R) would do it, so it's good to know the standard way, even though it is slightly more complicated.

@ebateru 2 ай бұрын

@@statquest Thanks for such a quick reply Josh!

@formula-box 2 ай бұрын

⁠⁠⁠@@statquestThanks for the video. I have a question related to the topic. In your video you calculated the p value of F statistics. The p value is small enough to indicate the double linear fitting is better. But how do you infer the difference of intercept of the two parallel fitted line is statistically significant?

@statquest 2 ай бұрын

@@formula-box That's what the p-value represents - that the intercepts are significantly different (since the slope is the same for both lines).

@DeepakSah3.0 5 жыл бұрын

How you calculated the p-value?

@statquest 5 жыл бұрын

With an F-distribution. The concepts are explained in Linear Models Part 1: kzbin.info/www/bejne/pJyVdIR_idKSm9E

@saileshpatra2488 4 жыл бұрын

Though I tried a lot but unable to digest all the concepts. Thanks for all detailed explanation. Bdw can we expect another video on design matrix and it's real use with a little simpler explanation!!! if possible

@statquest 4 жыл бұрын

Did you start from the very start of this series (this video is part 3), with linear regression? Earlier videos in this series cover simpler design matrices Here's the whole playlist, in correct order: kzbin.info/aero/PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU

@kt9509 3 жыл бұрын

pls do some videos on recommendation systems (like collaborative filtering) and distributed learning (like mapreduce)!!

@statquest 3 жыл бұрын

I'll keep those topics in mind.

@BeefLoverMan 3 жыл бұрын

I'd love to see you do a central composite design matrix for response surface modelling at some point! I seem to understand the topic enough to implement it in Python, but I'm still struggling to put it into a more general context. The final ANOVA output people usually show is the most confusing. I get that you do an ANOVA for each single term in the model (so each term is a "group"), but then there's always an extra "residual" ANOVA and I can't figure out what that is calculated on.

@statquest 3 жыл бұрын

I'll keep that in mind.

@alecryan8733 3 жыл бұрын

Ahhhhh the relationship between design matrices and dummy encoding just clicked for me. The less common case where the the design matrix is all 0s in the first vector in the matrix is the same as one-hot encoding right?

@statquest 3 жыл бұрын

I'm not sure. I can't imagine why a design matrix would ever a column with just 0s. That just sets a parameter to 0.

@cicinindivin3689 2 жыл бұрын

Is this kind of 2 way ANOVA?

@statquest 2 жыл бұрын

Sure.

@cicinindivin3689 2 жыл бұрын

@@statquest it isn't clear to me why I should use F (whole regression) instead of T for the "slope" between the intercepts to calculate a p value... I know that for a simple single linear regression F is just T^2 (for the slope), but going "3 dimensional", like in the batch effect example you show at the end of the video, I no longer see the relation between F and the (now) 2 Ts... and I would expect to have to use the Ts since I still have only 2 groups (orthogonally, lab a and b, or WT and mutant), not 3 (which would imply ANOVA and F, at least so I've been taught)... Isn't T in this case better than F since it is "measuring" specifically the difference in intercepts instead of F that is "measuring" the quality of the fit overall (or maybe the last residual "dimension" of the fitting "surface")? I'm confused...

@statquest 2 жыл бұрын

@@cicinindivin3689 The difference between a t-test and an F-test is like the difference between a knife and a swiss army knife ( imageengine.victorinox.com/mediahub/39710/640Wx560H/SAK_1_3713__S1.jpg ). The t-test can only compare means between two groups and it can only take one variable into account when making that comparison. For example, a t-test can compare the height of two groups of people, where "height" is the only variable we measure. However, when we have more than one variable, like we measured "height" and we measured "weight", then we would not be able to use the t-test to compare the two groups. In contrast, an F-test works when we have measured 1 or more variables. In other words, the F-test is a generalization of the t-test. In these examples, we have measured more than one variable per group (size and weight) so our only choice is to use the F test (if we want to use all of the data we measured).

@cicinindivin3689 2 жыл бұрын

@@statquest Thanks for replying. I omitted that the t values mentioned in the previous message are the ones provided by Excel in the regression box, below the ANOVA box, belonging to the various coefficients (intercept, slope in one axis, slope in the other axis etc.), maybe it makes more sense now. I would have thought that using the t value for the specific dimension under analysis (eg the difference in intercepts in the case of this video) would provide a p value "purified" for that specific null hypothesis (eg intercepts are equal)... Idk, I ll try to understand this better.

@statquest 2 жыл бұрын

@@cicinindivin3689 Generally speaking we want to use as much data as we can to make decisions. Using a t-test would force us to exclude some data from making a decision, and omitting data and result in worse decision making. To be honest, I think the best thing to do is just forget about using t-tests. Think of everything as a type of F-test, and you will be much better off and can always use all of the data.

@yulinliu850 6 жыл бұрын

Thanks a lot!

@statquest 6 жыл бұрын

You're welcome! :)

@dansolpa 2 жыл бұрын

hi guys! hope you are having a beautiful day! hey, I have a question, in the last example where there are 2 different labs, the difference between mutant and control is calculated in this way: Lab A mutant mean - Lab A control mean? or (Lab A mutant mean + Lab B mutant mean) - (Lab A control mean + Lab B control mean). Hope anyone can help me. Thanks!!

@dansolpa 2 жыл бұрын

Maybe is necessary a 4th parameter, and change 3th parameter to be the difference between (Lab A mutant mean) - (Lab A control mean) and the 4th parameter to be the difference between (Lab B mutant mean) - (Lab B control mean) ?

@statquest 2 жыл бұрын

The trick is to see how the 0's and 1's affect the equation. For a control measured in lab A, we have... 1*lab A control mean + 0*lab B offset + 0*difference = lab A control mean ...now for a mutant measured in lab A... 1*lab A control mean + 0*lab B offset + 1*difference = lab A control + difference. ...so the mutant value in lab A is the lab A offset plus the difference between mutants and control. Similarly, we and measure the difference between mutant in control in lab B by including the lab B offset.

@dansolpa Жыл бұрын

@@statquest thanks for your reply!!!!! I'm still confused. As I understand the lab B offset is calculated by taking the Lab B control mean, and the difference between mutant and control is calculated in the following way: (Lab A mutant mean - Lab A control mean). So, if the previous hypothesis is correct, when calculating the (Lab B mutant mean) with the design matrix we are going to get a different value than the one in the graph, because maybe the difference between control and mutant is different in lab B and in lab A. In other words: Lab A mutant mean = 1*lab A control mean + 0*lab B offset + 1*difference = lab A control + difference. Lab B mutant mean could be different from 1*lab A control mean + 1*lab B offset + 1*difference = lab A control + lab B offset + difference because the difference between control and mutant could be different in lab B and in lab A. And if we are only calculating the difference by (Lab A mutant mean - Lab A control mean) we are only taking into account the lab A difference

@statquest Жыл бұрын

@@dansolpa It is possible that the difference between control and mutant is different in different labs. If so, this is called an "interaction effect", and we would need to add an additional term to compensate for it.

@mohammadalidastgheib2688 2 жыл бұрын

I didn't get the last example.

@statquest 2 жыл бұрын

What time point, minutes and seconds, was confusing?

@rohitrajora9832 3 жыл бұрын

Hi josh, I was able to understand everything prior to the decision matrices topic. Could you please suggest me on how i could improve my understanding? Also, if it's plausible, can you please make a "Decison matrices in python" video cause that would really help.

@rohitrajora9832 3 жыл бұрын

PS if you could also make "in python" videos for for the topics you have implemented in R before

@rohitrajora9832 3 жыл бұрын

that would be great

@statquest 3 жыл бұрын

What specific time point, minutes and seconds, is confusing?

@rohitrajora9832 3 жыл бұрын

I'm new to ML and started watching your machine learning playlist and got stuck on decision trees Here are my overall doubts- From the "GLMs part 2" video ========================= 7:28 (& 10:36) - why do we calculate the residuals using the 2 mean lines when we just made a single line {y = mean(control) + mean(mutant) }to the data using the design matrix? From the "decision matrices" video ============================ 8:31 - What do we do when the slopes are different 8:23 (and 13:59) - How de get these y equations? I mean I'm not intuitively able to get them on my own. 9:56 - Is this how we always compute the residuals (data-line)^2 ? I mean to calculate the pts on the line , do we always use the design matrix (and its corresponding equation) 13:59 - what is lab B offset? what is diff(mutant - control) ...what labs do the these mean mutant and mean control correspond to?

@statquest 3 жыл бұрын

@@rohitrajora9832 From GLM 2, we have two lines because how the design matrix works. In this case, to estimate the mean of the control subjects, we multiply "mean_control" by 1 and the "mean_mutant" by 0, giving us just the mean for the control. To estimate the mean of the mutants, we multiply mean_control by 0 and mean_mutant by 1. This is illustrated here kzbin.info/www/bejne/hHeYkJWqhMZ2n8k

@claradong4649 5 жыл бұрын

your voice is different from previous

@statquest 5 жыл бұрын

This is actually an old video. I now use a better microphone.