The worst assumption to violate that nobody checks

Рет қаралды 5,069

Күн бұрын

Пікірлер: 71

@411dadofabres 29 күн бұрын

The linearity assumption in linear regression does not preclude the inclusion of non-linear term. You can apply transformations to the data, though it may complicate interpretation. What you're describing as a "violation of linearity" or "homogeneity" is actually an endogeneity problem, often caused by omitted variable bias or model misspecification. Endogeneity arises when E[u|X] =\= 0 meaning the error term u is correlated with the explanatory variables X. This results in biased estimators-on average, the estimated coefficients will be incorrect when using random samples. The "Independence Assumption" you mentioned combines two distinct concepts: strict exogeneity and uncorrelated errors. Strict exogeneity means that E[u|X] = 0. Uncorrelated errors means that E[ui*uj|X] =0 for all i=\=j. Together, they imply that Cov(ui,uj|X) = E[ui*uj|X] - E[ui|X]*E[uj|x] = 0. That's your 'independence assumption'. But these should be treated as separate things: Strict exogeneity, which ensures no correlation between the error term and the explanatory variables across all observations, is particularly challenging to satisfy. Meanwhile, uncorrelated errors refer to the absence of correlation between error terms in different observations. Combining these into a single assumption can be misleading because they address different aspects of the model's validity. Its ok to handle with correlated errors. And correlated errors impact only inference (confidence intervals), but do not impact the biasedness of OLS estimators. However, it is >very difficult< to deal with endogeneity (specially if you don't have good instruments) and endogeneity implies in bias. Endogeneity may be caused by ommited variable bias (this is what you're calling 'linearity', but it is actually ommited variable bias due to the omission of interaction variables), measurement errors, reverse causality and so on. This is much more difficult to handle.

@411dadofabres 29 күн бұрын

Just to be clear: your 'true model' example is linear because it is in fact linear on the parameters! If you have something as 'Y = B1*e^{B2*X2}' this is not linear. But Y = B1x^2 b2*z^3 + b2*x1*x2 is! You can do any transformations you want to capture non-linearity. Still wouldn't violate the linearity assumption of OLS, important to ensure the estimator properties (consistency, unbiasedness, Gauss-Markov theorem and so on). But as there are missing variables, you'll deal with omitted variable bias. Example Consider you are estimating Y = Bo + b1X. In this case, b1^ = Cov(x,y)/Var(x). But what if y (true model) actually is B0+b1*x+b2*z? Then: b1^ = Cov(x,b0+b1*x+b2*z)/Var(x) = [ b1 Cov(x,x) + b2 Cov(x,z) ] / Var[x] = b1 + b2 Cov(x,z)/Var(x). That is, b1^ =\= b1 for Omitted Variable Bias.

@411dadofabres 28 күн бұрын

Btw: Homogeneity - in the mathematical sense - is not an important assumption for OLS estimation! A function is said to be homogeneous of degree k if f(\lambda x)= \lambda^k f(x). If you had something like Y = b0 + b1x1 + b2x2 + b3x1x2, this in fact is not homogeneous in that mathematical sense. But this is not at all relevant to OLS estimator. Just call x1x2 = x3 and go with it! Maybe the confusion arises from the fact that people call homoscedasticity assumption (an important OLS assumption) as homogeneity. But homoscedasticity is simply saying that E[u_i^ 2 | X] = 0 and is not related to this homogeneity in the mathematical sense.

@jonasenglund38 29 күн бұрын

I my ears, what you describe is model mis-specification. I agree with you that violating additive assumptions is a big problem, but the more general problem is that of omitted variable bias. Or, even more general, model mis-specification.

@QuantPsych 28 күн бұрын

I guess I've always thought of omitted variable bias as a bit different, in that you may not have access to the omitted variable, whereas with the additivity problem, all the information you need is right there.

@michaelha2005 27 күн бұрын

How does this relate to multicollinearity?

@OnLyhereAlone 6 күн бұрын

I watched this the day it was published but didn't have my question then. Here is the question: I already knew there would be an interaction between the two predictors i was using to model a outcome using a generalized linear mixed effects model. However there was a crazy-high (>a couple of hundreds) variable inflation factor for one of the variables and of course, the interaction. I used the performance package and the full model with interaction was clearly much better, so i can't just get rid of the interaction. In this case, which imperfect option is better? Keep the high VIFs or violate multiplicative effects? Hope you can address this soon. By the way, i was the commenter who suggested that you provide references? Thank you.

@anne-katherine1169 18 күн бұрын

In mixed models, should you then check interactions between predictors and the random effects? You kind of put the random effects because you assume there would be interactions, or not? I'm a bit confused.

@broonzy2006 28 күн бұрын

Hey Dustin, the code is either not there or you need to do some kind of mixed model to reveal the answer? above it says " " is the code? Which I tried btw

@QuantPsych 28 күн бұрын

Oops. It's black_friday_24

@tagheuer001 29 күн бұрын

Homogeniety of regression slopes? but lmer() random effects can get around this right?

@QuantPsych 28 күн бұрын

Different context. In mixed models (lmer) we allow the slopes to vary by cluster. Another way to think of it is modeling a cluster by x interaction. I'm talking about lm's, but the idea is the same--we have to model an interaction.

@crashteens 28 күн бұрын

You are wrong when you say here that it is useless to check for independence of the residuals. It can even be a consequence of violating the additivity postulate. Take Anscombe quartet n°2 for example and plot your autocorrelation of the residuals against lag, then compare for the two models: the simple linear regression and then the model including main effect of IV and the squared IV after centering. You will see that the autocorrelation is higher for the linear model. Non-independance of the residuals isn't always a matter of design issue and can follow the omission of an interaction. In any case we agree on the fact that you should always plot raw data first, but for this kind of models where you have way too much covariates, autocorrelation against lag can be useful

@yuzaR-Data-Science 27 күн бұрын

I agreed with Dr. Scott :) independence assumption is the more important and the most ignore. folks often just run anova over ... everything ... horrible. not even RM ANOVA, just ANOVA :))) that's why I think independence assumptions is the most important. In my opinion, you are wrong. the difference between the means of independent samples can be huge and not significant, while the difference between the means of dependent could be small, but significant, simply because all data point went up by 1% or so

@OnLyhereAlone 6 күн бұрын

Agreed; just a simple Ttest example in Excel for independent vs. paired (formular ending with 2, 2 vs. 2, 1) types demonstrates this issue.

@Gaborio1 29 күн бұрын

I agree with the plot first idea. That's such an important step in any modelling task. But that does not work for an interaction as well as for a non-linear relationship. If you plot something you can eyeball a line an see that the best fit is a square or whatever. But an interaction requires some prior knowledge about what third variable might be conditioning the effect. And sometimes we just don't know that.

@taotaotan5671 29 күн бұрын

Exactly, you can only plot your data if you live in a 2d/3d space. Plus, it’s hard to specify the “right” model

@cocotte4746 29 күн бұрын

pairs(data) gives you what you need.

@Gaborio1 29 күн бұрын

@@cocotte4746 but it still assumes that you have measured all your relevant data. The example of the video is pretty good for my point actually, when would you actually measure the best friends' suicidal ideation?

@CainisUponUs 29 күн бұрын

I agree that this assumption is ignored for the most part in the biological sciences, but in my experience engineers (especially industrial engineers) who are more interested in prediction rather than inference tend to fit second order models more often, and so almost always test the additive models against the multiplicative models (see response surface methodology)

@dangernoodle2868 29 күн бұрын

I stayed because he asked me nicely

@zebittyangryman1746 29 күн бұрын

Hi Dustin. I was wondering if you could comment on the seeming paradox of avoiding median splits, but also being able to visualize potential interaction effects. In the plot you show in this video, the probable multiplicative effects are only apparent once the splits on a continuous data set are added. You have another video on avoiding median splits in continuous data. How do these two things balance. How do you determine if you should split to visualize multiplicaitve effects or not split to avoid power loss.

@QuantPsych 28 күн бұрын

Very good question! There's a difference between splitting your data for analysis (which I'm not doing) and splitting your data for visualization (which I am doing). The problem is that there's not many (or any) good ways to preserve the continuous nature of the data when visualizing multivariate data. But, we can feel assured that we're not deceiving ourselves because we can always look at the model estimates (which, again, are based on un-categorized data), and we can always "rotate" the view so that the variable we categorize shows up on the x-axis in the next plot.

@zebittyangryman1746 28 күн бұрын

@@QuantPsych Thanks. Much appreciated.

@yulia6354 28 күн бұрын

thanks for adding the name of the coupon, I tried it but the website wouldn't accept it. says that is has expired :(

@QuantPsych 27 күн бұрын

Okay, I updated it to have a much later expiration date. You should be good.

@yulia6354 27 күн бұрын

@@QuantPsych thanks a lot! it worked! Best present for Christmas ever! btw is there a limit for how long the course will be available to me? (I got the self-guided simplistics for now)

@prinzessinadana9309 26 күн бұрын

Nice video! I have a question about the independence assumption. When doing psychological studies with designs containing within and between factors and i model random slopes and intercepts by subject, how can i know, that the random effects by subject are not correlated with the between subject factor, so that there is no endogeneity, especially when i have for example eeg data and the between factor is a psychological trait? By theory i can not exclude, that they aren't independent so how can i specify the model? Shouldn't i use random effects by subject in this case? And what is worse: violating the assumption or not using random effects by subject?

@parthosen5942 29 күн бұрын

Non linearity is admittedly pretty easy to visualize, but don't we get the same problem of a large number of plots to test interactions if we try to visualize a multiplicative effect? It's still easier if the interaction is with a dummy variable, but an interaction variable with 3+ categories and especially continuous ones would be impractical to visualize. Even if, theoretically speaking, certain interactions would make sense eg. the effect of belonging to a certain sex on health outcomes interacted by age, it seems quite impossible to validate this assumption if the number of predictors is large.

@QuantPsych 28 күн бұрын

True...kinda. Take a look at these two papers: osf.io/preprints/psyarxiv/avu2n link.springer.com/article/10.3758/s13428-022-01901-9 The basic gist is that we use random forest (2nd paper) to figure out which variables are worth exploring and hope that the number of important predictors is less than four. If so, then we can use paneled plots (paper 1) with some other tools to figure out which variables are interacting.

@yulia6354 28 күн бұрын

yeah, I also don't quite get how to use the coupon code.. Is it a mistake or you need to crack it somehow to be able to use? 😅

@QuantPsych 28 күн бұрын

Oops. It's black_friday_24

@Blackmuhahah 29 күн бұрын

So people generally ignore fit residuals or they don't worry that there is some wonky distribution in their residuals?

@QuantPsych 28 күн бұрын

Correct, unfortunately.

@Blackmuhahah 28 күн бұрын

@QuantPsych Wow... someone should tell the universities/research entities, they could save SO much money and time by just plotting the data and not bothering to pretend like they're doing any thinking/explanatory investigations

@tagheuer001 29 күн бұрын

My suspicion is that the occurrence of squared effects and 2-way interactions is much higher in survey data, but not transaction/point of sale data (business/sales/IRI/Nielsen) which represent a much larger % of data analyzed by the private sector. i.e. every market basket purchased everyday across every major retail chain in North America is a huge % of all data analysis efforts (and people focused on it). Now there are still huge needs for mixed models and complex random effects to account for geography, retail chain, category, brand, etc grouping variables, but the interactions to test for are a little more intuitive because there's a certain level of rationality assumed and commonly accepted (but of course this doesn't change reality!....I'm just saying I've rarely found a lower BIC in models in which I tested non intuitive interaction and squared effects). But I'm a sample size of 1. :)

@Cor97 27 күн бұрын

The take away is as I see it, be aware what your assumptions are and check them and show that ho that actually turns out. In addition: Often there is the assumption that the phenomenona you are studying are time-independent like in assuming that a medicin with a certain effect last year will have the same effect next year and the year thereafter therefor conveniently denying the need ever to reassert its effects.

@Pedritox0953 28 күн бұрын

What about if there is not just 1 model but, many models over the "y axis", like classes or something ?? Great video!

@marcoghiotti7153 22 күн бұрын

In physics you always start by looking at the raw data. From there you try to come up with your own fancy theoretical model which must be confirmed by the data itself. Otherwise it's rubbish, no matter how fancy and elegant. And therefore you try again with different assumptions until something reasonable comes up. With all of this in mind, no sane physicist would assume by default a linear relationship. That would be an infinitesimally small realistic law in Nature. For some unknown reasons to me, when we apply statistics to our business necessities, we always force our equations to be linear, or multi-linear should you have more independent variables. I think it is a combination of laziness, mathematical ignorance and the absurdity of layman's terms as an excuse to hide our own lack of knowledge. Let's try to be a bit more creative and professional when modeling data. Thanks

@TheMartinontario 29 күн бұрын

What the Central Limit Theorem (CLT) states is that sample means will tend to be normally distributed, even if the original data distribution is not, provided that the following conditions are met: 1. The sample size is sufficiently large; 2. The observations or records are independent of each other; 3. All observations follow the same distribution. It is also essential to note that the CLT allows us to assume that the sample means will be normally distributed, **not the data itself** nor other measures such as medians or standard deviations. Finally, the CLT does not solve the issue of outliers if they are present. In general, if we truly want to understand our population, we should analyze why the distribution of our variable (which we theoretically assume to be normal) is not. In my experience, when a variable is not normal, it is often because significant biases exist that prevent the observations from being independent.

@vazquez-borsetti 29 күн бұрын

I think the same.

@writtenlike 29 күн бұрын

Can I ask why you’re writing all this? The dude in the video correctly said the central limit theorem is about the (theoretical) distribution of sample means, not the actual data. Also, later in your comment you write “the distribution of our variable (which we theoretically assume to be normal) which is confusing to me since it seems to say exactly what you’re trying to convey earlier; that it’s not about the actual data (“the variable”) but about the sample means distribution!

@vazquez-borsetti 29 күн бұрын

@@writtenlike What he said around minute 2 about skewed data is wrong, and many people who know statistics make that same mistake. In fact, the abstract of the paper he cited says the opposite of what he claimed.: "Parameter estimates were mostly unbiased and precise except if sample sizes were small or the distribution of the predictor was highly skewed"

@criticallyunderfunded3707 29 күн бұрын

@@vazquez-borsettiYou are correct. The CLT states that the sum of a set of an i. i. d. set of variables is always normal. E. g. the distribution of the sum of two die will always be normal. This does NOT mean that every sufficiently large i. i. d. variable set will be normal. E.g. the distribution of the product of two die is not normal. As a matter of fact, this formulation would violate the law of large numbers as, in Dustin's formulation, every large set of variables would become normal. Practically, he is right, because often times a measurement is true score plus error. However, many other times the errors are not linearly independent (e. g. higher and lower traits show higher errors) and thus the CLT does not apply.

@qv33nr0cks1 29 күн бұрын

@@criticallyunderfunded3707 It is irrelevant because the normality assumption is not required for unbiasedness of the estimates. Additionally, the central limit theorem does not say anything about the distribution of the errors, rather it says something about the convergence in distribution of the SAMPLE MEAN, which is not what regression uses for computation, and so again is irrelevant.

@tagheuer001 29 күн бұрын

Sorry Dr. Del!, you just got school'd by Dustin.

@QuantPsych 28 күн бұрын

Well, @411dadofabres suggested maybe he was right. It was a very econometrics way of looking at it, so I may grudgingly agree :)

@ricardovillela2175 28 күн бұрын

Dear Professor, thank you for your interesting videos. I am not a statistician, so forgive me if I write something stupid. The example you showed with the two graphs at 5:10 looks like a misspecified model. It is interesting to note that this modeling error has the effect that real independent errors in the correct nonlinear model will be seen as dependent residuals in the incorrect linear model. So could we conclude that there is also a problem in considering the dependence of the errors (residuals)?

@QuantPsych 28 күн бұрын

Yes! That's something I hadn't thought of until someone else said it (@411dadofabres). It's a very economist-way of looking at it, which we psychologists don't often do :)

@pppauliiii6843 29 күн бұрын

⁠I thought that‘s the reason why we use DAGs to determine which variables to condition on and which ones not. Although I get the point of plotting before modeling rather than the reverse - why isn‘t that considered HARKING? Am I wrong in assuming modeling should be informed by theory in the first place? Still one of my favourite statistics channels on KZbin!❤

@QuantPsych 28 күн бұрын

Such a good question! A few comments: DAGs don't specify whether a relationship between, say A/B/Y is linear, nonlinear, multiplicative. They just say that there is a causal relationship, and it's left up to the analyst to figure out the nature of the relationships, so we're still in a situation where we need to figure out whether the effects are additive. As for HARKING...in some sense yes this is HARKING. But I'm assuming the user is in EDA mode, where HARKING is permitted. If the user is in strict CDA mode, presumably the additivity of the model has already been established, so they'd still plot things to ensure these assumptions are still met. Make sense?

@pppauliiii6843 27 күн бұрын

Makes total sense, thanks for clarifying

@JohnVKaravitis 29 күн бұрын

Huh?

@QuantPsych 28 күн бұрын

Hi.

@filiproch3653 29 күн бұрын

i really enjoy the content but you just stretch the intro and the whole video waaaay too much

@QuantPsych 28 күн бұрын

It's a 14 minute video.

@ronburgundy9712 25 күн бұрын

@@QuantPsych "That is 13 minutes and 30 seconds longer than modern attention spans!" (not OP, but I enjoy your content)

@filiproch3653 22 күн бұрын

@@QuantPsych what i mean is that i wish youd make those 14 minutes richer in information and explanations rather then stretching out the content. see 15 minute long videos from "mutual information" channel, id love to see more content like that in the context of fields like biology or psychology. my problem is not with the length of your video but content/information per unit of time.

@rugbybeef 22 күн бұрын

It's an admission of fraud

@cocotte4746 29 күн бұрын

Answer: Use GAMS for nonlinearity

@qv33nr0cks1 29 күн бұрын

It's almost like you make click bait statistics video for people who don't understand statistics. Maybe explain what additivity even is before you start giving examples instead of just giving the jargon definition

@ricardogomes9528 29 күн бұрын

I think your videos would be much better if actually hadn't the high pich voice and the forward speed talk set to False...good content still though

@batesthommie2660 29 күн бұрын

I like it just the way it is.

@olenapo4895 29 күн бұрын

you will get used to. we all did

@CoolChannelName 16 күн бұрын

I think you should start counting lottery numbers. If you graph the winning numbers you will find that some numbers are drawn almost daily and some numbers never appear. I have considered that on the atomic level that the balls may be different weights and that may explain why the number 3 is drawn so much. My other thought is the universe is mathematical in nature and numbers such as 3,6,9,12 are just inherent in all things. When the lottery bans you for winning too much, send me a message and I will play some tickets for you.

@CoolChannelName 16 күн бұрын

This is a lesson in rhetoric. Is the non-existent social contract an assumption or presumption? Understanding of the law is a presumption. It is presumed that you are a member of a body politic and that you have delegated your authority to speak before government to another man or woman who acts as a re-presentation of you and speaks on your behalf; don't violate that one, the entire system will descend upon you. Was Iraq having weapons of mass destruction and assumption or presumption? It's only been 23 years of searching and they may find those WMD's any day now.

@CoolChannelName 16 күн бұрын

asleep at the wheel