2 - Potential Outcomes (Week 2)

Рет қаралды 27,481

Brady Neal - Causal Inference

Күн бұрын

Пікірлер: 80

@sethjchandler 3 жыл бұрын

A masterpiece of clarity.

@woods.9549 4 жыл бұрын

Nice death star easter egg!

@hafsabenzzi3609 Жыл бұрын

wonderful lecture bravo!

@TheProblembaer2 4 ай бұрын

I SAW THE DEATH STAR!

@Theviswanath57 4 жыл бұрын

@Brady: In Jason A. Roy's Coursera course both no-interference assumption & only one way of getting treatment is clubbed under SUTVA;

@Theviswanath57 4 жыл бұрын

Whereas in your example of "Golden retriever or other dog" which I guess violating "only one way of getting treatment assumption" your putting consistency assumption

@BradyNealCausalInference 4 жыл бұрын

@@Theviswanath57 Not entirely sure i understand you comment, but are you saying this: "SUTVA is satisfied if unit (individual) i's outcome is simply a function of unit i's treatment. Therefore, SUTVA is a combination of consistency and no interference (and also deterministic potential outcomes)." If so, that sounds right to me. That's taken from Section 2.3.5 of the course book (not everything makes it into the lecture)

@Theviswanath57 4 жыл бұрын

@@BradyNealCausalInference make sense, thanks

@sahilverma1635 4 жыл бұрын

Hello Brady. I have a silly doubt, what is the difference between Y(0) and Y | T= 0 ?

@BradyNealCausalInference 4 жыл бұрын

Y(0) corresponds to "take a random person in the whole population and force them to take treatment 0." Y | T = 0 corresponds "take a random person from the subpopulation that happened to take treatment 0." Some of the comments in the threads on this video might also be helpful: kzbin.info/www/bejne/m5iQk3meg7CVpLs

@michelspeiser5789 Жыл бұрын

@@BradyNealCausalInference This is a very helpful formulation, that I recommend to be included in the course (unless it's already there and I missed it)

2 жыл бұрын

In Unconfoundedness, does the conditioning to X means if we fill the group "went to sleep with shoes" with ALL PEOPLE DRUNK, and fill the group "went sleep without shoes" ALSO WITH DRUNK PEOPLE, is a workaround for to fill both groups with random people, selected by a coin? The negative aspect of this is that some data will be lost because we only care about a subset of the dataset (e.g. DRUNK=1, ignoring all data with DRUNK = 0)?

@YashSharma-yw9er 3 жыл бұрын

How is the two groups (shoe sleepers and non-shoe sleepers) not being comparable considered a separate reason for association not being causation? Isn't it indirectly a confounder as well?

@Fhoneysuckle Жыл бұрын

Hi Brady,thanks for your awesome lecture.But I have a question about the Ignorability and Exchangeablility.In the Causal Inferences: What if ,Randomization refer to the joint independence of potential oucome as full exchangeability. Randomization makes the potential outcome jointly independent of treatment T which implies, but is not implied by exchangeability.So why the Randomization/Ignorability means joint independence rather than marginal distribution?

@Theviswanath57 4 жыл бұрын

In Slide #40 with regards to Estimation: I feel it should be sigma_i rather than sigma_x; Currently it's 1/n * ( Sigma_x ( { E[Y | T=1, x ] - E[Y | T=0, x] } )) I feel it should be 1/n * Sigma_i ( { E[Y | T_i=1, X_i ] - E[Y | T_i=0, X_i] } )) which we can re-written as Sigma_x ( P(X=x) * (E[Y|T=1, X=x] - E[Y|T=0, X=x]) )

@BradyNealCausalInference 4 жыл бұрын

You are absolutely right. Unfortunatley, some typos might stay in the videos, even if they have been fixed in the book.

@Theviswanath57 4 жыл бұрын

Reason: Let's there are four sub-groups with following conditional average treatment effect: 1, 0.5, 1.5, 2.5 Let's say P(X=x) = [ 0.5, 0.2, 0.2, 0.1] Let's say there are total 100 subjects with the first equation: ATE will be (1/100) * ( 1 + 0.5 + 1.5 + 2.5 ) = (1/100) * 5.5 = 0.055 with the second equation ATE will be ( 0.5*1 + 0.2*0.5 + 0.2*1.5 + 0.1*2.5 ) = 1.15

@sourajmishra1450 3 жыл бұрын

Hey Brady, Thanks for the great course!! In slide 17: Why does E[Y(1)|T=1] becomes E[Y|T=1]? and same for E[Y(0)|T=0] = E[Y|T=0]?

@shipan5940 2 жыл бұрын

my understanding, because the condition is T=1, so Y(T) = Y(1) = Y(all) = Y. My own way of explaining this. If T could be 1 or 0, it can't be simplified like this.

@rajeevbhatt7415 4 ай бұрын

It's after applying the consistency assumption because we are guaranteed that for T=t, we will get Y(t), so Y | T = t is sufficient.

@edisonge9311 4 жыл бұрын

Hi Brady, in page 18, I understand your point here, but I have a question about the definition of E[Y(1)|T=0]. If we observe T=0, then what the meaning of Y(1) here?

@BradyNealCausalInference 4 жыл бұрын

Y(1) given that you observe T = 0 is the outcome you would have observed if you had taken T = 1. It isn't something that we can observe (usually)! I think I give the intuition for this on the potential outcomes intuition slide.

@edisonge9311 4 жыл бұрын

@@BradyNealCausalInference So observation T=0 is independent of the do-operation Y(1), then we also can get E[Y(1)] - E[Y(0)] = E[Y(1)|T=0] - E[Y(0)|T=1] , right? But we cannot use consistency law here, therefore, in ICI, Eq.(2.3), it's E[Y(1)] - E[Y(0)] = E[Y(1)|T=1] - E[Y(0)|T=0]. Is my understanding correct?

@scotth.hawley1560 4 жыл бұрын

Great lecture, but starting at 20:02 I become lost: How is E[Y(1) | T=0] not a contradiction? If you do(T=1) then doesn’t that force T=1?

@BradyNealCausalInference 4 жыл бұрын

Yes, but T=0 is *conditioning* on T=0, not doing T=0. So condition on T=0 means "look at the people how happened to not take the treatment." Then for those people, Y(1) means "what would have happened had they taken the treatment?"

@scotth.hawley1560 4 жыл бұрын

@@BradyNealCausalInference Thanks so much for taking the time to respond! This clarification helped me be able to move forward.

@BradyNealCausalInference 4 жыл бұрын

@@scotth.hawley1560 Glad to hear it! Thanks for bearing with me on the slow response time haha.

@Ptilu2 2 жыл бұрын

Hi Brady! Thank you so much for those lovely pedagogical videos! There is something I am struggling to wrap my head around though and I was wondering if somebody (you or some other kind soul) could help me with here. You presented ignorability as resulting from an assumption of independence between the causal variables Y(1) and Y(0), leading to E[Y(1)|T=0] = E[Y(1)|T=1]. Isn't this independence meaning that basically the treatment has no causal effect on Y? Instead of removing the arrow from X to T, aren't we removing all arrows leading to T? If I try to explain in other words my confusion: if the expectation of the outcome Y(1) does not change whether we give T or not, doesn't it mean that T is not causal for Y? I am obviously having a logic flaw here somewhere so I would be glad if someone could help me seeing it :)

@Ptilu2 2 жыл бұрын

I think I am confusing Y(1) with Y=1 here, while in fact it is Y|do(T=1). Some getting used to...

@kangchenghou5027 4 жыл бұрын

Thanks for the great lecture again! I learnt a lot and I have a few questions: 1. The fundamental problem in causal inference refers to that for each individual, we only get to observe one potential outcome. Ways to get around this is to make assumptions, therefore convert a causal estimand to a statistical estimand. So far in the course, it seems that we cope with average treatment effects. To estimate individual treatment effects, is it that we need more assumption there? Will we cover that in the course? 2. For positivity assumption, if for some covariates, P(T = 1 | X = x) is very close to 0 or 1. The estimation will be fine if we have access to the full distribution. When it goes to the estimation using the finite samples, it will leads to big variance. So to have good estimate to the treatment effects, we would want P(T = 1 | X = x) not go to the extreme, is this correct? This also reminds me of the bias-variance tradeoff: including more covariates reduce confoundedness (bias), but may lead to estimate with high variance (variance). Does this make sense? 3. This is more of a comment: I think the lectures mentions that including more covariates is better (correct me if I am wrong). I think it may worthwhile to mention, this is not always the case, for example X -> C

@BradyNealCausalInference 4 жыл бұрын

1. Awesome question. Makes me think you already know the answer haha ;). To move from ATEs to ITEs, we do need to make stronger assumptions. The stronger assumptions we need to make have to do with the specific functional form and noise distribution (in addition to the causal graph). This corresponds to moving from Level 2 to Level 3 of Pearl's ladder. We will see this later in the course when we get to counterfactuals.

@BradyNealCausalInference 4 жыл бұрын

2. You are exactly right on both counts. When we get to estimation in week 5, we will actually see that people sometimes just drop specific examples where P(T = 1 | X = x) is too close to 0 or 1. Your bit about the bias-variance tradeoff is also right (usually).

@BradyNealCausalInference 4 жыл бұрын

3. Right again. I mention this in sidenote 8 of Chapter 2 in the book (www.bradyneal.com/Introduction_to_Causal_Inference-Sep1_2020-Neal.pdf). I think I meant to use weak language in the lecture (e.g. "there is a general perception that this is the case"). If I used strong language (e.g. "this is the case"), would you mind linking me to it, as I should probably correct that with an annotation.

@BradyNealCausalInference 4 жыл бұрын

4. I do everything with PowerPoint and TikZ (since I use TikZ for the book, might as well just reuse those figures in the slides). I sometimes use Inkscape when I need more flexibility than both of those can easily provide.

@kangchenghou5027 4 жыл бұрын

@@BradyNealCausalInference Thanks for the detailed explanation! For 3, it could be just my perceptual bias :) You did mention this is not the general case. But just for the reference, 34:32 "for unconfoundedness, the general idea (which is not always true) is that the more covariates you condition on, the more likely you are to have satisfied unconfoundedness." For 4, may i know how do you integrate the latex with powerpoint?

@tOo_matcha 2 жыл бұрын

31:13 that split of a second when you see the Death Star 😂

@KyleReevesSci Жыл бұрын

Was looking for this comment 😂

@charismaticaazim 4 жыл бұрын

Brady, do the casual theory literature say anything about "knowing the presence of confounding variables, but not being able to know or measure what they are". This would hint the domain expert that there's something else that is influencing the decision. Also, in terms of the shoe example, since we know being drunk is contributing to the outcome it wouldn't really be a confounder if we know it, right.

@Theviswanath57 4 жыл бұрын

On Final Estimation example: Question 1: By controlling for age, our estimated ATE is matching with actual ATE; but whereas by controlling for both age & 'protein excreted in urine', our estimated ATE is just 0.85; Question 2: What's the causal graph with both age & protein excreted in the urine age blood_pressure } where age is confounding variable Actual ATE: 1.05 & estimated_ate: 1.05 ( Both from the "mean of differences" & from regression coefficient )

@BradyNealCausalInference 4 жыл бұрын

I'm not sure I see a question in there haha. It sounds like you are describing the code. Note: some of that code is for Chapter 4, where we acually write down the causal graph, so it might not all make sense without Chapters 3 and 4.

@Theviswanath57 4 жыл бұрын

@@BradyNealCausalInference Cool, will wait for chapter 3 & 4 to be covered

@RobertKwapich 3 жыл бұрын

Great course! Any particular books or review papers that you could recommend to read in more detail?

@alialthiab7527 Жыл бұрын

Have you found any?

@gwillis3323 3 жыл бұрын

Hey, you say that the approach at the end, where you train a regression of the form y=at + bx only works because the treatment effect is the same for all individuals (ATE=CATE). I don't think this is correct. In fact, the paper which introduced the Double Machine Learning approach starts off by showing that for the case of y = at + g(x), standard approaches which predict y well will give biased estimators for a (although granted, the Double Machine Learning approach really starts to shine when y=f(x)t + g(x)). Do you have any intuition on why the linear regression approach works so well here? Is it because the outcome variable depends linearly on both the treatment and the feature? Will it always work well in such cases? My intuition says no, that confoundedness can still mess you up. Maybe it's just a quirk of this exact dataset?

@tyflehd 2 жыл бұрын

Hello Brady, thank you for the awesome video :) I come over here to get an intuitive understanding on causality. I have a question on lecture slide 14. If the group from T=1 and T=0 are comparable, shouldn't it be drunk on the right if it is sober on the left. Based on my understanding, let's say I am the topmost guy in both groups (T=1, T=0). How can I be included in a group 'go to sleep with shoes on' and the other group 'without shoes on' under the same condition 'drunk'? Please correct me if I am wrong. Thanks!

@rajeevbhatt7415 4 ай бұрын

The same person cannot be included in both groups. just the number of people in both groups is almost same, due to randomization.

@jitingjiang7401 4 жыл бұрын

Hi Brady, Thanks for this lecture. It is super great. I have one question about the fourth assumption for identification, i.e. consistency. To illustrate the concept, you mentioned an example with two different types of dogs as multiple versions of the treatment. I am wondering, is it really a problem? I guess one can always define a specific version of treatments as the T, right? Thank you!

@BradyNealCausalInference 4 жыл бұрын

Yes, that just means being sufficiently specific about how you define the treatment.

@galaxystat 4 жыл бұрын

Hi Brady, thanks for great lectures! I read the book of why by Judea Pearl. Any difference between potential outcomes framework and counterfactual calculation in the Peral's book ? I saw some comments in the book that Judea thought missing value interpretation was wrong. What methodology do you recommend in practical applications ? or they are just the same ?

@BradyNealCausalInference 4 жыл бұрын

I think the two languages share a lot more than a lot of people seem to think. To me, they are simply different notations and different ways to formulate the assumptions. You should be able to understand both, so I include them both in the first month of the course. I use both, depending on the setting or who I'm talking to.

@adrianoyoshino 3 жыл бұрын

In the consistency example I got the point that we can't have multiple treatments (like different type of drugs as a treatment). But does it has to have the same outcome always? I mean, is it possible having a case where I take a pill one day and I get better but I take a pill another day and the headache does not get better?

@rajeevbhatt7415 4 ай бұрын

Not following consistency is like adding more nodes to the causal graph. For example, the dog type in the given example, along with whether the person got a dog. Similarly, if the pill's effect is different each day, a day node needs to be added to the causal graph.

@Theviswanath57 4 жыл бұрын

Slide #40: Naive estimate might have been estimated through following regression equation: Y_i = alpha + Beta * T_i; alpha_hat is 5.33 ?

@BradyNealCausalInference 4 жыл бұрын

Not quite. That simple regression and taking the coefficient from the regression is actually what I describe for slide *41*. And in your comment, *beta* hat is actually the ATE estimate (5.33), not alpha hat. In the notation I use in slide 41 (different from yours), it is alpha hat that is 5.33.

@Theviswanath57 4 жыл бұрын

@@BradyNealCausalInference yeah that's right, little confused; thanks

@Theviswanath57 4 жыл бұрын

Where can I get the data

@BradyNealCausalInference 4 жыл бұрын

@@Theviswanath57 See the GitHub link in Section 2.5 of the book for the data generation and estimation code.

@charismaticaazim 4 жыл бұрын

Reporting a mistake: Around 5:03 Brady said T=0 for taking the pill. It shld be T=1.

@souradipchakraborty7071 4 жыл бұрын

Can we have a non-linear cause and effect relationship? In that case, how do we estimate the exact effect ?

@BradyNealCausalInference 4 жыл бұрын

Yes! You'd use the same estimator that is used in slide 40, but with a nonlinear model instead of linear regression. You can also use any of the other estimators that we discuss in week 6 of the course.

@souradipchakraborty7071 4 жыл бұрын

@@BradyNealCausalInference Thanks will definitely check the week 6 course. I asked as if there is non-linearity with respect to T, then Y_hat = alpha * T + alpha' * T^2 + alpha'' * T^3.... + beta_X. Then which coefficient would give us the causal effect of T on Y.

@Theviswanath57 4 жыл бұрын

@Brady: In Slide #41, I am wondering estimation should be sigma_x ( P(X=x) * ( E[ Y/T=1, X=x] - E(Y/T=0, X=x) )

@Theviswanath57 4 жыл бұрын

In your variant essentially we are saying that P(X=x) is same for all x; please correct me if I am wrong;

@BradyNealCausalInference 4 жыл бұрын

@@Theviswanath57 In slide 40, it is that equation that you write, assuming that you meant "E[Y | T=1, X=x] - E[Y | T=0, X=x] " when you wrote "E[ Y/T=1, X=x] - P(Y/T=0, X=x)." However, in slide 41, we use a completely different way to estimate the ATE: linear regression and then using the coefficient of the regression. In general is not equal to the correct equation from slide 40. It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals). I don't actually include the specific equation for the estimate in slide 41, but you can get it using the closed-form solution to linear regression. You can see the exact code that I used for this in Section 2.5 of the course book.

@Theviswanath57 4 жыл бұрын

@@BradyNealCausalInference regarding P(Y/T=0, X=x), yes I mean E(Y/T=0, X=x).

@Theviswanath57 4 жыл бұрын

Understood on "It is only equal when E[Y | T=1, X=x] - E[Y | T=0, X=x] is the same for all x (i.e. the treatment effect is the same for all individuals)."

@Theviswanath57 4 жыл бұрын

@brady if we have P(X=x) as part of the equation ATE is unbiased estimate even if "E[Y | T=1, X=x] - E[Y | T=0, X=x] is not same for all x "?

@viralgupta5630 4 жыл бұрын

Is there a textbook or course website

@BradyNealCausalInference 4 жыл бұрын

Website: causalcourse.com Book: www.bradyneal.com/causal-inference-course#course-textbook

@mingmingchen7154 3 жыл бұрын

Thanks for the lecture! I have a question around kzbin.info/www/bejne/a6nCoYOboqaJrtU: Is E[Y(1) - Y(0)] (here the individual subscript i is implicit) properly defined since some data are missing?

@DailySFY 11 ай бұрын

@mingmingchen7154 As you have pointed out it is a biased estimate. And Brady explains this clearly afterwards.

@chadpark9248 4 жыл бұрын

Thanks for the grate lecture again. I have a few questions about the text book In page 8. "A natural quantity that comes to mind is the associational diﬀerence: ~~~~Then, maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]." From these sentences, I got little confused What "maybe ~ equals" means.....

@chadpark9248 4 жыл бұрын

In addition, I have a question about the description section of "Consistency" on page 14. I understand Y(t) intuitively, but I don't understand "whereas Y(T) is the potential outcome for the actual value of treatment that we observe" intuitively. Do you have an example?

@BradyNealCausalInference 4 жыл бұрын

Basically, it's just like a train of thought that is common to go down. "maybe E[Y(1)]-E[Y(0)] equals E[Y|T=1]-E[Y|T=0]" is the more formal way of writing "maybe causation equals association (correlation equals causation)." Of course, this thinking is often incorrect :)

@BradyNealCausalInference 4 жыл бұрын

@@chadpark9248 For a given individual, they will observe a specific value, say t', for the random variable T. That means that they will observe the potential outcome Y(t'). So the realized value, t', of T gets connected to the observed outcome Y in that way (assuming consistency). Similarly, Y(T) corresponds to the potential outcome that we observe when we know the realized value of the treatment random variable T. It is distinct from Y(1), Y(0), or Y(t) which is meant to denote a specific potential outcome, that isn't related to the random variable T at all (even though, we use the same letter, but in lower case, for Y(t)).

@chadpark9248 4 жыл бұрын

@@BradyNealCausalInference Thank you for your detailed explanation.

@charismaticaazim 4 жыл бұрын

Independently & Identically distributed = Ignorability / Exchangability. Agree ?

@amins6695 2 жыл бұрын

Amazing video. One question. The example at the end of the lecture seems like a simple linear regression. Does it mean that when we run linear regression, we are doing causal inference? What is the difference between regression and causal inference, here?