A/B Testing Made Easy: Real-Life Example and Step-by-Step Walkthrough for Data Scientists!

Рет қаралды 115,142

Күн бұрын

This is a comprehensive walkthrough of an A/B testing example. I will go through the details of designing A/B testing experiments, running experiments, and interpreting results. The idea is inspired by the book Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. I'm sure it will be extremely helpful for data science interview preparation.
Part 1 of the tutorial:
• A/B Testing Fundamenta...
Derivation of sample size equation:
• Sample Size Estimation...
🟢Get all my free data science interview resources
www.emmading.com/resources
🟡 Product Case Interview Cheatsheet www.emmading.com/product-case...
🟠 Statistics Interview Cheatsheet www.emmading.com/statistics-i...
🟣 Behavioral Interview Cheatsheet www.emmading.com/behavioral-i...
🔵 Data Science Resume Checklist www.emmading.com/data-science...
✅ We work with Experienced Data Scientists to help them land their next dream jobs. Apply now: www.emmading.com/coaching
// Comment
Got any questions? Something to add?
Write a comment below to chat.
// Let's connect on LinkedIn:
/ emmading001
====================
Contents of this video:
====================
0:00 Intro
1:00 Background
2:02 Prerequisites
5:50 Experiment design
11:08 Result to decision

Пікірлер: 96

@alexlindroos828 2 жыл бұрын

There are so many youtube videos on these subjects that are barely scratching the surface. They absolutely serve a purpose, but this is incredibly thorough for a 14 minute video. Thank you!

@tangled55 2 жыл бұрын

You are AMAZING Emma. This is a WONDERFUL example.

@prithviprakash1110 2 жыл бұрын

This was a great video! Thank you for such a great explanation, love the passion you show.

@kelvinscomps 3 жыл бұрын

Great video. Exactly what I was looking for and well explained.

@amadyba6761 2 жыл бұрын

So well explained! Great example, thank you!

@aoihana1042 2 жыл бұрын

Thorough and precise explanation! Thank you

@ramavishwanathan7214 2 жыл бұрын

Thanks for the great video. I really liked your transparent approach of attributing the source that you have built upon.

@praveensebastian2107 2 жыл бұрын

Hi Emma, I recently started watching your videos. good explanation. keep going !

@Kkidnappedd 9 ай бұрын

Thank you, I like your format!

@biniyampauloschamiso1411 9 ай бұрын

Thank you for sharing your valuable knowledge on this topic. Very helpful.

@Edward_Adams 11 ай бұрын

Emma, thank you for your channel and your videos! I love how practical and clear your videos are. They are very easy to understand and very useful for understanding the topic and getting ready for an interview. BTW, I wanted to share that when your videos are example-oriented and precise they are more understandable than theoretical ones.

@emma_ding 11 ай бұрын

Thanks for your feedback, Edward! I will keep this in mind. 😊

@vincentyin4992 2 жыл бұрын

Thanks for the great video! a little question about your final conclusion: It looks like the treatment 1 is not statistically significant nor practically siginificant, therefore we should definitely not to launch it; the treatment 2 is not practically significant but it is statistically significant, we may consider to launch if the cost is low.

@mingzhouzhu4668 2 жыл бұрын

super useful! thank you!

@kennethleung4487 2 жыл бұрын

Hi Emma, great video! Would like to ask whether we need to worry about multiple testing problem here (i.e. apply Bonferroni correction), since there are now >2 variants in this context?

@victoradewoyinva Жыл бұрын

Thank you Emma!

@Jonwisniewski04 Жыл бұрын

Excellent! very helpful, ty.

@Vikash_Kumar8090 Жыл бұрын

Excellent Video!

@vonderklaas Жыл бұрын

Thanks, became much more clear

@sophiezheng4850 2 жыл бұрын

Hi Emma, For this particular example, when choosing which population to target, I have always had one question, how do we control the self selection bias if users visit the check out page? I would assume that for those who visit checkout page might have higher propensity to purchase and they might not be the representative subset of the population. One way i think is to correct this bias during post-experiment analysis (e.g. adjusting the distribution of both groups to make inference for overall population). Do you think my concern is valid? Would love to hear your thoughts.

@sitongchen6688 3 жыл бұрын

Thanks Emma for taking your time to create this great video!! I have a quick question regarding defining metrics. Do we also need to mention the timeframe associated with the revenue per user? Like in your example, it seems that average daily revenue per user was used, or is it avg weekly revenue? In the following series, it will be super helpful if you can do another video like this for referral or promo program in the two-sided markets like Uber (many of us feel difficult to answer ab test in that case)!! Thanks a lot for your help.

@ktyt684 3 жыл бұрын

It really depends on the setting of the question. If you question is about a feature that is used by user everyday, then you can use the daily time frame. If you think there's a weekly pattern, say users tend to use this feature more on the weekend, then it makes sense to use weekly. For this problem, I think it makes sense to just use the average rev/user over the experiment period, because I am assuming shopping is not a frequent user behavior.

@emma_ding 3 жыл бұрын

Thanks ktyt. Adding to the answer above, generally, we want to keep experiments short. You can get more "per day" measurements than "per week" measurements over the same period. In our example, "per day" measurement is preferred.

@sitongchen6688 3 жыл бұрын

@@emma_ding Thanks Emma for the explanation! Another question I have when rewatching this video is that, the duration of 1 week mentioned here is to do random assignment and collect sample for each group, right? But for metrics to analyze, if it is a weekly metrics, then it will take additional time for metrics to realize per each cohort. So I guess that also resonates with your point that we prefer daily metrics more?

@Alex-tv8cf 3 жыл бұрын

感谢Emma的A/B testing系列视频。前段时间面试被考到一些基本概念，但因为没做过实际项目，回答得不好。重新开始学习

@suhascrazy805 2 жыл бұрын

Hey Emma, you mentioned sanity checks in the video, could you please elaborate on how can we do these checks?

@nianyiwang8993 2 жыл бұрын

Hi Emma, you're the best!!!! I have question about smaller significance level though. When α is smaller, that means my margion of error is higher, which means I need less sample size. Please let me know if I understand wrong. Keep going!

@goryglory729 3 жыл бұрын

Can you disambiguate expected effect size and practical significance boundary? Thank you!

@sharanupst 26 күн бұрын

Great primer

@karthicthangarasu6766 2 жыл бұрын

Hey Emma - one question on segmenting results. Does each segment require us to run an SRM check? E.g. the SRM check passes at the overall level but fails at the segment level.

@xinxinli8779 2 жыл бұрын

Hi Emma, I might've missed this in your video, which hypothesis test did you choose for this example, t-test or z-test and why?

@quentingallea166 5 ай бұрын

Great job! For the sample size, I would seriously advise to use a more precise approach like G-power (or R/Python). It would tremendously increase the precision.

@DePhpBug 2 жыл бұрын

Hi, this subject trigger the interest for me to study bit more on AB testing , but to practice this , how do we proceed in learning A/B testing if we do not have a digital product? Is there like a mock data to begin testing with ? just so able to learn the A/B test?

@jessehe9286 3 жыл бұрын

When you mentioned "follow-up tests" in the end, would you just keep the current experiment running for more power? Or are you referring to something else? Thanks! (love the video!!!)

@nanfengbb 3 жыл бұрын

Good point! I have the same question. For "follow-up tests", do we let the experiment run longer so more samples can be collected?

@emma_ding 3 жыл бұрын

Great question! It depends. Generally speaking, when the experiment hasn't stopped, the simplest way is to keep it running to get more users. If it has stopped or there are some major changes to the experiments or assumptions, we need to rerun the experiment, which will introduce some overhead.

@Crtg17 3 жыл бұрын

@@emma_ding Hey Emma. Thanks for sharing! Just to follow up: I was wondering why people always talk about "power" like "underpowered" or "adding more power". I am thinking that the only way to increase power is to increase the sample size, then why not just say "sample size is too small" or "adding more sample?"? Is there any other way to increase power? Thanks!

@goryglory729 3 жыл бұрын

@@Crtg17 a diluted experiment can also be underpowered.

@mindfuel-ness 2 жыл бұрын

Need your advice on this; Should we rather have customer picked for this experiment by customer profiling to make sure we cover wide array of our customer behavior and represent the change with new features? I have personally had a lot of explaining when I do experiments with randomly picked user groups

@SimoneIovane 2 жыл бұрын

Thank you for your clear tutorials. Do you know any resources to go into detail about practical significance? Thanks and keep it up 😉.

@emma_ding 2 жыл бұрын

Hi Simone, I would recommend checking the blog posts on www.datainterviewpro.com/blogpost.

@yiminglee4372 6 күн бұрын

hello Emma, thank you so much for this amazing ab testing series video. I got a quick question about the ramp-up plan part, so in the first day, we only use 5% of traffic for each variant, so its like we use 5% of traffic for variant 1, another 5% of traffic for variant 2, and the left 90% of traffic goes to the control group? i am just really confused about this part. much appreciated if you can give me some hints! : )

@sandeepgupta2 2 жыл бұрын

Hi Emma, Amazing Content !! Cleared a lot of doubts I had a question though. How can we estimate the Standard Deviation of Population. I mean looking at the historical data ??

@emma_ding 2 жыл бұрын

Hi Sandeep! Apologies for the delayed response but yes, we can get that from the historical data.

@kellykuang6122 3 жыл бұрын

Hi Emma, thank you for the great video! I have some questions after watching the videos and hope you can help with that. In the example you mentioned, there are 2 treatment groups, one is displaying similar product below check-out cart, another one is a pop-up window. Therefore shouldn't the variants = 2? should we use bonferroni correction to divide the significance level by 2? thanks!

@emma_ding 3 жыл бұрын

It's a great point to consider multiple testing problem. However, Bonferroni correction is often being criticized for being too conservative. A more recommended way to deal with the problem is to control false discovery rate. More details in this video kzbin.info/www/bejne/jmnYZ56oacurjsU. Also, the number of variants is 3 -- control, treatment I and treatment II.

@suhascrazy805 2 жыл бұрын

@@emma_ding Hey Emma, so would it be cirrect to consider this example as a multiple testing problem ???

@ashwinmanickam 2 жыл бұрын

Thanks for the great video! Very informative! 10:23 Can customers belong to more than one group ? What I mean is - In Cycle 1 there are 100 users , so total of 300 users will be tested. In cycle 2 - 200*3 - 600 users Now can the 100 users that belonged to control or treatment 1 in cycle 1(day 1), belong to treatment 2 in another cycle While starting the experiment only do we fix them into the here categories?

@Funnylukn 3 жыл бұрын

Great video, and really clear. Thank you so much! Question: One of the big assumption is knowing the population standard deviation when in reality we don't. Can you give a real-life example how to estimate that too? Thank you!

@emma_ding 3 жыл бұрын

The standard deviation has to be estimated from historical data. You could assume it's the same in both control and treatment group. For more info, you can refer to kzbin.info/www/bejne/gHakpKKLp71pgbM

@TheCsePower Жыл бұрын

@@emma_ding The video doesnt really explain how to get variance from previous data. check kzbin.info/www/bejne/hX6YfZWYZ8yNnaM

@Leon_1218 2 жыл бұрын

Hey Emma, can I ask what is the rationale of running another test with "more power"?

@charlottelee5831 2 жыл бұрын

This video is super helpful! One question, why is it recommended to be overpowered? If the test gets too overpowered wouldn't the effect size you are detecting be too small/ trivial?

@zachmanifold Жыл бұрын

Being overpowered would significantly reduce the variance of the sampling distribution of whatever you’re measuring. It’s very possible you can detect trivial effects like you mentioned, but that’s where practical significance plays a role. If my final sampling distribution has an interval [0.04, 0.07], that’s a completely trivial effect which is “statistically significant” but way lower than what we’d consider practical. On the other hand, it could be [2.22, 2.25] where we have much more certainty and we can see in this case it’s comfortable enough to say that it’s a practical change to make. It’s only recommended to be overpowered if you can manage all of the costs associated with the design and of course, if you have the users for it

@anuragnegi9636 5 ай бұрын

@@zachmanifold Such a good example☝

@yuanliu1290 2 жыл бұрын

Thanks for the great video! One question on delta (business requirement), what if there's no minimum required delta value, e.g. maybe the cost of test is extremely low and we just want to see if there can be any more revenue generated, then how should we estimate sample size?

@zachmanifold Жыл бұрын

For this case where you don’t have a specific delta in mind, I would probably target a specific margin of error for both groups. In the case where both groups have the same size then this reduces to one calculation. For example, maybe you want to target a margin of error of 0.3 (I.e., point estimate +/- 0.3) then you can use the formula of the variance of the sampling distribution to determine the appropriate sample size to reach that margin of error

@ajitkirpekar4251 2 жыл бұрын

Ahh...AB testing. Depending on your career path, you may not have experienced these kinds of challenges since you tackled them in grad school. Thanks for the videos.

@emma_ding Жыл бұрын

You are welcome Ajit! I am glad you are finding them useful!

@ruthrugezhao862 2 жыл бұрын

IMO it's more natural to choose randomization unit (randomize by user) first before metric (avg revenue per user)

@kandulareddy5394 2 жыл бұрын

just a question on the final formula sample size = 16(sigma)^2/(delata)^2 I get for the delta part we can determine minimal effect we want to measure and it can be taken by the business. But without knowing the sample size how can we get sample SD (Sigma)? @

@AlirezaAminian 11 ай бұрын

At the end of the video, the p-value for Treatment I vs Control is 0.055, which is larger than 5%. So statistically it should be significant, while practically it may not be. In the video, you mentioned otherwise. Can you please advise?

@tinawang1291 Жыл бұрын

Need some help to understand the sample size calculation. It seems to me the MDE ($2/user) missed info about the time period. Is it $2/user per day, per month, or per testing period? If it's per testing period, then how can we use this MDE to calculate how many days we need to run a test?

@hongyuweng7786 3 жыл бұрын

Hi Emma, thanks for your inspired and helpful explanations about A/B testing. I'm confused about the result of Treatment II VS Control group. It seems like Treatment II is both practically and statistically significant. Why we also would like to get more tests and then make final decision? Does the follow-up test is like a fine tuned model which well gives us better design details and results?

@emma_ding 3 жыл бұрын

The C.I. of Treatment II overlaps with practical significance boundary meaning it's likely to be NOT practical significant. The followup test is to get more users to increase the power -- either keep running the ongoing experiment or rerun with more users.

@qietang2701 3 жыл бұрын

@@emma_ding Then why Treatment I is likely to be practically significant? The C.I. of Treatment I also overlaps with practical significance boundary.

@kellykuang6122 3 жыл бұрын

from my understanding, ideally we want to launch a change if is (1) statistically significant (2)partially significant. For treatment I, point estimate - 2.45 > 2, indicating the change may be practically significant, but we are not sure since confident interval [-0.1, 5] overlaps practical significance boundary [-2, 2]. Most importantly, the p-value = 0.055 indicates that it's not statistically significant. Since the change may be practically significant, but not statistically significant, we are not sure about the launch, so we want to increase the power of the test, to see if the change become statistically significant, and also observe whether the CI move outside the Practical significant boundary. For treatment II, differs from treatment I, now p-value = 0.0001, indicating the change is statistically significant. However, still, the CI overlaps practical significance boundary. We know there's a positive change, but not sure if such change meets the business goal. Is my understanding correct! welcome comments and feedbacks!

@scottqian7687 3 жыл бұрын

@@qietang2701 I was also confused by this. I felt Treatment I is NOT practical significant if using the criteria mentioned later. I tried to convince myself that maybe Emma meant to bring in CI as a condition only in Treatment II.

@sinaamininiaki3995 3 жыл бұрын

your videos are the best

@nikhilsahay895 2 жыл бұрын

Great video ! I didnt understand why Treatment 2 was practically insignificant? the p-value was less that 0.05 and also point estimate was greater than $2. What am I missing?

@ramavishwanathan7214 2 жыл бұрын

The lower bound of the CI was less than 2. For practical sign., in this case, the lower bound of CI had to at least be equal to 2

@nikhilsahay895 2 жыл бұрын

@@ramavishwanathan7214 then whats diff between statistical significance and practical? if lower bound was atleast 2, then it would be statistically significant too.

@user-hw8gx9vh5v 11 ай бұрын

Hi @emma_ding , thanks for the video. I have an A/B test with two groups, treatment and control, but I analyze the statistical significance of multiple subgroups within these groups, such as different shift segments throughout the day: morning shift from 9 am to 12 pm, lunchtime shift from 12 pm to 3 pm, snack time shift from 3 pm to 6 pm, and dinner time shift after 6 pm. If I perform an A/B test by dividing the results of control and variant within each of these subgroups, should I conduct an ANOVA or can I still use a T Test to compare means with a correction, similar to multiple ab testing (ABN tests)?

@jasminbogatinovski8391 10 ай бұрын

ANOVA will do the trick. You are using one independent categorical variable to analyze the dependent variable. In case the normality assumptions and homoscedasticity are violated, consider the usage of the Kruskal Wallies test.

@user-hw8gx9vh5v 9 ай бұрын

@@jasminbogatinovski8391 thank you!

@haiyentruong1510 2 жыл бұрын

Hi Emma, for both treatments, the point estimates are larger than the practical significance boundary, why treatment 1 is likely to be practically significant and treatment 2 is not? I am not quite clear on that part. Hope you can clarify. Thank you.

@codylian9009 2 жыл бұрын

For treatment 1: Difference between treatment and control = $2.45, it is more than practical boundary of $2, but this is only an estimate based on the sample and might not incorporate margin of error, so it’s more reliable to look at the CI, which in this case is [-0.1, 5], this CI suggests that we are confident that we would observe an average increase of -$0.1 (which is a loss of revenue) all the way up to average increase of $5, so it is likely that it would not be practical significant. However, if the minimum of CI is above 2 (e.g. [3, 6]), we would say it is practical significant. Thus, Treatment 1 is neither statistically nor practically significant. For treatment 2: Using the same logic as above, treatment 2 should be considered as statistically significant but not practically significant. Am new learner as well, so feel free to correct me.

@moashtari7619 2 жыл бұрын

Hi Emma, I love your videos, short, clear, and to the point. However, I tend not to agree with you on the conclusion you made. There is no point estimate comparison that can validate either of statistically or practically significant observations. For both cases, CI needs to be used. Generally, if CI has overlaps with any value, it means chances are high that CI is referring to the same distribution that the point estimate or parameter is referring to. For TG1, the CI includes both control value of "0" and practical significance boundary of "2", so for both it is NOT significant. For TG2, it doesn't cover "0" but covers "2", so it is statistically significant, but practically NOT significant.

@gbting1988 2 жыл бұрын

Hi. Can somebody explain why treatment 2 is not practically significant? 2.25>2

@adachen5319 2 жыл бұрын

thanks Emma! I have one question, based on your formula for sample size, it only contain difference of control and treatment groups and standard deviation. you said if the significance level to 2.5%, the sample size will be larger. how large? double or triple? you didn't say clearly. thanks.

@emma_ding 2 жыл бұрын

Hey, You would need to find the corresponding z score to calculate it. I hope this answers your question. Thanks!

@ningxinhuang1924 2 жыл бұрын

Hi Emma do you still provide mock interview services? I checked your website and they are marked sold-out.

@emma_ding 2 жыл бұрын

Hi Ningxin, currently all my services are sold out. But I will soon launch my masterclass which will introduce you to my full course, and that includes mock interviews in the cohort!

@jushkunjuret4386 Жыл бұрын

shouldn't the selection of the sagement of users falls under step 2 (design experiment)? in the video, it still falls under the prerequisites

@liuauto 2 жыл бұрын

Always wondering where the delta comes from when computing the sample size. It looked like a chicken egg problem to me. Now realized that delta is actually tied to the practical significance level which is from the business requirement.

@GrantKing-px2bn Жыл бұрын

Thank you! This wasn't clicking for me

@viviandoggy07 Жыл бұрын

why gradual ramp up?

@PriyankaSingh 2 жыл бұрын

What is practical significance?

@modhua4497 3 жыл бұрын

Hi, how could we ensure that the users assigned in each group are carried out randomly? thanks

@emma_ding 3 жыл бұрын

You can use hypothesis testing: t-test or chi-squared test.

@modhua4497 3 жыл бұрын

@@emma_ding For AB test, does the web page A and B run in the same day or today we run A and tomorrow we run B and so on? Do you have a video to show the entire process plus who are the key parties involved in AB test? Thanks