FAQ: 1. 7:46 it should be 10 rather than 1 False positives in 200 metrics, thanks Ayank for pointing it out! 2. Running one A/B test with 10 variants vs running 10 A/B tests 10 variants testing means you have 1 control and 9 treatments. For example, you want to test 10 colors of a button, so each group of users see a different color. It's different from 10 A/B tests (each with 2 variants). For the 10 color example, you could run 10 A/B tests each have two variants (1 control and 1 treatment) but it's less efficient. This article may help understand the multiple testing concept home.uchicago.edu/amshaikh/webfiles/palgrave.pdf 3. 5:50 Probability of "no false positive" For details of how it's computed as (1 - alpha) ^ n, you can read more from home.uchicago.edu/amshaikh/webfiles/palgrave.pdf 4. 12:20 in two-sided markets the treatment effect would be overestimated. Why is that? For example, if a small group of Uber users receives incentives to have more rides, there will be enough driver to accommodate for the additional demand. However, if the incentives extents to all users, it's likely there will be not enough drivers to meet the huge increase of the demand (in the short term). Therefore, the treatment effect would likely to be overestimated. Feel free to ask questions below. Your questions may help others as well! If you have specific questions in your job search, feel free to reach out to me here data-interview-questions.web.app/.
@rachitsingh32994 жыл бұрын
5:50 Why .05 is subtracted. I understand the part of no false positive. Why .05?
@jasonchen30624 жыл бұрын
@@rachitsingh3299 5% type 1 error is a commonly used value
@leoyuanluo3 жыл бұрын
Hey emma, in 12:06 you said, "...a new product that attracts more drivers in the treatment group...", is the objective of the treatment group to attract more drivers or to make uber users to call more uber rides?
@sitongchen66883 жыл бұрын
Hi Emma, thanks for your great sharing! Regarding the point 4 above, I feel this is a comparison between pre and post launches of a new feature. What about bias during ab test between control and treatment groups? I think that should also be overestimation of true treatment effect, since there will be less available drivers for the control group which will cause a less number of rides completed than that of normal scenario for control group riders.
@oliverxu51342 жыл бұрын
For False Positive Rate, I want to know, how do you know a rejection is false positive. I mean, unlike classification, we know the true label and prediction, so we know whether a prediction is false positive or not. But in this case, we don't know whether the null hypothesis is true or not. Then how do we know a rejection is false positive? Besides, for each rejection, should we use the original criteria (p = 0.05) to reject?
@lexichen41313 жыл бұрын
This 16mins saved me 3 hrs at least, thanks so much!
@abtestingvideos22594 жыл бұрын
This is more helpful than a paid A/B testing course on Udemy! Emma, you are so awesome!
@klimmy.3 жыл бұрын
Hey Emma, thank you, that's really helpful! Please note, for the multiple testing problem there is a common confusion between p-value and false positive error ratio and what you calculated on 5:58 I believe is not particularly a false positive. They are related, but not the same (for the reference you may use pages. 41, 186 of Trustworthy experiments, or an article A dirty dozen: twelve p-value misconceptions). False positives depend on p-value and the prior belief in the Hypothesis. This example helped me: if you are trying to convert steel to the gold you may get in the experiment p-value = 0.05. But our prior belief is that we cannot do that from the chemical perspective. So 100% of rejections will be false, or False Positives will be 1.00 for our experiment (not 0.05). In probability terms (H0 means null hypothesis is true, D means data observed): False Positive Rate = P(H0, D) p-value = P(D | H0) by definition Their relations: P(H0, D) = P(D | H0) * P(H0) Hope that'd help :)
@goodjuju21324 жыл бұрын
I was really struggling with A/B testing. This video + your friend Kelly's post just helped me ace an interview on it! You are a treasure
@jeoffleonora46124 жыл бұрын
This is the best ab testing video. Period.
@kellypeng174 жыл бұрын
Very comprehensive content! Honored to be mentioned in Emma’s video! 😄😄😊
@tattwadarshipanda60294 жыл бұрын
You both are an inspiration.
@1-person-startup3 жыл бұрын
this channel is a goldmine
@zenofall44554 жыл бұрын
Emma your channel is brilliant. Thanks for creating this content. I had a quick follow up question: Lets say we do a small format change on posts at FB and want to measure if this has any effect on user interaction. We choose metric: #UsersWhoEngagedinAction/#TotalUsers Based on your A/B testing video - where you used approx formula N = 16*var/d^2 , to determine sample size. typically for a binomial distributed metric like one we chose: var= p*(1-p) , say if p=0.2, and dmin=2%, sample size comes to ~6400. For a big company like FB where they have 2.5B DAU, approx 30K users active per min (Assumption: Ignoring any other splitting of users by characterstics or time of day) if we decide to only use 1% of our active users per min (30k * 1% ) and split them into two groups - 150 each, the minimum samples required would be collected in 21mins. 6400/300. Is that correct? - are the experimentation durations this small for a problem like this at a high traffic platform.
@emma_ding4 жыл бұрын
You are right on the math. But in reality, companies don't assign all users to either control or treatment groups of a single test. It's due to a few reasons: 1. they may run hundreds (if not thousands) of experiments in parallel (especially in companies such as FB) so each test don't get that many users. 2. In reality, it's more common to have a "ramping" process to control risks rather than splitting all users into either control or treatment, so the duration will be longer than the calculated value.
@lanaherman3 жыл бұрын
why did you take (var=p*(1-p) instead of var=p*(1-p)*n) and (dmin=2%)?
@Theartsygalslays3 жыл бұрын
So well articulated and enlightening! This is the vocabulary I wish I had to explain A/B testing stats to less technical folks in the past. Thank you!
@emma_ding3 жыл бұрын
Thank you for your kind words Veronica! :)
@santoshbiswal65672 жыл бұрын
Thanks Emma for putting this up. One question: If we want to compare total revenue/acquisition of Test and Control group, what test(z-test ,Chisquare etc) can be used to test hypothesis? Population size > 1Mn
@linhe58964 жыл бұрын
I enjoyed this one a lot Emma. You are becoming a pro at youtube content and style. You show more facial expression => user engagement. The second part I like is how relevant it is to real interview questions. Please keep going, and perhaps a case study combined with product sense and AB testing for future topic.
@jfjoubertquebec2 жыл бұрын
Subscribed, liked. Finally, who talks like an adult. Thank you for your ptofessionalism!
@emma_ding2 жыл бұрын
Thank you JJ!
@karundeep073 жыл бұрын
Hey Emma, One more quick questions - At 3:50 when we are calculating Sample Size, it is said that we can get variance from the sample. Just wondering how we can get variance while we are in the first phase of designing A/B Test and we have not run the experiment and we don't have the sample yet. How we will get the sample variance? Please help me here as well.
@tejashshah52022 жыл бұрын
Hi @Karundeep Yadav, did you find out the answer to your question. Would love to hear the answer in that case. Had same question too!
@cl2hanovastar3 жыл бұрын
at 7:43, what does '200 metrics" mean? According to definition of FDR, it should be 200 rejected null hypothesis but not 200 tests. Could you please clarify?
@diegozpulido3 жыл бұрын
Hi Ema. Thank you very much for your videos. Thanks to them I got a Senior Data Scientist position at Facebook. I will forever thank you for your exceedingly good work.
@emma_ding3 жыл бұрын
Congrats! I'm so glad to hear it, best of luck with your new job!
@aspark474 жыл бұрын
Awesome content. I appreciate the structured walk-through of potential problems in designing A/B testing. I also like the idea of summarizing "trustworthy online controlled experiments." Looking forward to it!
@tech-n-data4 ай бұрын
Quality AB video, thank you.
@taylorlee81963 жыл бұрын
Best video ever! Very organized and oriented! Look forward to seeing more!
@ceciliaxu3 жыл бұрын
This is very helpful. Your voice is like one of my teacher at Bittiger. Her name is also Emma. 😊😊
@ayankgupta47964 жыл бұрын
7:46, should it not be 10 False positives in 200 metrics? Am i missing something
@emma_ding4 жыл бұрын
Thanks for pointing it out!
@stellaying54833 жыл бұрын
Thanks. Had the same question.
@rogerzhang62963 жыл бұрын
same question here
@yihongsui4525 Жыл бұрын
Hey Emma, thanks so much for the great video! 9:34 when test is already running while you want to deal with the novelty and primacy effect, would it be better to compare "first time users in treatment" vs "first time users in control"? or even... compare "first-time in treatment vs first-time in control" vs "old in treamtment vs old in control"?
@afridmondal34542 жыл бұрын
Amazing Explanation! Loved it ☺
@xingchenwang14714 жыл бұрын
I just read the summary article by Kelly a few days ago
@jaysun26543 жыл бұрын
I found a typo at 8:23 that is word of 'lager' should be 'larger'.
@zhefeijin96274 жыл бұрын
Hi Emma. One more such useful video!! I have a question about 'split the control and treatment group by cluster'. I know the clustering by geo-location can introduce some selection bias. For example, we do not know if it works in the U.S when we test it in Spain. Therefore, Facebook and Linkedin make the cluster according to the social graph. My question is 'if randomly take some of these clusters (social graph) for testing, will it also have any selection bias'? Thank you so much!
@alanzhu75383 жыл бұрын
Love the content! Keep going!
@SerenaKong Жыл бұрын
Thanks for sharing these videos. It is really clear and helpful! I have a question. How can we know if there is the spillover effect between control group and treatment group? If there is any way to detect it?
@MrBlackitalian3 жыл бұрын
Thank you so much for the resources!!
@hameddadgour2 жыл бұрын
Great content! Thank you for sharing.
@carloschavez97403 жыл бұрын
I 've read a lot of articles and this video is amz
@jonathanloganmoran3 жыл бұрын
Fantastic video-thank you, Emma, for your help! Just an FYI, you forgot to reference LinkedIn's ego-cluster paper in the description (14:45).
@emma_ding3 жыл бұрын
I added a link to the paper in the description. Thanks!
@mussdroid4 жыл бұрын
I want to be data scientist. Emma rocks the industry 🙏
@iOSGamingDynasties3 жыл бұрын
Great video Emma, some of the best A/B testing materials I have to say. However, I have some questions, when we say sample size, does it mean adding control + treatment groups? I read from somewhere that it is just the number of experimenters in a single group. Also why when we calculate the time it takes to run A/B test, we use the formula (sample size/# of users in a group)? Group here means control/treatment or just a batch of users that we show the experiment to at a single time? Do you think that it is a good idea to show all users at the same time, when the required sample size is small? Thanks!
@tinos03309 ай бұрын
wow it's very informative emma
@jieyuwang51203 жыл бұрын
Really great video! Thanks for making it available to everyone!
@shelllu68883 жыл бұрын
Hey Emma, thanks a lot for creating the video. tbh this is the most applicable ab testing video I've watched on KZbin! Great job on creating this and thanks for making the video and help the data science community grow. 1. Got a quick question on determining the # of days to run AB testing, you mentioned to divide sample size by # of users in each group. If we have multiple groups with not equal number of users, how do we decide # of days to run AB testing accordingly? 2. About FDR: I'm still a bit confused on the definitions, why the formula involves calculating expectations, is FDR a random variable? (If I'm lagging so much behind, could you help throw me a link so that I can read more to pick up?) Thanks so much again!
@minma19873 жыл бұрын
This was very helpful, thank you!
@lisawenyingliu38012 жыл бұрын
Hi Emma, thanks a lot for making these high quality tutorial videos, very helpful. But it is hard for me to understand because I don't have any basic knowledge, can I ask do you have any book to recommend for me to read so that I can better understand your videos?
@emma_ding2 жыл бұрын
Hi Lisa, please check out this blog! towardsdatascience.com/how-i-got-4-data-science-offers-and-doubled-my-income-2-months-after-being-laid-off-b3b6d2de6938#6f86
@littlemida4 жыл бұрын
Great! I am also reading the book you recommended. Looking forward to the next video.
@hasantao2 жыл бұрын
Very well done.
@Han-ve8uh4 жыл бұрын
At 5:50 it shows (1-0.05)^3 for 3 groups (i assume it means variants also), then is the formula for 2 groups (1-0.05)^2? But this seems wrong because no False positive for 2 groups should just be 0.95? Something confusing here is the concept of number of tests and variants within a test. I'm not sure if these 2 are the same thing? At 5:30 i interpret it as 2 variants in a single test, suddenly at 5:50, the word variant disappeared and changed to 3 groups, making me think it's 3 variants in a single test, but it also looks like 3 tests, each containing 1 group/variant and the "no change" null group?
@emma_ding4 жыл бұрын
Sorry for the confusion, I should've made it clearer. Group refers to the treatment group, 3 groups at 5:50 means there're 4 variants in total. Multiple testing problem is about more than two variants in a single test, it does not relate to multiple A/B tests (each has two variants). This may help you understand the concept better home.uchicago.edu/amshaikh/webfiles/palgrave.pdf "But this seems wrong because no False positive for 2 groups should just be 0.95?" - Why? If we have 2 variants (i.e. one control and one treatment), the false positive rate (Type 1 error or significance level) should be exactly 0.05 thus the probability of seeing no false positive is 0.95.
@allison-hd1fg3 жыл бұрын
Is minimum detectable effect the same thing as practical significance?
@judyhe6863 жыл бұрын
Hi Emma, thanks for this video and it's super helpful! I have a question around the ego-network randomization to solve network effect. I don't understand how it works because even if each user is not assigned a feature, they are still likely affected by users in the treatment group when it spills over? Can you elaborate more on that? Thanks!
@yidanhu78893 жыл бұрын
Hi Emma, I do not understand ego-network randomization. What is the difference between it and "create network effect" method? I do not understand your sentence in the video "meaning the effect of my immediate connections treatment on me"? Could you please help? The paper is too long. Thank you!!
@amneymnr64553 жыл бұрын
Thanks Emma! I got this question on a previous interview and would love your thoughts: 'What methods can you use when an A/B test cannot or has not been conducted?"
@emma_ding3 жыл бұрын
Ideas could be comparing before and after. Or implementing and compare variants in different geo-regions (or based on other user segmentation methods). You can google and explore more ideas. Depending on the problem, the downside of not using A/B is you may need more effort on analysis and/or bias correction.
@zzzs55454 жыл бұрын
Great! Looking for more ab testing contents.
@emma_ding3 жыл бұрын
More to come!
@RobertoAnzaldua2 жыл бұрын
Great video, thanks for posting :D
@emma_ding2 жыл бұрын
My pleasure! So happy it was helpful for you Roberto!
@plttji26153 жыл бұрын
Hi Emma, thank you for the video. What if we want to decide among two features how can we design the AB testing? or Is it multivariate testing? Thank you
@alanzhu75383 жыл бұрын
14:40 When you talked about splitting the clusters, do you mean randomly splitting people within a cluster to treatment and control group?
@nipundiwan3 жыл бұрын
Let's say there are a total of n clusters in the entire sample. You randomly assign n/2 clusters to the treatment group and the remaining n/2 clusters to the control group.
@LauraLigmail3 жыл бұрын
Hey Emma, for 5% FDR, would u mind helping me understand how you got to ‘at least 1 false positive for 200 metrics ‘?
@jessesong95463 жыл бұрын
I think she meant that the probability of observing at least 1 false positive among 200 metrics is .05, hope this makes sense.
@XuJiBoY4 жыл бұрын
Hi Emma, thank you very much for the great informative video! I have a question: at 12:20 you mentioned that in two-sided markets the treatment effect would be overestimated, may I know why is that? I can't quite figure it out.
@emma_ding4 жыл бұрын
For example, if a small group of Uber users receives incentives to have more rides, there will be enough driver to accommodate for the additional demand. However, if the incentives extents to all users, it's likely there will be not enough drivers to meet the huge increase of the demand (in the short term). Therefore, the treatment effect would likely to be overestimated.
@XuJiBoY4 жыл бұрын
@@emma_ding Thank you very much for the explanation. This makes sense. So it's the resource competition in the population of all users, which was not an issue in the sub-population of the experiment. I guess it's probably assumed that the treatment effect is focusing on the increase in successful ride transactions, instead of pure ride demand from users (regardless of fulfillment of the demand).
@haowu69183 жыл бұрын
How to estimate the variance from datasets?
@goelnikhils2 жыл бұрын
Thanks a lot. Amazing content
@neeru11963 жыл бұрын
It would help if you explained the variables and talked about "parameters" in detail. Thanks for the video!
@emma_ding3 жыл бұрын
Noted! Thanks for the feedback!
@halflearned21904 жыл бұрын
Excellent content, thanks!
@sophial.44883 жыл бұрын
Quality content in each and every video. Emma you are great to condense information into digestible format.
@yingyingxu99264 жыл бұрын
Question: when you talk about multiple testing problem, is that required exact same tests among 10 groups? Like 10x AA test? If we have 10 different variants, we can think it as 10 different AB tests conduct simultaneously. Do I miss something here?
@emma_ding4 жыл бұрын
No, multiple testing means you have 10 variants, i.e. 1 control and 9 treatments. For example, you want to test 10 colors of a button, so each group of users see a different color. It's different from 10 A/B tests (each with 2 variants). For the 10 color example, you could run 10 A/B tests each have two variants (1 control and 1 treatment) but there's not need to do it.
@nope48813 жыл бұрын
Hi, great topic! I have a question! You mentioned the 'difference between treatment and control' = 'delta' can be obtained by MDE. How do we get it? How to estimate 'delta' from MDE? Also, can you show an example of the use the sample size = 16*sample variance/delta formula, obtain 'delta' from MDE and get a value of 'sample size'. Hope you understand the question :)
@emma_ding3 жыл бұрын
You can refer to this video kzbin.info/www/bejne/gHakpKKLp71pgbM for derivation of the sample size.
@rachitsingh32993 жыл бұрын
Hey Emma! can you explain the difference between A/B testing and experimental design?
@emma_ding3 жыл бұрын
A/B testing is the same as online controlled experiment.
@roshanpatnaik19022 жыл бұрын
Hi Emma, In the sample size discussion i.e. where you mentioned that sample size is 16 sigma square/ Delta square, sigma is sample variance of the test or control?
@emma_ding2 жыл бұрын
Hi Roshan, thank you for your question. Have you checked out my video -> kzbin.info/www/bejne/jKG3nYGIish8etE, where I discuss the basics of A/B testing? Have a look and let me know if you still have questions! Thanks for watching and sharing!
@kelseyarthur64213 жыл бұрын
Great video
@karundeep073 жыл бұрын
Thank a lot emma. One quick question. At 3:45 pm, since we haven't run the test yet... then how we can the value of sigma and delta. Delta, we can get by minimum detectable effect. But what about signma. Please help me understand this. Thanks again..
@emma_ding3 жыл бұрын
Both sigma and delta are predetermined. They should be known before running the experiment.
@guancan2 жыл бұрын
@@emma_ding Wonder how we can know what are the samples if the sample size is not determined -- if we don't know what are the samples, how we could observe sample variance? Could you please further explain Emma?
@halflearned21903 жыл бұрын
Nice video, thanks!
@omid94223 жыл бұрын
Excellent
@freya_yuen Жыл бұрын
Why can't I save this video for my playlist /.\
@karencao15383 жыл бұрын
Hi Emma, one question on sample variance when calculating sample size, are we referring to the sample variance of the treatment group before the experiment? Just a bit confused on what actually we're referring to here...
@emma_ding3 жыл бұрын
The statistic we are testing is delta (the difference) so the "variance" is the variance of delta.
@rogerzhao11583 жыл бұрын
@@emma_ding @Data Interview Pro Hi Emma, the video is super helpful. I have one question: the sample variance is calculated as the variance of the delta, so we can only calculate the sample size after the experiment is started and data is collected? But shouldn't we decide the sample size before we start the experiment? I get confused about the order and hope you can help clarify. Thank you.
@janeli24873 жыл бұрын
Hi Emma, Thanks for your video, It's very comprehensive. I am wondering what would you do or communicate with PMs if the p-value is just a little bit missed, such as you got 0.051 while you defined your significant level at 0.05? Thanks
@emma_ding3 жыл бұрын
The situation is debatable. An option could be to run the experiment a little longer to see if the p value changes. The bottomline is you don't want to compromise the criteria (ie the significance level) after seeing the results.
@janeli24873 жыл бұрын
@@emma_ding Thanks!
@nplgwnm10 ай бұрын
Video was made in 2021, and I busted into laughter when “company X” is mentioned 😂 who would know, right? 😂😂😂
@ВоловичМихаил3 жыл бұрын
Brilliant!
@thegreatlazydazz3 жыл бұрын
I would like to say that I whole heartedly support the idea of making a video on the book with the picture of the hippo. I am from a staistics background, but never quite understood how stats were being used in this ab testing setting. Thanks a ton!!!!!
@lingli89994 жыл бұрын
Emma, another great video, thank you! I had a question. You mentioned in this video that referral program is usually considered as long-term. I understand for referral programs for like housing, it takes a long time. How about other referral programs like Uber eats, Robinhood new user program with a random stock? Can those be tested with A/B testing?
@emma_ding4 жыл бұрын
Even for Uber eats and Robinhood referral programs are still longer time compared with instantaneous change eg. feature update. You can A/B test those but with longer feedback loop.
@lingli89994 жыл бұрын
@@emma_ding Thanks a lot Emma!
@ARJUN-op2dh3 жыл бұрын
Amazing........!!!!!!!!!!!
@tejas58724 жыл бұрын
Hey Emma, Thank you for the valuable content. I've been following your channel and it's helping me regarding the expectation in the interview!. I just have a question - You mentioned coding round will be conducted in the first round. Will the coding round be based on data structures (Linked Lists, Queues, Stacks, Dynamic programming etc) or basic coding challenges like print a palindrome? Please help
@emma_ding4 жыл бұрын
Good question! This blog summarizes all the different kinds of coding interviews and I think it may clarify things towardsdatascience.com/the-ultimate-guide-to-acing-coding-interviews-for-data-scientists-d45c99d6bddc!
@Alexandra-he8ol3 жыл бұрын
Thank you very much🙏🏻
@nathannguyen20413 жыл бұрын
Informative video! What is the difference between A/B testing and analysis of variance (design of experiments topics? All of these topics are essentially the same e.g., treatments/factors, randomisation, experiment design, Bonferroni/Kimball inequality, etc. Is there a particular reason why there is a distinction of A/B testing from the general ANOVA framework? I may have just answered my own question though..."the general ANOVA framework," but it doesn't hurt to ask someone with more education and work experience than me.
@seant79074 жыл бұрын
Emma, I don't mean any offense. Can you add subtitles to your vids? I find it hard to follow what you're speaking because I myself am not a native English speaker. Thank youu!!
@emma_ding4 жыл бұрын
Thanks for the feedback! KZbin has the subtitles function (a "cc" icon on the right bottom of the video) that may help with understanding the content. It may have some errors though, I'll try to upload subtitles as soon as I can.