Crack A/B Testing Problems for Data Science Interviews

Crack A/B Testing Problems for Data Science Interviews | Product Sense Interviews

Рет қаралды 163,197

Күн бұрын

Пікірлер: 125

@emma_ding 4 жыл бұрын

FAQ: 1. 7:46 it should be 10 rather than 1 False positives in 200 metrics, thanks Ayank for pointing it out! 2. Running one A/B test with 10 variants vs running 10 A/B tests 10 variants testing means you have 1 control and 9 treatments. For example, you want to test 10 colors of a button, so each group of users see a different color. It's different from 10 A/B tests (each with 2 variants). For the 10 color example, you could run 10 A/B tests each have two variants (1 control and 1 treatment) but it's less efficient. This article may help understand the multiple testing concept home.uchicago.edu/amshaikh/webfiles/palgrave.pdf 3. 5:50 Probability of "no false positive" For details of how it's computed as (1 - alpha) ^ n, you can read more from home.uchicago.edu/amshaikh/webfiles/palgrave.pdf 4. 12:20 in two-sided markets the treatment effect would be overestimated. Why is that? For example, if a small group of Uber users receives incentives to have more rides, there will be enough driver to accommodate for the additional demand. However, if the incentives extents to all users, it's likely there will be not enough drivers to meet the huge increase of the demand (in the short term). Therefore, the treatment effect would likely to be overestimated. Feel free to ask questions below. Your questions may help others as well! If you have specific questions in your job search, feel free to reach out to me here data-interview-questions.web.app/.

@rachitsingh3299 4 жыл бұрын

5:50 Why .05 is subtracted. I understand the part of no false positive. Why .05?

@jasonchen3062 4 жыл бұрын

@@rachitsingh3299 5% type 1 error is a commonly used value

@leoyuanluo 3 жыл бұрын

Hey emma, in 12:06 you said, "...a new product that attracts more drivers in the treatment group...", is the objective of the treatment group to attract more drivers or to make uber users to call more uber rides?

@sitongchen6688 3 жыл бұрын

Hi Emma, thanks for your great sharing! Regarding the point 4 above, I feel this is a comparison between pre and post launches of a new feature. What about bias during ab test between control and treatment groups? I think that should also be overestimation of true treatment effect, since there will be less available drivers for the control group which will cause a less number of rides completed than that of normal scenario for control group riders.

@oliverxu5134 2 жыл бұрын

For False Positive Rate, I want to know, how do you know a rejection is false positive. I mean, unlike classification, we know the true label and prediction, so we know whether a prediction is false positive or not. But in this case, we don't know whether the null hypothesis is true or not. Then how do we know a rejection is false positive? Besides, for each rejection, should we use the original criteria (p = 0.05) to reject?

@lexichen4131 3 жыл бұрын

This 16mins saved me 3 hrs at least, thanks so much!

@abtestingvideos2259 4 жыл бұрын

This is more helpful than a paid A/B testing course on Udemy! Emma, you are so awesome!

@klimmy. 3 жыл бұрын

Hey Emma, thank you, that's really helpful! Please note, for the multiple testing problem there is a common confusion between p-value and false positive error ratio and what you calculated on 5:58 I believe is not particularly a false positive. They are related, but not the same (for the reference you may use pages. 41, 186 of Trustworthy experiments, or an article A dirty dozen: twelve p-value misconceptions). False positives depend on p-value and the prior belief in the Hypothesis. This example helped me: if you are trying to convert steel to the gold you may get in the experiment p-value = 0.05. But our prior belief is that we cannot do that from the chemical perspective. So 100% of rejections will be false, or False Positives will be 1.00 for our experiment (not 0.05). In probability terms (H0 means null hypothesis is true, D means data observed): False Positive Rate = P(H0, D) p-value = P(D | H0) by definition Their relations: P(H0, D) = P(D | H0) * P(H0) Hope that'd help :)

@goodjuju2132 4 жыл бұрын

I was really struggling with A/B testing. This video + your friend Kelly's post just helped me ace an interview on it! You are a treasure

@jeoffleonora4612 4 жыл бұрын

This is the best ab testing video. Period.

@kellypeng17 4 жыл бұрын

Very comprehensive content! Honored to be mentioned in Emma’s video! 😄😄😊

@tattwadarshipanda6029 4 жыл бұрын

You both are an inspiration.

@1-person-startup 3 жыл бұрын

this channel is a goldmine

@zenofall4455 4 жыл бұрын

Emma your channel is brilliant. Thanks for creating this content. I had a quick follow up question: Lets say we do a small format change on posts at FB and want to measure if this has any effect on user interaction. We choose metric: #UsersWhoEngagedinAction/#TotalUsers Based on your A/B testing video - where you used approx formula N = 16*var/d^2 , to determine sample size. typically for a binomial distributed metric like one we chose: var= p*(1-p) , say if p=0.2, and dmin=2%, sample size comes to ~6400. For a big company like FB where they have 2.5B DAU, approx 30K users active per min (Assumption: Ignoring any other splitting of users by characterstics or time of day) if we decide to only use 1% of our active users per min (30k * 1% ) and split them into two groups - 150 each, the minimum samples required would be collected in 21mins. 6400/300. Is that correct? - are the experimentation durations this small for a problem like this at a high traffic platform.

@emma_ding 4 жыл бұрын

You are right on the math. But in reality, companies don't assign all users to either control or treatment groups of a single test. It's due to a few reasons: 1. they may run hundreds (if not thousands) of experiments in parallel (especially in companies such as FB) so each test don't get that many users. 2. In reality, it's more common to have a "ramping" process to control risks rather than splitting all users into either control or treatment, so the duration will be longer than the calculated value.

@lanaherman 3 жыл бұрын

why did you take (var=p*(1-p) instead of var=p*(1-p)*n) and (dmin=2%)?

@Theartsygalslays 3 жыл бұрын

So well articulated and enlightening! This is the vocabulary I wish I had to explain A/B testing stats to less technical folks in the past. Thank you!

@emma_ding 3 жыл бұрын

Thank you for your kind words Veronica! :)

@santoshbiswal6567 2 жыл бұрын

Thanks Emma for putting this up. One question: If we want to compare total revenue/acquisition of Test and Control group, what test(z-test ,Chisquare etc) can be used to test hypothesis? Population size > 1Mn

@linhe5896 4 жыл бұрын

I enjoyed this one a lot Emma. You are becoming a pro at youtube content and style. You show more facial expression => user engagement. The second part I like is how relevant it is to real interview questions. Please keep going, and perhaps a case study combined with product sense and AB testing for future topic.

@jfjoubertquebec 2 жыл бұрын

Subscribed, liked. Finally, who talks like an adult. Thank you for your ptofessionalism!

@emma_ding 2 жыл бұрын

Thank you JJ!

@karundeep07 3 жыл бұрын

Hey Emma, One more quick questions - At 3:50 when we are calculating Sample Size, it is said that we can get variance from the sample. Just wondering how we can get variance while we are in the first phase of designing A/B Test and we have not run the experiment and we don't have the sample yet. How we will get the sample variance? Please help me here as well.

@tejashshah5202 2 жыл бұрын

Hi @Karundeep Yadav, did you find out the answer to your question. Would love to hear the answer in that case. Had same question too!

@cl2hanovastar 3 жыл бұрын

at 7:43, what does '200 metrics" mean? According to definition of FDR, it should be 200 rejected null hypothesis but not 200 tests. Could you please clarify?

@diegozpulido 3 жыл бұрын

Hi Ema. Thank you very much for your videos. Thanks to them I got a Senior Data Scientist position at Facebook. I will forever thank you for your exceedingly good work.

@emma_ding 3 жыл бұрын

Congrats! I'm so glad to hear it, best of luck with your new job!

@aspark47 4 жыл бұрын

Awesome content. I appreciate the structured walk-through of potential problems in designing A/B testing. I also like the idea of summarizing "trustworthy online controlled experiments." Looking forward to it!

@tech-n-data 4 ай бұрын

Quality AB video, thank you.

@taylorlee8196 3 жыл бұрын

Best video ever! Very organized and oriented! Look forward to seeing more!

@ceciliaxu 3 жыл бұрын

This is very helpful. Your voice is like one of my teacher at Bittiger. Her name is also Emma. 😊😊

@ayankgupta4796 4 жыл бұрын

7:46, should it not be 10 False positives in 200 metrics? Am i missing something

@emma_ding 4 жыл бұрын

Thanks for pointing it out!

@stellaying5483 3 жыл бұрын

Thanks. Had the same question.

@rogerzhang6296 3 жыл бұрын

same question here

@yihongsui4525 Жыл бұрын

Hey Emma, thanks so much for the great video! 9:34 when test is already running while you want to deal with the novelty and primacy effect, would it be better to compare "first time users in treatment" vs "first time users in control"? or even... compare "first-time in treatment vs first-time in control" vs "old in treamtment vs old in control"?

@afridmondal3454 2 жыл бұрын

Amazing Explanation! Loved it ☺

@xingchenwang1471 4 жыл бұрын

I just read the summary article by Kelly a few days ago

@jaysun2654 3 жыл бұрын

I found a typo at 8:23 that is word of 'lager' should be 'larger'.

@zhefeijin9627 4 жыл бұрын

Hi Emma. One more such useful video!! I have a question about 'split the control and treatment group by cluster'. I know the clustering by geo-location can introduce some selection bias. For example, we do not know if it works in the U.S when we test it in Spain. Therefore, Facebook and Linkedin make the cluster according to the social graph. My question is 'if randomly take some of these clusters (social graph) for testing, will it also have any selection bias'? Thank you so much!

@alanzhu7538 3 жыл бұрын

Love the content! Keep going!

@SerenaKong Жыл бұрын

Thanks for sharing these videos. It is really clear and helpful! I have a question. How can we know if there is the spillover effect between control group and treatment group? If there is any way to detect it?

@MrBlackitalian 3 жыл бұрын

Thank you so much for the resources!!

@hameddadgour 2 жыл бұрын

Great content! Thank you for sharing.

@carloschavez9740 3 жыл бұрын

I 've read a lot of articles and this video is amz

@jonathanloganmoran 3 жыл бұрын

Fantastic video-thank you, Emma, for your help! Just an FYI, you forgot to reference LinkedIn's ego-cluster paper in the description (14:45).

@emma_ding 3 жыл бұрын

I added a link to the paper in the description. Thanks!

@mussdroid 4 жыл бұрын

I want to be data scientist. Emma rocks the industry 🙏

@iOSGamingDynasties 3 жыл бұрын

Great video Emma, some of the best A/B testing materials I have to say. However, I have some questions, when we say sample size, does it mean adding control + treatment groups? I read from somewhere that it is just the number of experimenters in a single group. Also why when we calculate the time it takes to run A/B test, we use the formula (sample size/# of users in a group)? Group here means control/treatment or just a batch of users that we show the experiment to at a single time? Do you think that it is a good idea to show all users at the same time, when the required sample size is small? Thanks!

@tinos0330 9 ай бұрын

wow it's very informative emma

@jieyuwang5120 3 жыл бұрын

Really great video! Thanks for making it available to everyone!

@shelllu6888 3 жыл бұрын

Hey Emma, thanks a lot for creating the video. tbh this is the most applicable ab testing video I've watched on KZbin! Great job on creating this and thanks for making the video and help the data science community grow. 1. Got a quick question on determining the # of days to run AB testing, you mentioned to divide sample size by # of users in each group. If we have multiple groups with not equal number of users, how do we decide # of days to run AB testing accordingly? 2. About FDR: I'm still a bit confused on the definitions, why the formula involves calculating expectations, is FDR a random variable? (If I'm lagging so much behind, could you help throw me a link so that I can read more to pick up?) Thanks so much again!

@minma1987 3 жыл бұрын

This was very helpful, thank you!

@lisawenyingliu3801 2 жыл бұрын

Hi Emma, thanks a lot for making these high quality tutorial videos, very helpful. But it is hard for me to understand because I don't have any basic knowledge, can I ask do you have any book to recommend for me to read so that I can better understand your videos?

@emma_ding 2 жыл бұрын

Hi Lisa, please check out this blog! towardsdatascience.com/how-i-got-4-data-science-offers-and-doubled-my-income-2-months-after-being-laid-off-b3b6d2de6938#6f86

@littlemida 4 жыл бұрын

Great! I am also reading the book you recommended. Looking forward to the next video.

@hasantao 2 жыл бұрын

Very well done.

@Han-ve8uh 4 жыл бұрын

At 5:50 it shows (1-0.05)^3 for 3 groups (i assume it means variants also), then is the formula for 2 groups (1-0.05)^2? But this seems wrong because no False positive for 2 groups should just be 0.95? Something confusing here is the concept of number of tests and variants within a test. I'm not sure if these 2 are the same thing? At 5:30 i interpret it as 2 variants in a single test, suddenly at 5:50, the word variant disappeared and changed to 3 groups, making me think it's 3 variants in a single test, but it also looks like 3 tests, each containing 1 group/variant and the "no change" null group?

@emma_ding 4 жыл бұрын

Sorry for the confusion, I should've made it clearer. Group refers to the treatment group, 3 groups at 5:50 means there're 4 variants in total. Multiple testing problem is about more than two variants in a single test, it does not relate to multiple A/B tests (each has two variants). This may help you understand the concept better home.uchicago.edu/amshaikh/webfiles/palgrave.pdf "But this seems wrong because no False positive for 2 groups should just be 0.95?" - Why? If we have 2 variants (i.e. one control and one treatment), the false positive rate (Type 1 error or significance level) should be exactly 0.05 thus the probability of seeing no false positive is 0.95.

@allison-hd1fg 3 жыл бұрын

Is minimum detectable effect the same thing as practical significance?

@judyhe686 3 жыл бұрын

Hi Emma, thanks for this video and it's super helpful! I have a question around the ego-network randomization to solve network effect. I don't understand how it works because even if each user is not assigned a feature, they are still likely affected by users in the treatment group when it spills over? Can you elaborate more on that? Thanks!

@yidanhu7889 3 жыл бұрын

Hi Emma, I do not understand ego-network randomization. What is the difference between it and "create network effect" method? I do not understand your sentence in the video "meaning the effect of my immediate connections treatment on me"? Could you please help? The paper is too long. Thank you!!

@amneymnr6455 3 жыл бұрын

Thanks Emma! I got this question on a previous interview and would love your thoughts: 'What methods can you use when an A/B test cannot or has not been conducted?"

@emma_ding 3 жыл бұрын

Ideas could be comparing before and after. Or implementing and compare variants in different geo-regions (or based on other user segmentation methods). You can google and explore more ideas. Depending on the problem, the downside of not using A/B is you may need more effort on analysis and/or bias correction.

@zzzs5545 4 жыл бұрын

Great! Looking for more ab testing contents.

@emma_ding 3 жыл бұрын

More to come!

@RobertoAnzaldua 2 жыл бұрын

Great video, thanks for posting :D

@emma_ding 2 жыл бұрын

My pleasure! So happy it was helpful for you Roberto!

@plttji2615 3 жыл бұрын

Hi Emma, thank you for the video. What if we want to decide among two features how can we design the AB testing? or Is it multivariate testing? Thank you

@alanzhu7538 3 жыл бұрын

14:40 When you talked about splitting the clusters, do you mean randomly splitting people within a cluster to treatment and control group?

@nipundiwan 3 жыл бұрын

Let's say there are a total of n clusters in the entire sample. You randomly assign n/2 clusters to the treatment group and the remaining n/2 clusters to the control group.

@LauraLigmail 3 жыл бұрын

Hey Emma, for 5% FDR, would u mind helping me understand how you got to ‘at least 1 false positive for 200 metrics ‘?

@jessesong9546 3 жыл бұрын

I think she meant that the probability of observing at least 1 false positive among 200 metrics is .05, hope this makes sense.

@XuJiBoY 4 жыл бұрын

Hi Emma, thank you very much for the great informative video! I have a question: at 12:20 you mentioned that in two-sided markets the treatment effect would be overestimated, may I know why is that? I can't quite figure it out.

@emma_ding 4 жыл бұрын

For example, if a small group of Uber users receives incentives to have more rides, there will be enough driver to accommodate for the additional demand. However, if the incentives extents to all users, it's likely there will be not enough drivers to meet the huge increase of the demand (in the short term). Therefore, the treatment effect would likely to be overestimated.

@XuJiBoY 4 жыл бұрын

@@emma_ding Thank you very much for the explanation. This makes sense. So it's the resource competition in the population of all users, which was not an issue in the sub-population of the experiment. I guess it's probably assumed that the treatment effect is focusing on the increase in successful ride transactions, instead of pure ride demand from users (regardless of fulfillment of the demand).

@haowu6918 3 жыл бұрын

How to estimate the variance from datasets?

@goelnikhils 2 жыл бұрын

Thanks a lot. Amazing content

@neeru1196 3 жыл бұрын

It would help if you explained the variables and talked about "parameters" in detail. Thanks for the video!

@emma_ding 3 жыл бұрын

Noted! Thanks for the feedback!

@halflearned2190 4 жыл бұрын

Excellent content, thanks!

@sophial.4488 3 жыл бұрын

Quality content in each and every video. Emma you are great to condense information into digestible format.

@yingyingxu9926 4 жыл бұрын

Question: when you talk about multiple testing problem, is that required exact same tests among 10 groups? Like 10x AA test? If we have 10 different variants, we can think it as 10 different AB tests conduct simultaneously. Do I miss something here?

@emma_ding 4 жыл бұрын

No, multiple testing means you have 10 variants, i.e. 1 control and 9 treatments. For example, you want to test 10 colors of a button, so each group of users see a different color. It's different from 10 A/B tests (each with 2 variants). For the 10 color example, you could run 10 A/B tests each have two variants (1 control and 1 treatment) but there's not need to do it.

@nope4881 3 жыл бұрын

Hi, great topic! I have a question! You mentioned the 'difference between treatment and control' = 'delta' can be obtained by MDE. How do we get it? How to estimate 'delta' from MDE? Also, can you show an example of the use the sample size = 16*sample variance/delta formula, obtain 'delta' from MDE and get a value of 'sample size'. Hope you understand the question :)

@emma_ding 3 жыл бұрын

You can refer to this video kzbin.info/www/bejne/gHakpKKLp71pgbM for derivation of the sample size.

@rachitsingh3299 3 жыл бұрын

Hey Emma! can you explain the difference between A/B testing and experimental design?

@emma_ding 3 жыл бұрын

A/B testing is the same as online controlled experiment.

@roshanpatnaik1902 2 жыл бұрын

Hi Emma, In the sample size discussion i.e. where you mentioned that sample size is 16 sigma square/ Delta square, sigma is sample variance of the test or control?

@emma_ding 2 жыл бұрын

Hi Roshan, thank you for your question. Have you checked out my video -> kzbin.info/www/bejne/jKG3nYGIish8etE, where I discuss the basics of A/B testing? Have a look and let me know if you still have questions! Thanks for watching and sharing!

@kelseyarthur6421 3 жыл бұрын

Great video

@karundeep07 3 жыл бұрын

Thank a lot emma. One quick question. At 3:45 pm, since we haven't run the test yet... then how we can the value of sigma and delta. Delta, we can get by minimum detectable effect. But what about signma. Please help me understand this. Thanks again..

@emma_ding 3 жыл бұрын

Both sigma and delta are predetermined. They should be known before running the experiment.

@guancan 2 жыл бұрын

@@emma_ding Wonder how we can know what are the samples if the sample size is not determined -- if we don't know what are the samples, how we could observe sample variance? Could you please further explain Emma?

@halflearned2190 3 жыл бұрын

Nice video, thanks!

@omid9422 3 жыл бұрын

Excellent

@freya_yuen Жыл бұрын

Why can't I save this video for my playlist /.\

@karencao1538 3 жыл бұрын

Hi Emma, one question on sample variance when calculating sample size, are we referring to the sample variance of the treatment group before the experiment? Just a bit confused on what actually we're referring to here...

@emma_ding 3 жыл бұрын

The statistic we are testing is delta (the difference) so the "variance" is the variance of delta.

@rogerzhao1158 3 жыл бұрын

@@emma_ding @Data Interview Pro Hi Emma, the video is super helpful. I have one question: the sample variance is calculated as the variance of the delta, so we can only calculate the sample size after the experiment is started and data is collected? But shouldn't we decide the sample size before we start the experiment? I get confused about the order and hope you can help clarify. Thank you.

@janeli2487 3 жыл бұрын

Hi Emma, Thanks for your video, It's very comprehensive. I am wondering what would you do or communicate with PMs if the p-value is just a little bit missed, such as you got 0.051 while you defined your significant level at 0.05? Thanks

@emma_ding 3 жыл бұрын

The situation is debatable. An option could be to run the experiment a little longer to see if the p value changes. The bottomline is you don't want to compromise the criteria (ie the significance level) after seeing the results.

@janeli2487 3 жыл бұрын

@@emma_ding Thanks!

@nplgwnm 10 ай бұрын

Video was made in 2021, and I busted into laughter when “company X” is mentioned 😂 who would know, right? 😂😂😂

@ВоловичМихаил 3 жыл бұрын

Brilliant!

@thegreatlazydazz 3 жыл бұрын

I would like to say that I whole heartedly support the idea of making a video on the book with the picture of the hippo. I am from a staistics background, but never quite understood how stats were being used in this ab testing setting. Thanks a ton!!!!!

@lingli8999 4 жыл бұрын

Emma, another great video, thank you! I had a question. You mentioned in this video that referral program is usually considered as long-term. I understand for referral programs for like housing, it takes a long time. How about other referral programs like Uber eats, Robinhood new user program with a random stock? Can those be tested with A/B testing?

@emma_ding 4 жыл бұрын

Even for Uber eats and Robinhood referral programs are still longer time compared with instantaneous change eg. feature update. You can A/B test those but with longer feedback loop.

@lingli8999 4 жыл бұрын

@@emma_ding Thanks a lot Emma!

@ARJUN-op2dh 3 жыл бұрын

Amazing........!!!!!!!!!!!

@tejas5872 4 жыл бұрын

Hey Emma, Thank you for the valuable content. I've been following your channel and it's helping me regarding the expectation in the interview!. I just have a question - You mentioned coding round will be conducted in the first round. Will the coding round be based on data structures (Linked Lists, Queues, Stacks, Dynamic programming etc) or basic coding challenges like print a palindrome? Please help

@emma_ding 4 жыл бұрын

Good question! This blog summarizes all the different kinds of coding interviews and I think it may clarify things towardsdatascience.com/the-ultimate-guide-to-acing-coding-interviews-for-data-scientists-d45c99d6bddc!

@Alexandra-he8ol 3 жыл бұрын

Thank you very much🙏🏻

@nathannguyen2041 3 жыл бұрын

Informative video! What is the difference between A/B testing and analysis of variance (design of experiments topics? All of these topics are essentially the same e.g., treatments/factors, randomisation, experiment design, Bonferroni/Kimball inequality, etc. Is there a particular reason why there is a distinction of A/B testing from the general ANOVA framework? I may have just answered my own question though..."the general ANOVA framework," but it doesn't hurt to ask someone with more education and work experience than me.

@seant7907 4 жыл бұрын

Emma, I don't mean any offense. Can you add subtitles to your vids? I find it hard to follow what you're speaking because I myself am not a native English speaker. Thank youu!!

@emma_ding 4 жыл бұрын

Thanks for the feedback! KZbin has the subtitles function (a "cc" icon on the right bottom of the video) that may help with understanding the content. It may have some errors though, I'll try to upload subtitles as soon as I can.