Dissecting GitHub Outage - Master failover failed

Рет қаралды 1,562

Күн бұрын

Пікірлер: 18

@AkshayKumar-yc2ls 2 жыл бұрын

This is just brilliant dissection! I’ve seen n number of content creators creating content on software engineers but you just thump them all. The amount of passion that you showcase while explaining is just so contagious. Looking forward to learning more from you 😊

@karthikdinne1078 2 жыл бұрын

Thanks a lot Arpit for all the effort you are putting in. Hope you continue doing this. Small token of appreciation from my side - Karthik Dinne :)

@AsliEngineering 2 жыл бұрын

Thank you so so so much Karthik. Means a ton :)

@d4devotion 2 жыл бұрын

I am laughing, because during our beer party me and my friend suddenly started this master failover discussion and we did not know that how to tackle such situation of making sure that you do not loose the data if new master also crashed while serving write request. You explained like it can be done by a 5 years old child. Great job dude.

@AsliEngineering 2 жыл бұрын

You folks talk about Master failover when drunk 🤣🤣 insane!!

@d4devotion 2 жыл бұрын

@@AsliEngineering :D I am again laughing as I saw you posted this on LinkedIn, now how can I comment there to the world that we are that folks :D :D . Great to see that bro.

@shinnosukenohara1201 2 жыл бұрын

saw similar incident last year for one of the products I was working on. You really explained very well in easy words.

@AbhijeetSachdev 4 ай бұрын

Thank you !, Very Good video. [Question] Critical information is missing: a. What happens if some updation happens in 6-sec window in new-master ? And as soon as we switched back to old-master, one more updation happens for the same row in old-master. How will we handle this scenario, because WAL entry in new-master in not valid any more because it is stale. We cannot blindly replay WAL of new-master. One possible way I can think of is: First: apply WAL then switch back to old-master.

@kushalkamra3803 2 жыл бұрын

Awesome! Thanks for sharing

@sritejaparimi6605 2 жыл бұрын

Hi Arpit, when we switch back to old from new after the new master crashes, first we sync the binlog and then start serving traffic on the old master? Until the binlog is synced the old master shouldn't be serving traffic right?

@AsliEngineering 2 жыл бұрын

Great point. But during an outage you do whatever it takes to accept new writes. Which is where it is a common strategy to first switch and then sync. In ideal world you would have sync and then switch.

@sritejaparimi6605 2 жыл бұрын

Got it, Thank you so much Arpit! Would be really great if you could expand on how switch and sync works internally. Because what if after switching we get some updates to the 6 sec writes happened in the new master(crashed), and then if we try to sync we have to be careful we don't overwrite those updates right? we would have to check timestamp or something else to not do that?

@dhruvilshah9098 2 жыл бұрын

Really Loving your content, want to see more and more, if possible make a video on Cloudfare outage and Uber Database change from Postgressql to Mysql, Thanks...

@AsliEngineering 2 жыл бұрын

Thanks. Cloudflare outage coming soon. I am trying to wrap my head around it. Diving really deep to understand what exactly happened so that I could explain it in simpler language :)

@dhruvilshah9098 2 жыл бұрын

@@AsliEngineering Thanks a lot

@LeoLeo-nx5gi 2 жыл бұрын

Hii Arpit bhaiya wanted to know are all these things handled only by the SRE team engineers? Sorry am a Fresher so don't know much like can anyone from any other team not know or contribute for it?

@AsliEngineering 2 жыл бұрын

Not really. Even backend engineers do this. It depends on the company though.

@LeoLeo-nx5gi 2 жыл бұрын

@@AsliEngineering ohh great