How GitHub's Database Self-Destructed in 43 Seconds

  Рет қаралды 1,033,292

Kevin Fang

Kevin Fang

Күн бұрын

Пікірлер: 853
@sollybunn
@sollybunn Жыл бұрын
"We can't delete user data, we aren't gitlab" This video is a goldmine
@robloxboxertblocked
@robloxboxertblocked Жыл бұрын
gitlab*
@YashendraShuklaTheOG
@YashendraShuklaTheOG Жыл бұрын
I literally choked on my breakfast.
@pratikkore7947
@pratikkore7947 Жыл бұрын
sounds like I missedsomething, can I have some keywords to look up?
@yungifez
@yungifez Жыл бұрын
Haha i saw that golden statement
@christianbarnay2499
@christianbarnay2499 Жыл бұрын
And yet they did actually delete user data by one-sidedly deciding to rollback east coast servers and reintroduce deleted changes manually later. The right course of action was to simply let the replication system merge the vast majority of projects that had no conflicting data at all, then reach out to the very few project admins that needed manual reconciliation. Instead of discussing and letting the situation aggravate for hours, this would have been resolved in a couple minutes for almost all clients with no impact other than the downtime. And still less than a full day for those few that needed extra work. Deciding to alter user data without proper informed consent is a huge ethical no-no.
@kalebbruwer
@kalebbruwer Жыл бұрын
It's bold to assume that a) 50% of Github users are active on any given day b) Their time is worth an average of $50/hr c) Not syncing with remote for one day would affect the average user
@mews75
@mews75 11 ай бұрын
That's what i was thinking lol
@opfipip3711
@opfipip3711 11 ай бұрын
yeah, one of the great things about git is that it is trivial to set up a new remote and even no problem to code for weeks without an internet connection at all. I'd say GitHub could only be up ~20% of the time without that having a strong (financial) impact on most of the projects hosted there. Would piss of lots of devs, tho.
@Ignacio_DB
@Ignacio_DB 10 ай бұрын
im no it guy, but 40 mins of lost data is a better sacrifice than hours of slow time, they couldnt just freeze the west db, and see what it was different, transfer and boom everything has been solved
@mennoltvanalten7260
@mennoltvanalten7260 8 ай бұрын
I push maybe 3 times a week... but I'm basically using GitHub as a backup for some personal projects. So long as my computer survives I can handle not pushing for a few days
@iheartlreoy8134
@iheartlreoy8134 7 ай бұрын
don’t you just hate when your andromeda integration service fails causing all writes made after the American civil war to be lost
@MaxwellHay
@MaxwellHay Жыл бұрын
The assumption that 50% of total github users are active is too optimistic
@Backtrack3332
@Backtrack3332 Жыл бұрын
Yea, I'm guessing 2% max
@FiksIIanzO
@FiksIIanzO Жыл бұрын
It's good to grossly overestimate potential issues
@KaidenBird
@KaidenBird Жыл бұрын
As someone who hasn't pushed in weeks, that hurts, but is too true.
@lightning_11
@lightning_11 Жыл бұрын
@@Backtrack3332 That's still a lot, though!
@RMDragon3
@RMDragon3 Жыл бұрын
Yeah, those assumptions seem very off to me. I'm feel like less than 50% of GitHub users are active daily between abandoned users and people who rarely use it. On top of that, a significant percentage of users will be students or personal projects that don't really have a monetary impact. Also, most users likely didn't lose anywhere near to 2 hours, especially because the website wasn't fully down for anywhere close to those 24 hours. I'm sure it didn't work great during that time, but it was usable. If it happened to me, I would likely test for 5 minutes, check with collegues and just work locally, testing every hour or so. Some people may have been affected more, but 2 hours of lost productivity seems way too high to me. With that in mind, the estimate would likely be a few orders of magnitude lower.
@Justin-jm2fd
@Justin-jm2fd Жыл бұрын
As a former bitbucket employee I can confirm we have disaster recovery plans for a lunar data center outage
@KangJangkrik
@KangJangkrik Жыл бұрын
Now what?
@fatrobin72
@fatrobin72 Жыл бұрын
Last I checked it was a disaster plan, there was no recovery...
@DaveParr
@DaveParr Жыл бұрын
I'd assume you would us IPFS.
@jaythecoderx4623
@jaythecoderx4623 Жыл бұрын
@@DaveParr Those have a lot of latency tho, don't they?
@siliconcassettes3369
@siliconcassettes3369 Жыл бұрын
As a time traveller from the future I can confirm the recovery plans are insufficient and the situation becomes irrecoverable
@RichieYT
@RichieYT Жыл бұрын
These problems always occur during routine maintenance. That's why I don't do any maintenance whatsoever and my systems have never experienced downtime (although I've never checked)
@tisaconundrum
@tisaconundrum Жыл бұрын
can't have a problem if you don't see a problem
@kurdtpage
@kurdtpage Жыл бұрын
This is the way
@zsoltsz2323
@zsoltsz2323 Жыл бұрын
Even Chernobyl was routine maintenance.
@PieJee1
@PieJee1 Жыл бұрын
That makes your system full of security exploits as security issues are not patched too. You will also face a huge issue if you are forced to update if you use versions that are too old
@elle9834
@elle9834 Жыл бұрын
Out of sight out of mind
@axelboberg
@axelboberg Жыл бұрын
Interplanetary failovers are a struggle, not gonna lie.
@__dm__
@__dm__ Жыл бұрын
ipfs is (was?) a project with interplanetary, high-latency connection in mind with Merkle DAG datastructures for well, unstructured object data. It got adopted by the crypto crowd because memes and idk where it's going
@philip3963
@philip3963 Жыл бұрын
@@__dm__ I work with IT solutions and I swear I've seen IPFS support in the industry before, just can't remember where
@ExEBoss
@ExEBoss Жыл бұрын
@@philip3963 *Cloudflare* says they have support for it.
@muhammadyusoffjamaluddin
@muhammadyusoffjamaluddin Жыл бұрын
PHP Devs: YOU THINK SOO??????
@LinhNguyen-zg9kn
@LinhNguyen-zg9kn Жыл бұрын
bruh they had the option to rollback 40 mins of write on the promoted db and sync both db. They pretty much fucked themselves in the ass tbh
@ericlizama8552
@ericlizama8552 Жыл бұрын
Honestly I'm impressed that Bitbucket was able to lower the Earth-Mars latency down to 60 milliseconds.
@Fenhum
@Fenhum Жыл бұрын
they must've found a cheap way to build those einstein rosen bridges ey?
@wesleyeberly228
@wesleyeberly228 Жыл бұрын
@@Fenhumsomething akin to hyper pulse relays from battletech
@shippo72
@shippo72 Жыл бұрын
@@mikicerise6250 Ansible is instantaneous, no matter the distance. It even allows you to communicate both upstream and downstream of your current dimensional position.
@AR-yd2nd
@AR-yd2nd 10 ай бұрын
Faster than light bitbucket
@runforitman
@runforitman 7 ай бұрын
Those wormhole generators give you cancer you know
@riddixdan5572
@riddixdan5572 Жыл бұрын
What a goldmine of a channel. I'm here with you all, witnessing the birth of a great channel
@0tiii
@0tiii Жыл бұрын
dude almost sounds like fireship
@manzenshaaegis8783
@manzenshaaegis8783 Жыл бұрын
This is one of those things that in hindsight, it is so easy to see how they set themselves up for failure. But I bet you a lot of brilliant people looked at this and still did not see the issue until it (inevitably) blew up. It do be like that sometimes...
@christianbarnay2499
@christianbarnay2499 Жыл бұрын
I know at least one org that can't have that kind of failure. Their standard operating procedure is to actually force the primary switch on a regular basis. Every 2 or 3 months they power off all primary servers and check that all secondaries have promoted and are now fully operating as primaries with no data was loss. Then they return they restart the old primaries that become the new secondaries. It covers all possible kinds of failures of the primaries. This is also used for the upgrade procedure. Whenever you need to upgrade a server, you upgrade the secondary first, do some offline tests, then promote it to primary, keep the old primary/new secondary ready with the old version for a few days in case a rollback is needed. And finally update it. The first time I saw that choice of having the failover procedure being an integral part of normal operations I thought it was genius. When you have an incident, you don't need to panic and look up for exceptional procedures you are not familiar with. You just change the schedule of the regular routine. And if needed you can do forensics on the system you just put offline while users are working unaffected by the incident.
@travcollier
@travcollier Жыл бұрын
@@christianbarnay2499 Good idea. Of course, it is also expensive AF. Robustness always costs short term efficiency.
@smugfaced
@smugfaced Жыл бұрын
it really do be
@checker297
@checker297 Жыл бұрын
@@christianbarnay2499 everyone can have this kind of failure, it just is the level of extremes. It isnt in normal situations when you get pressured as a engineer, its when shit is on fire and suddenly all your plans which required something you assumed would be working due to its robustness, forces your hand to pull a rabbit out of your arse.
@bogdancolesiu4853
@bogdancolesiu4853 Жыл бұрын
​​@@christianbarnay2499ut don't you run into the very same issue that GitHub had? It is great that promoting the secondaries to become primaries works nicely, but what about synchronising the Data? GitHub did the same thing that you are describing which is, in case of a primary failure, promote somebody else quickly to be a primary. But the real issue was the fact that, because of the disconnection, DB A had received data that was not synced with DB B when the connection was lost. Your scheme promotes DB B to become primary when DB A fails, but how do you synchronise the data that DB A has been receiving later after DB B has been updated and the 'timelines' are out of sync? Is it just me or it seems like you would have the very same problem that GibHub had? Their issue was not simply promotion to a new primary, but everything else
@CoryKing
@CoryKing Жыл бұрын
I worked at a website that handles millions of write transactions per day across like 7 global data centers. We were starting to think of a way to drop into a “read only” mode in the event something like this happened. Then we wouldn’t need to paw through the mess of uncommitted transactions…
@yuhyi0122
@yuhyi0122 Жыл бұрын
that's actually sounds good
@xpusostomos
@xpusostomos Жыл бұрын
​@@yuhyi0122sure it's good ... If this is the rare web site where it even makes sense to be read only
@GeorgeTsiros
@GeorgeTsiros 11 ай бұрын
when you say millions of transactions per day, is there something difficult about these? I mean, even if you do 100 million per day, that's on the order of 1k transactions per second, that's reasonable, yes?
@xpusostomos
@xpusostomos 11 ай бұрын
@@GeorgeTsiros the difficult part, if you watched the video, is reconciling conflicting changes
@edhahaz
@edhahaz Жыл бұрын
imagine being github and being unable to... MERGE two databases
@littleloner1159
@littleloner1159 Жыл бұрын
It's GitHub Didn't they delete their whole code like twice?
@joelpww
@joelpww Жыл бұрын
​@@littleloner1159 might be thinking of gitlab
@casev799
@casev799 Жыл бұрын
Yeah, but you'd expect them to learn at some point. They have their whole library of users that could help too....
@ko-Daegu
@ko-Daegu Жыл бұрын
@@casev799 typical YT reply evrything is easy in their eyes yet they accomplished nothing
@Paulo27
@Paulo27 Жыл бұрын
git push --force -----FORCE ----------FOOOOOOOOORCEEEEEPLEEEEEAAAAASSSSEEEEEE
@JohnAlbertRigali
@JohnAlbertRigali Жыл бұрын
Considering the scope of the GitHub disaster, it seems to me that recovery with 30 hours is very impressive. I've had to engineer recoveries from much smaller disasters and every one of them took me at least 48 hours if I remember correctly.
@icedlava7063
@icedlava7063 4 ай бұрын
yee, i think this was very well handled
@TuMadre8000
@TuMadre8000 Жыл бұрын
11:55 i'd say getting 60ms of latency over a 10 light-minute distance is still pretty good
@dybdab
@dybdab Жыл бұрын
One of the greatest "history" channels on KZbin, love the content.
@l-l
@l-l Жыл бұрын
Absolutely
@namansoood
@namansoood Жыл бұрын
Internet Historian: 👀
@ccthomas
@ccthomas Жыл бұрын
When the east coast database recovered and started accepting writes again from applications, they dodged the very common bullet of those apps pushing work at the database as fast as they can and overwhelming it, causing a second wave of outage. In this case, it looks like the controls over the work rate (whether implicit in the nature and scale of the apps, or an explicit mechanism) were sufficient to prevent that.
@rajarshichattopadhyay1728
@rajarshichattopadhyay1728 Жыл бұрын
I love how in the last 30 sec, Kevin was not only able to explain how interplanetary network would work but how a random command would blow everything up in exactly 30 sec 😆
@Hopgop1
@Hopgop1 Жыл бұрын
I love these videos, I work in IT but for a much smaller national company, really interesting to learn some lessons from, plus the editing and storytelling makes it very entertaining.
@thebeber2546
@thebeber2546 Жыл бұрын
The ending was hilarious. Great video overall.
@kuroodo_
@kuroodo_ Жыл бұрын
The explosion at the end threw me into tears lol
@acoolnameemm
@acoolnameemm Жыл бұрын
This video is full of explosions and memes but in a tempered manner and it hits all the nerves in my brain. I need more videos like this.
@Geolaminar
@Geolaminar Жыл бұрын
Well, it could have been worse. The automated lunar relay launch could have been misconfigured such that it did not alert US STRATCOM, and therefore appeared to be a ballistic missile launch against a domestic target, which immediately would lead to global thermonuclear war due to improper database failover configuration.
@MrLastlived
@MrLastlived Жыл бұрын
I swear to god if all of humanity gets wiped out over a stupid accident and not because of a grand painstaking political catastrophe I'ma be real disappointed in hell.
@mattheholic2
@mattheholic2 Жыл бұрын
​@@MrLastlivedThat was close to happening multiple times over the course of history. It's a miracle we haven't already done that.
@hchris96
@hchris96 Жыл бұрын
Thank you! This was perfect. I love this. And the amount of explosions is tasteful and not overdone
@LolWutMikehSM
@LolWutMikehSM Жыл бұрын
That interplanetary loop was good
@XxBuzzkill77xX
@XxBuzzkill77xX Жыл бұрын
This content is incredible! Really has me thinking about some of my architecture and how to think about planning infrastructure going forward, keep up the awesome work!
@radiosification
@radiosification Жыл бұрын
I love these incident analysis videos. Please keep making more!
@rigell2764
@rigell2764 Жыл бұрын
These graphics make me laugh. 1, 2, 4, 5, red among us guy, purple among us guy, pizza, 8 ball 😂. Also the Ace Attorney part was great.
@IroAppe
@IroAppe Жыл бұрын
This was definitely not a failure. I've seen other videos where "they did everything wrong they could". In this case, in the circumstances, they did exactly what they had to do. Except for those few discussing prioritizing uptime over data consistency, which is a no-no. It's good that the right engineers prevailed. A laggy service is just so much better, than a nightmare collapse or massive inconsistency nightmare that will plague costumers all over for weeks. I get that they're paid for uptime and fluidity of the service, but in a case that is equivalent to a survival situation, you have to prioritize. Worrying about a "laggy service" in the east-coast is then equivalent to complaining about the lack of ice cream in an apocalypse scenario. In fact, I see this as a huge win! How many times have short measures without much thinking trying to treat the superficial symptoms as fast as possible, that are merely an extension of the underlying real problem, led to a full-scale disaster? At once, there were finally people thinking critically before doing something! Treating the core of the problem.
@Penfolduk001
@Penfolduk001 Жыл бұрын
The worry here was that they had to spend the time coming up with the plan to respond. Whilst I realise you can't plan for every contingency, cross-hub failure like this should have already been considered and planned for. From the video this doesn't appear to have been the case. Guess they were lucky the initial fault didn't last more than 49 seconds.
@xpusostomos
@xpusostomos Жыл бұрын
Nobody was arguing for inconsistency. The argument was getting back up fast vs losing 40 minutes of changes
@leaffinite2001
@leaffinite2001 Жыл бұрын
​@xpusostomos losing 40 minutes of changes is i think the inconsistency in question
@xpusostomos
@xpusostomos Жыл бұрын
@@leaffinite2001 that's not a data inconsistency
@leaffinite2001
@leaffinite2001 Жыл бұрын
@@xpusostomos why dont you define the term then, get us on even ground
@CubemasterXD
@CubemasterXD Жыл бұрын
these videos are so underrated the (visual) humor keeps getting better and better
@LemonGingerHoney
@LemonGingerHoney Жыл бұрын
I felt their pain. What a fantastic job on the recovery and post mortem.
@eantropix
@eantropix Жыл бұрын
Bro backing up data to Mars sounds so unbelievably awesome and impractical at the same time, I love it
@majesticcok
@majesticcok Жыл бұрын
I love these videos, but as a DevOps Engineer I get anxious if I watch too many in a short period of time :)
@jure.
@jure. Жыл бұрын
I love your videos so much. They're so informative, interesting, well-made and even funny. Keep it up!
@mr_darkeye
@mr_darkeye Жыл бұрын
always nice to see a new video from you
@IceTank
@IceTank Жыл бұрын
The editing is on point. Very nice video.
@fairlyfactual451
@fairlyfactual451 Жыл бұрын
This is why you always should practice regional failovers of your cloud architecture and make doing so mandatory company events (or even random events).
@alexischicoine2072
@alexischicoine2072 Жыл бұрын
My company practices that once a year I believe. I had a senior colleague take part in it.
@darthollie
@darthollie Жыл бұрын
This video is waaaay longer than 43 seconds
@gleep23
@gleep23 Жыл бұрын
I like how you turned this technical issue into an enjoyable story. Great storytelling skill.
@whynotanyting
@whynotanyting Жыл бұрын
"For instance, how am I gonna stop some big mean Mother-Hubber from tearin' me a structurally superfluous data center?"
@kriterer
@kriterer Жыл бұрын
$50 an hour is a wild overstatement
@fir3cl4w
@fir3cl4w Жыл бұрын
Love the Ace Attourney bit, keep up the good work ❤
@benbrist
@benbrist Жыл бұрын
We're not GitLab had me in stitches
@theowinters6314
@theowinters6314 Жыл бұрын
I think the biggest surprise in this was the fact that they had daily tests of restoring from backup, when most companies only tests that after need it.
@kblt94
@kblt94 Жыл бұрын
Please…. More of these videos of software disasters! Facebook outage etc. !! As a developer myself, it’s somehow calming that such big players fall into these „oh shit….“ situations too! ❤️
@MexicanSkynet
@MexicanSkynet 5 ай бұрын
and HERE WE GO AGAIN!
@kiro_f
@kiro_f Жыл бұрын
Can't wait for another video, just kinda wanna go on a binge watch of them but there isn't that many, hopefully in the future though :)
@vikaskrishnan4018
@vikaskrishnan4018 Жыл бұрын
I loved the whole breakdown of the issue Github faced, but its the last 30 seconds of the video that gained you a Sub! Keep up the crisp K.I.S.S explanation and subtle humour combined with the accurate images and editing!
@jermunitz3020
@jermunitz3020 Жыл бұрын
Nice editing Kevin. Really looking forward to the next one.
@Gabriel-kl6bt
@Gabriel-kl6bt Жыл бұрын
The thought of being amidst these people recovering from this kind of chaos gives me stomachache.
@koustubs6561
@koustubs6561 Жыл бұрын
This is why you don't TOUCH A FUCKING WIRE WHEN THEY TAKE YOU THRU A TRIP IN THE EAST COAST SERVER
@henkfinkers3931
@henkfinkers3931 Жыл бұрын
I absolutely love this channel.
@VaraNiN
@VaraNiN Жыл бұрын
This channel gonna be big soon with these high quality vids and the algorithm starting to push em
@kubajurka
@kubajurka Жыл бұрын
I understood virtually nothing but still found the video absolutely exhilarating.
@CoryKing
@CoryKing Жыл бұрын
These videos are hilarious! I look forward to more! It’s like the dark net diaries podcast but different and super funny. Good stuff! I watched all of these and am disappointed there isn’t more to binge watch. I hope you keep this format, this is an excellent concept for a KZbin channel!
@kanal7523
@kanal7523 Жыл бұрын
I love the animations and goofiness, pls never stop making these videos
@EdwardChan.999
@EdwardChan.999 Жыл бұрын
I hate dealing with databases, but watching your database stories is a pleasure 👍🏻
@MaNameizJeff
@MaNameizJeff Жыл бұрын
I am loving your videos so much. You make describing how exactly these internet exploits are done in the most entertaining way. Even someone who only knows basics like myself can follow along and understand.
@ellieban
@ellieban Жыл бұрын
“They expected X to follow a linear trajectory rather than the actually observed power function” can be applied to most of what’s wrong with humanity 🤣
@arcaneblackwood3602
@arcaneblackwood3602 Жыл бұрын
The humor in this video is 120%. We need news actors like you in this world.
@not_hehe__
@not_hehe__ Жыл бұрын
i just noticed today is the anniversary of this incident
@elatedemu
@elatedemu Жыл бұрын
Your visuals are probably the best and most entertaining I've ever seen
@ForcefighterX2
@ForcefighterX2 Жыл бұрын
2nd video from your channel. Realized it's awesome. You've got a new subscriber, bro!
@AdroSlice
@AdroSlice Жыл бұрын
That last part is gold. Thank you so much.
@JxH
@JxH Жыл бұрын
We do have to admire the self-confidence of the system designers. They plunged right in, built a highly complex system, blissfully unaware of their own naïveté. Failure control is about 30x more complex than they had assumed.
@RaphaelDDL
@RaphaelDDL Жыл бұрын
Thank you youtube algorithm for suggesting this piece of art
@joshhuang2279
@joshhuang2279 Жыл бұрын
This man is more chaotic fire ship
@Markyroson
@Markyroson Жыл бұрын
I love the "until next time" segment at the end lolol
@matthewschuster4600
@matthewschuster4600 Жыл бұрын
That last 30 seconds or whatever just earned you a sub. Lmao.
@christianbarnay2499
@christianbarnay2499 Жыл бұрын
Github is designed at its core to allow for loss of connectivity anywhere in the network. In this event they completely failed at handling the exact type of issue their system was designed to overcome seamlessly. As mentioned in the video this should have resulted in a 43s downtime for the vast majority of clients. And only a handful of clients having to reconcile data by hand between the west and east coast centers. The major problem is they clearly never tested the primary database loss scenario. They would have identified that they needed to replicate not only the database but the entire infrastructure to the west coast so it could still work during an east coast downtime. Or deactivate cross-country failover. The second problem is they one-sidedly decided they had to reconcile all user data by themselves. Client data belongs to clients. You should never alter client data without full information and consent. Deciding to manually rollback and backup east coast commits was altering client data and a big no-no. The right course of action should be: 1. Inform clients that there is a potential discrepancy between servers and you are building a list of affected projets, 2. Let the system reconcile projects that have no issue at all (no commits during the downtime or only west coast commits that can be pushed to the east with a fast-forward) and inform those clients that everything is fine for them and the system is back to normal operations, 3. Tell clients that need manual reconciliation that you propose the following plan: keep the branch with the most recent commit as is, and rename the conflicting branch as _ so they will have both accesible in the same repo and can reconcile their data as it suits them. And ask them to reply with their approval of the plan or a proposition for an alternate plan before some reasonable deadline. And give them contact info if they need help and/or advice. That way instead of going all in manipulating all clients data, they would only need a small taskforce ready to help those that actually need it.
@eekee6034
@eekee6034 Жыл бұрын
*Git* is designed to allow for loss of connectivity. Git*hub* was designed by the kind of crazies who jump on open source bandwagons.
@samuellourenco1050
@samuellourenco1050 Жыл бұрын
One question about your point 3. How to reconcile two divergent branches?
@christianbarnay2499
@christianbarnay2499 Жыл бұрын
@@samuellourenco1050 There are tons of ways to do it. Simplest is git merge with manual resolution of conflicts. Most tedious is creating a new branch at the diverging point and cherry pick from each side, then destroy both incomplete branches and rename the new branch to the original name. The right strategy is up to each client depending on the state of their data and their own standards for repo cleanliness. Some will want to remove all traces of the incident. Others will consider it's part of the project life and should stay visible in the history.
@JohnSmith-fz1ih
@JohnSmith-fz1ih Жыл бұрын
Where did you get the notion that they altered client data? My understanding from watching this video is that they rolled back to a consistent state, then restored the two lots of data that ended up split over the two data centres. The result being all data restored. I’m not certain in what users with the data spread across both the east coast and west coast servers experienced. But your post reads to me of “I watched a 12-minute summary and now I think I know better than the staff that worked with the product every day”.
@christianbarnay2499
@christianbarnay2499 Жыл бұрын
@@JohnSmith-fz1ih In a history tool like GIT client data is not limited to the content of latest commit. Client data is the entire tree with all branches, commit dates, comments and commit order. Dealing with conflicting data is an important decision. And the way you want the data to appear and be accessible after the resolution is a decision by the project owner. Each project owner will have a different approach on the way they want to deal with such a situation. And GIT allows for all those approaches. The Github team making a single universal decision for all projects is barring project owners from making their own decision on the matter. What I say doesn't come from just watching a 12 minute video. It comes from using GIT on a daily basis, including a few occasions in which I migrated entire projects from old tech repos like CVS or SVN to GIT. And on some of those occasions I had to retrieve commits that were split over several repos and reconcile them using dates and comments. With the help of some low level GIT commands I could easily automate that process. That's why I am fully confident that GIT has all the tools needed to allow the Github team to automatically rename conflicting branches, regroup everything in the master repo, replicate to all mirrors, and then let project owners do the merge the way they want instead of forcing their own single decision for everyone. The main benefits of GIT over all other versioning systems are its high resilience to conflicts and the possibility for project admins to do absolutely everything with their repo on any PC and push the result to the central repo. This incident was the perfect occasion to highlight those features and display complete transparency by rapidly giving control of the 2 branches of their repos to project owners.
@LukeeGD
@LukeeGD Жыл бұрын
11:30 That 2047 bit is probably one of the greatest things I have ever heard haha
@Pixelhurricane
@Pixelhurricane Жыл бұрын
your joke at the end about the martian servers hand me in tears, too real
@Bozebo
@Bozebo Жыл бұрын
I mean, cross region issues are something you're meant to have tested disaster recovery from and this is a really obvious point of failure they shouldn't have missed. That's the issue here, not necessarily an architecture problem itself.
@PolskaChild
@PolskaChild Жыл бұрын
Everything about the video was great lmao. The humor, the animations, and not stupidly complicated.
@miklov
@miklov Жыл бұрын
Fascinating. Love the bit at the end too! Thank you.
@Crocsx058
@Crocsx058 Жыл бұрын
Man your video are so good and it's so cool to see other company post mortem and the cause so well explained. Thanks
@nickdaboss03
@nickdaboss03 Жыл бұрын
Loving these new documentary type videos!
@MizManFryingP
@MizManFryingP Жыл бұрын
The bitbucket bit killed me haha
@PrivateUsername
@PrivateUsername Жыл бұрын
And the funny part of all this is that this is a well-known issue among database, storage, and server engineers. It has been solved many ways decades before. In short, hire the crusty old Unix geeks every once in a while.
@dorinsuletea1928
@dorinsuletea1928 Жыл бұрын
A genuine question : Is it even possible to use async replication for the primary without bricking consistency on fail-over? (the root cause of this entire mess). Btw, great video and the end segment was glorious!
@kevinfaang
@kevinfaang Жыл бұрын
Nope (pretty much what CAP theorem states). Only way to prevent that would be to semisynchronously replicate to all remote DCs, which GitHub doesn't do (see 2018 blog post in description on MySQL High Availability)
@jamesgrant3343
@jamesgrant3343 Жыл бұрын
Async- there is a time between one instance having data and the other instances having the data. If you pull the plug in that time then that data isn’t available to the other instances. It’s possible (and very desirable) to have synchronous knowledge of data and asynchronous transfer of that data - so your non-primary instances get a ‘journal’ entry which can ease reconciliation pain at the cost of read after write speed to the cluster (primary). Given that this is only useful occasionally but the overhead cost in worse performance is always, the approach is uncommon
@Ganerrr
@Ganerrr Жыл бұрын
*drums start beating*
@teamwolfyta
@teamwolfyta Жыл бұрын
That Bitbucket joke was the funniest thing I've heard in coding terms, Keep up the awesome stuff mate! 🤣
@draakisback
@draakisback Жыл бұрын
This type of scenario is why I've been building a CRDT backed nosql database. To make it so that you can have a ridiculously complex topology and recover from any failover. Fault tolerance is extremely important for apps this size and it seems like they had some very janky setups.
@omniphage9391
@omniphage9391 Жыл бұрын
You always end on one side of the DB triangle. The trick is to chose the right approach to the problem.
@draakisback
@draakisback Жыл бұрын
@@omniphage9391 that's what the crdts are for, they give you strong eventual consistency. If you accept sec you can have all there as well.
@fernandososterbortolotto7315
@fernandososterbortolotto7315 Жыл бұрын
2 month update perhaps?
@MarekKnapek
@MarekKnapek Жыл бұрын
Lessons learned: Design your topology such as any sigle node could fail and have (automated) plan what to do when it does fail. Second: Why they had single primary + multiple replicas architecture? That seems like obvious single point of failure. If I designed this, any node could accept new data (act as primary) and only after the data would be provably replicated to few (not all) random other nodes, it would tell the user: yes, I have received your data. Then continue replicating to the remaining replicas in background. This strategy seems a bit slower but it seems to scale better and is more resistent to failures. But I know nothing about designing services to be running across multiple data centers. I still have the idea in my head I heard in the 90's: The Internet is mesh/web of many computers and many interconnected networks. If one fails, other can take its role. But this is not the case. Single BGP or DNS misconfiguration can bring down big portion of it.
@TheRZOM
@TheRZOM Жыл бұрын
Probably because creating and maintaining such a system is VASTLY expensive and time-consuming and may have been deemed too infeasible until they got so big that such an infrastructure was necessary.
@ccthomas
@ccthomas Жыл бұрын
It's often very difficult to design a system that allows any node to accept a change, when those changes are based on data that may be out of sync across nodes. The classic example is a bank balance - if you don't see right away that someone has already withdrawn the last $100 from a $100 account, then you will also be allowed to withdraw that $100. When all the nodes communicate and reconcile the changes, you have -$100. It can be done, but it takes careful design of the structure of your data, the kind of changes that can be applied, and the workflow of the applications to to deal with unexpected results when they occur. Imagine if your ATM receipt said, "Your current bank balance is probably $0, but we'll get back to you to return the money if you're overdrawn.". Come to think of it, this is exactly how checks work
@MarekKnapek
@MarekKnapek Жыл бұрын
​@@ccthomas In the bank scenario I would imagine it would behave like this: Any node could accept your request to transfer the $100. The node would respond very quickly with something like: "yes, I successfully accepted your request to transfer $100". Quickly, because it is only one node out of many, thus not overloaded. This means the request was accepted, not that the transaction was completed. Then, some time later, after the node has communicated with enough other nodes to verify you actually have at least $100, will actually do or refuse to do the transaction, "eventual consistency". This seems like no better than single node for writes. But no, it is still better because your transfer request is independent of other people's requests, thus could be done "in paralel" whit them. Again, better horizontal scaling, no single point of failure. Or something like this. I know I imagine like this is "easy-peasy-lemon-squeezy" but it actually is not.
@MadiganTech
@MadiganTech 4 ай бұрын
The MCO reference was 10/10. -Dylan
@JAMBUILDER08
@JAMBUILDER08 Жыл бұрын
This is a great example of what to do after a major IT issue, which is make plans to handle such a situation better and easier should it occour again.
@jardeshna
@jardeshna Жыл бұрын
Deleting servers? No, on this channel we nuke them . Instant subscription.
@0xEmmy
@0xEmmy Жыл бұрын
To be fair, I doubt systems will ever be designed to require live interplanetary interoperation given that the latency is already measured in minutes (10ish for Earth -> Mars, 2x for a round trip). Even translunar (3ish seconds, 6ish for round trip) is kinda pushing it. And this is a fundamental limitation of general relativity, so it's not changing anytime soon unless someone has a warp drive they're keeping secret.
@MmmHuggles
@MmmHuggles Жыл бұрын
Sounds like a major headache. One "oopsie" and all hell breaks loose.
@Buizie
@Buizie Жыл бұрын
Clears the table after dinner Everyone in the database team:
@iheuzio
@iheuzio Жыл бұрын
Fireship has really nailed this video
@xihua12370
@xihua12370 4 ай бұрын
I like how such issue which can easily be fixed if found on personal PC, becomes such hassle when scaled
@bloeckmoep
@bloeckmoep Жыл бұрын
Problem with this setup is, that both databases can assume a "write to" status. My company has two sql servers running our cad part library and some lower services. One is the master sql database some 400kms away, while the other is local in the region I live. Internet outaged are somewhat common and thus the system of having two databases, one master, the other slave, has proven gold. Only downside is adding new parts or reworking existing cad parts, I have to connect to the master sql server, import them, wait for synchronization. But even this "downside" has a silverlining, synchronization happens at fixed timestamps, letting me try out different parts and how they behave on our slave server, not having to fear of destroying anything because the next synchronization will put everything right again. If tested cad parts work on the slave database server and work as expected, I simply connect to master and repeat the adding process. In case of github, they should imho, if connection to current master is lost, automatically fallback to read only database. 43 seconds of only read is much less destructive than 6 hours of ongoing unsync. In those 43 seconds I doubt, that anyone world wide would have noticed, that the github database was read only and not accepting any commits. Even if someone commited in that exact moment, it could have been explained as internet hiccup from ones isp.
@ironized
@ironized Жыл бұрын
Founds this video today, please keep these up. I work in business resilience/crisis management and find this very helpful
@Caphalem
@Caphalem Жыл бұрын
This channel is way too small for content this good
@HyperMario64
@HyperMario64 Жыл бұрын
This kind of incident is pretty much a nightmare for any on-call engineer. Not eager to do any of this kind of work.
@FrozenMilkOnACloudyDay
@FrozenMilkOnACloudyDay Жыл бұрын
Inches instead of centimeters, I love this channel
@zyxwvutsrqponmlkh
@zyxwvutsrqponmlkh Жыл бұрын
11:34 During the solar occlusion is backup data transmitted via gravity waves or by neutrinos, that part has always been a bit confusing to me.
@Epausti
@Epausti Жыл бұрын
Love your stuff! Your channel will blow up
@MikeHarris1984
@MikeHarris1984 Жыл бұрын
LMAO @ the bitbucket skit at the end... that had me cracking up!
@jackhammer915
@jackhammer915 Жыл бұрын
I saw this and was like “Damn he was quick, this was like a day ago!”… Then I realized that this was from 2wks. ago and GitHub just was down for a second time on Thursday 😂😂
@lbgstzockt8493
@lbgstzockt8493 Жыл бұрын
The outro was hilarious, fully expect this to happen with the colonisation of the solar system.
How Bad Leap Day Math Took Down Microsoft
11:29
Kevin Fang
Рет қаралды 266 М.
Dev Deletes Entire Production Database, Chaos Ensues
10:20
Kevin Fang
Рет қаралды 2,9 МЛН
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН
Quando eu quero Sushi (sem desperdiçar) 🍣
00:26
Los Wagners
Рет қаралды 15 МЛН
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 1 МЛН
Polish Amazon Offers Deal So Good Their Servers Implode
8:05
Kevin Fang
Рет қаралды 285 М.
The Man Who Broke The Internet By Deleting 11 Lines of Code
5:43
Half as Interesting
Рет қаралды 1,2 МЛН
AI Is Not Designed for You
8:29
No Boilerplate
Рет қаралды 316 М.
I Scraped the Entire Steam Catalog, Here’s the Data
11:29
Newbie Indie Game Dev
Рет қаралды 852 М.
How This SQL Command Blew Up a Billion Dollar Company
13:11
Kevin Fang
Рет қаралды 703 М.
How principled coders outperform the competition
11:11
Coderized
Рет қаралды 1,8 МЛН
How This Missing Shell Option Took Down Cloudflare
9:45
Kevin Fang
Рет қаралды 134 М.
How programmers flex on each other
6:20
Fireship
Рет қаралды 2,5 МЛН
Gitlab DELETING Production Databases | Prime Reacts
17:27
ThePrimeTime
Рет қаралды 363 М.