An engineering deep-dive into Atlassian's Mega Outage of April 2022

Рет қаралды 10,327

Күн бұрын

Пікірлер: 36

@IshanUpamanyu 2 жыл бұрын

I loved this breakdown. What I loved the most was that you did not just regurgitate what was in atlassian’s blog post but explained key concepts very well. Super.👏

@shervilgupta92 2 жыл бұрын

An amazing analysis and break down of the outage. Thank you so much Arpit !!

@1littlecoder 2 жыл бұрын

Great Content! For someone who's not much into System Designs and Engineering (I work with Data), I could understand very well! Thanks a lot! Subscribed!

@krum.00 2 жыл бұрын

Good stuff! There are very few channels that try to do this on the fly, not in an interview prep style. Hope your channel grows like Hussein Nasser’s, who also does detailed deep dives like this.

@AsliEngineering 2 жыл бұрын

Thanks 🙌

@mayankpant5376 2 жыл бұрын

He is better than hussein nasser imo.

@krum.00 2 жыл бұрын

Backend, specially distributed systems is a pretty broad domain. So better is relative like always. You could excel in a certain domain and just know stuff about others. Anyway, I'm not here to pass judgement, just to learn something new.

@mayankpant5376 2 жыл бұрын

@@krum.00 yeah for sure. but arpit has better way of explaining things. hussein videos are very slow paced for my taste.

@harryinsin 2 жыл бұрын

@@mayankpant5376 Same here :) , I too felt that he did a better job with content and explaining things with diagrams.

@CodeGenAI 2 жыл бұрын

Hey Arpit. Really insightful video. Well explained. Looking forward for a video, what could have done for better recovery.

@manoranjithp3187 2 жыл бұрын

Very good dissection. I liked the format too. The explainations were brief enough to not shift focus from the dissection and detailed enough to understand the context. The interesting thing is (speculatively) looking at the system architecture from both development and operations point of view. And that hacktober fest t-shirt 😉

@ArunKumar-gy5wz 2 жыл бұрын

Wow!! A detailed explanation man! keep it up.

@Riteshsharma-tw9ov 2 жыл бұрын

Hi Arpit, I think one important aspect you might want to highlight is the violation of single responsibility principal in this case. The script in question must have been too smart for it's own good. One takeaway for me was to keep my production scripts dumb and aligned to srp. Thanks for the great video. Subscribed :)

@AsliEngineering 2 жыл бұрын

Yeah. The script should have been responsibly focussed.

@utsavprabhakar5072 2 жыл бұрын

Great video, as always !! And this is just you getting started. Excited for where this channel goes :D

@saurabhjagtap 2 жыл бұрын

Damn! This shows that even such a giant org makes rookie mistakes which leads to such a catastrophe and how impactful our profession really is. Great content! This is what I'm going to binge watch for this long weekend. Thanks for this! Also, we would love some video on CDC as well! :) Ps: that hacktober tshirt makes me feel sad, my mom converted it into pochaa without asking me :'(

@1littlecoder 2 жыл бұрын

Haha, I guess all households hate Hacktoberfest T-shirts :)

@mohitkumartoshniwal 2 жыл бұрын

Loved it. Learnt a lot from this.♥️

@damercy 2 жыл бұрын

Loved it! Thanks a lot for this ❤️

@architshukla8076 2 жыл бұрын

Thanks Arpit for sharing the amazing analysis 👏 👍

@arai_19999 2 жыл бұрын

Amazing Video! 👌

@neelabhrabhattacharyya 2 жыл бұрын

Amazing content. Earned a subscriber ❤️

@hridyanshpareek 2 жыл бұрын

30 minutes passed like a breeze, such an interesting video.

@anubhavgoel4884 11 ай бұрын

As you mentioned that permanent delete had happened, how come data didn't get deleted from replica database and was available for restoration ?

@AsliEngineering 11 ай бұрын

It was deleted from transactional systems (including replicas) but backups, archives, and snapshots had it.

@pawanmahalle6298 2 жыл бұрын

Thanks for the step by step walkthrough using official release! System wise they did have CDC (assumption) and mechanisms to restore customer data (based on doc). So, for delay in restoring data, can the problem be attributed to lack of automation and thus scalability for restoring customer data? One can also suggest architectural shifts like pure multi-tenancy over hybrid but that’s would to costly and may not be practical as you pointed out.

@AsliEngineering 2 жыл бұрын

Slowness in restoration is purely because of multiple customers sharing the data ase because of which rows are intermingled. The extract the rows interested in, they will have to load the entire backup and then fetch the necessary rows while nearly skimming the entire table. Yes. If the architecture was pure multi-tenant it would have been much simpler. Even logical sharding would have helped. Looking at their restoration times it seems so (pure speculation) that they do not have logically sharded dbs.

@NikunjGupta9 2 жыл бұрын

Great content again !!! One question here about synchronised standby replicas. These synchronised standby replicas are for higher availability or for higher consistency?

@krum.00 2 жыл бұрын

Consistency. You would go for async updates to the replicas for eventual consistency.

@AsliEngineering 2 жыл бұрын

Availability. They are not serving any traffic they are there to just take the writes and be ready for a failover.

@krum.00 2 жыл бұрын

@@AsliEngineering I thought if you are ready to pay 2x in latency to make sure you don't loose any commit, consistency would be your end objective

@NikunjGupta9 2 жыл бұрын

@@AsliEngineering IMO if you are letting the clients know when their data has been written only after flushing the writes in both master and replica, it implies consistency cause if let's say there is partition tolerance (connection between master and replica gets disconnected), your application will not be available.

@AsliEngineering 2 жыл бұрын

I am not denying the possibility. The hint is purely because of the word "standby". Strong consistency across DB is a good to have only when the other DB is serving traffic. But because the other DB is a standby replica it implies that it is not serving any traffic. And hence I think that it would be used as a pure synchronous backup. But no restriction on interpretation or usage.

@RohanShahHere 2 жыл бұрын

Hi Arpit. Rohan from Airtel xLabs. a great video and content on this. kudos. although a quick question. how could client-site-ids match with client-app-ids? I would guess these 2 would be the primary keys for different collections, client-sites and client-apps. and script should have run on client-apps collection with client Ids list in permanent delete mode. which ideally should not have deleted any record? (if ids were of UUID type) am I understanding and assuming something wrong here?