I loved this breakdown. What I loved the most was that you did not just regurgitate what was in atlassian’s blog post but explained key concepts very well. Super.👏
@shervilgupta922 жыл бұрын
An amazing analysis and break down of the outage. Thank you so much Arpit !!
@1littlecoder2 жыл бұрын
Great Content! For someone who's not much into System Designs and Engineering (I work with Data), I could understand very well! Thanks a lot! Subscribed!
@krum.002 жыл бұрын
Good stuff! There are very few channels that try to do this on the fly, not in an interview prep style. Hope your channel grows like Hussein Nasser’s, who also does detailed deep dives like this.
@AsliEngineering2 жыл бұрын
Thanks 🙌
@mayankpant53762 жыл бұрын
He is better than hussein nasser imo.
@krum.002 жыл бұрын
Backend, specially distributed systems is a pretty broad domain. So better is relative like always. You could excel in a certain domain and just know stuff about others. Anyway, I'm not here to pass judgement, just to learn something new.
@mayankpant53762 жыл бұрын
@@krum.00 yeah for sure. but arpit has better way of explaining things. hussein videos are very slow paced for my taste.
@harryinsin2 жыл бұрын
@@mayankpant5376 Same here :) , I too felt that he did a better job with content and explaining things with diagrams.
@CodeGenAI2 жыл бұрын
Hey Arpit. Really insightful video. Well explained. Looking forward for a video, what could have done for better recovery.
@manoranjithp31872 жыл бұрын
Very good dissection. I liked the format too. The explainations were brief enough to not shift focus from the dissection and detailed enough to understand the context. The interesting thing is (speculatively) looking at the system architecture from both development and operations point of view. And that hacktober fest t-shirt 😉
@ArunKumar-gy5wz2 жыл бұрын
Wow!! A detailed explanation man! keep it up.
@Riteshsharma-tw9ov2 жыл бұрын
Hi Arpit, I think one important aspect you might want to highlight is the violation of single responsibility principal in this case. The script in question must have been too smart for it's own good. One takeaway for me was to keep my production scripts dumb and aligned to srp. Thanks for the great video. Subscribed :)
@AsliEngineering2 жыл бұрын
Yeah. The script should have been responsibly focussed.
@utsavprabhakar50722 жыл бұрын
Great video, as always !! And this is just you getting started. Excited for where this channel goes :D
@saurabhjagtap2 жыл бұрын
Damn! This shows that even such a giant org makes rookie mistakes which leads to such a catastrophe and how impactful our profession really is. Great content! This is what I'm going to binge watch for this long weekend. Thanks for this! Also, we would love some video on CDC as well! :) Ps: that hacktober tshirt makes me feel sad, my mom converted it into pochaa without asking me :'(
@1littlecoder2 жыл бұрын
Haha, I guess all households hate Hacktoberfest T-shirts :)
@mohitkumartoshniwal2 жыл бұрын
Loved it. Learnt a lot from this.♥️
@damercy2 жыл бұрын
Loved it! Thanks a lot for this ❤️
@architshukla80762 жыл бұрын
Thanks Arpit for sharing the amazing analysis 👏 👍
@arai_199992 жыл бұрын
Amazing Video! 👌
@neelabhrabhattacharyya2 жыл бұрын
Amazing content. Earned a subscriber ❤️
@hridyanshpareek2 жыл бұрын
30 minutes passed like a breeze, such an interesting video.
@anubhavgoel488411 ай бұрын
As you mentioned that permanent delete had happened, how come data didn't get deleted from replica database and was available for restoration ?
@AsliEngineering11 ай бұрын
It was deleted from transactional systems (including replicas) but backups, archives, and snapshots had it.
@pawanmahalle62982 жыл бұрын
Thanks for the step by step walkthrough using official release! System wise they did have CDC (assumption) and mechanisms to restore customer data (based on doc). So, for delay in restoring data, can the problem be attributed to lack of automation and thus scalability for restoring customer data? One can also suggest architectural shifts like pure multi-tenancy over hybrid but that’s would to costly and may not be practical as you pointed out.
@AsliEngineering2 жыл бұрын
Slowness in restoration is purely because of multiple customers sharing the data ase because of which rows are intermingled. The extract the rows interested in, they will have to load the entire backup and then fetch the necessary rows while nearly skimming the entire table. Yes. If the architecture was pure multi-tenant it would have been much simpler. Even logical sharding would have helped. Looking at their restoration times it seems so (pure speculation) that they do not have logically sharded dbs.
@NikunjGupta92 жыл бұрын
Great content again !!! One question here about synchronised standby replicas. These synchronised standby replicas are for higher availability or for higher consistency?
@krum.002 жыл бұрын
Consistency. You would go for async updates to the replicas for eventual consistency.
@AsliEngineering2 жыл бұрын
Availability. They are not serving any traffic they are there to just take the writes and be ready for a failover.
@krum.002 жыл бұрын
@@AsliEngineering I thought if you are ready to pay 2x in latency to make sure you don't loose any commit, consistency would be your end objective
@NikunjGupta92 жыл бұрын
@@AsliEngineering IMO if you are letting the clients know when their data has been written only after flushing the writes in both master and replica, it implies consistency cause if let's say there is partition tolerance (connection between master and replica gets disconnected), your application will not be available.
@AsliEngineering2 жыл бұрын
I am not denying the possibility. The hint is purely because of the word "standby". Strong consistency across DB is a good to have only when the other DB is serving traffic. But because the other DB is a standby replica it implies that it is not serving any traffic. And hence I think that it would be used as a pure synchronous backup. But no restriction on interpretation or usage.
@RohanShahHere2 жыл бұрын
Hi Arpit. Rohan from Airtel xLabs. a great video and content on this. kudos. although a quick question. how could client-site-ids match with client-app-ids? I would guess these 2 would be the primary keys for different collections, client-sites and client-apps. and script should have run on client-apps collection with client Ids list in permanent delete mode. which ideally should not have deleted any record? (if ids were of UUID type) am I understanding and assuming something wrong here?