Google Systems Design Interview With An Ex-Googler

Рет қаралды 771,032

Күн бұрын

Пікірлер: 869

@clem 4 жыл бұрын

I hope you all find this mock systems design interview insightful! And if you're in the business of acing your systems design interviews (who isn't? 😎), be sure to check out www.systemsexpert.io where we have more questions like this one as well as a full course on all of the fundamental systems design topics that you need to understand to tackle these types of interviews.

@ajr1791ze 4 жыл бұрын

who is the guy taking interview? where he works ?

@Andrew-ez9ft 4 жыл бұрын

Clement, I have some questions. What is the fastest way to become a junior developer in 2020??? What is the best way to learn programming to build advanced projects???

@KeyrptNetwork 4 жыл бұрын

it will be very useful to make another Systems Design Interview?

@VishalSingh-kc1jn 4 жыл бұрын

Dude grow some beard. Would look good🔥

@techwithwhiteboard3483 4 жыл бұрын

make sure you buy other domains with the same name 🤔✌

@lasophistique3513 4 жыл бұрын

What I loved most is, it actually went on like a real interview, where candidate doesn't know all the answers, and kept building on the assumptions, kept extending the designs. Thank you for making it look realistic!

@charliel.5794 4 жыл бұрын

I'd like to hear the feedback from the interviewer, I think that's the important part on distinguishing what went well and what didn't

@bluespeckcf5949 4 жыл бұрын

The feedback is probably that he's missed a lot of corner cases (interviewer mentions at least 2 of them in the video), and I don't know why he's not using a messaging queue instead. In short, this is probably a weak interview.

@vishnuv4813 4 жыл бұрын

Normally, the interviewers not sure about the solution themselves or they would have a template solutions which they feel that would fit every problem. So, they would not be able to comment on our design.

@gavindmello 4 жыл бұрын

@@bluespeckcf5949 I thought he could have used a message queue too, but, it wouldn't allow monitoring workers which have gone awol.

@cloud15487 4 жыл бұрын

@@bluespeckcf5949 can you elaborate a bit more? What exact corner cases did he miss?

@SpencerYork1534 4 жыл бұрын

@@cloud15487 For Example, one that I caught, was the worker heartbeat, what if the job is successfully completed but didn't send a heartbeat due to a network error, you wouldn't want to run the build again. Another is he never mentioned storage for the build artifacts, and how to resume a build from the artifacts, or use them, if you are building a build system, you definitely need to cover some sort of distributed storage for build artifacts, if you are building 100k times a day you don't want to have to download JS, or .NET, 100k times, even if you are pulling from your own repository you want to eliminate as much bandwidth usage as possible. BUT overall this is a good interview, maybe not perfect, but that is typical for most interviews, I think it was more important to show how he interacted with the interviewer, asking questions, clarifying requirements. Most companies want to see how well you handle overall project communication so as to best deliver according to the stated requirements.

@jlove1296 2 жыл бұрын

I used to have zero understanding of the importance of clarifying requirements and used to ask typical dimension clarifying question without knowing how that could actually help me solving the problem at hand. You really showed how you could incorporate the information you gained by asking clarifying questions through this mock interview - appreciate it

@nishantt6419 4 жыл бұрын

wow...simply amazing.....this is like a real world experience for the system design interviews most of the videos on youtube are just about explaining the system design about some popular applications like messenger, tinyurl, dropbox etc... all of them have one thing in common....all of them explain the design very well in only one aspect....they are all well structured, there are no hiccups, there is not much thinking to do.....they design the system without any flaws...why? because they know the design and there is no interviewer who is asking for in depth details or looking for more clarifying answers...they just explain the system in only one direction This is totally different and the one which we actually need. This is by far the best for multiple reasons 1. It shows how a real world interview looks like with all the white boarding sessions 2. the question given by the interviewer is very abstract....the interviewee may know the design in one particular aspect, but the interviewer is looking for some different aspect and asking more in depth details at every stage 3. the interviewee actually asks a series of questions at the start and also keep asking different questions time to time 4. at times the interviewer stops the interviewee because he missed explaining few details and missed details about fault tolerance and keeps asking for more in depth details about how the interviewee is approaching his design 5. its more of a collaborative brainstorming which actually happens in real world 6. it also showed how the interviewee picked some technologies like SQL, GCS, KVS etc and validated it with his explanation, getting approval of the interviewer and proceeding to the next stage at every point in the interview. Totally worth watching the whole video.....thank you Clement and kudos for creating such a valuable content hope to see more such videos

@elliemay1748 4 жыл бұрын

Yeah this is exactly how this goes in real interviews. When I interviewed at Amazon for engineering manager, I was asked to write an analytics system with 1 million writes per second. It was a bit more challenging than this example in terms of scale, but still similar in concept.

@khoavo5758 4 жыл бұрын

"Ex-Googler" means that you now use Bing or what? /s

@444haluk 4 жыл бұрын

hahahahh underrated joke

@DarkH4X0 4 жыл бұрын

Yep, now he uses Bing to see StackOverflow answers

@atharvparlikar8765 4 жыл бұрын

maybe duckduckgo

@amazingvipul8392 3 жыл бұрын

He still uses KZbin though 🙂

@carlhsieh2615 3 жыл бұрын

It means he's not evil anymore lol

@zss123456789 4 жыл бұрын

This is VERY helpful. This is literally better than any video that just tells you general tips that are hard to follow in an interview setting.

@crankerson 3 жыл бұрын

The SQL transaction would ensure that all of the SQL statements would run together or not at all (rollback), but it doesn't ensure atomicity. In this design, to prevent multiple workers from running the same job concurrently, wouldn't you need to specify the isolation level on the transaction in order to prevent a non-repeatable read? Example: worker 1 selects the oldest "queued" job. worker 2 selects the oldest "queued" job. worker 1 updates the job status to "running".

@anant543210 2 ай бұрын

Or use a row level lock?

@TheCycloneTrooper 29 күн бұрын

Table locks will throttle the throughput of your system but it’s true concurrency control will be needed to prevent effort duplications

@charan775 4 жыл бұрын

Every time clement was the interviewer but now for a change clement is the interviewee.

@clem 4 жыл бұрын

Gotta flex my owns skills _sometimes_ ! 😎

@kevinsheng8775 4 жыл бұрын

Lol clement does sound like a interviewee

@kevinsheng8775 4 жыл бұрын

It seems like the interviewer is getting educated from clement lol

@yunjiehong4649 4 жыл бұрын

Chinglin Sheng An interviewer has no motivation to speak a lot, while an interviewee needs to think loud

@animalrocket4809 2 жыл бұрын

This was really good. I often paused the video to offer my 2 cents and thought out loud. Really liked the collaboration between Clement and the interviewer. Also, one subtle thing is Clement’s frightlessness in asking even the most simple questions and repeating things in his own words. This is what allowed him to really build and understanding of the problem, as the interviewer’s words may not click the first time and they are meant to be ambiguous so it fosters clarifying questions. His interviewing skills really showed off here. So much to learn!

@thefunkyhermit 3 жыл бұрын

Great video Clement. Thank you for sharing an insight into how these interviews are conducted. 2 things I would have prodded more for as the interviewer are: 1) How do you deal with downtime while switching between the versions? 2) How do you manage DB changes as well as the code changes discussed here.

@AmrMostafaY 4 жыл бұрын

I loved this. Thanks so much for taking the time to put this high quality video! I've a small comment that I can't get out of my head: The SQL transaction that workers use to grab jobs seem to have a race condition that might make 2 or more workers grab the same job. I don't think ACID here protects from that. Locking needs to be done somehow to prevent that, and the first option I'd explore, as it's the simplest, is to change the SELECT to SELECT FOR UPDATE, assuming the underlying SQL engine supports it which is guaranteed to lock the resulting row for a single worker until it had a chance to perform the update on the status. If SELECT FOR UPDATE is not supported, we can explore other options such as manual row-level or table-level locking. Great stuff, thanks again!

@ihorhula 4 жыл бұрын

Was thinking the same :)

@jiaruitian677 3 жыл бұрын

Same thought. The transaction's atomicity does not mean only one of the concurrent update requests will be success.

@nameunknown007 3 жыл бұрын

Exactly my thought. Plus I was thinking why he wouldn’t use a tool built for queues instead of making sql fit into this use case. Especially when the interviewer asked him do you think this will serve the purpose and he doubled down on it lol. Note to myself: listen carefully when such a question comes up haha

@JohnTowell 3 жыл бұрын

Came here to say this

@swoogan 2 жыл бұрын

You can actually build a queue in SQL without SELECT FOR UPDATE, or any implicit locks. You do an update on the row, setting its state to RUNNING, where the id matches and the row is in the QUEUED state. If the affected rows is 0, the task has already been picked up. If it is 1, you have picked up the task. Note that autocommits need to be enabled on the UPDATE, or there is no difference from SELECT FOR UPDATE.

@kkkarki6880 4 жыл бұрын

Clement, I really like the way how questioned to dissect every aspect of the question, to collect as much as information which is a super motivating tool for all developers.

@akkshayganesh1136 4 жыл бұрын

Brilliant setup. Great cross questioning and really makes the viewer think. Good going Clement!

@ashrafkotbofficial 4 жыл бұрын

This is really beneficial, thank you for taking the time to go through a full mock interview.

@ShubhamSingh-ku2ow 3 жыл бұрын

Such an AMAZING video Clement. Before this video, I went through few others but this one literally beated them all. The quality of the questions asked and the way you responded gave a Google feel. Thanks ALOT for bringing such high quality content. Absolutely terrific.

@aritraroy4220 4 жыл бұрын

Your help and free content is appreciated, and I am rooting for the success your entire business. There is nothing brighter than an honest and smart effort. Keep shining.

@AshutoshAnandashu 4 жыл бұрын

This was so good. Thank you @Clement for giving us an insight on what goes on in a real system design interview. Kudos to the interviewer as well. He asked brilliant questions and put up some interesting extra questions to make it difficult yet understandable. Subscribed ✅

@ramkrishnakulkarni8289 4 жыл бұрын

Clement you nailed it man. Fabulously awesome content. I’ve learned a lot from this. Please make more videos like this. 👏👏👏

@l_karuhanga 4 жыл бұрын

No one: Clément: Thinking back to my experience with code deployment systems.

@ascientistanengineer8038 3 жыл бұрын

I will leverage that :D

@samehghanmi4487 4 жыл бұрын

Thank you Clément for the great video. I learned so much about System Design in less than an hour. Keep it up!

@kalmurza 4 жыл бұрын

Great video. I totally recommend pausing the video after each feedback from interviewer and try to continue the design based on his comments. And great job, Clement. You basically recaptured Gitlab CI’s structure :)

@jeffhsipepi 3 жыл бұрын

Thanks Clément! What I really enjoyed about this video is that someone non-technical like myself, can still gain valuable insight/understanding into how to tackle these tricky system design questions. Great video!

@iorekby 4 жыл бұрын

Great to see this sort of thing. So many people trying to get in to tech are obsessed just with software engineering. There's a whole gamut of careers and skill sets in tech, and hopefully this will open people's eyes to the opportunities aside from software engineering (nothing wrong with software engineering, btw).

@danielrocks234 3 жыл бұрын

This is the kind of interview *is* for software engineering. This is what senior SWEs at major tech companies go through in the interview process.

@ankitGrover-LTD 4 жыл бұрын

Kudos to you Clement and team. This is one of the best interviews I have seen in a long time. Thanks for sharing!

@Adam-ro2zv 4 жыл бұрын

Hi Clement you're really doing great works for people, I like the way you shares these insightful contents.

@uskro 4 жыл бұрын

I don't get why you'd set a heartbeat to write to the db that frequently. You could just set a timestamp of assignment and your health check could just check the timestamp delta. You could have a read query executed every 5 minutes, to find the running operations updated over 15 minutes ago to find the problem operations. But you could also benefit from a worker identifier in the table, and your health check could also be a queue solution to check both with your table and the workers, maybe the operation is just taking longer than expected so you also wouldn't want to reassign the operation to another worker and have the code deployed by two different workers.

@mikedqin 2 жыл бұрын

Great interview video: I liked the most is the interaction between the interviewer and interviewee. Some thoughts on the design: 1) the SQL database can be a Queue system such as Kafka or Pub/Sub from GCP, which handles the scale very well with partitions. Given the large scale, the frequent query and updates to the SQL database will harm the performance, especially update the worker's heartbeats every second in the table. So a message broker will be the best solution. Worker status / heartbeat should be monitored and reported to a monitoring system such as Prometheus. Worker can push its health check status to Prometheus. 2) The interviewer tried to ask system design in Non-Abstract way, meaning how many servers are needed, do an analysis on bandwidth, timing, storage in a datacenter, and non-uniformed load (x1.25), then 3 datacenter deployment in a region, to derive the footprint per datacenter and global footprint. I hope the interviewee can elaborate on the Non-Abstract design more. Overall, a great system design interview. Thank you very much.

@artemtraer7615 2 жыл бұрын

I guess problem with Kafka is that it stores the msges in topic only for some period of time, and also how would you update the record inside the topic . But there could be a sql - Kafka replica tho

@mputcha 2 жыл бұрын

I had imagined a distributed queue and a separate process to store the jobs for retaining history if needed. DB for queue will be expensive. Maybe even a DB+ Redis Cache where the server pulled off from a queue structure if access to cloud services is a constraint

@dorcohen3522 2 жыл бұрын

100% right. I can add two more things on top of this. 1. Why building SQL queries is important for system design interview? We should deal with more abstract design, and maybe, later, it's possible to talk about SQL in depth, but they wasted so much time on it. 2. The presented SQL query will now work properly under load, this is why a queue system is by far superior, I couldn't understand why he wouldn't just use that, we can retain the data for as long as we want and slap extra metadata on top.

@Core433 2 жыл бұрын

I think SQL is indeed a better choice than message queue (i.e., Kafka) here. MQ for stream processing might be appropriate for super high QPS workflows, but assuming a giant tech company with 40k devs each with 2 commits per day, that's only 0.9 QPS builds coming into this build system, which is well within the abilities of a SQL DB and complete overkill for MQ. Additional reasons why MQ is not an ideal option: 1) MQ tends to have low retention, you'll need to end up storing your jobs in some persistent store like SQL or NoSQL DB anyway to preserve full history 2) It's difficult to parallelize MQ unless you have multiple partitions / topics which match the number of workers -- this is because coordinating multiple workers consuming the same MQ partition is difficult. Yet the low QPS of input to this system does not warrant partitioning to me. 3) As the presenter explained, SQL DB has built-in consistency where build workers need to acquire a read/write lock to obtain the rights to build a job. MQ does not provide this out of the box, so it's up to client to coordinate worker pool. See point 2 above on why this is difficult. 4) MQ does not preserve order of jobs between partitions, and it was a requirement that jobs be roughly started in the same order they come in. I think jumping to the conclusion that we need MQ for *any* kind of queue can be a detrimental mindset in system design, and could hurt a candidate in a system design interview. It's worth understanding the tradeoffs and explaining this to the interviewer.

@xtzyshuadog 2 жыл бұрын

@@Core433 How does one enter the career path of system design? What prerequisite knowledge, certifications, skills would you think they would need? Are there particular programming languages or tools for home situations one can practice in by designing their own simple small systems using free tools?

@fbeer23 4 жыл бұрын

Best way for the SQL-Statement at 27:00 would be: BEGIN TRANSACTION; SELECT * FROM jobs GROUP BY status HAVING status = "QUEUED" AND created_at = MIN(created_at) LIMIT 1; UPDATE jobs SET status = "RUNNING" where id = our_id; COMMIT;

@Nyquiiist 4 жыл бұрын

I am so inspired and motivated by your videos Clement. Gonna dive into AlgoExpert starting tomorrow, super excited!

@waynegreen7970 4 жыл бұрын

Thanks for the great content. Even if someone doesn't get a job at a FAANG company, it will help them become a better programmer. Your courses are definitely a step up from the vague books, tutorials, and blogs that were available sparingly just a few short years ago. While some of those resources were adequate for people who learn well from books, the videos help those who are visual learners by actually seeing the diagrams, and hearing what to say, how to say it, and what questions to ask. Thanks for providing this service.

@BiancaAguglia 4 жыл бұрын

This was very useful. Not only did it help me get a better understanding of systems design from a software development perspective, it also gave me some ideas for better designing data science specific projects. Thanks to both of you. Keep up the great work. 😊

@stepanseliuk8042 4 жыл бұрын

I can recommend “grokking the system design interview” on educative.io and github.com/donnemartin/system-design-primer#index-of-system-design-topics

@PoulJulle-wb9iu 4 жыл бұрын

@@stepanseliuk8042 bra

@guiyuxu4084 2 жыл бұрын

Going for multiple onsite interview all with system design interview, this is the most helpful material I have seen so far

@venkateshTD 4 жыл бұрын

When we got to the part where we were trying to figure out a way to find out if a worker died, I was shouting "heartbeat mechanism" sitting in front of the screen, coz that was the first concept we learnt in our software architecture class. It was so satisfying when I heard you said the same haha. Great video and it does feel like a very real interview

@achilles165 4 жыл бұрын

Most time spent on build system, not too many details covered about the deployment systems like safety of deployment, health checks, configurable deployment speeds, rollbacks, global service health \ availability checks to pause, incident management. These things are highest order bit for any large scale company as no one wants a global or regional outage.

@saranyaks6436 4 жыл бұрын

So many questions :) 1. How do you handle jobs which require building in multiple architectures (x64, arm, etc) 1.5 How would you handle scenarios involving multiple branches? 2. Why use SQL while there are so many distributed queue alternatives (SQS, for example) which let you handle scenarios like a builder failed elegantly? 3. Why store heartbeats in the same table? Why not have the builders register with a configuration service like Zookeeper which can take care of heartbeating? 4. How are build logs stored? How often are they garbage collected? 5. With this approach the SQL table will continue to grow infinitely which will make queries more and more expensive over time. What would be the table cleanup strategy. 6. How do you ensure that all machines have received a certain build? 7. How do you ensure that a malicious system doesn't pose to be a 'seed' and try to deploy malicious binaries?

@josephwong2832 4 жыл бұрын

Thanks a lot Clem, we need more mock SD interviews just like this for different size companies and different apps

@saadakhtar2000 4 жыл бұрын

You're great. Keep creating quality contents for us !

@clem 4 жыл бұрын

I'll do my best!

@dudukiwi 4 жыл бұрын

This is really awesome. This video is an eye-opener. Now I consider the system design question is not just a question, but it is a real task in the company. And the interviewers already know what they want, but they don't speak for themself. So an interviewee should find out what they want using additional questions like real life!

@chimpu92 4 жыл бұрын

Thank you so much for this mock interview. This is one of the best video about system-design interview for starters to learn on what to expect in a design interview and how to respond to make the interview a success....

@michalfuka5322 4 жыл бұрын

I think "begin transaction" at ~27:39 does not really mean that select is concurent safe. I would rather use something like "select ... for update" statement (depends on db). I know he assume "serial sql job service" but it's not really prepared for pooling or horizontal scaling from the sql point of view.

@ankurlucknowi 4 жыл бұрын

yes. you need some kind of locking mechanism.

@eduardolucas9896 4 жыл бұрын

Very good content as usual. I'm doing all mock interviews available in SystemsExpert and I'm very impressed by how much effort you put on this.

@sergiiolshanetskyi5488 4 жыл бұрын

Clement, thanks for the video. It would be nice to have a summary at the end - what the interviewer thinks you did well/what could have been better. Or the assumption is that you did everything perfectly fine? One specific question/concern I have is - the interviewer asked you if you can have concurrency problems when querying your queue log table from 100 workers. You said 'no' and wrote a query you would use. What transaction isolation level you meant when you were creating a transaction in your query? If you use MySQL for instance, and REPEATABLE_READ (the default one), two or more workers can read the same record as the 'oldest job' at the same time (as read is non-locking operation by default), and all such workers will update the same record as 'queued' and will pick up the same job. If you use, SERIALIZABLE you will run into a bunch of deadlocks, two or more workers can read the same record and then will wait on each other to unlock the record so it can be updated.

@heraldo623 4 жыл бұрын

Each Worker should own a private Job Queue. Make an Scheduler push new Jobs and balance the load between Queues. At any pont in time the max number of concurrent queries over the same queue is 2, the Scheduler and the Worker. In most of the cases the Scheduler will query for non running jobs to move them around and balance the load, so I think queries which will change the rows should lock the rows because the Worker may be querying for the same rows at the same time. This way just one Worker is locked down, the remaining keep working on their own queue. I made a comment explaining my solution.

@sergiiolshanetskyi5488 4 жыл бұрын

@@heraldo623 Thanks for the reply. I was not asking for an alternative implementation description (I can come up with several myself). I was specifically mentioning that 1) there was no feedback at the end 2) there seems to be an apparent concurrency problem with the existing approach. Unless I am missing something and it might still work. That is why I was asking in the first place.

@heraldo623 4 жыл бұрын

@@sergiiolshanetskyi5488 Clement made a very high level design of that system. There was a constraint of time to do the entire process, 30 min. But they did not show that the system does its job in 30 min. They could have described more the constraints (like available network bandwidth of workers and target machine) and details about the builder itself. How many time the builder takes to convert the code into a binary? Vague description of the system's responsabilities and constraints, so one can't tell if that works or does not.

@danielboteroc 4 жыл бұрын

Clement, Nice interview, I liked it. Only one thing, I am not sure about the fact that two servers won't pick up the same job using the code you've done. You might do a more complicated logic to pick up that job. If two jobs do the select at the same time, both will get the same job unless you are blocking the table.

@FlaviusAspra 3 жыл бұрын

The interviewee was quite weak I have to say. He could have done at least a SELECT FOR UPDATE SKIP LOCKED or equivalent.

@hbgl8889 3 жыл бұрын

This is more like an interview about implementing a scalable queue.

@superbmood 2 жыл бұрын

Learning some of the ideas in here helped me land a FAANG job. As a junior engineer I didn't think to use a SQL database to implement a queue for real systems at scale.

4 жыл бұрын

I didn't understand why you took so long to start designing the actual system (40 minutes to design the queue). Also, a couple of times the "Binger" (ex-Googler) asked you a couple of times to explain what would back your queue. And when you said it would be a SQL database, he asked you how the dequeue operation would be, you spent more time than I expected to explain it. I would use some already existent queue system and use another database to store the building/deploying data. Despite that, thank you for the amazing content, Clement! keep it up!

@Izma9 4 жыл бұрын

I wish I had an interviewer like this guy. Mine was just silently nodding to everything that I've said. And when I asked questions like "Should I care about this or that? What load can I expect? What are the system parameters?" he had one answer to all of them: "You can assume that anything is possible".

@iamworstgamer 4 жыл бұрын

man they are so calm and composed. Indian interviewers get irritated when I ask them questions to understand their questions

@CesarOzuna 4 жыл бұрын

Thank you for showing your prejudice for all to see. Btw this is a mock interview, not a real one. The Indians you speak so “highly” of are under pressure of a real interview.

@mohanish35 3 жыл бұрын

Another solution to the healthcheck problem of the workers could be: We can have a column called 'worker_id' in the jobs table. This will contain the id of worker a running job is assigned to. Healthcheck service will contain the information of worker_id as well. Once it finds that one of the worker is dead, it will update in the jobs table based on the worker_id.

@SamyakMehta1508 2 жыл бұрын

Fantastic video. Full of knowledge and very engaging. Thanks for this Clement! One different approach for managing the workers could be to have just one service poll the SQL table (queue) and spawn new instances on demand (Spot instances, may be) to run the jobs. Rest of the architecture remains the same i.e the spawned instances still send heartbeat signals. This could prevent over provisioning of workers especially during times when no builds are run (code freeze) or very few builds are run. But it comes with an additional waiting time equivalent to that taken by a new instance to boot up when none of the available instances are available to take up a new job.

@ritvikindupuri2388 7 ай бұрын

riigtht and mathemtiacally it may not be suitable

@zeno_aratus 4 жыл бұрын

I've always found estimating workload to be a challenging part of design, especially when you do not have as much information as you'd like to go on. So in one instance I have gone for determining the nightmare scenario and developing system specs based on that.

@alexandrudascal5585 4 жыл бұрын

A great example of a System Design interview, really looking forward to more this kind of videos

@timurkhadimullin8211 3 жыл бұрын

I am not entirely convinced that SQL scales horizontally. Especially when you realise how many records one needs to update on this single table. What I think it does well though - it illustrates the point and allows to throw some concrete details into the solution

@Mocha. 3 жыл бұрын

Realistically you would move the completed transactions into a permanent storage log through a separate service to maintain SoC. Definitely do not want to keep them in a working table. IMO it's good that Clement makes oversights because that is what makes the interview authentic.

@murtuzasyed5796 2 жыл бұрын

Thanks for making this great video. While watching the video these two points came to my mind 1. Do we need an event system that generates an event whenever a new commit is made instead of the worker checking for new jobs? 2. Having a time limit for a job to run can solve the issue when the system goes down.

@redayoub 4 жыл бұрын

thank you, i learned a lot and you really refreshed my SQL knowledge 1) in building code sub-system i think (38:30) you can check the heart beet when you enqueue the job so you don't need another aux workers ex ( select * from jobs where (heart_b < current_time -10m and status="running ") or ( status=="enqued" and create_at...) ) 2) the problem in the deployment system is what if multiple engineers at small periods click on deploying a build ( each machine will stop and start downloading the build again) i would love to hear your opinion

@heraldo623 4 жыл бұрын

About engineers trying to deploy when there is already a deploy going on. Well the Build Subsystem could build the code and the Deploy Subsystem could distribute the build as a new release to all regions as usual. It's up to the engineers to decide if the ongoing deploy should be cancelled or not. New releases are asynchronously made, only the deploy to the production machine, in this scenario, is synchronous (containers could be used to isolate the application environment from its hosting machine, this way multiple releases could be deployed asynchronously)

@TheVigilancer 4 жыл бұрын

Feels like you completely ignored “under 30 minutes” requirement. Am I missing something?

@RahulKomuravelly 4 жыл бұрын

52:15

@danielgerlag2369 4 жыл бұрын

There is a race condition in this design... simply selecting the next queued job does not give you a write lock on that row, so 2 workers could select it and try to update it, and they will both succeed, resulting in a duplicate parallel build on a second worker. Could add and 'AND status = "queued"' to the update clause to avoid this.

@ArjunGovind 3 жыл бұрын

13:06 - Why to need a queue to maintain the commits. You always pick the latest commit if there one. If there are multiple commits, you pick the latest commit since you don't want to build on intermittent commit which is anyways going to be replaced in the next cycle of 30 mins. Right? If it is deploying to multiple applications to multiple systems, the queue makes sense.

@nasserrcorcuera93 4 жыл бұрын

Awesome. Just like the interviewer said, this architecture is very close to the one's being used in the real industry - just like in my company. I was so amazed you just thought of the design on the spot. Again, great content!

@JM_utube 4 жыл бұрын

lol i mean this is definitely based on experience.

@cliffrosen3605 3 жыл бұрын

There is a pretty glaring issue with this discussion. The discussion/solution proceeded as though there was only a single application, which was deployed to every server. Maybe that’s not absurd in its own right, but the problem states that there were around 5,000 new builds per day. That would mean this one application was updated every 15-20 seconds! Clearly, a system that can deploy 5,000 different builds per day only makes sense if the system is supporting multiple different applications. And, providing such support would require a much more involved design than what was discussed.

@CFXTBogard 4 жыл бұрын

Questions I have: 1. from 11:27 is it not better to build the code when you have merge a set of new features in a release? It seems not "safe" to build on every commit, that wastes resources to build a binary of 10GB just for one single commit per developer. 2. 14:30 If you use a CI/CD script you could build and deliver the code built ready for deployment. you could use a SQS message where the regions are subscribed so once the servers read the message they can fetch the binary. 3. 18:08 Basically you could use the commit id as key since each deployment is unique and has different ID's, the release tag version for that commit. Date and time when the binary is built and if the regions already read such SQS message should be there. once the binary is downloaded to each region, set some value to get that specific commit build out of the queue 4. 47:23 you could use a SQS message subscription in each region, so when the deployment is ready, each region could import and build the code. the SQL Table should keep a record who and when it was fetched. that works for each machine connected to each regional server. Any news about the feedback? Thanks.

@Victor-gj5hy 3 жыл бұрын

great presentation. looks like a totally real interview. guy on left is even little bit annoyed. very real

@ralphwbms 2 жыл бұрын

Great video man. Thanks! About the TRANSACTIONAL part, here is how I'd have done: Instead of SELECT the next JOB in the QUEUE, I'd have the WORKER UPDATING the table with its WORKER ID, and than SELECTING the only row that got affected. If you SELECT first and then UPDATE, there is a chance of to WORKERs getting the same JOB ID at the SELECT step. Another point: I thing there is no need to begin a TRANSACTION just for one INSERT. A transaction is made by multiple steps (the SELECT does not count as a step).

@mullergyula4174 2 жыл бұрын

This is more about how to communicate than anything else. It is hard to take such thing seriously, real life tasks are usually more complex and you don't start from scratch, still seems useful.

@altereago1 4 жыл бұрын

Clement, I am curious why you didn't use a 'message queue' instead of the SQL table for task queueing?

@karimhasebou4252 4 жыл бұрын

I have the same question, why not something similar to kafka.

@knightwfu 4 жыл бұрын

I don't know what the answer is, but I would imagine that a message queue works for maintaining which requests are available, but by itself it's not enough. For example, what happens when the request fails or times out? How do you make sure that exactly one request is being processed by the workers? How can you represent the status of all jobs, past and present? My first reaction is to implement using AWS SQS; by itself, SQS FIFO might handle exactly once, some basic request error handling, but based on the read/write volume and the use cases, I think a SQL might be a great first pass.

@CodeCampaign 4 жыл бұрын

Using Database for queuing is famous anti-pattern.

@skiller1339 4 жыл бұрын

@@knightwfu I think a message queue would actually be a better option. You could set the prefetch to 1 to make sure the worker only consumes one message and when a worker starts building you could dispatch an event. You would only acknowledge the message after the build is done, in that case if the worker stops and a new one would start up it will just try the message again. Whenever something happend you could dispatch an event (for example BuildStarted, BuildFailed, BuildDone and BuildCancelled). These events would be consumed by another service which builds an representation of the current status. Also when you want to cancel the current build you could make sure that all messages for a certain job would end up at the same consumer (for example with a consisten hash exchange in rabbitmq).

@ronquan3730 4 жыл бұрын

I think the thread above is great, a point I want to call out is Messaging systems like Kafka are durable message queues in the sense that messages are persisted to disk. This actually could have been expanded on rather than going with the SQL Table approach. If this system is used by the entire company for every commit of every repository this won't scale well with a SQL Table as you'd have more writes than reads due to status updates. Messaging system would be more scalable not just with durability but also performance as you can partition on the hash of the repository and preserve order of the commits.

@abcd-dcba 2 жыл бұрын

The way you delivered intro deserves a like :)

@alexfeguson1 3 жыл бұрын

Great job. I would like to see a feedback at the end of the interview, but it is just my humble opinion. Thanks

@nitinagrawal6637 4 жыл бұрын

Will be interesting to hear the honest feedback of the interviewer after this discussion.

@suboii2978 4 жыл бұрын

First of all, thanks for making these available Clem. I seen a lot of videos about the systems design interview in preparation for my interviews and I'm a SystemsExpert customer, but I'm wondering if you could explain the difference between a novice , intermediate, and expert level performance. What criteria can be used to determine what level someone is and how able they are at designing good systems. Thanks!

@X-Dime 4 ай бұрын

I'm going to do my second Systems Design interview this week and this video clarifies a lot of things whys i wasn't so good in the first interview! I think one thing i would done different here is assuming we are using some kind of git, maybe have some kind of CDN for builds and some sort of builders for region and than some kind of a worker per server in each region, assuming we might have a cluster of machines per region and than the machine calls the CDN or something like that here

@TreeLuvBurdpu Жыл бұрын

You're building queue functionality into af database, presumably to achieve multi region fail over. You'd have to use a serverless db, to avoid massive replication overhead for a single table, use a Running TTL column updated by the worker that picked up the job to set the TTL to 20 minutes later, the worker-pull query selects the latest job with TTL after now(). You need a worker-finished query to delete the record when the worker process COMPLETES.

@jweyrich 4 жыл бұрын

About the update on the RUNNING job when the node fails... if there's a network problem in the node and the hearbeat period expired, you update the job back to QUEUED, but that node might come back alive after another node picks up the re-queued job, finish it's job, and update the job to SUCCESS, and after a while, the new node that started re-processing the job, would also update the job. Not a huge problem, but might interfere with the build history.

@warrenwongww Жыл бұрын

Just a side note, for the SQL, if multiple workload executed at the exact same second, they may pick up the same "Queued Job ID" because the transaction will not Lock the Record until Commit. Need to do it with pessimistic locking.

@pushpendrahirwani9105 4 жыл бұрын

I really like this simulated version of the system design interview. I do have more clarity now on what I should be emphasizing on more in terms of approaching the problem statement. Thanks Clement.

@ohanfillbach7106 4 жыл бұрын

I might have missed something, admittedly the video was pretty long, but you asked for feedback in the comments so here we go: Given how in depth a lot of the solution was, it seemed that how the binaries get into the regional peer to peer networks was missing. You covered that GCS will magically replicate them for you, but then you jumped right into those binaries going from host to host. How does it get into the network? I am assuming that it would just be the first host to poll KVS and see a new state, then poll the network and nobody see nobody else has it so it goes to fetch from regional GCS. Why not expand the enum in the jobs table so that you can have a single source of truth with regards to deployment? Something like: Queued, Building, Replicating, Deployable, Complete, Failed. Also with that DB, why not just use the last_h column as a phase_started time. Since you are still benchmarking against expected worst runtime, you can can keep your query basically the same but also track failures across the whole process (and not just building), without needing the heartbeat system for builders/workers. I probably also would have asked a little more about the code that we are deploying. Is it only one codebase, or should this work for multiple applications? Do we only need to support one deployed version at a time? Especially if the answer to the second question is yes, then the whole goal state solution would likely need to be changed. Overall great interview, I think the most important part was watching how you took your time and thought through everything slowly and deliberately while asking for clarification at basically every step

@StupidNub 4 жыл бұрын

I'm not getting how the 10gb binary gets to all the regional GCS either. Or how p2p networks work. I would think a 10gb file going from country to country would take a while, especially at if it's done ad-hoc at the time of the app release? What if it's a 5TB binary? My guess from what was meant in this video is that regional host 1 (rh1) sends to host 2 (rh2), then rh1 and rh2 send to rh3 and rh4, incrementally/exponentially being able to send binary/replicate more at a time? Alternatively, with an entirely different take: I heard that making incremental patches to source code as patches is very quick. So, why can't the version control system live in each region, and we just ensure they all have the latest commit. Then, each regional host just builds the binary prior to the scheduled release time of the app, and there wouldn't be a need to download a huge binary across countries.

@UniqPtr 3 жыл бұрын

About the heath check part. correct me if I'm wrong. I think the ideal solution will be using a zookeeper or a zookeeper-like coordinator. Reasons: 1. your surveillance service can also die. 2. Times stamps rely on system clocks. In some cases, clock may pause or even go backward. In those cases, you are killing a healthy job

@Tourach74 4 жыл бұрын

I would also have asked to which platform/OS we deploy: linux, windows, ... etc. Thoughts?

@nobytes2 3 жыл бұрын

Is probably one OS, most likely Linux unless is some dotnet or iis product.

@shabyparveen1649 3 жыл бұрын

@@muthukumarankothandaraman2371 when you say global there first thing I would ask is what do you mean by global? And this question was aked by the candidate in this mock interview, then what kind of devices or machines are target , along with geo-locations. Also, how devices or machines distributed across are supposed to receive or trigger the deployment build etc etc. At last there has to be some very generic speaking format among systems like XML, html and scripts basically the last part of design will define the technology stack. Very realistic interview. Love it. Btw I also when try ing to get into doing some design it turn out to be rabbit hole but then if one has clarity in thoughts then exit and entry points will always be clear to them. Having a clear thought process is a challenge. 😔

@konstantinrebrov675 4 жыл бұрын

Thank you Clement. You are producing many high quality videos with useful information for software engineers. I always learn something interesting. Hats off to you.

@ThatCrazzyRussian 4 жыл бұрын

It is odd that Clément worked at Google because the deployment system he created has nothing to do with what is used at Google. (Contrary to what the interviewer suggested, I worked there) In fact, it would not be used anywhere. Simply put, it is very rarely acceptable to deploy to all of the machines at the same time. Any non-trivial application would have a non-zero startup time (often measured in minutes) and you don't want your service to go dark while the fleet is ramping up. You would use a gradual rollout, often with multiple stages, and have a dedicated system which manages the whole process. I was going to buy SystemsExpert, but this and a couple of smaller misses gave me a pause. The video does give me some interesting ideas on how to communicate in a system design interview, though. Start with the birds-eye view, identify the blocks and then address them one by one. Be very methodical, very pedantic, very thorough. Go into obscene amount of detail. Check back with the interviewer regularly. Clarify the constraints in the beginning, even the ones that seem self-evident. I tend to jump quickly to the trickiest areas and bottlenecks, so this is quite helpful for me.

@IhsanMujdeci Жыл бұрын

Instead of polling for the last bit we can potentially have kafka broadcast the signal out and then write to a table once all the systems have been successfully deployed and this can update the UI via SSE or something similar.

@dhruvbansal3 Жыл бұрын

Video looks pretty good and clean. Only thing missing from my perspective is that deployment do not happen all at once but in an incremental way in large organization that too different for different services, would love to see that as well incorporated into this design a bit.

@jewjonny5621 2 жыл бұрын

The last part of the build trigger P2p network can be handles by a consensus protocol like RAFT, most disributed consensus systems today (Consul, Kubernete, nosql DBs etc) are handling state in that way and sending commands to the rest of the workers to set the desired state.

@cloud15487 4 жыл бұрын

Question: Clemente says at 41:40 that this design is super horizontally scalable. How is it super well scalable when we use a SQL database? I thought SQL databases scale bad horizontally.

@mayuragrawal4352 3 жыл бұрын

SQL databases are not horizontally scalable in true sense. however, you are not really talking about volume of data here. a simple sql db can easily meet his requirements as data volume is pretty low.

@umiakalevinem 3 жыл бұрын

Probably the only realistic System Design interview on whole of youtube

@samuelchien821 4 жыл бұрын

Instead of using SQL database to backup queue and store worker servers health, couldn’t we use off shelf solution like SQS which is durable and handling queue dying. Then use Lambda to run the queue, which will handle the server dying as well as the concurrency problem.

@ikucuk 3 жыл бұрын

Tip: Don't reinvent the wheel, just use a queue service or job scheduler service. Experienced senior engineers should know when to implement something and when not to. I wouldn't hire this candidate if it was a senior engineer interview

@HallaBhalla 3 жыл бұрын

Why do interviewers ask you to write merge ? When sort() is available in all the languages. Because interview is all about knowing if the candidate knows the basics well and if they can INVENT THE WHEEL if need be. You senior engineer would probably get stuck, if GCP / AWS stop their services for some god saken reason.

@ikucuk 3 жыл бұрын

@@HallaBhalla If the question is to design a Queue service, then you implement it by using the basics. If question involves using a queue service then, you don't re implement the queue service. I said a senior engineer should know when to implement something and when not to.

@YawarMurtaza-z9k 6 ай бұрын

@30:31 - I dont think rollback is correct unless you are using try-catch block then under catch use Rollback. In your case, I think select will return nothing as status will not match to "queued" and you should be just fine.

@rbianchidev 4 жыл бұрын

A very complete and useful analysis of a complex problem like this one. Thanks a lot!

@Kitsuki42 4 жыл бұрын

Bravo pour ta vidéo Clément, tu fais un excellent travail ! C'est précieux d'avoir accès à de telles connaissances, merci de rendre cela plus accessible. Let's go!

@kobew1351 2 жыл бұрын

One thing to be aware of is, the sql statement to get a queued job and update it to running actually doesn't address concurrency issue. Two workers may end up fetching the same job. A "for update" can be added to fix it i believe.

@randeepwalia1507 2 жыл бұрын

That could have been addressed by setting the transaction isolation level or by changing the SQL update statement to include a suffix: "WHERE state = 'queued'"

@scubasteve117 3 жыл бұрын

So I felt like you had an overly complicated solve for the monitoring for the workers. The problem parameters said that the deployment needs to be completed in 30 minutes. Figure out how long it takes to move built code to servers let's say 10 minutes, and allow 20 minutes for build. Record the last time status changed, and when the rows are RUNNING and status was set > 20 minutes ago, update status to FAILED: Update StatusTable SET Status = "FAILED" where Status = "RUNNING" AND LastStatusUpdate > 20 minutes;

@deathbombs 10 ай бұрын

For the Queue design, instead of your approach using SQL Queue, I thought maybe instead use CQRS to separately scale read and writes: inmemory queue back by SQL for active tasks (fast update active tasks- remove when done) noSQL to track a history of all builds. Reading is very fast from noSql

@notangryatanyone 4 жыл бұрын

"My SQL is a bit rusty..." proceeds to write perfectly valid SQL

@andre1sk 4 жыл бұрын

SQL is not valid for the intended purpose multiple workers can grab same row so you need select for update or something like FOR UPDATE SKIP LOCKED if supported

@siobhanahbois 4 жыл бұрын

Should've said "COMMIT TRANSACTION;" but ok

@SAKronikle 3 жыл бұрын

Also SQL Server uses single quotes for strings, not double quotes.

@Sudo4Life 3 жыл бұрын

Starting a transaction isn't sufficient when using typical isolation level. The interview was trying to hint at this and that's why he asked the SQL to be written. A correct SQL should have been UPDATE ... WHERE id=? AND status="queued". When there are competing workers nodes, only one of them would have a result of 1 row updated whereas the other node would have 0 row updated. Otherwise both nodes would attempt to build and deploy over each other. Listen to your interviewer! When they ask "Do you think your current solution is good given the concurrency level you have?", you should probably stop and give it a thought

@北斗星-y4x 3 жыл бұрын

I think order all then limit 1 has bad performance . should select * from jobs where created_at = (select min(created_at) from jobs);

@rajasubasubramanian9365 4 жыл бұрын

@Clément Mihailescu : How do we ensure the goal state set right - what I mean is say commit1 we generate build B1 and would like to deploy it - followed by another engineer who do commit2 and generates build B2 and would like to deploy it. When B2 gets deployed which means that B1 changes would also get included. So when it's deployed in the following order B1 -> B2 then there will not be an issue. But what happens when the build is deployed in the order B2 -> B1? How does the system here captures it? Or am I missing something here?

@resilientprogrammer8466 3 жыл бұрын

This is a great initiative, having open programming interview. It's my first interview I witnessed live. Imagine how long I have to go. But that doesn't intimidate me either.

@shellysclan1901 4 жыл бұрын

Hi Clement! Thank you for your effort.

@clem 4 жыл бұрын

Hi Shelly's Clan! You're very welcome.

@Firewall422 5 ай бұрын

Would it make sense to split your table into a jobs table and a workers table? And then your workers table has a job mapping, heartbeat timestamp, etc. 37:38 Also… doesn’t the presented design kind of assume that the source code produces a monolith binary? That might have been a good clarifying question, I think the design would change if there were micro services.

@reedipbanerjee 2 жыл бұрын

I appreciate this a lot, it helps cleaar some ideas and helps reinforce our thoughts as well. One thing though, the number of writes for last_h seems high, a liveness/heartbeat check may work as well with the monitoring/auxillary service just subscribing to a MQ with an event set ( start/stop ). And just add the Worker information in the DB table. That ways, any build logs on the worker can also be easily verifiable and writes for Last_h would reduce a lot. However, this is just a small thought, and this system design itself is splendid enough