Watching this course while brushing up my system design skills. Very useful. Thank you
@rahulsoshte929924 күн бұрын
Such practical things to learn tysm @martin
@bermick7 ай бұрын
brilliant! thanks a lot for the content Martin!
@sarathkumarmutnuru11773 жыл бұрын
at 6:51, how can any fault detector label a node as correct if it crashed actually? Since, fault detector labels correct only if it receives an acknowledgment of some sort, so there is no way a crashed node can acknowledge. Unless, the node has crashed in between the signal trigger intervals of the fault detector.
@khaldrogo94512 жыл бұрын
Well one example is to think of the time in between messages being passed. A sends a message to B, asking if B is still up. B responds by saying "yes, I'm good", and crashes right away. Now, A will get a message saying that B is up, but in reality B has actually crashed. So, until A goes around and asks B for its status again, it will never know and will have marked it as correct.
@GooseBerry390 Жыл бұрын
@@khaldrogo9451 Excellent response. Note that there is the timeout period itself as well, so even after A has asked B, it will wait for a particular length of time until it decides that a timeout has actually occurred.
@mantistoboggan5374 жыл бұрын
So wait, how does the eventual failure detection get implemented? Don't we still fundamentally have the same problem if we have asynchronous timings? How would I know that my node has failed, as opposed to just going through a huge garbage collection protocol, or thrashing, or anything else?
@AZAssazin3 жыл бұрын
I think the idea is that *eventually* may mean a very long time, e.g. if you don't get a response in a few weeks, the node crashed. Alternatively, you could probably enforce (maybe via an SLA) what a failed node will look like, especially if the service you're calling is another service your company owns. "If we don't respond within 1 minute, then even if we were just stalled due to garbage collection, we'll discard the message and consider the node faulty."
@yogeshedekar60783 жыл бұрын
You can simply have a heartbeat signal sent to every node usually called as a liveness probe in cloud terminology. If the node does not reply to heart beat say 3 times consecutively you know that the node has failed and can trigger an automatic restart. If restart also does not fix the issue then you take that node out of rotation and put another node in place.
@allyourcode3 жыл бұрын
I think the answer is in the title of the slide: a PARTIALLY SYNCHRONOUS model is being considered, not async.
@kleppmann3 жыл бұрын
That's exactly the point: if you don't get a reply from some node within some timeout, it might be that the node crashed, but it could also be that the node or the network is just temporarily being slow. And we can't definitively distinguish between crash and slowness. However, if slowness is only temporary, then eventually the node will start responding again if it's not crashed. The problem is that in an asynchronous or partially synchronous system, we don't know how long that might take.