Distributed Systems 2.4: Fault tolerance

Рет қаралды 41,563

Martin Kleppmann

Күн бұрын

Пікірлер: 13

@eyadkhayat Жыл бұрын

Watching this course while brushing up my system design skills. Very useful. Thank you

@rahulsoshte9299 24 күн бұрын

Such practical things to learn tysm @martin

@bermick 7 ай бұрын

brilliant! thanks a lot for the content Martin!

@sarathkumarmutnuru1177 3 жыл бұрын

at 6:51, how can any fault detector label a node as correct if it crashed actually? Since, fault detector labels correct only if it receives an acknowledgment of some sort, so there is no way a crashed node can acknowledge. Unless, the node has crashed in between the signal trigger intervals of the fault detector.

@khaldrogo9451 2 жыл бұрын

Well one example is to think of the time in between messages being passed. A sends a message to B, asking if B is still up. B responds by saying "yes, I'm good", and crashes right away. Now, A will get a message saying that B is up, but in reality B has actually crashed. So, until A goes around and asks B for its status again, it will never know and will have marked it as correct.

@GooseBerry390 Жыл бұрын

@@khaldrogo9451 Excellent response. Note that there is the timeout period itself as well, so even after A has asked B, it will wait for a particular length of time until it decides that a timeout has actually occurred.

@mantistoboggan537 4 жыл бұрын

So wait, how does the eventual failure detection get implemented? Don't we still fundamentally have the same problem if we have asynchronous timings? How would I know that my node has failed, as opposed to just going through a huge garbage collection protocol, or thrashing, or anything else?

@AZAssazin 3 жыл бұрын

I think the idea is that *eventually* may mean a very long time, e.g. if you don't get a response in a few weeks, the node crashed. Alternatively, you could probably enforce (maybe via an SLA) what a failed node will look like, especially if the service you're calling is another service your company owns. "If we don't respond within 1 minute, then even if we were just stalled due to garbage collection, we'll discard the message and consider the node faulty."

@yogeshedekar6078 3 жыл бұрын

You can simply have a heartbeat signal sent to every node usually called as a liveness probe in cloud terminology. If the node does not reply to heart beat say 3 times consecutively you know that the node has failed and can trigger an automatic restart. If restart also does not fix the issue then you take that node out of rotation and put another node in place.

@allyourcode 3 жыл бұрын

I think the answer is in the title of the slide: a PARTIALLY SYNCHRONOUS model is being considered, not async.

@kleppmann 3 жыл бұрын

That's exactly the point: if you don't get a reply from some node within some timeout, it might be that the node crashed, but it could also be that the node or the network is just temporarily being slow. And we can't definitively distinguish between crash and slowness. However, if slowness is only temporary, then eventually the node will start responding again if it's not crashed. The problem is that in an asynchronous or partially synchronous system, we don't know how long that might take.