This is one of the most complex areas in distributed software. How do you define "dead" and what do you do when a node begins to behave strangely? How do you detect brown-outs and network partitioning? What do you do when these thing hit? The list of questions can go on and on.
Traditionally the simplest solutions work the best. Set a timeout on communications between nodes writing the same key segment, when a timeout expires the node that has access to a quorum device lives on while the other one commits suicide (or goes into read-only mode). Of course, you need a good quorum device - usually a lock service based on Paxos is a good provider of that sort of things. Moving to read-only mode (and coming back from the dead remembering previous state of data on the node) sounds like a better option. Unfortunately people often forget that the next step is the update from an active node. Update puts extra load on the active node and can (and in many implementations does) push the active node over if it was loaded well enough.
A node that never comes back from the dead (e.g. gets a drive formatted while returning to the service) is simpler, but you might lose data that way AND you still need to solve the problem of adding a node to the affected key segment.I personally favor a resurrection attempt because network partitioning happens much more often than a true death.
Traditionally the simplest solutions work the best. Set a timeout on communications between nodes writing the same key segment, when a timeout expires the node that has access to a quorum device lives on while the other one commits suicide (or goes into read-only mode). Of course, you need a good quorum device - usually a lock service based on Paxos is a good provider of that sort of things. Moving to read-only mode (and coming back from the dead remembering previous state of data on the node) sounds like a better option. Unfortunately people often forget that the next step is the update from an active node. Update puts extra load on the active node and can (and in many implementations does) push the active node over if it was loaded well enough.
A node that never comes back from the dead (e.g. gets a drive formatted while returning to the service) is simpler, but you might lose data that way AND you still need to solve the problem of adding a node to the affected key segment.I personally favor a resurrection attempt because network partitioning happens much more often than a true death.

No comments:
Post a Comment