|
Fault Tolerance is
the ability of the system to cope with and to recover
from various faults. The faults that are covered by
the K2 Fault Tolerant Architecture include the following:
Fatal
Server Process Fault: A server process fault
is usually caused by a serious programming error, what
causes the Operating System to terminate the process.
Server
Hardware Fault: A server hardware fault causes
the complete machine unexpectedly to shutdown. All running
processes will be lost as well.
|
 |
| The following points are taken into
account by the implementation of the K2 Fault Tolerance
system: |
- Detection of the fault and notification
of the instance that is responsible for the recovery.
- Recovery from the fault itself.
- Fault tolerance of the fault detection
and recovery system.
The K2 Component Server
supports detection of the above mention faults and implements
support for recovery. However a Component programmer is still
requested to use the offered functionality in case he wants
to implement a complete recoverable application. That means
a recovery from a severe fault must also be considered at
the application design and implementation level.
The Server process fault handling assumes
that the underlying operating system provides protection domains
between different server processes, so one server crash doesn't
affect other servers running on the same node. This is the
case for the platforms K2 is running on. The detection itself
is done by the K2daemon, which uses Operating System features
to detect a process termination.
The fault recovery mechanism
makes use of basic CORBA® functionality for redirecting
a client to a restarted component after a crash. However all
the current CORBA implementations implement a proprietary
instance called the Implementation Repository that works with
servers of this particular implementation only. The K2 Fault
Tolerant System makes use of the standardised stub behaviour
but is independent of a particular ORB implementation. Thus
the K2 Fault Tolerance System can handle multiple servers
that are implemented using different ORB implementations within
a single instance of the Fault Tolerance system per node.
The recovery of a Server Process depends
on what kind of Components the process was implementing and
how this server process was configured to behave during a
fatal fault. The following cases are considered:
-
A server process that
contains Entity Components and Session Components crashes.
Entity components are expected to manage their state entirely
with a database or other kind of persistent storage. In
case of an server error the K2daemon can restart the appropriate
Server either immediately or when the next request arrives.
The Component Implementation can restore its state from
its persistent storage and continue processing events.
The server process restart is completely transparent to
the client, as client stub will contact the K2daemon Fault
Tolerance system.
-
A server process that
implements Service Components crashes. The server process
can be restarted immediately by the K2daemon. However,
the object references used by the client are not valid
across a server restart for Service Components, that results
in a well defined error condition that is issued to the
client. If the client code doesn't use a MultiProxy stub
this error is visible to the client, so the client can
acquire a new valid reference and retry the request. In
case the client is using a MultiProxy stub in order to
take advantage of the load balancing, the MultiProxy will
delete the reference in its collection of possible object
references and issue a retry transparently.
-
The main entity of
the fault detection and server re-start is the Fault Tolerance
system of the K2daemon. The K2daemon manages its state
in a persistent manner. This allows the K2daemon to be
monitored and restarted by the Operating System using
a script.

The K2daemon detects the
Server Hardware fault. The federation of this instance is
spread across the complete K2 Component Server Cluster. The
various instances of the Trader-Arbitrator are connected by
the Reliable Multicast Protocol (RMP). RMP is using periodic
ping messages as a protocol internal reliability feature.
These ping messages are also used to trigger a watchdog for
every node. If the watchdog expires the appropriate node is
declared as dead and mapped out of the cluster. The main action
for mapping a node out of the cluster is masking its offers
as dead in the remaining Traders in the system, so they are
not returned in any query sent to the Trader until the node
revives and the offers are mapped in again.
The client is informed about
the failed session by throwing an appropriate exception.
The fault detection and
recovery itself does not suffer from any reliability problem
as it is implemented in a distributed and decentralised way.
A revived node is detected
by the RMP and mapped into the cluster dynamically.
CORBA® is a Registered Trade Mark of Object Management
Group,Inc. in USA and other countries
|