Search
 
 
 
     
   
   
   
 

Fault Tolerance is the ability of the system to cope with and to recover from various faults. The faults that are covered by the K2 Fault Tolerant Architecture include the following:

Fatal Server Process Fault: A server process fault is usually caused by a serious programming error, which causes the Operating System to terminate the process.

Server Hardware Fault: A server hardware fault causes the complete machine unexpectedly to shutdown. All running processes will be lost as well.


The following points are taken into account by the implementation of the K2 Fault Tolerance system:
  • Detection of the fault and notification of the instance that is responsible for the recovery.

  • Recovery from the fault itself.

  • Fault tolerance of the fault detection and recovery system.

The K2 Component Server supports detection of the above mention faults and implements support for recovery. However a Component programmer is still requested to use the offered functionality in case he wants to implement a complete recoverable application. That means a recovery from a severe fault must also be considered at the application design and implementation level.

Fatal Server Process Fault

The Server process fault handling assumes that the underlying operating system provides protection domains between different server processes, so one server crash doesn't affect other servers running on the same node. This is the case for the platforms K2 is running on. The detection itself is done by the K2daemon, which uses Operating System features to detect a process termination.

The fault recovery mechanism makes use of basic CORBA® functionality for redirecting a client to a restarted component after a crash. However all the current CORBA implementations implement a proprietary instance called the Implementation Repository that works with servers of this particular implementation only. The K2 Fault Tolerant System makes use of the standardised stub behaviour but is independent of a particular ORB implementation. Thus the K2 Fault Tolerance System can handle multiple servers that are implemented using different ORB implementations within a single instance of the Fault Tolerance system per node.

The recovery of a Server Process depends on what kind of Components the process was implementing and how this server process was configured to behave during a fatal fault. The following cases are considered:

  1. A server process that contains Entity Components and Session Components crashes. Entity components are expected to manage their state entirely with a database or other kind of persistent storage. In case of an server error the K2daemon can restart the appropriate Server either immediately or when the next request arrives. The Component Implementation can restore its state from its persistent storage and continue processing events. The server process restart is completely transparent to the client, as client stub will contact the K2daemon Fault Tolerance system.

  2. A server process that implements Service Components crashes. The server process can be restarted immediately by the K2daemon. However, the object references used by the client are not valid across a server restart for Service Components, that results in a well defined error condition that is issued to the client. If the client code doesn't use a MultiProxy stub this error is visible to the client, so the client can acquire a new valid reference and retry the request. In case the client is using a MultiProxy stub in order to take advantage of the load balancing, the MultiProxy will delete the reference in its collection of possible object references and issue a retry transparently.

  3. The main entity of the fault detection and server re-start is the Fault Tolerance system of the K2daemon. The K2daemon manages its state in a persistent manner. This allows the K2daemon to be monitored and restarted by the Operating System using a script.

Server Hardware Fault

The K2daemon detects the Server Hardware fault. The federation of this instance is spread across the complete K2 Component Server Cluster. The various instances of the Trader-Arbitrator are connected by the Reliable Multicast Protocol (RMP). RMP is using periodic ping messages as a protocol internal reliability feature. These ping messages are also used to trigger a watchdog for every node. If the watchdog expires the appropriate node is declared as dead and mapped out of the cluster. The main action for mapping a node out of the cluster is masking its offers as dead in the remaining Traders in the system, so they are not returned in any query sent to the Trader until the node revives and the offers are mapped in again.

The client is informed about the failed session by throwing an appropriate exception.

The fault detection and recovery itself does not suffer from any reliability problem as it is implemented in a distributed and decentralised way.

A revived node is detected by the RMP and mapped into the cluster dynamically.


CORBA® is a Registered Trade Mark of Object Management Group,Inc. in USA and other countries

Copyright 2008 iCMG. All rights reserved.
Site Index | Contact Us | Legal & Privacy Policy