Running a Pacemaker cluster without STONITH is not the right way to go, and if you’re using OCFS2 in that context, you’re even forced to provide a STONITH service, as else the cluster won’t start. We have some SNMP-controlled PDUs (power distribution units – the IT term for multi-socket extension leads) made by Gude, so here’s how we got them to work with Pacemaker’s stonithd.
STONITH – computerized assassins
STONITH is the commonly used term for technology to reliably shut down (and, if possibly, restart) seemingly malfunctioning nodes of a computer cluster. Fully written, it stands for “shoot the other node in the head” – and while computers usually have no heads, “stonithing” aims at the same goal: To make sure that a cluster node does no longer participate in the cluster operation, once other nodes noticed it being ill-behaving or seemingly gone. This is to avoid that “ill” node to continue on its own, possibly even thinking that the remainder of the cluster has gone and starting to offer services that may only run once (but actually are active on the other part of the cluster as well).
There are many ways to achieve this, some rely on the node to shut down itself (i.e. via ssh-issued commands or via storage-based death (sbd) by sending commands via a specially set up shared storage area), but two rather final technologies are stopping the node via remote out-of-band management (IPMI) or by simply shutting off its power at the source.
We have a number of cluster nodes that are individual servers, rather than blades in a chassis with a common power supply, and we have connected them to the UPS-backed power network via PDUs from Gude – “Expert Power Control NET 8x”, designed and made in Germany. We can shut off (and on) each of the 8 PDU ports individually, i.e. via a web interface and via SNMP. Of course, the latter option is just perfect for controlling the PDUs via a computer program – just what we need to integrate them into our cluster environment.
stonithd – the component used by Pacemaker
We’re talking about a Linux-based cluster, run on top of SLES11 SP3 and controlled by the Pacemaker software, which comes as part of the High-Availability Extension (HAE) of SLES11.
To keep things short, “stonithd” (the component responsible for executing the final calls of death) has a modular design, allowing to create “device drivers” to integrate each and any STONITHing device you’d like. While the “fully integrated” drivers are “shared object” pieces of binary code, the “external” drivers are called as separate programs and can even be bash scripts, as long as they support the calling convention.
A device driver for Gude PDUs
It’s all about choice – we’ve chosen the Gude PDUs (and are very happy with them, even after years of use), but there was no ready-to run driver to integrate them with stonithd. Unfortunately, there isn’t much documentation on how to write a stonithd device driver either… unless you count in the source code of the “external” drivers shipped with stonithd.
To us that source code was enough to grasp the concept and to write a little shell script offering the support we need, and that’s what this article is about. And even if you do not happen to have Gude PDUs, you might find our snipplet of code useful if you’re running a different, but SNMP-controlled PDU yourself.
The design principles are easily described:
- you can have one or more PDUs that control your cluster servers (think of different power circuits or redundant UPS)
- each server can have one or more connections to one or more of these PDUs (think of redundant power supplies)
stonithd contacts the driver giving the name of the cluster node to be shut down or started, so all we have to do is to identify the ports that are connected to the server. Gude’s PDUs allow to set labels per port, that can of course be queried per SNMP, so we’ve been setting these to match the node names. If your PDUs don’t support this, you’ll have to find some other way to get that information – parse some external configuration file, use a database, query some daemon – whatever fits your needs. In addition, the script supports a number of labels to skip unused ports, namely “leer”, “unbekannt”, “empty”, “unused” and “unknown”.
Of course you’ll want to know the SNMP OIDs and values to set in order to control the ports’ states, we’re using the SNMP variable names and thus need the Gude MIB installed. Fortunately, Gude offers a download link for the MIB file right within the web interface of the devices themselves, which is a very convenient feature.
In addition to the “on”, “off” and “reset” commands, the driver has to handle a number of further calls for administrative features. Three of these need explicit mentioning: There’s a call to query an XML-formatted documentation of the driver configuration, one is used to report back the names of all parameters supported (I’ve never seen that one called, but you’ll better be prepared to report back the correct variables, in case some cluster configuration checker uses that API to verify the list of driver parameters) and then there’s the status check – which doesn’t check the state of a cluster node, but the state of the STONITH device.
I’ve attached our little script (bzip2-compressed) to this blog article, so you can download, analyze and modify it according to your needs. Even if you’re using Gude PDUs – don’t just put the script into “/usr/lib64/stonith/plugins/external/GudeEPC” (or whereever your stonithd expects such scripts), do at least run a quick scan for anything that might hurt your installation and of course, test the script.
Please note that the script will keep a log of the more important calls in /var/log/gude.log, so if you’re running a busy cluster (with lots of forced node up/downs), you should be prepared to truncate that log from time to time. You can use a simple logdigest rule for this, the file isn’t kept open all the time, so there’s no daemon you need to restart upon log truncation.
Currently the script uses SNMP V1 only – but as it uses “netsnmp” commands under the hood, adopting to anything supported by your version of these commands (i.e. SNMP V3) should be fairly straight-forward.
Cluster configuration
Defining the stonith resource to the cluster is as simple as can be: The only required parameter is the (list of) IP address(es) of the PDU device(s) to use. If you use an SNMP community other than “private” (which I strongly suggest for security reasons), you can specify that as well.
A sample XML snipplet of configuration is
<clone id="stonith"> <primitive id="stonith-gude" class="stonith" type="external/GudeEPC"> <instance_attributes id="stonith-gude-instance_attributes"> <nvpair name="gudeip" value="192.168.1.98 192.168.1.99" id="stonith-gude-instance_attributes-gudeip"/> <nvpair name="community" value="verysecret" id="stonith-gude-instance_attributes-community"/> </instance_attributes> <operations> <op name="monitor" interval="15" timeout="15" start-delay="15" id="stonith-gude-monitor-15"/> </operations> </primitive> </clone>
Once you have this in your cluster information base, you’re ready to rock… or to be more precise, you’re ready to have your nodes killed (and hopefully restarted) at Pacemaker’s will.
One word of advice: For the unlikely situation that all nodes somehow shut down all other nodes at the same moment, you need to be prepared to re-enable the PDU ports manually – especially in a two-node cluster, there won’t be any node left to re-enable the other node’s power source. But according to our experiences, this is a rather unlikely situation, especially with a growing number of cluster nodes.