It’s been a while and I’ve been thinking about new topics for our blog… it’s not that we wouldn’t have enough to lament about, but that’s more for a cool beer and some chit-chat, rather than to annoy you.
Nevertheless, a subject resurfaced the last days, which had me wondering and doubting my skills quite some time: Running a Linux Pacemaker cluster to create higher availability for our Xen VMs, I’d like to combine two basic features:
- Create resource dependencies between VMs, and
- use “live migration” of VMs.
“1.” is really helpful if you separate services between multiple VMs, like running MySQL in one VM and the database-using applications on separate VMs. We have a complete infrastructure of VMs running in our cluster, with user machines, development servers, compiler platforms, multi-machine test beds… not to forget the virtual base infrastructure like databases, DNS, DHCP, LDAP, proxies, code repositories, issue tracking and so on.
“2.” is the way to go if you want to send a Xen machine (Dom0) into maintenance mode, without interrupting the services. This is “standard practice” and supported by Xen, once you’ve enabled it, and Pacemaker claims to support this, too. (It’s noteworthy that from our experience, live migration of Xen DomUs is indeed not for the faint of heart: For one, the actual migration may take up to a couple of minutes. And even if completed, not all services inside the VM may have survived the move – MySQL is actually one of the candidates that seems particularly picky. But that’s a separate topic for a separate entry.)
Since all this is said to be supported, why am I taking your time with stating the well-known? Because the combination of these two features won’t work the way you may have expected.
Take the following scenario:
- two Xen servers (xen1 and xen2), both running Pacemaker and joined in a cluster
- two VMs (domU1 and domU2) up & running, i.e. domU1 on xen1 and domU2 on xen2
- both VMs defined as cluster resources with live migration enabled
- domU1 defined as required by domU2 (technically speaking, an order constraint like <rsc_order first=”domU1″ id=”someid” kind=”Mandatory” symmetrical=”true” then=”domU2″/>
Now let’s assume we want to update something on our cluster node and want to do things manually, for the sake of demonstration. We’ll start with xen2 by migrating domU2 to server xen1, take xen2 offline and do our thing. Once xen2 is back online, we migrate domU2 back to xen2, then migrate domU1 to xen2 and take xen1 offline. Last step would be to migrate domU1 back to xen1. Or so one might think.
Live migrating domU2 is no problem at all, everything works as expected, so we can update xen2, bring it back online and restore the initial situation. But when migrating domU1, things go awkward (making me at first believe to have goofed somehow setting xend up or configuring the cluster): Instead of just invoking xend in order to migrate the DomU to the other node, Pacemaker will first stop domU2! And once it’s stopped, domU1 will be live migrated and domU2 is brought back alive.
This is completely against the intended design of our cluster… after all, reason for the live migration is the uninterrupted operation of all resources. So why this contra-productive behaviour? At first I believed this to be a bug in Pacemaker and hoped for fixes in upcoming releases, but the I found a thread from the beginning of this year. In it’s conclusion, David Vossel points out that a more or less complete re-engineering of the corresponding part of Pacemaker is required to fix this… something that for sure isn’t going to happen very soon.
So for those of you in the same pit as we are: There’s an open enhancement request on clusterlab’s Bugzilla where you can follow the progress on this subject. Or, since no progress was recorded after initial analysis in April this year, you’d better somehow voice your opinion…