initrd woes with SLES11 updates

As the holiday season is a time of few active users, it’s considered a good time for systems maintenance. Updating a cluster of SLES servers is no exception to this, and so we took a few moments to bring a set of servers to the latest level of SLES 11 SP3.

Interestingly, unlike with earlier updates, running the command line update (“zypper up”) reported problems with “mkinitrd”:

Installation of drbd-kmp-default-8.4.4_3.0.101_0.8-0.20.1 failed:
 (with --nodeps --force) Error: Subprocess failed. Error: RPM failed:
 Kernel image:   /boot/vmlinuz-3.0.101-0.8-default
 Initrd image:   /boot/initrd-3.0.101-0.8-default
 KMS drivers:     radeon
 Root device:    /dev/mapper/system-root (mounted on / as ext3)
 Resume device:  /dev/md2
 Device disk!by-id!md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5 not found in sysfs
 Script /lib/mkinitrd/setup/72-block.sh failed!
 There was an error generating the initrd (1)
 error: %post(drbd-kmp-default-8.4.4_3.0.101_0.8-0.20.1.x86_64) scriptlet failed, exit status 1
 Abort, retry, ignore? [a/r/i] (a):

This wasn’t the only place “mkinitrd” is called, but interestingly it was the only place where it caused the update process to fail – go figure… and always check your update logs!

Running “mkinitrd” manually reported the same difficulties, which is not much of a surprise:

server01:/boot/grub # mkinitrd
 Kernel image:   /boot/vmlinuz-3.0.101-0.8-default
 Initrd image:   /boot/initrd-3.0.101-0.8-default
 KMS drivers:     radeon
 Root device:    /dev/mapper/system-root (mounted on / as ext3)
 Resume device:  /dev/md2
 Device disk!by-id!md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5 not found in sysfs
 Script /lib/mkinitrd/setup/72-block.sh failed!
 There was an error generating the initrd (1)
 server01:/boot/grub #

But what’s all this about? The disruption is obviously caused by the “missing” device (“Device disk!by-id!md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5 not found in sysfs”), so here’s some background on the setup:

The servers all come with a local disk plus a Fiber Channel connection to a SAN server. In order to provide optimum availability, the local disk’s partitions are mirrored to a similarily partitioned SAN LUN via Linux MD. There are three partitions, one for /boot, one as a LVM physical volume, and lastly a swap partition for sake of completeness. These partitions (from both the local disk and the LUN) are used to create /dev/md0, /dev/md1 and /dev/md2.

The device in question is /dev/md1, as we could confirm by looking at /dev/disk/by-id:

server0103:~ # ls -l /dev/disk/by-id/md-uuid-*3ab5
lrwxrwxrwx 1 root root 9 Jan  3 16:43 /dev/disk/by-id/md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5 -> ../../md1

So it was the LVM “physical volume” that somehow caused problems… but why? Up to now, everything went quite smoothly, both during uptime and for previous upgrades.

The riddle’s solution can be found in /etc/lvm/lvm.conf: These servers handle a large number of dynamically attached LUNs that carry file systems for Xen virtual machines. But to keep the Dom0 (“host”) LVM from picking up volume groups on these additional disks, the servers where configured to only use the specific RAID1 created from the two dedicated partitions:

[... /etc/lvm/lvm.conf ...]
# we know what we're looking for
filter = [ "a|/dev/disk/by-id/md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5|", "r|.*|" ]

It seems that the mkinitrd scripts somehow catch that information and try to look up the named device in sysfs. Why it doesn’t pick up the obviously existing device has yet to be determined, but as a quick work-around we modified lvm.conf to use the (in our case persistent) generic device name:

[... /etc/lvm/lvm.conf ...]
# we know what we're looking for
#filter = [ "a|/dev/disk/by-id/md-uuid-e76f9f53:b7553f91:5ed64d64:e65d3ab5|", "r|.*|" ]
filter = [ "a|/dev/md1|", "r|.*|" ]

Once that change was in effect, both “mkinitrd” and the whole update could be completed.

This entry was posted in Linux. Bookmark the permalink.

Leave a Reply