LVM metadata inconsistency trouble on boot

This is what “Murphy’s Law” is all about: You change a minor issue and spend the day fixing big trouble.

In this specific case, I had to reboot one of our more important servers because of problems with persistent network names. But after the reboot, the machine didn’t come up as expected (or should I rather say “hoped for”?), but instead went into Dracut rescue mode, AKA “initrd shell”.

We quickly noticed that only some of the LVM volume groups were activated and noticed that the volume group carrying the system LVs was unavailable. “lvm pvscan” not only gave a too short list of PVs (the two disks for system VG missing), but also spat out a number of error messages: That it wouldn’t activate our system volume group, that it could not work with a standalone PV, and “read-only locking type set. write locks are prohibited”. Now what’s that about?

A quick research hinted at specifics of the Dracut environment, where LVM was set to (metadata) read-only on purpose. But why this message at all, and how to get our server back online?

For a first check, we booted with an up-to-date openSUSE Leap 15.1 image, and to our great relief, all data on these disks was still available. As a matter of fact, the disks even appeared to give no trouble at all and all logical volumes, even from that system volume group, could be mounted and accessed. So we unmounted everything, formally stopped the volume groups via “vgchange -an”, rebooted and hoped for the best.

Unfortunately, the original SLES 12 SP4 system again booted into rescue mode and gave the same symptoms as before.

In the end, the solution in our case was to configure LVM in that Dracut environment to use locking mode 1, which allowed LVM to not only open the LVs, but to also change metadata of the PVs: During the manual “pvscan” right after changing lvm.conf, it reported a metadata inconsistency between the two PVs required for the system volume group, and seemed to also have repaired it auto-magically. Once this was done, a simple reboot of the machine (via “systemctl reboot” issued while in rescue mode) brought the system back only.

LVM’s locking mode is set in /etc/lvm/lvm.conf: Looking at the “global” section, you may see a statement “locking_type = 4” when in rescue mode – unlike during normal operations, where it is set to “1”, typically.

I can follow the notion to restrict Dracut to read-only access to LVM metadata in that early stage of boot. But unlike with file systems, which will get fixed by calling “fsck” during boot, LVM metadata problems cannot be fixed automatically during boot, even if LVM were able to do so. Plus, the admin has to understand what LVM is trying to say: “Hey, there’s an inconsistency that I’d like to fix, but I’m set to read-only mode.”

If someone from LVM team reads this: Please update the error message 🙂

This entry was posted in OpenSUSE, SLES (SUSE Linux Enterprise Server). Bookmark the permalink.

Leave a Reply