Sorry for the weird title – it’s about booting a SLES server when the device for the root file system will initialize later than expected. We’ve experienced this issue with our Fiber Channel setup and raised a bug report with SUSE support, which yielded in a quickly delivered fix.
The symptoms
Most of our servers have access to SAN-based disks, but usually boot from local media. With our new servers, we’ve started to have the boot record on the local disk and LVM with a SAN LUN as the only physical volume to that volume group. The root file system is on a logical volume on that VG.
When we boot the server, we get dropped to the initrd’s shell, because the root file system could not be found. Immediately running “vchange -ay <volgroupname> && exit” in that shell let the system boot successfully, but who’d want to manually intervene on each server reboot, especially when these servers are remote?
The cause
It was easily determined that the volume group was not having any active logical volumes when we were dropped to the shell, and boot messages (still on screen) indicated what occurred: The boot scripts do detect that the root file system is on LVM and try to activate the according volume group – but this happens before the LUN is actually presented to the system (initialization of the Fiber Channel HBA takes a few seconds). This would be no problem if the corresponding boot script would wait a moment until udevd had things set up properly, but it didn’t – it rather finished and boot commenced to the next step, where mounting the root fs would fail due to the missing LV.
According to the boot messages still on screen, the HBA had finished its initialization even before we got dropped to initrd’s shell, udev had set up all entries and thus running “vgchange -ay” successfully resolved the situation.
While our problem was related to LVM, this problem might hit anyone having their root or resume file system on a device with delayed activation.
The solution
We suggested adding a step to initrd’s “boot/61-lvm2.sh” to wait and retry until all required VGs could be activated, and received a PTF from SUSE with this functionality. I asked them to rework it a bit to make sure you still get to the initrd shell if someone goofed with the kernel parameter (specifying a bad root fs or restore fs) – but the current PTF at least fixes our problem.
This solution will only work with LVM-based setups, but could be extended to always look for the specified root / resume devices and wait for their activation. I guess that’d be another call for support, though.
Of course I don’ know when the change will hit the update channel, so if you run into this problem and want to contact SUSE for support, get in touch with me and I’ll hand out the “service request” number to make reporting to SUSE easier.
Update: New lvm2 patches have hit the update channels on July 29, bringing it to version “2.02.98-0.29.1”. Included with this patch is the fix for the problems described in this article.