When setting up our new NAS/SAN servers, we decided to use OpenSUSE 13.1 and to create a configuration with all HDDs in a RAID6 array, which then is expanded with SSD caching support via “bcache”. As this isn’t available right out of the box, here’s our description of the steps required to get this up and running.
What we basically need to have set up is
- a RAID6 array of all HDDs, used as the actual long-term data storage
- a RAID1 array of two SSDs, used as a caching device for the data storage
- joining these two in a “bcache”-conforming way
- and a kernel and initial ram disk that support this during system boot (we put “/boot” on a separate thumb drive, to have more drive bays for “real” disks)
Interestingly, the OpenSUSE 13.1 does support booting from a “bcache”-based root device, at least when the latest updates are applied. Unfortunately, the setup routines do not support this out of the box, so some manual action is required. And last (but definitely not least) we found out that even the current mkinitrd scripts will not detect the run-time dependencies of our setup, so we had to “convince” them to create a functional initrd.
Creating the RAID arrays
The YaST installation routines start by collecting all setup information before actually changing anything, and we decided not to dig too deep into that to find a proper place to introduce “bcache” configuration. Instead, we chose to set up the disks manually during setup (you can do this in a separate step before booting into the YaST installer, if you like).
So once you can access a separate shell during setup (i.e. by pressing Ctrl-Alt-F2 while the screen to confirm the license text is displayed), the first step is to create the RAID arrays. In our sample case, the kernel named the six HDDs “/dev/sda” to “/dev/sdf”, then our thumb drive as “/dev/sdg”, followed by the SDDs named “/dev/sdh” and “/dev/sdi”. Please make sure to use the correct disk names for your specific situation, these are assigned dynamically by the kernel during boot.
- create a partition on every disk (HDD and SSD) that you want to use in a RAID array using “fdisk”, and set it’s type to “fd” (called “Linux raid autodetect”). This will give us “/dev/sda1” to “dev/sdi1” (but without “/dev/sdg1” – the thumb drive remains as it is.
- create the two RAID arrays using “mdadm”
- The “cache device” is a RAID1:
mdadm -C san01-cache -n 2 -l raid1 /dev/sd[hi]1
- The “backing store device” is a RAID6: mdadm -C san01-data -n 6 -l raid6 /dev/sd[a-f]1
- The “cache device” is a RAID1:
Depending on the size and speed of your disks, creating the arrays may take quite some time. I leave it to if you want to install the system in parallel, or would rather like to wait until the initialization has completed. I tend to opt for the latter, just to keep things simple.
The RAID arrays will get created with a host-specific name, why you’ll see it reported as “linux:san01-data” (“linux” is the default host name at the early stages of the installer). If you happen to reboot your machine in the middle of the RAID init process, you may end up with a situation where the MD subsystem does not match that with your current host setup and suspends the RAID initialization… but only until the first write operation has happened or you force for the array into write mode.
Setting up bcache
Even the original OpenSUSE 13.1 installation image contains the module required to use “bcache” devices, unfortunately two things are missing: The module isn’t loaded per default, and the utility to set up the bcache devices is missing as well.
The utility is called “make-bcache” and is found on current installations of your distribution in “/usr/sbin/make-bcache”, if you have the “bacache-tools” RPM installed. We simply copy that over to our installation machine and set up the bcache device via the following steps, still in the “shell” from the previous step:
- set up minimum network configuration using “ifconfig” so you can use “scp”
- grab a copy of “/usr/sbin/make-bcache” from any other system running the same distribution
- load the “bcache” module (which is available with the installation image, but unfortunately not loaded)
- create the bcache device using “make-bcache”
make-bcache -C /dev/md/linux\:san01-cache -B /dev/md/linux\:san01-data
You can verify that the module loaded ok by checking for the existence of “/sys/fs/bcache”. Once you’ve run make-bcache”, you should see a device file “/dev/bcache0”. If , for some reason, the device did not get registered automatically (or you’ve rebooted the system to start over), you need to echo the two RAID array device file names to the bcache registration file:
echo /dev/md/linux\:san01-cache > /sys/fs/bcache/register
echo /dev/md/linux\:san01-data > /sys/fs/bcache/register
Unfortunately, LVM in OpenSUSE 13.1 is not yet prepared to handle “bcache” devices. Fortunately, you can change this by extending lvm.conf. You’ll have to do this twice, once for the installation environment, and later on in the installed system (we’ll get to that step later on).
The installer has most of the tools and directories mounted read-only, but don’t despair: There’s an overlay mechanism, all you have to do is follow the following steps:
- “mv /etc/lvm /etc/lvmx && mkdir /etc/lvm && cp -rp /etc/lvmx/* /etc/lvm”
This will make sure you have a writable copy of the LVM config file available.
- Edit “/etc/lvm/lvm.conf” and add the following “types” statement to the “devices” section:
types = [ “bcache”, 16 ]
- now you can create the LVM physical volume
- and lastly, create your LVM volume group
vgcreate san01 /dev/bcache0
If “pvcreate” complains that the device cannot be found, and you verified that “/dev/bcache0” exists, then it’s time to verify that you have properly extended “/etc/lvm/lvm.conf”.
Continue the YaST installation
Actually, it’s more like starting the YaST installation, as we hadn’t done much there, yet… It’s a requirement to let the system use the updated packages via network, else you wouldn’t catch the patches to initrd that are mandatory to support “bcache” during system boot. So on the installer screen, select to add online repositories right from the start.
On the next screen (“list of online repositories”), select the OSS and non-OSS update repositories in addition to the main repositories.
When you get to the step where you can setup up your file systems, go for the manual “expert” route – click on “create partitioning” and when asked to select a disk device to partition, selectvthe expert mode (the option is below the disk list) before continuing to the next screen.
The installer should have detected the existing volume group (“san01”, which you created during the manual setup described above), so that you can use it to build your logical volumes. Another thing to keep in mind is that you’ll be requiring a separate /boot device – we’re using a USB thumb drive for that, but any plain disk will do.
The typical layout we select for our LVMs is
- “root” with 1 GB, mounted as “/”
- “usr” with 3 GB, mounted on /usr
- “opt” with 500 MB (rarely used, thus you could do without on most servers), mounted on “/opt”
- “var” with 1 GB, mounted on “/var”
- “varlog” with 2 GB, mounted on “/var/log”
- “tmp” with 1 GB, mounted on “/tmp”
- “swap0” with 8 GB, used as swap
You can use the file system you trust most – I went for Ext4. In addition, I prefer to mount the file systems by label, which can be configured in the “fstab options” per each file system (including swap).
When you’re at the stage where you can select the RPMs to install, make sure you include “bcache-tools” as well, as omitting this will leave out required scripts from the initial ram disk and you’ll have a rough time correcting things.
And beware: Do not reboot the system. Neither the created initrd, nor the /etc/lvm.conf on the installed system are set up to properly boot the system.
A short break before the boot loader is set up
Once the installer has copied all the files into the logical volumes (or partitions) of your choice, it will try to reboot the system. The system will not be able to complete that boot unless you make the following modifications using the shell, optimally before the installer gets to the step of configuring the boot loader:
- Update the on-disk version of /etc/lvm/lvm.conf to include the “types” line mentioned above. If it’s not included, the volume group will not get activated during boot.
- Add the following statement at the end of /etc/sysconfig/kernel, to make sure that the boot stage will actually try to activate the MD RAID and hence can set up bcache0 and LVM on top of that:
- Add the “bcache” module to the “INITRD_MODULES” line in “/etc/sysconfig/kernel”.
- If your machine needs special device drivers to support your disks, chances are high the automated scripts will not pick that up. This will lead to missing devices for your disks during boot, which will come to a screeching halt. You can look up the required driver(s) in the installation shell, by calling “hwinfo –block” and checking the entries for your disks. There, on a line starting with “Driver Modules:”, you should see the name of the module required to handle the according disk, and you should add the name of each module to the “INITRD_MODULES” line in “/etc/sysconfig/kernel”.
Typically, I do all this in a chroot environment during install, having my future file systems all mounted in all the right locations, and am able to manually run “mkinitrd” as the final step. If you hit the right spot (this is, prior to the first reboot), the installer will have mounted all the on-disk file systems, so that you’ll just have to invoke “chroot”, pointing it to the directory where your future root FS is currently mounted.
When you have to do this step at a later point in time, or generally have the need to repair the system via the according boot option of the OpenSUSE installer, you’ll have to prepare the mount points manually. The steps are described below in the “Repair” section of this article.
Reboot and enjoy
If you followed these steps and used a good amount of common sense to get around any differences that may have come up during your install, you’ll have a system that will boot up fine and you can both continue the installation and use it after.
And what to do if the system won’t boot?
While the basics behind all this are pretty simple, a lot can go wrong along the way. Usually, you’ll then end up with a system that will not be able to access its root file system and allow you to drop into a shell at the boot state.
There, you need to assess what has worked and what is missing. Can you see the disk devices in /dev (i.e. “/dev/sda1” for one of the partitions that make up your RAID array)? Did the RAID get set up, so that you can see the RAID devices in “/dev” (i.e. “/dev/md127” and “/dev/md126”)? Did the “bcache” module load correctly, so that “/sys/fs/cache” shows up? Did “/dev/bcache0” get created, indicating the bcache device was already set up? Can you see the LVM physical volume, volume group and logical volumes (use “lvm” and its sub-commands to check)?
If you miss your disk devices, you probably didn’t include the proper module name in “/etc/sysconfig/kernel” in the INITRD_MODULES variable.
If you have the disks, but no RAID devices, run “mdadm -Ia” to see what happens… if this will create your array(s), then the “need_mdadm=1” in /etc/sysconfig/kernel either is missing or didn’t catch.
If you have the RAIDs, but no /sys/fs/bcache directory, try to run “modprobe bcache”. If that fails, you probably have omitted “bcache” from the INITRD_MODULES variable.
If it loads well, then something else went wrong and you’ll have to debug the initrd scripts to find out what. But you can get your running system back manually by issuing “echo /dev/md127 > /sys/fs/bcache/register” (and do this for any device you used to create your bcache during installation), which should make “/dev/bcache0” available.
If you have “/dev/bcache0”, “lvm pvscan” should return with telling you the device was found. If not, then probably /etc/lvm/lvm.conf wasn’t edited correctly with regard to the “types” line?
Once you have successfully, but manually reached the stage where you can see the /dev/bcache0″ PV, you should be able to continue the boot process by exiting the boot loader shell via “exit”.
Once you’ve identified what went wrong, or at least have a well-educated guess, you can access your disks by booting into the “Repair” mode of the OpenSUSE installation:
- log in as “root” (no password is required)
- verify that your disk devices and MD RAID are accessible
- “modprobe bcache”
- create the bcache device via “echo /dev/yourdevicename > /sys/fs/bcache/register” for the devices the bcache consists of (in our case, the two RAID arrays /dev/md127 and /dev/md126)
- make /dev/mapper writable via “mv /dev/mapper /dev/mapper.ro; mkdir /dev/mapper;mknod /dev/mapper/control c 10 236”);
- modify lvm.conf via the steps described above (“Configuring LVM”)
- activate your volume group (“vgchange -ay yourVGname”)
- mount your root fs to /mnt
- mount all other logical volumes to their respective places (relative to /mnt, i.e. /dev/vgname/usr to /mnt/usr)
- mount /sys, /proc and /dev via “mout -o loop /sys /mnt/sys” (and the others accordingly)
- don’t forget to mount your /boot device to /mnt/boot
- and then run “chroot /mnt”
You now should have an environment where you can edit your on-disk configuration files, /lib/mkinitrd scripts (if ever required, i.e. for debugging) and even run “mkinitrd”. You may have to create symlinks in (chroot’s) /dev/mapper, if you didn’t make it writable before running “vgchange -ay”, but “mkinitrd” will tell you what’s missing.
Of course, creating and activating “bcache” isn’t all there is to it. For instance, your /dev/bcache0″ will be in write-through mode per default – you’ll have to create some script that will enable write-back mode, should you decide to want that feature. There’s some system tuning you can do, and of course there’s the main feature of the evening: Enjoy the speed increase when using your system, especially for random accesses across your RAID 🙂