Creating a Linux-based SAN-/NAS-Server

When our former SAN server crashed in 2014, we set up a new SAN/NAS environment based on servers that were available to us at that time. While the servers were definitely suitable, they have their limits and we decided to move on to “real” hardware.

This article is intended to describe our goals, our hardware setup and how we configured the various Open Source software bits and pieces, namely OpenSUSE Linux, NFS, Samba, and SCST.

Where to go

Setting up a reliable, reconstructable SAN/NAS environment can be tricky – especially if you have some non-standard requirements.

First of all, especially in small environments, the SAN/NAS tends to be a single resource that many other systems depend upon. On the other hand, one typically doesn’t have the resources to build highly redundant SAN server systems – more often than not, you don’t even have two separate locations to run such servers.

Then there’s the need for an “all-in-one” solution, providing access to “disk space” via many protocols and types of users: Windows shares come to mind, NFS may play an important role, HTTP- and FTP-based access may be required as well. But then there’s more than NAS (file-based access): Block-level access, i.e. providing “virtual disks” to servers and virtual machines, leading on to iSCSI and/or Fiber Channel as typical SAN technologies.

Speedy access to all the data is what everyone wants. But when you concentrate your data in a few (or even) single NAS/SAN servers, being used by more than a very few “clients”, you’re quickly hitting the “performance ceiling” of traditional setups. And trying to cope with this the traditional way will make you spend lots of money on i.e. fast and many disks, which will drive the overall costs into regions intolerable for most smaller companies.

When the office is relatively small, “high availability” of your NAS/SAN may not be a top priority – as long as your recovery times are relatively small and, more important, you can face temporary outages without loss of existing data. Hence a certain amount of data redundancy quickly enters the picture, both “online” (creating a shadow copy of your current data) and offline (backing up your data to media you can take off-site, and offering history versions of your data).

All of the above pretty well describes what we need, but may not be willing to spend tons of money on: We have an even mix of “Fiber Channel”-attached clients (Xen servers running two or three dozens of virtual machines) and NFS clients (mostly again those VMs, plus some physical desktop machines mounting central resources and home directories). Add a few Windows clients (we’re an almost 100% Linux shop, fortunately) and upcoming requests of running cloud infrastructure services that tend to require iSCSI.

Current disk space demand is at about 5 TB, but growing fast – we plan a 150% increase until the end of 2016, so we’ll face about 12 TB+ by then.

The vehicles

Originally, we were running a single server machine with dual Xeon E5410 processors, an Areca 1280ML Raid controller and a chassis housing up to 16 3,5 inch SATA disks. It was running “open-e DSS V6” as a Linux-based pre-customized NAS/SAN software stack, supporting Fiber Channel clients via a QLogic QLE-2462 HBA. This was a “grown solution”, with 11 active 1 TB disks (and two hot spares) and a fair amount of manual tuning, in addition to what open-e’s DSS brought along.

After a massive disk failure, requiring us to recover from tape backup, we switched the hardware platform to using two separate servers (two “blades” from a Super Micro Twin² server solution (SYS-2027TR-HTRF) that we originally had reserved to expand our pool of Xen servers. These servers already sported the required Fiber Channel HBAs (we originally fitted them in to support FC access for our virtual machines), but have only room for six 2,5 inch HDDs (and only two of them are running as SATA3) and could not take another add-on card, to support more Ethernet bandwidth that we expect to need when offering iSCSI LUNs.

Since the Super Micro Twin server did a pretty good job overall, we decided to stick with the brand and product line, but went for a 2U twin chassis with only two server modules. That would give us 12 disk bays per module, and room for two (low-profile) add-on cards. As we could go for new hardware revisions, we decided on the Super Micro Twin SYS-2028TP-DECR, giving us best value for money, with up-to-date CPUs (we went for Intel Xeon E5-2609v3), a low power consumption and still some room for upgrades.

Important update: See this article for the need to update the SAS extender firmware of these servers!

Dropping in QLogic HBAs (QLE25xx line, but with a Hewlett-Packard branding… those are way less expensive than original QLogics), you have what we’re running now.

But wait: An important part is missing: the disks!

When we moved to Twin servers, we decided on 2,5 inch disk models. Now there’s not a huge market on 24×7 NAS/SAN 2.5 inch drives, and we decided to stick with a lower-cost disk base. Hence we bought a stack of Western Digital “Red”s… 1 TB of disk space in a small form factor, a fairly low power consumption and not crawling slow. Those “WDC WD10JFCX” do have a specifically mentioned limit of supported drives per chassis, which was at four or six disks when we started using them, and has now been lifted to 16. The chassis we use sport 24 bays… but we’ve talked to WD representatives and were told that those limits are “based on customer experience”, so in other words, the disks needn’t fail just because there are more than 16 in a chassis. It’s just that not enough folks have tried so far and reported back.

We’ve had our share of tuning experience and of course these disks are feeling pretty slow when dealing with tons of requests from various clients at the same time. A good indicator is the reported time of I/O wait on the current servers, which is usually at 5 to 10 percents, peaks up to 70 and more percent under load. We’re using large amounts of RAM on the servers to offer some caching, but with plenty of random accesses, one starts to feel some lagging with these high I/Os.

So with those bigger number of HDD bays in the chassis, we turned toward SSD caching. We dedicated two of the 12 bays to two “Samsung 850 Pro MZ” SSDs, leaving 10 bays for real HDDs.

Another specialty (that’s left over from the disk-restricted install on the interim servers) is an additional USB thumb drive, offering 16 GB space to put the /boot file system onto. The reason for this was that we wanted (and still want) to run the HDDs in a RAID6 configuration, leaving no room for an extra OS disk in a 6-bay-configuration (which is limited to only 4 TB of available disk space in RAID6, anyhow). As /boot is rarely accessed, using the thumb drive proved to be no problem at all, and so we decided to keep that design for the new servers, too.

In addition to the two (new) SAN/NAS servers, there’s the former single server with it’s 16 3.5 inch bay chassis. Let me just say that it’s a decent old server, sufficiently reliable to run as a backup machine. It’s equipped with enough disks to shadow the two active SAN/NAS server disks and has a tape library attached via SAS.

And just to round up the picture: All our servers are covered by an UPS with sufficient capacity to handle 30 minutes of power outage.

Pick your card

When deciding on the operation system, there’s the easy and the complex part.

The operating system had to be Linux. We’re down into the depths of that OS for years, we know how to break things and how to fix it. We’re Linux specialists.

The distribution is another matter. We have run SUSE Linux Enterprise for years, but given the fact that we’d be on our own when it comes to support and that the price tag has been growing significantly, we decided to run some free distribution. OpenSUSE has been on our list of supported OS even longer than SLES, so it went to the top of our pick list. And knowing that 13.1 is on the Evergreen radar as the next long-term support release, we rather took that than using the now current 13.2 release.

From disk ’til dawn… or at least up to the file system layer

Let me elaborate on our disk setup: Per node, we have 6 (extendible to 10) 1 TB SATA HDDs, two 128 GB SSDs and a 16 GB thumb drive. In order to get high performance and a proper “available space” to “cost” ratio out of this scenario, we decided on the following configuration:

  • all HDDs are combined into a RAID6 array. Level six gives us two parity disks, so we can cover the loss of two active disks without bringing down the array immediately. With the starter configuration, we have about 4 TB netto capacity and can double that to 8 TB by adding new disks.
    We’ll use Linux MD-RAID to create the array, for two reasons: moving the array to new hardware (i.e. in a case of machine failure) is non-problematic as long as it’s a compatible Linux server (the CPUs are and will be powerful enough to handle the extra load imposed by calculating the array data), and the hardware RAID support of the new servers seems to be unavailable for the chassis-mounted disks. The latter can be attributed to the SAS expander that is used to offer 12 bays per server node, while the mainboard only has 4 plus 6 SATA ports that can be seen directly be the controller.
    This array will be referred to as “RAID-data”.
  • The SSDs will be combined into a RAID1 array. This is mainly for reliability reasons: We will use the SSD as a block-level caching device, so a single failing SSD might cause data loss… a mirror will reduce that risk significantly.
    This array will be referred to as “RAID-cache”
  • Using “bcache”, we’ll create a cached block device that will use RAID-data for it’s (slow) main storage, and will use RAID-cache to significantly speed up random data accesses.
    We’ll refer to this as “bcache block device”.
  • We’ll use LVM to partition the bcache block device into appropriate chunks used for the various file systems. These file systems will be both for regular server operation (root fs, /usr, /var, /var/log, /opt, /tmp and swap) as well as for the NAS and SAN operation (one LV per “share” and per FC/iSCSI target).
  • To boot the machine in this configuration, a separate /boot partition is required. We’ll put that onto the USB thumb drive.
  • To create further redundancy and a separate backup base, all NAS/SAN-oriented logical volumes will be shadowed to the “backup server” via DRBD. That way, we have a live shadow copy of our current data, and we can take down replication per share while creating a consistent 1:1 data recovery image per share, periodically.
  • We’ll stick to using Ext4 file systems, rather than i.e. BTRFS. This decision was based on reading various comparisons on the ‘net, which seem to indicate that ext4 is a tidbit more “stable” and probably a bit faster.

There are other options to “bcache”, and when using LVM anyhow, you may wonder why we’re not using a combination of dm-cache and dm-raid. When comparing these options, dm-cache was the deciding element: From what I can tell, you’ll need to create a caching area for each and every LVM logical volume. Thus, we’d need to prepare for a growing number of LVs by not allocating all SSD space to caching from the start. On the other hand. when using bcache the devices need to be specially prepared (unlike with dm-cache, where you can simply add caching to a data-carrying LV), but as we start from scratch, this isn’t the greatest of problems.

Using “bcache” will make us use three layers of storage handling (MD-RAID, bcache and LVM), but it’s a structure that’s rather simple to maintain and has a long-standing history of reliability. And… bcache is said to be faster than dm-cache ;).

Read on in part 2: Installing the operating system.

 

This entry was posted in Linux. Bookmark the permalink.

Leave a Reply