Optimizing, it ain’t easy, but somebody’s got to do it!

Some problems can be solved by throwing big money at them. When it comes to storage performance, it could be spending your bucks on faster and/or more disks, or paying someone to optimize your environment.

For people like me, that’s like chickening out: I rather want to understand, learn and then fix it myself, rather than hire one of the IT Voodoo priests. And I don’t like spending big bucks when I can come by without ;).

This article is about optimization steps we took in our own production environment. It is not about absolute numbers (even if one or another may appear below), but rather about the screws you may adjust to get things to fit your own requirements. Tuning is about you and your system; you need to find out where you want to go. What works for us may not work for you, or even turn things to the worse. But if you start to understand your options, you can get your set of screws turned to the right point.

Thankfully, there already are quite a few good articles on many of the details out there on the net, and I recommend to read them as well. I just decided to sum up our experiences in this article.

Starting point

It all starts with: problems. We’ve increasingly experienced a lack of “desktop responsiveness”, be it GUIs or CLIs. This was accompanied by sometimes high loads reported by our Xen servers (and the storage server) and increasing i/o waits as reported at the storage server itself.

If you have been browsing through this blog, you’ll already know that we’re using a storage server running open-e’s DSS software, currently still at version 6. It’s used to feed (in order of importance)

  • Fiber Channel LUNs for virtual machine disks
  • NFS shares via Ethernet
  • SMB shares via Ethernet
  • rsync services
  • iSCSI

iSCSI is currently only used for some tests, most of our traffic (90%) is Fiber Channel plus NFS.

Though it’s not overly important at the details level, here’s a general picture of then environment in question:

The storage server is a 64bit dual-“Intel Xeon” system with 4 GB of main memory, a QLogic 4Gbps Fibre Channel card, four Gigabit Ethernet ports and an Areca ARC-1261 RAID card with 12 1TB SATA disks (used as a single RAID6 volume set). It is currently running Open-E DSSv6 (but waiting to be upgraded to V7 ;)), mostly used as an NFS and Fibre Channel server.

A QLogic Fiber Channel switch is used to connect our (currently two productive) Xen servers to the storage server, forming a SAN. The FC switch is required as we use NPIV for the virtual machines, which is not available in non-switched environments without special support on all hosts.

Our Xen servers are typical Intel servers, with sufficient memory and more-than-needed CPU, using FC to access back-end storage for the virtual machines. (This is done by creating NPIV vHBAs per VM, which are only active when the VM is up, and only on the server where the VM is active.) Some OCFS2 (backed by a FC LUN) is used within the Xen servers to provide shared disk storage for the VM configuration files and Xen cluster locking mechanisms, NFS is used to access NAS-stored installation images and tools.

Throw in some more servers for various services, and you have the general idea. One especially mentionable server is the “backup server”, which uses SAN-based disk space as a cache before spooling the backup data to a tape library. Many resources are on NFS shares, which in turned are mounted on several, if not all, virtual servers and on the client systems, too.

IOPS difficulties

Once we got a stable DSS setup, it was only a matter of time until increasing workloads made performance tuning measures inevitable.

Our main problem seemed to be slow responses, sometimes with NFS, sometimes the virtual disks of the VM, sometimes both. Our first, rather un-educated guess was “add more spindles, mechanical disks are sooooo slooooooooow”. That’s how we got so many disks in our RAID… and you guessed it, they gave only short relief.

To cut a month-long story short: Our iops where increasing more and more, our time spent waiting too. We found many knobs to turn, they mostly are specific to a certain environment, and there are quite some inter-dependencies. Time to learn…

Design goal

(This chapter slipped in after our first approach, and with it my corresponding draft of this article, didn’t get us where we needed to go. We had hoped to improve the overall performance by tuning a few settings, but actually things got worse: We hadn’t thought it through to the end, which brought the problems back to us like a boomerang.)

Our first lesson learned the hard way: Find out what you’re trying to achieve.

In our case, this turned out to be “responsive reads” and not “quick writes to the RAID array”.

The key elements to our solutions were

  • massively increase of the write cache (at the storage server OS level)
  • prioritization of reads at the device (RAID storage as seen by the storage server OS) level
  • provisioning of sufficient number of NFS server threads

and after we were done with all that,

  • optimizing our client applications to kill unnecessary load

Of course it was helpful to improve the write speed of the RAID array and to fine-tune the various i/o schedulers and alike, but those first three above really made the difference at the SAN/NAS level and the latter gave things the finishing touch.

That first list came from looking at the technical requirements, imposed by our workload setup: We have a large number of virtual machines, all their virtual disks are hosted on the storage server, accessed via Fiber Channel. So that “SAN server” faces high numbers of small requests, scattered across the RAID (all those virtual disk LUNs). This calls for a RAID setup with small stripe size. On the other hand, the same server (the NAS part) feeds files from local file systems to many clients via NFS – this calls for moderate to large stripe sizes. We were unsure which way to go.

More important than the RAID stripe size proved to be a proper cache configuration on the storage server, at the OS level: With enough RAM cache, immediate writes to disk are less important than responsive reads. Of course our storage server is UPS-backed, so we decided to trade guaranteed integrity for speed.

Below are descriptions of individual tuning steps we made, as always, take these as a report of what worked for us. You’ll have to find out yourself what works best for you.

Fixing the RAID setup

The RAID came pre-configured and pre-initialized by the shop where we bought the system. They had heard “file server” and gave a moderate RAID strip size of 64k bytes. That’s fine if you have (large) files to serve, but our system was mostly used for Fiber Channel with traditionally set 512 bytes blocks. When we noticed “60 to 100 % i/o wait” at the storage server, even when moderately accessing the disks of the VM or writing to NFS, we assumed this to be caused by the increasing number of FC LUNs accessed in parallel by all these virtual machines we have.

It was decided to change the stripe size to it’s minimum (4k), which can be done live with the Areca controller. At least with the CLI – our card’s BIOS (v1.47) still featured a bug where it didn’t detect that the “Confirm the operation” check box of the HTTP interface was actually selected – and therefore aborted the requested operation. Luckily, the CLI program did it’s work just fine.

Changing the stripe size of our 5 TB volume set took us around two and a half days… and would have stopped us from working, hadn’t we been able to do some other optimizations, too. As the “migration job” was run in the background, we had set the background priority to “low”. Had we switched that to “medium” or even “high”, at least during off-peak hours, the migration might have completed much faster.

In the end, I believe it wasn’t worth the hassle: Having a rather mixed work load (partly calling for larger and partly calling for smaller stripe sizes), we didn’t notice any immediate improvements: You just can’t do it right.

Caching at the storage server

We had decided to give the storage server an ample amount of memory, which is mostly used as a file system cache. The obvious use case is NFS and SaMBa, traditional file-based services. But we also decided to use file-based back-ends to our Fibre Channel LUNs, which made them “files on XFS” from the OS point of view. That way, they could profit from the OS cache as well.

There’s a lot you can tune at the OS level, but a major thing to keep in mind is that Linux is distributed with workstations in mind. When you’re operating a server, especially a storage server, you most likely will want to change some of the default settings:

  • If you’re using volumes off a RAID controller, using the CFQ scheduler isn’t ideal. You’ll find recommendations to use “noop” instead, leaving all sorting etc to the RAID controller, but our best results where with the “deadline” scheduler, which will rearrange reads and starving write requests. It doesn’t hurt to keep some dirty pages in the cache for a longer time, but clients will notice if their read requests take longer.
  • With tons of memory for caching (and a UPS, rounded up by some faith), you may consider to turn up the amount of cache available to dirty pages (“echo 80 > /proc/sys/vm/dirty_ratio”) and the time you’re willing to just keep it in memory, rather than writing it to the disk (“echo 1000 > /proc/sys/vm/dirty_writeback_centisecs; echo 6000 > /proc/sys/vm/dirty_expire_centisecs”).

There’s a pretty good round-up of the caching semantics and variables by Frank Rysanek, hosted at the “FCC prumyslove systemy s.r.o” site. I truly recommend reading that stuff!

Here’s what we ended up with: Our complete RAID is available through a single disk device, where we decided to use the “deadline” scheduling algorithm and set a read time-out of 250 ms, a write time-out of 60 seconds, and permitting 10 loops before calling it a write starvation:

echo deadline > /sys/block/sda/queue/scheduler
echo 250 > /sys/block/sda/queue/iosched/read_expire
echo 60000 > /sys/block/sda/queue/iosched/write_expire
echo 10 > /sys/block/sda/queue/iosched/writes_starved
echo 16384 > /sys/block/sda/queue/nr_requests

To balance between write caching and write load balancing, the following values were set:

echo 80 > /proc/sys/vm/dirty_ratio
echo 80 > /proc/sys/vm/dirty_background_ratio
echo 6000 > /proc/sys/vm/dirty_expire_centisecs
echo 1000 > /proc/sys/vm/dirty_writeback_centisecs

NFS server tuning

Depending on the number of NFS clients, these may have to queue up to have their work (read & write requests) handled by the server. Some statistical information can be found in “/proc/net/rpc/nfsd”. Especially the line starting with “th” is of interest:

th 256 10711 1131.920 136.950 91.430 162.380 1577.840 5.980 3.570 4.940 3.670 60.900

The first number, “256” states the number of NFS server threads available. The default value is 128, so the example already has that number doubled via “echo 256 > /proc/fs/nfsd/threads”.  The second number (“10711”) states how many times all these threads have been busy (obviously, some more threads would have kept some clients from waiting). The last ten numbers state how long (in seconds) “0 to 10%”, “10 to 20%”, …, “90 to 99%” of these threads have been busy.  According to the sample numbers, some more threads seem desirable – but come with a memory trade-off. And depending on the number of clients versus the number of threads, you may ask yourself *why* these threads were busy: Maybe it’d be better so increase the speed of your i/o subsystem, which would help the threads to answer more quickly and be ready to serve new requests.

Like the initial rant about throwing money at a problem, increasing servers (or threads, in this case) may not be the solution, however obvious it may seem at first.

With DSS, the proper way to adjust that server thread number is described here – the tuning options are available via the console.

With our large number of VMs and a big handful of Linux stations and servers, all those making strong use of NFS mounts, the default number of NFS server threads was far too low to handle all the requests in time. We decided to increase that number to 256 and still see sufficient times where even that number of threads get used up.

echo 256 > /proc/fs/nfsd/threads

Ethernet tuning

Lots of traffic on our SAN/NAS server is generated via NFS mounts, so looking into TCP/IP and Ethernet tuning seemed worthwhile. We’re using Intel network adapters on the storage server, so the information from i.e. NuclearCat’s Wiki came very handy.

With server-grade network adapters, there is support for off-loading some tasks related to sending and receiving IP packets. Depending on the actual brand and model, there may be various things to fine-tune, too. The tool of choice this time is “ethtool”, which is most probably part of your Linux distribution.

You’ll run “ethtool” against the adapter(s) you’re actually using. If you have joined interfaces in a bond, you still have to run ethtool against all individual adapters and not against the bonding device. Changing settings via “ethtool” may temporarily bring down the interface, even when done correctly, so be prepared for short service interruptions. If you’re using a bond, then I suggest to monitor it and wait with further interface updates until the bond settled. And while it worked for us, I’ve seen reports on the net that some Ethernet drivers react strangely to changes of bonded adapters, so YMMV.

Using  “ethtool -k eth0”, you can check the current offloading settings. We decided to activate all forms of support, except for “UDP fragmentation offload” (there’s a bug in kernel 2.6.32 for sure, corrupting NFS sessions, and I’m not certain it isn’t in earlier kernels, too.)

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
#

Another thing to check is adapter statistics, via “ethtool -S eth0”. Some values worth looking at (from the tuning point of view) are “rx_no_buffer_count”, “tx_restart_queue” and “alloc_rx_buff_failed”:

# ethtool -S eth0
NIC statistics:
     rx_packets: 3804176178
     tx_packets: 3211333743
     rx_bytes: 2550877840562
     tx_bytes: 1454273624765
     rx_broadcast: 26420944
     tx_broadcast: 198774
     rx_multicast: 302601
     tx_multicast: 295502
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 302601
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 7866
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 602502
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 185624548
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 2550877840562
     rx_csum_offload_good: 3776185645
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 27622129
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
#

Since we saw some missing buffers, we decided to increase the ring buffer size from 512 to 2048 bytes:

ethtool -G eth0 rx 2048

Another topic that is worth looking at: “jumbo frames”. Those are Ethernet frames with a size larger than the standard 1.500 octetts – giving you a better data/overhead ratio throughout the whole transmission, at least if your read/write payload exceeds that size. But as it is non-standard, all components will have to support them and that’s something we’d like to make sure before implementing this feature in our production network. So bringing “jumbo frames” into production will have to wait until we find some spare time to test things thoroughly.

SCST tuning

There’s still the “old” SCST software active on DSS V6, especially without “sysfs” administration. But even with the “procfs” implementation you can fine tune SCST: Depending on the number of Fiber Channel clients, the standard number of worker threads might easily be too low to handle all requests concurrently. So if you, like us, have tons of VMs and each has it’s own vHBA, then you may gain some responsiveness by increasing the number of SCST worker threads. One per vHBA seems a good maximum, so check your “/proc/scsi_tgt/sessions” and increase “/proc/scsi_tgt/threads” accordingly.

# wc -l /proc/scsi_tgt/sessions
36 /proc/scsi_tgt/sessions
# echo 40 > /proc/scsi_tgt/threads
#

As many things, this comes with a memory penalty – if you’re short on memory for caching, increasing the number of threads will most probably do no good, unless your disk subsystem is faster than your write requests ;).

Hardware tuning

While at it, we took a look to the left and the right as well. One thing that caught the eye was the extremely uneven distribution of interrupts across the CPUs/cores of the storage server for the RAID card driver. (This could be seen when looking at the contents of “/proc/interrupts”, at the line for “arcmsr”. Only a single core listed interrupts for this controller.)

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:       3844       3859       2828       3813       3889       3918       3910       3620   IO-APIC-edge      timer
  1:          1          4          1          1          2          1          1         30   IO-APIC-edge      i8042
  4:       1799         10  168446240         24         33         32         32         32   IO-APIC-edge      serial
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:         17         16         16         16         17         17         16         16   IO-APIC-edge      i8042
 14:         20         19         20         20         18         19         22         21   IO-APIC-edge      ide0
 15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
 18:          0          0          0          0          0          0  135219273          0   IO-APIC-fasteoi   arcmsr
 [...]
#

We decided to manually permit distribution across all CPUs, by changing the irq mask of the cards specific IRQ:

# echo ff > /proc/irq/18/smp_affinity

Now the interrupts distribute equally across all cores. But did performance improve? I actually cannot tell any difference… so most probably, this has only a small (if any) effect.

Applications and users

After applying all above changes, the most noticeable impact was achieved by analyzing the remaining load and identifying those processes that created most of it… we found two scripts that did their job quite badly. To make matters worse, one of these ran every quarter of an hour during daytime and busied quite some NFS server threads. With a simple change inside the script, the script’s run time dropped from several 10 seconds to under a second – most of that excess time was originally spend waiting for the NFS server to respond. Nasty.

After optimizing these scripts, we were back to statistical values we hadn’t seen in years. Tuning the system itself will definitely have helped, but running high-load clients is likely to bring down your infrastructure even if well optimized ;).

Conclusion

This article, unlike many available on the net, isn’t about absolute performance numbers, nor did we start our optimization with these in mind. I wanted to point out some of the areas that are worth looking at, with some hints at how we did it, and why. Optimization always has to start with a bottle-neck analysis, and the measures have to fit the cause and are very specific to the technical environment. So any advice will probably not only be imprecise, but is potentially dangerous when implemented without thinking it over. Be careful!

Currently, we seem to be running fairly well with our newly tuned values. The storage server “load” (processes waiting for I/O, mostly nfsd and SCST threads in our case) has dropped to values mostly below 4 during work hours, which corresponds to the good level of responsiveness felt at the client stations. I/o waits, as reported by “vmstat”, are considerably lower and more often below a value of 5 (percent) than not. So to us this tuning effort has been a success and hopefully yours will be too!

 

This entry was posted in DSS, Ethernet, Fiber Channel, Linux, NPIV, SCST and tagged . Bookmark the permalink.

Leave a Reply