Turning it off and on again

Horde 5 and Kolab 2.2 – yes, we can!

Posted on November 19, 2012 by j mozdzen

I’ve been trying for months to get the latest Horde groupware client to integrate with our Kolab server. While this has been promised for Horde version 4 already (and first implementation attempts reach back to Horde version 1, if I recall correctly), it had always been more effort than I was able to make. But at least I tried 😉

Kolab once shipped with Horde as its web-based client, but that has changed, so maybe there where political games being played in the background without me noticing. And while I once was told that Kolab integration was to be expected for Horde 4 (H4), we now have Horde 5 out there and still Kolab integration doesn’t come right out of the box.

Horde 5 can be made to talk to Kolab

Nevertheless, with Horde 5 I’ve been able to get so far as to almost having a full working integration: It took me the better of a half working day to debug, (more or less) understand and change some code, but then it was “Heureka!” time: Horde 5 and Kolab 2.2 – yes, we can!

(And like with the original “main user” of that slogan, when it comes to re-election time and you look back at what has been achieved, not all is golden. There are some minor problems still waiting to be solved. But I keep up my vote for Horde 😉 )

Lets take a look at the half-full glass of wine, rather than whine: With the changes I’ve made (see further down this article), not only standard web-mail features work, but I can also use my Kolab/IMAP-stored calendar, address books, notes and task lists, together with the shared resources of other users. Great!

I’ve been discussing this privately and was told that “Kolab works with Horde, you just have to do the manual configuration” – looking at some other reports, there actually seem to be some non-developers out there interfacing Horde with Kolab servers, using groupware functions. Must be some kind of configuration artwork, because as soon as I turn on all kolab drivers in Horde’s configuration options, I start getting PHP errors in the logs and white browser screens.

For completeness, here are the things that don’t work for me (yet):

Persisting the address book “name format” settings doesn’t work for me: Although I receive a positive response messages after saving, they’re always “no format”. (Update: This problem has shown to be related to my account’s preferences as stored in the according IMAP folder – after re-initializing the preferences, the above error could no longer be recreated.)
Maybe completely unrelated to the Kolab changes, but who knows: When I sync my mobile phone with Horde, I can see changes made on the phone within Horde. But when I change that data (i.e. a calendar entry) inside Horde, that change doesn’t find its way back to the phone. (Update: This is no Horde bug,but PHP’s fault.)

While testing my setup, I came across two nasty problems:

There seems to be a bug in PHP, causing error messages (and corresponding failures of Horde) like:

PHP Fatal error: Base lambda function for closure not found

Once I ran into this, I had to restart my Apache web server to get things back to normal… but a few requests later, I got struck by the same error. There was an update to PHP on the server machine, but the problem still manifests much too often. (Just for reference: I’m talking about a SLES 11 SP2 machine, 64bit, currently using php53-5.3.8-0.39.1)

The other problem was more of the “user at keyboard detected – system halted!” type: The task application (“nag”) seemed to be unable to create new tasks. As I was testing my own changes, I attributed these errors to my changes, but actually it turned out that you need to have both mod_rewrite activated on your web server and to allow .htaccess overrides for them rewrite rules, too. Seems this is documented somewhere, but also it seems as many others missed that info, too.

I solved this by creating a new file /etc/apache2/conf.d/horde5.conf:

# ActiveSync support
Alias /Microsoft-Server-ActiveSync /srv/www/htdocs/horde5/rpc.php
Alias /autodiscover/autodiscover.xml /srv/www/htdocs/horde5/rpc.php

# allow .htaccess with individual mod_rewrite settings
<Directory /srv/www/htdocs/horde5 >
Options FollowSymLinks
AllowOverride All
</Directory>

There may be many other problems I haven’t come across yet, those are just the few I noticed after spending a few hours with Horde 5. I’ll put the modified “Horde 5 / Kolab” system up as an alternative to our “old” H4-based web interface, we’ll see how things work out after some productive use.

The changes

This being version 2 of this article, the required changes have dropped drastically due to the good and continuous work of the Horde development team. It basically went down to two code changes and tuning the (not too up-to-date) “kolab-webmail” bundle. I’ve attached three patch files to this article, for those that would like to them against the git version or manually apply these to the files coming via PEAR.

Here’s what I had to do to bring up a functional Horde5 with working Kolab 2.2 integration, and why I deemed it necessary:

Install the current Horde5 from its PEAR repositories
This ought to be straight-forward and is documented on the Horde site. Just make sure to install the webmail package and don’t actually access the site via browser before making the changes documented below.

# pear install horde/horde_role
# pear run-scripts horde/Horde_Role
Including external post-installation script "/usr/share/php5/PEAR/PEAR/Installer/Role/Horde/Role.php" - any errors are in this script
Inclusion succeeded
running post-install script "Horde_Role_postinstall->init()"

init succeeded

Filesystem location for the base Horde application : /srv/www/htdocs/horde5

Configuration successfully saved to PEAR config.

Install scripts complete

# pear install -a -B horde/webmail

Integrate the kolab_webmail bundle files from git
The git repository of Horde has all the latest code, which in case of the kolab_webmail bundle doesn’t automatically mean “current” :(. Some of the stuff right-out caused errors, some of it seems to do no harm, some of it is really helpful. I’ve created and submitted a patch to the Horde team with my modifications, hopefully they’ll find some or all of it worthy of integrating with the rest of the code.
You’ll find the files under “bundles/kolab_webmail/” in the base directory of your git clone and can copy them to the according directories beneath the “Filesystem location for the base Horde application”
Fix a few files, see the linked patch files for details:
1. /usr/share/php5/PEAR/Horde/Core/Factory/Group.php
  Somehow, the file “/usr/share/php5/PEAR/Horde/Group/Kolab.php” isn’t included at run-time, thus the data types defined there are unavailable. I’m sure there must be a better way to do this, but I simply know of none – the straight-forward – but ugly – Include_once() at least fixed this situation. (Update: There are Horde5 updates on their way to fix this differently – but definitely they’re fixing it :))
2. /usr/share/php5/PEAR/Horde/Group/Kolab.php
  It seems that along the way, some refactoring was done. The class named “Horde_Group_Kolab” is now called as “Horde_Core_Group_Kolab”, but no-one change its definition in this file. To make matters worse, the constructor didn’t establish the proper layout of the instance, I fixed the according places by borrowing from “Ldap.php” in the same directory. The patch file contains both changes A and B. (Update: There are Horde5 updates on their way to fix this differently – but definitely they’re fixing it :))
3. $WEBROOT/nag/lib/Driver/Kolab.php
  Maybe our installation contains too old entries, but “nag” seems to have a problem with some of them (those with empty start and/or completion dates). Again, a simple fix will help with that. (Update: Once I re-created the preferences of my account (dating back to Horde1 times), this error could no longer be preproduced.)
Set the proper ownership to all files and directories
When you try to configure Horde via the web interface, the web server invokes all PHP routines to update the config files. To make those updates possible, you need to make sure that all config files (and probably the containing directories, too) are write-accessible to the user id running the web server.

Configure
Once the code is patched, you can start configuring the Horde installation by invoking the bundle configuration script (copied earlier) and then the web interface.

# $WEBROOT/bin/kolab-webmail-install 

Installing Horde Kolab Edition

Configuring database settings

What database backend should we use?                                                                          
    (false) [None]
    (mysql) MySQL / PDO
    (mysqli) MySQL (mysqli)
    (pgsql) PostgreSQL
    (sqlite) SQLite

Type your choice []: mysql
Request persistent connections?
    (1) Yes
    (0) No

Type your choice [0]:

Username to connect to the database as* [] horde5
Password to connect with
How should we connect to the database?
    (unix) UNIX Sockets
    (tcp) TCP/IP

Type your choice [unix]: tcp

Database server/host* [] mysql.company.com

Port the DB is running on, if non-standard [3306]

Database name to use* [] horde5

Internally used charset* [utf-8]     
Use SSL to connect to the server?
    (1) Yes
    (0) No

Type your choice [0]: 0

Certification Authority to use for SSL connections [] /etc/ssl/certs/CA_company.cert.pem
Split reads to a different server?
    (false) Disabled
    (true) Enabled

Type your choice [false]:

Writing main configuration file... done.
Configuring administrator settings

Provide the host name of your Kolab server: kolab.company.com

Provide the host name of your SMTP server (may be your Kolab server name): kolab.copmany.com

Provide the host name of your LDAP server (may be your Kolab server name): kolab.company.com

Provide the primary mail domain of your Kolab server: company.com

Provide the base DN of your Kolab server: dc=company,dc=com

Provide the PHP DN of your Kolab LDAP nobody user, relative to base DN (may
be "cn=nobody,cn=internal,"): cn=horde,ou=virtual,ou=people,

Provide the PHP pw of your Kolab LDAP nobody user: *******

Writing main configuration file... done.

Creating and updating database tables... done.
Thank you for using Horde Kolab Edition!

I had made the following changes via the web administration interface (using Kolab’s “manager” account to log in to Horde):

General
- adjusting the path values according to our server situation
Mailer
- set the localhost information

Of course there are plenty of other settings you may need or want to modify to adopt to your specific situation – I’ve only mentioned the changes that seemed to be necessary to get things up & running.

Now you can save the configuration, create the database schemata etc as required… and are ready to rock!

Copyrights and disclaimer

You may use any information on this page without any license restrictions, for private, public and commercial projects. If you find any error or a better way to do things, you are kindly asked to leave a corresponding comment below and/or contact the Horde project about it.

Disclaimer: If you use anything from this page, it’s at your own risk. I specifically do not claim that the information provided here is free of errors nor that it is suitable for any type of use. Applying the changes described above may lead to data corruption, mail loss or worse. YMMV.

Posted in Horde, Kolab | 1 Comment

Failing a RAID disk drive via Areca web interface

Posted on November 18, 2012 by j mozdzen

Once in a while, there’s a disk in a RAID array that we believe to be “suspicious” and would like to mark as a failed drive. But using Areca’s web interface, there’s no obvious possibility to do so.

You may have already guessed: According to Arca’s FAQ section, there’s a hidden way to do so:

We have added the “FailDisk” hidden on the “Rescue Raid Set” function. The “FailDisk” function can kill off the member disk of RAID set that you enter on the Rescue Raid Set. Please use the “FailDisk Device Location” on the Rescue Raid Set. You can get the correctly “Device Location” from the area “Device Location” of the “Device Information”. Such as, you want to kill off the “Enclosure#2 SLOT 01″ HDD. You can get the right disk channel from the Device-Information and enter ” FailDisk Enclosure#2 SLOT 01″.

But when you look at the web interface of the Areca ARC-1261ML… where’s the “Device Information”, to look up the “Device Location”?

Let me translate that FAQ answer to the ARC-1261ML: You simply make use of the channel identification, as found on the “RaidSet hierarchy” page in the “IDE Channels” column and duplicated in the “Channel” column of the “IDE Channels” listing:

Raid Set Hierarchy

Raid Set	IDE Channels	Volume Set(Ch/Id/Lun)	Volume State	Capacity
Raid6-0000	Ch01	RAID6-0000v0000 (0/0/0)	Normal	45000GB
	Ch02
…

IDE Channels

Channel	Usage	Capacity	Model
Ch01	Raid6-0000	15000GB	MySuperDuperFastDisk
Ch02	Raid6-0000	15000GB	MySuperDuperFastDisk
Ch03	Raid6-0000	15000GB	MySuperDuperFastDisk
…
Ch16	N.A.	N.A.	N.A.

So all you have to do to mark i.e. disk #8 as "failing", is to enter "DiskFail Ch8" as the "keyword" in "RaidSet Functions" – "Rescue Raidset", check the "Confirm Operation" box and submit the form.

Invoking the same function via the Areca CLI failed for me: According to the help message, it should be invoked as “disk fail drv=8”, but unfortunately the only response I receive is “GuiErrMsg: Invalid Parameter.”, even after setting the admin password first. (Of course I tried “drv=Ch08” as well, which gave a “ErrMsg: Drive# Is Missing.” response.)

Be careful which this command, you get what you ask for: The controller will mark the drive as “failed” and start to recover the RAID set. Not only may this take a long time (and degrading your RAID’s performance during rebuild) – it may be dangerous, too. So use this at your own risk, don’t toy with this in a productive environment!

Posted in howto | Tagged Areca, RAID | Leave a comment

Optimizing, it ain’t easy, but somebody’s got to do it!

Posted on November 14, 2012 by j mozdzen

Some problems can be solved by throwing big money at them. When it comes to storage performance, it could be spending your bucks on faster and/or more disks, or paying someone to optimize your environment.

For people like me, that’s like chickening out: I rather want to understand, learn and then fix it myself, rather than hire one of the IT Voodoo priests. And I don’t like spending big bucks when I can come by without ;).

This article is about optimization steps we took in our own production environment. It is not about absolute numbers (even if one or another may appear below), but rather about the screws you may adjust to get things to fit your own requirements. Tuning is about you and your system; you need to find out where you want to go. What works for us may not work for you, or even turn things to the worse. But if you start to understand your options, you can get your set of screws turned to the right point.

Thankfully, there already are quite a few good articles on many of the details out there on the net, and I recommend to read them as well. I just decided to sum up our experiences in this article.

Starting point

It all starts with: problems. We’ve increasingly experienced a lack of “desktop responsiveness”, be it GUIs or CLIs. This was accompanied by sometimes high loads reported by our Xen servers (and the storage server) and increasing i/o waits as reported at the storage server itself.

If you have been browsing through this blog, you’ll already know that we’re using a storage server running open-e’s DSS software, currently still at version 6. It’s used to feed (in order of importance)

Fiber Channel LUNs for virtual machine disks
NFS shares via Ethernet
SMB shares via Ethernet
rsync services
iSCSI

iSCSI is currently only used for some tests, most of our traffic (90%) is Fiber Channel plus NFS.

Though it’s not overly important at the details level, here’s a general picture of then environment in question:

The storage server is a 64bit dual-“Intel Xeon” system with 4 GB of main memory, a QLogic 4Gbps Fibre Channel card, four Gigabit Ethernet ports and an Areca ARC-1261 RAID card with 12 1TB SATA disks (used as a single RAID6 volume set). It is currently running Open-E DSSv6 (but waiting to be upgraded to V7 ;)), mostly used as an NFS and Fibre Channel server.

A QLogic Fiber Channel switch is used to connect our (currently two productive) Xen servers to the storage server, forming a SAN. The FC switch is required as we use NPIV for the virtual machines, which is not available in non-switched environments without special support on all hosts.

Our Xen servers are typical Intel servers, with sufficient memory and more-than-needed CPU, using FC to access back-end storage for the virtual machines. (This is done by creating NPIV vHBAs per VM, which are only active when the VM is up, and only on the server where the VM is active.) Some OCFS2 (backed by a FC LUN) is used within the Xen servers to provide shared disk storage for the VM configuration files and Xen cluster locking mechanisms, NFS is used to access NAS-stored installation images and tools.

Throw in some more servers for various services, and you have the general idea. One especially mentionable server is the “backup server”, which uses SAN-based disk space as a cache before spooling the backup data to a tape library. Many resources are on NFS shares, which in turned are mounted on several, if not all, virtual servers and on the client systems, too.

IOPS difficulties

Once we got a stable DSS setup, it was only a matter of time until increasing workloads made performance tuning measures inevitable.

Our main problem seemed to be slow responses, sometimes with NFS, sometimes the virtual disks of the VM, sometimes both. Our first, rather un-educated guess was “add more spindles, mechanical disks are sooooo slooooooooow”. That’s how we got so many disks in our RAID… and you guessed it, they gave only short relief.

To cut a month-long story short: Our iops where increasing more and more, our time spent waiting too. We found many knobs to turn, they mostly are specific to a certain environment, and there are quite some inter-dependencies. Time to learn…

Design goal

(This chapter slipped in after our first approach, and with it my corresponding draft of this article, didn’t get us where we needed to go. We had hoped to improve the overall performance by tuning a few settings, but actually things got worse: We hadn’t thought it through to the end, which brought the problems back to us like a boomerang.)

Our first lesson learned the hard way: Find out what you’re trying to achieve.

In our case, this turned out to be “responsive reads” and not “quick writes to the RAID array”.

The key elements to our solutions were

massively increase of the write cache (at the storage server OS level)
prioritization of reads at the device (RAID storage as seen by the storage server OS) level
provisioning of sufficient number of NFS server threads

and after we were done with all that,

optimizing our client applications to kill unnecessary load

Of course it was helpful to improve the write speed of the RAID array and to fine-tune the various i/o schedulers and alike, but those first three above really made the difference at the SAN/NAS level and the latter gave things the finishing touch.

That first list came from looking at the technical requirements, imposed by our workload setup: We have a large number of virtual machines, all their virtual disks are hosted on the storage server, accessed via Fiber Channel. So that “SAN server” faces high numbers of small requests, scattered across the RAID (all those virtual disk LUNs). This calls for a RAID setup with small stripe size. On the other hand, the same server (the NAS part) feeds files from local file systems to many clients via NFS – this calls for moderate to large stripe sizes. We were unsure which way to go.

More important than the RAID stripe size proved to be a proper cache configuration on the storage server, at the OS level: With enough RAM cache, immediate writes to disk are less important than responsive reads. Of course our storage server is UPS-backed, so we decided to trade guaranteed integrity for speed.

Below are descriptions of individual tuning steps we made, as always, take these as a report of what worked for us. You’ll have to find out yourself what works best for you.

Fixing the RAID setup

The RAID came pre-configured and pre-initialized by the shop where we bought the system. They had heard “file server” and gave a moderate RAID strip size of 64k bytes. That’s fine if you have (large) files to serve, but our system was mostly used for Fiber Channel with traditionally set 512 bytes blocks. When we noticed “60 to 100 % i/o wait” at the storage server, even when moderately accessing the disks of the VM or writing to NFS, we assumed this to be caused by the increasing number of FC LUNs accessed in parallel by all these virtual machines we have.

It was decided to change the stripe size to it’s minimum (4k), which can be done live with the Areca controller. At least with the CLI – our card’s BIOS (v1.47) still featured a bug where it didn’t detect that the “Confirm the operation” check box of the HTTP interface was actually selected – and therefore aborted the requested operation. Luckily, the CLI program did it’s work just fine.

Changing the stripe size of our 5 TB volume set took us around two and a half days… and would have stopped us from working, hadn’t we been able to do some other optimizations, too. As the “migration job” was run in the background, we had set the background priority to “low”. Had we switched that to “medium” or even “high”, at least during off-peak hours, the migration might have completed much faster.

In the end, I believe it wasn’t worth the hassle: Having a rather mixed work load (partly calling for larger and partly calling for smaller stripe sizes), we didn’t notice any immediate improvements: You just can’t do it right.

Caching at the storage server

We had decided to give the storage server an ample amount of memory, which is mostly used as a file system cache. The obvious use case is NFS and SaMBa, traditional file-based services. But we also decided to use file-based back-ends to our Fibre Channel LUNs, which made them “files on XFS” from the OS point of view. That way, they could profit from the OS cache as well.

There’s a lot you can tune at the OS level, but a major thing to keep in mind is that Linux is distributed with workstations in mind. When you’re operating a server, especially a storage server, you most likely will want to change some of the default settings:

If you’re using volumes off a RAID controller, using the CFQ scheduler isn’t ideal. You’ll find recommendations to use “noop” instead, leaving all sorting etc to the RAID controller, but our best results where with the “deadline” scheduler, which will rearrange reads and starving write requests. It doesn’t hurt to keep some dirty pages in the cache for a longer time, but clients will notice if their read requests take longer.
With tons of memory for caching (and a UPS, rounded up by some faith), you may consider to turn up the amount of cache available to dirty pages (“echo 80 > /proc/sys/vm/dirty_ratio”) and the time you’re willing to just keep it in memory, rather than writing it to the disk (“echo 1000 > /proc/sys/vm/dirty_writeback_centisecs; echo 6000 > /proc/sys/vm/dirty_expire_centisecs”).

There’s a pretty good round-up of the caching semantics and variables by Frank Rysanek, hosted at the “FCC prumyslove systemy s.r.o” site. I truly recommend reading that stuff!

Here’s what we ended up with: Our complete RAID is available through a single disk device, where we decided to use the “deadline” scheduling algorithm and set a read time-out of 250 ms, a write time-out of 60 seconds, and permitting 10 loops before calling it a write starvation:

echo deadline > /sys/block/sda/queue/scheduler
echo 250 > /sys/block/sda/queue/iosched/read_expire
echo 60000 > /sys/block/sda/queue/iosched/write_expire
echo 10 > /sys/block/sda/queue/iosched/writes_starved
echo 16384 > /sys/block/sda/queue/nr_requests

To balance between write caching and write load balancing, the following values were set:

echo 80 > /proc/sys/vm/dirty_ratio
echo 80 > /proc/sys/vm/dirty_background_ratio
echo 6000 > /proc/sys/vm/dirty_expire_centisecs
echo 1000 > /proc/sys/vm/dirty_writeback_centisecs

NFS server tuning

Depending on the number of NFS clients, these may have to queue up to have their work (read & write requests) handled by the server. Some statistical information can be found in “/proc/net/rpc/nfsd”. Especially the line starting with “th” is of interest:

th 256 10711 1131.920 136.950 91.430 162.380 1577.840 5.980 3.570 4.940 3.670 60.900

The first number, “256” states the number of NFS server threads available. The default value is 128, so the example already has that number doubled via “echo 256 > /proc/fs/nfsd/threads”. The second number (“10711”) states how many times all these threads have been busy (obviously, some more threads would have kept some clients from waiting). The last ten numbers state how long (in seconds) “0 to 10%”, “10 to 20%”, …, “90 to 99%” of these threads have been busy. According to the sample numbers, some more threads seem desirable – but come with a memory trade-off. And depending on the number of clients versus the number of threads, you may ask yourself *why* these threads were busy: Maybe it’d be better so increase the speed of your i/o subsystem, which would help the threads to answer more quickly and be ready to serve new requests.

Like the initial rant about throwing money at a problem, increasing servers (or threads, in this case) may not be the solution, however obvious it may seem at first.

With DSS, the proper way to adjust that server thread number is described here – the tuning options are available via the console.

With our large number of VMs and a big handful of Linux stations and servers, all those making strong use of NFS mounts, the default number of NFS server threads was far too low to handle all the requests in time. We decided to increase that number to 256 and still see sufficient times where even that number of threads get used up.

echo 256 > /proc/fs/nfsd/threads

Ethernet tuning

Lots of traffic on our SAN/NAS server is generated via NFS mounts, so looking into TCP/IP and Ethernet tuning seemed worthwhile. We’re using Intel network adapters on the storage server, so the information from i.e. NuclearCat’s Wiki came very handy.

With server-grade network adapters, there is support for off-loading some tasks related to sending and receiving IP packets. Depending on the actual brand and model, there may be various things to fine-tune, too. The tool of choice this time is “ethtool”, which is most probably part of your Linux distribution.

You’ll run “ethtool” against the adapter(s) you’re actually using. If you have joined interfaces in a bond, you still have to run ethtool against all individual adapters and not against the bonding device. Changing settings via “ethtool” may temporarily bring down the interface, even when done correctly, so be prepared for short service interruptions. If you’re using a bond, then I suggest to monitor it and wait with further interface updates until the bond settled. And while it worked for us, I’ve seen reports on the net that some Ethernet drivers react strangely to changes of bonded adapters, so YMMV.

Using “ethtool -k eth0”, you can check the current offloading settings. We decided to activate all forms of support, except for “UDP fragmentation offload” (there’s a bug in kernel 2.6.32 for sure, corrupting NFS sessions, and I’m not certain it isn’t in earlier kernels, too.)

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
#

Another thing to check is adapter statistics, via “ethtool -S eth0”. Some values worth looking at (from the tuning point of view) are “rx_no_buffer_count”, “tx_restart_queue” and “alloc_rx_buff_failed”:

# ethtool -S eth0
NIC statistics:
     rx_packets: 3804176178
     tx_packets: 3211333743
     rx_bytes: 2550877840562
     tx_bytes: 1454273624765
     rx_broadcast: 26420944
     tx_broadcast: 198774
     rx_multicast: 302601
     tx_multicast: 295502
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 302601
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 7866
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 602502
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 185624548
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 2550877840562
     rx_csum_offload_good: 3776185645
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 27622129
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0
#

Since we saw some missing buffers, we decided to increase the ring buffer size from 512 to 2048 bytes:

ethtool -G eth0 rx 2048

Another topic that is worth looking at: “jumbo frames”. Those are Ethernet frames with a size larger than the standard 1.500 octetts – giving you a better data/overhead ratio throughout the whole transmission, at least if your read/write payload exceeds that size. But as it is non-standard, all components will have to support them and that’s something we’d like to make sure before implementing this feature in our production network. So bringing “jumbo frames” into production will have to wait until we find some spare time to test things thoroughly.

SCST tuning

There’s still the “old” SCST software active on DSS V6, especially without “sysfs” administration. But even with the “procfs” implementation you can fine tune SCST: Depending on the number of Fiber Channel clients, the standard number of worker threads might easily be too low to handle all requests concurrently. So if you, like us, have tons of VMs and each has it’s own vHBA, then you may gain some responsiveness by increasing the number of SCST worker threads. One per vHBA seems a good maximum, so check your “/proc/scsi_tgt/sessions” and increase “/proc/scsi_tgt/threads” accordingly.

# wc -l /proc/scsi_tgt/sessions
36 /proc/scsi_tgt/sessions
# echo 40 > /proc/scsi_tgt/threads
#

As many things, this comes with a memory penalty – if you’re short on memory for caching, increasing the number of threads will most probably do no good, unless your disk subsystem is faster than your write requests ;).

Hardware tuning

While at it, we took a look to the left and the right as well. One thing that caught the eye was the extremely uneven distribution of interrupts across the CPUs/cores of the storage server for the RAID card driver. (This could be seen when looking at the contents of “/proc/interrupts”, at the line for “arcmsr”. Only a single core listed interrupts for this controller.)

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:       3844       3859       2828       3813       3889       3918       3910       3620   IO-APIC-edge      timer
  1:          1          4          1          1          2          1          1         30   IO-APIC-edge      i8042
  4:       1799         10  168446240         24         33         32         32         32   IO-APIC-edge      serial
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:         17         16         16         16         17         17         16         16   IO-APIC-edge      i8042
 14:         20         19         20         20         18         19         22         21   IO-APIC-edge      ide0
 15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
 18:          0          0          0          0          0          0  135219273          0   IO-APIC-fasteoi   arcmsr
 [...]
#

We decided to manually permit distribution across all CPUs, by changing the irq mask of the cards specific IRQ:

# echo ff > /proc/irq/18/smp_affinity

Now the interrupts distribute equally across all cores. But did performance improve? I actually cannot tell any difference… so most probably, this has only a small (if any) effect.

Applications and users

After applying all above changes, the most noticeable impact was achieved by analyzing the remaining load and identifying those processes that created most of it… we found two scripts that did their job quite badly. To make matters worse, one of these ran every quarter of an hour during daytime and busied quite some NFS server threads. With a simple change inside the script, the script’s run time dropped from several 10 seconds to under a second – most of that excess time was originally spend waiting for the NFS server to respond. Nasty.

After optimizing these scripts, we were back to statistical values we hadn’t seen in years. Tuning the system itself will definitely have helped, but running high-load clients is likely to bring down your infrastructure even if well optimized ;).

Conclusion

This article, unlike many available on the net, isn’t about absolute performance numbers, nor did we start our optimization with these in mind. I wanted to point out some of the areas that are worth looking at, with some hints at how we did it, and why. Optimization always has to start with a bottle-neck analysis, and the measures have to fit the cause and are very specific to the technical environment. So any advice will probably not only be imprecise, but is potentially dangerous when implemented without thinking it over. Be careful!

Currently, we seem to be running fairly well with our newly tuned values. The storage server “load” (processes waiting for I/O, mostly nfsd and SCST threads in our case) has dropped to values mostly below 4 during work hours, which corresponds to the good level of responsiveness felt at the client stations. I/o waits, as reported by “vmstat”, are considerably lower and more often below a value of 5 (percent) than not. So to us this tuning effort has been a success and hopefully yours will be too!

Posted in DSS, Ethernet, Fiber Channel, Linux, NPIV, SCST | Tagged Areca | Leave a comment