You might think I’m a lazy guy, for not continuing with my articles on the bcache-assisted SAN server. But believe me, it’s far from that: Shortly after putting the SAN server into production mode, we ran into serious trouble: One of our two servers would reboot at random, both had reports of failed disks at least once a week, and sometimes the machines would just hang for a couple of seconds.
Please note that this article was updated, as the steps originally described within only reduced the risk of hitting the problem, rather than fixing the issue!
We tried a lot to get stable servers: upgrading the OS, switching to special kernels, blaming bcache and more. But in the end, it turned out to be the servers themselves that caused these symptoms.
The symptoms
The most obvious problems were when the servers rebooted out of the blue. We had no kernel crash reports, the SEL had no entries indicating anything out of order. There was no obvious timing issue, no specific load scenario. It was like lightning out of a blue sky killing our servers – now you see ’em, now you don’t. Most of the times the reboot went well and needed no manual intervention at the SAN server, but it never was fun for the dependent server systems and clients.
IIRC, it all started with a reboot not so random: We had created a new logical volume and wanted to create a file system on it – something we had done various times on the exact same server. But this time, it caused a reboot. Reproducibly!
Then there were many reports from MD-RAID about failing disks. Re-adding these disks, even immediately, was possible and after a RAID resync things would go on for a while again – until the next strike. I remember a particularly bad one, Saturday night, I had a really bad connection for remote access, but only an hour after the first failure report came in via email, I received the notification that another disk had failed. This was on the RAID6 array, so not all was lost, but I wasn’t much fun that night (my apologies to the party’s host!).
On very rare occasions, we experienced actual server hangs, requiring a power cycle to the server systems. I do remember one in those weeks, but it may have been two.
On top, the system rather often (sometimes multiple times a day) felt sluggish, like hanging.
Since we have a rather complex setup on these SAN servers, “syslog” had tons of messages for all these incidents. One common denominator was that the subsystems reported some sort of a time-out, or at least gave the impression that something along that line had happened. Along all those follow-up messages for missing DRBD updates, SCST command aborts and all, we spotted a few message lines in syslog that pointed at a SCSI problem, too:
ndesan01 kernel: [569626.956074] sd 10:0:3:0: attempting task abort! scmd(ffff880037054170) ndesan01 kernel: [569626.956080] sd 10:0:3:0: [sdd] CDB: ndesan01 kernel: [569626.956083] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 ndesan01 kernel: [569626.956097] scsi target10:0:3: handle(0x000d), sas_address(0x5003048001a08b83), phy(3) ndesan01 kernel: [569626.956100] scsi target10:0:3: enclosure_logical_id(0x5003048001a08bbf), slot(3) ndesan01 kernel: [569627.338075] sd 10:0:3:0: task abort: SUCCESS scmd(ffff880037054170)
Please note that it wasn’t always the same disk that was affected, and it didn’t restrict itself to HDDs or SDDs – any of these may be struck. And not all of these “events” led to reboots or MD-RAID reporting bad disks.
Things that didn’t help
Throughout diagnosis, we tried plenty of approaches. As we had similar servers, minus “bcache”, active for months, it was natural to blame that new element. Actually we found a number of patches on the mailing list and even today, there’s an open (but probably fixed) issue that might lead to system instability. But to make it short: Despite all the problems that were in bcache’s code (at least in the version we originally used), bcache wasn’t the culprit.
So maybe it’s a problem of one of the other key elements, MD-RAID, DRBD, or SCST? Probably some interaction between these? The kernel?
Looking closer, it seems unlikely that anything would work at all, ever, when seeing the amount of bug fixes rolling in via the respective mailing lists. We chose to upgrade both the kernel (to 3.18.8) and DRBD (tools now are 8.9.2, the kernel module is 8.4.6). This helped a lot – with issues completely unrelated to the hangs and reboots. Nope, wrong track again.
And so it went on. We even asked SuperMicro for the current BIOS and the corresponding change log. Nothing there caught our eye, and SuperMicro’s support said the same.
Oh, their support team did notice something sub-optimum with our setup: The Samsung SSD 850 Pro are considered “consumer grade” (even by Samsung) and SuperMicro strongly advised against these. As the SSDs were by far most affected by the “false RAID disk failures”, it did look like a good catch and we decided to fight the budget dragon. The fight was fierce, we returned with lots of scratches, and with four brand-new TOSHIBA PX02SMF020 SSDs. Man, these babies are fast… being SAS does probably help, too. The first MD-RAID1 cache rebuild (“850 Pro” copied to “PX02”) took about twice as long as did the final rebuild (“PX02” to “PX02”).
But in the end, we still had our disk fails and reboots. Interestingly, now only the HDDs were reported to fail. And the common denominator to all failing components is – SATA.
The (update: no so) final solution
After getting some support from the SCSI gurus at SCST, we contacted SuperMicro again and asked if anything was known to be wrong with the SAS extender built-in to the server backplane: The server board (“X10DRT-P”) comes with an SAS/SATA chip set that will drive a number of locally attached disks, but not all 12 disks you can mount in the drive bays. Hence SuperMicro decided to add in an LSI3008 SAS extender, which by the book should be able to drive both SAS and SATA devices. And of course it does, else we wouldn’t have been able to build our server with SATA disks in the first place.
The support team immediately focused on the SAS firmware. The servers were shipped to us in May this year, so we wouldn’t have expected any major updates in that area. From the boot log we saw that we had the v6 firmware active on both servers:
mpt3sas0: LSISAS3008: FWVersion(06.00.00.00), ChipRevision(0x02), BiosVersion(08.13.00.00)
Interestingly, this somehow made the support folks cringe, having a much more recent version at hand: 08.00.00.00. Judging by the file name, that version is from July 2015, and up’ing by two major versions against a three-month-old system is something that makes me cringe, too.
We applied this new firmware to one server, and the next day to the other, without any problem. The installer will ask you for a controller address – this was the SAS controller address, not the SAS extender address. I cannot tell if giving the wrong address will harm the system, or simply lead to an error message. After rebooting our servers, we could see the new firmware reported and the BIOS version went up a bit as well:
mpt3sas0: LSISAS3008: FWVersion(08.00.00.00), ChipRevision(0x02), BiosVersion(08.17.00.00)
Since that update, we’ve not had a single reported problem with our disks (update: for quite some days).
On a side note: Without changing the access patterns, we noticed in our monitoring that i/o wait went down 50% on the server we updated first. But on the other hand, this wasn’t reproducible on the second server, which has an access pattern very different from the server we updated first. So YMMV.
Update: Checking the disks
After two weeks, we were struck by the error again. As Supermicro didn’t really want to pay any attention to the problem and insisted that their servers were working correctly, we contacted Western Digital and were offered to hand in internal logs of the disks that were reported to fail.
After only a few days, the technicians came back with their (not so unexpected) results: The disks did not show any signs of hangs, read errors or alike. And after all, these disks were “RAID-ready” because of their support for TLER:
sanserver:~ # smartctl -l scterc /dev/sda smartctl 6.3 2014-07-26 r3976 [x86_64-linux-4.2.3-1.gef1562d-default] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.orgSCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) sanserver:~ #
The identical disks run fine in a similar server, that comes without an SAS Extender – week after week, with very similar load patterns.
I’m impressed by Western Digital’s quick response to our contact request and by their eagerness to help us diagnose the problem. Thank you, WD!
Update: The final work-around
As the behavior affected our production servers, we did bite the bullet and exchanged the disks for SAS versions. The vendor had offered to exchange for SATA “Enterprise Grade” disks mentioned on SuperMicro’s compatibility list for the server and to take them back if we could reproduce the problems, but we had spent too much time on this already and didn’t want to see more interrupts of our daily work or even put our data at risk.
That setup has now been running for weeks, without any problems, showing that the SAS Extender can work fine with SAS disks. Big surprise.
And Supermicro still hides behind their list of certified disk.
Update: The final solution
In September 2017, I got pointed to the LSI3008 firmware release notes, for a completely unrelated issue. But one reported issue caught my eye while scanning the docs:
ID: SCGCQ01310561 (Port Of Defect SCGCQ01085128) Headline: IO Timeout when running IOs and TMs Description Of Change: When an internal TM completes and there are no resources available to send the completion event to the driver, code was added to send the event when resources become available. Issue Description: After running a workload of IOs and random TMs, all IO to one device will stop for more than 30 seconds, causing the test to fail. Steps To Reproduce: Run IOs to a topology with large queue depths and large IO sizes while issuing random TMs to devices such that the controller runs out of resources.
SuperMicro had adopted their BIOS for our server systems and after installing the included firmware, our LSI 3008, SAS extender and SATA disks combo no longer showed any symptoms of disks suddenly being thrown out. As a matter of fact, we seem to have a stable server system since.