After moving our Xen cluster to new servers, we noticed a strange behavior: While the first VMs would start immediately, later VMs (having started about 20 VMs) would not come up: The NPIV-configured LUNs couldn’t be attached in Dom0.
Or to be more precise: They could not be activated in a timely fashion. We saw in syslog that Xen created the NPIV vHBA, then 10 seconds later, the Xen time-out struck struck and DomU creation was aborted. And still a few seconds later, we could see in syslog that the LUN was actually detected and reported to the system – only too late. We could isolate that behavior to simple vHBA creation, Xen was not part of the picture (except for that 10 seconds time-out, that seems to be hard-coded. Bad design, IMO).
The symptoms
With only few vHBAs configured, we could see that just a few seconds pass between the adapter creation and the LUN detection:
Jul 1 20:15:20 xxx102 kernel: [ 666.797439] qla2xxx [0000:04:00.0]-7080:9: VP entry id 2 assigned. Jul 1 20:15:20 xxx102 kernel: [ 666.797445] scsi9 : qla2xxx Jul 1 20:15:20 xxx102 kernel: [ 666.799171] scsi host6: vport-6:0-1 created via shost6 channel 0 Jul 1 20:15:23 xxx102 kernel: [ 669.797359] scsi 9:0:0:0: Direct-Access SCST_FIO PN9kCLUa2RlAK25O 300 PQ: 0 ANSI: 6 Jul 1 20:15:23 xxx102 kernel: [ 669.797656] sd 9:0:0:0: Attached scsi generic sg4 type 0 Jul 1 20:15:23 xxx102 kernel: [ 669.797927] sd 9:0:0:0: [sde] 41873408 512-byte logical blocks: (21.4 GB/19.9 GiB) Jul 1 20:15:23 xxx102 kernel: [ 669.797932] sd 9:0:0:0: [sde] 4096-byte physical blocks Jul 1 20:15:23 xxx102 kernel: [ 669.798317] scsi 9:0:0:1: Direct-Access SCST_FIO qpexjedY4FRGwTpG 300 PQ: 0 ANSI: 6 Jul 1 20:15:23 xxx102 kernel: [ 669.798540] sd 9:0:0:1: Attached scsi generic sg5 type 0 Jul 1 20:15:23 xxx102 kernel: [ 669.799180] sd 9:0:0:0: [sde] Write Protect is off Jul 1 20:15:23 xxx102 kernel: [ 669.799186] sd 9:0:0:0: [sde] Mode Sense: 83 00 10 08 Jul 1 20:15:23 xxx102 kernel: [ 669.799203] sd 9:0:0:1: [sdf] 209539072 512-byte logical blocks: (107 GB/99.9 GiB) Jul 1 20:15:23 xxx102 kernel: [ 669.799207] sd 9:0:0:1: [sdf] 4096-byte physical blocks Jul 1 20:15:23 xxx102 kernel: [ 669.799326] sd 9:0:0:0: [sde] Write cache: enabled, read cache: enabled, supports DPO and FUA
You’ll notice the three-seconds delay after the “vport-6:0-1 created” message.
But once the server (or the fabric?) got full of vHBAs, that time went up significantly:
Jul 1 20:31:02 xxx102 kernel: [ 1608.961533] qla2xxx [0000:04:00.0]-7080:21: VP entry id 7 assigned. Jul 1 20:31:02 xxx102 kernel: [ 1608.961537] scsi21 : qla2xxx Jul 1 20:31:02 xxx102 kernel: [ 1608.967112] scsi host6: vport-6:0-13 created via shost6 channel 0 Jul 1 20:31:14 xxx102 kernel: [ 1620.987787] scsi 21:0:0:0: Direct-Access SCST_FIO NqQuMjRlCRellleH 300 PQ: 0 ANSI: 6 Jul 1 20:31:14 xxx102 kernel: [ 1620.988142] sd 21:0:0:0: Attached scsi generic sg11 type 0 Jul 1 20:31:14 xxx102 kernel: [ 1620.989090] sd 21:0:0:0: [sdl] 41873408 512-byte logical blocks: (21.4 GB/19.9 GiB) Jul 1 20:31:14 xxx102 kernel: [ 1620.989096] sd 21:0:0:0: [sdl] 4096-byte physical blocks Jul 1 20:31:14 xxx102 kernel: [ 1620.989794] sd 21:0:0:0: [sdl] Write Protect is off Jul 1 20:31:14 xxx102 kernel: [ 1620.989801] sd 21:0:0:0: [sdl] Mode Sense: 83 00 10 08 Jul 1 20:31:14 xxx102 kernel: [ 1620.989920] sd 21:0:0:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA Jul 1 20:31:15 xxx102 kernel: [ 1621.614299] sdl: sdl1 sdl2 Jul 1 20:31:15 xxx102 kernel: [ 1621.615296] sd 21:0:0:0: [sdl] Attached SCSI disk
Twelve seconds – too much for Xen’s time-out of 10 seconds.
Our first guess was the load on the server, but we couldn’t spot anything significant. So we turned to the HBA-side of things and turned on verbose logging, which flooded our syslog with similar blocks of (error) messages during that delay:
Jun 30 16:31:32 xxx102 kernel: [77989.423240] qla2xxx [0000:04:00.0]-2824:11: Failed management_server login: loopid=c mb[0]=4008 mb[1]=0 mb[2]=5 mb[6]=7d8f mb[7]=8000. Jun 30 16:31:32 xxx102 kernel: [77989.496258] qla2xxx [0000:04:00.0]-2865:11: Device wrap (020300). Jun 30 16:31:32 xxx102 kernel: [77989.496267] qla2xxx [0000:04:00.0]-2869:11: LOOP READY. Jun 30 16:31:32 xxx102 kernel: [77989.496271] qla2xxx [0000:04:00.0]-286b:11: qla2x00_configure_loop: exiting normally. Jun 30 16:31:32 xxx102 kernel: [77989.496276] qla2xxx [0000:04:00.0]-481b:11: Loop resync end. Jun 30 16:31:32 xxx102 kernel: [77989.496286] qla2xxx [0000:04:00.0]-481a:13: Loop resync scheduled. Jun 30 16:31:32 xxx102 kernel: [77989.497567] qla2xxx [0000:04:00.0]-8837:13: F/W Ready - OK. Jun 30 16:31:32 xxx102 kernel: [77989.497597] qla2xxx [0000:04:00.0]-883a:13: fw_state=3 (3, 8000, 0, 4) curr time=101285ca7. Jun 30 16:31:32 xxx102 kernel: [77989.497604] qla2xxx [0000:04:00.0]-2814:13: Configure loop -- dpc flags = 0xa0. Jun 30 16:31:32 xxx102 kernel: [77989.568864] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 2000001b32125F50 pn 2100001b32125F50 portid=010100. Jun 30 16:31:32 xxx102 kernel: [77989.570399] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 2000001b3212F983 pn 2100001b3212F983 portid=010200. Jun 30 16:31:32 xxx102 kernel: [77989.571974] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 200000e08b1261e6 pn 210000e08b1261e6 portid=010300. Jun 30 16:31:32 xxx102 kernel: [77989.573601] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 500143802668E609 pn 500143802668E608 portid=010400. Jun 30 16:31:32 xxx102 kernel: [77989.575143] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 29011100aa122001 pn 29001100aa122001 portid=010401. [... quite some more of our vHBA nodes & ports ...] Jun 30 16:31:32 xxx102 kernel: [77989.630658] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 29011100aa013055 pn 29001100aa003055 portid=01052d. Jun 30 16:31:32 xxx102 kernel: [77989.632242] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 500143802668E825 pn 500143802668E824 portid=010700. Jun 30 16:31:32 xxx102 kernel: [77989.633860] qla2xxx [0000:04:00.0]-2858:13: GID_PT entry - nn 500143802668E869 pn 500143802668E868 portid=020300.
At some point, this comes to an end, you’d see “Async-login” initiation statements (some with “failed”), then corresponding “Async-login complete”s (and “Assigning new loopid” for those that “failed” earlier on) and finally it results in an available LUN:
Jun 30 16:31:40 xxx102 kernel: [77997.346424] qla2xxx [0000:04:00.0]-5836:84: Async-login complete - hdl=1a portid=010300 iop0=3a2. Jun 30 16:31:40 xxx102 kernel: [77997.346441] qla2xxx [0000:04:00.0]-287d:84: FCPort state transitioned from UNCONFIGURED to LOST - portid=010300. Jun 30 16:31:40 xxx102 kernel: [77997.346444] qla2xxx [0000:04:00.0]-2867:84: Port login retry 210000e08b1261e6, id = 0x0147 retry cnt=16. Jun 30 16:31:40 xxx102 kernel: [77997.346477] qla2xxx [0000:04:00.0]-287d:84: FCPort state transitioned from LOST to ONLINE - portid=020300. Jun 30 16:31:40 xxx102 kernel: [77997.346774] qla2xxx [0000:04:00.0]-287d:84: FCPort state transitioned from LOST to ONLINE - portid=010200. Jun 30 16:31:40 xxx102 kernel: [77997.347041] qla2xxx [0000:04:00.0]-287d:84: FCPort state transitioned from UNCONFIGURED to ONLINE - portid=010100. Jun 30 16:31:40 xxx102 kernel: [77997.347329] qla2xxx [0000:04:00.0]-287d:84: FCPort state transitioned from LOST to ONLINE - portid=010300. Jun 30 16:31:40 xxx102 kernel: [77997.347590] qla2xxx [0000:04:00.0]-4818:84: Relogin needed scheduled. Jun 30 16:31:40 xxx102 kernel: [77997.347592] qla2xxx [0000:04:00.0]-4819:84: Relogin needed end. Jun 30 16:31:40 xxx102 kernel: [77997.347594] qla2xxx [0000:04:00.0]-4800:6: DPC handler sleeping. Jun 30 16:31:40 xxx102 kernel: [77997.349433] scsi 84:0:0:0: Direct-Access SCST_FIO NIGkSLNLlbZGY17q 300 PQ: 0 ANSI: 6 Jun 30 16:31:40 xxx102 kernel: [77997.349693] sd 84:0:0:0: Attached scsi generic sg26 type 0 Jun 30 16:31:40 xxx102 kernel: [77997.350350] sd 84:0:0:0: [sdaa] 20916224 512-byte logical blocks: (10.7 GB/9.97 GiB) Jun 30 16:31:40 xxx102 kernel: [77997.350353] sd 84:0:0:0: [sdaa] 4096-byte physical blocks Jun 30 16:31:40 xxx102 kernel: [77997.350729] sd 84:0:0:0: [sdaa] Write Protect is off Jun 30 16:31:40 xxx102 kernel: [77997.350733] sd 84:0:0:0: [sdaa] Mode Sense: 83 00 10 08 Jun 30 16:31:40 xxx102 kernel: [77997.350863] sd 84:0:0:0: [sdaa] Write cache: enabled, read cache: enabled, supports DPO and FUA Jun 30 16:31:40 xxx102 kernel: [77997.351882] sdaa: sdaa1 sdaa2 sdaa3 sdaa4 Jun 30 16:31:40 xxx102 kernel: [77997.352777] sd 84:0:0:0: [sdaa] Attached SCSI disk
Resolution attempts
As soon as we noticed that the problem lies within the communication from HBA to FC fabric (and the fabric log did not point out any specific problem), we focused on the HBAs’ firmware. The cards in question are HP 81Q (part number AK344A), basically rebranded QLE2560 cards. Trying to spot information on the currently supported firmware level on HP’s support site was not an easy task, so we contacted HP’s phone support in Germany. Once you found your way to the actual support engineers (we use these HP cards in non-HP servers, which made it a bit more difficult than anticipated), we were given very friendly and competent support information (despite the typically bad line quality due to centralized support centers with meager VoIP channels):
When using SLES (or other OSes with similar drivers), the firmware is downloaded to the adapter by the OS driver. Included with SLES11SP3 (the currently latest version) is a firmware file for version 5.8:
<4>[ 6.916266] qla2xxx [0000:04:00.0]-00fc:6: ISP2532: PCIe (5.0GT/s x8) @ 0000:04:00.0 hdma- host#=6 fw=5.08.00 (90d5).
So we downloaded a newer firmware file from QLogic, put that into /lib/firmware/ql2500_fw.bin (after moving aside the original ql2500_fw.bin and ql2500_fw.bin.sig files), called “mkinitrd” and rebooted the system:
<4>[ 7.393623] qla2xxx [0000:04:00.0]-00fc:6: ISP2532: PCIe (5.0GT/s x8) @ 0000:04:00.0 hdma- host#=6 fw=7.03.00 (90d5).
To make a long story short: It didn’t provide any mentionable improvement. On the other hand, going to a more current firmware doesn’t seem to be a choice too bad, at least as long as we don’t experience problems with the driver distributed with SLES11SP3 and don’t need “secure boot” (we have no signature file for the new driver, so secure-booting the system is bound to fail).
The actual resolution
So if changing the HBA-side of things doesn’t provide immediate help, you start looking at the other side of the link: The fibre channel switch. As already mentioned above, the logs in our SANbox 5602s didn’t provide obvious clues, but looking a the available port settings, two options caught the eye: “I/O Streamguard” and “device scan”. “I/O Streamguard” defaulted to “auto-detect”, “device scan” was turned on for all FC ports.
We noticed earlier that during the many seconds elapsing between vHBA activation and LUN detection, many calls were made to the fabric that concern node and port names. So searching the net, we spotted a few documents that, while not related to our specific situation, mention explicit configuration settings:
- For “initiator” ports, “I/O Streamguard” should be turned on, while “device scan” should be turned off.
- For “target” ports, it’s just the other way around: “I/O Streamguard” should be turned off, while “device scan” should be turned on.
We were able to change the settings of our Fibre Channel switch ports during live operation and since then, the long delays during vHBA activation are a thing of the past!
In addition, monitoring the servers for the following hours indicated that the load on these servers dropped significantly and all disk-bound operations were much faster now, probably because those errors in communicating with the fabric happened not only during vHBA activation, but were spilled all over the place.
Some background information
Interestingly, while the “I/O Streamguard” setting is documented quite explicitly (“I/O Stream Guard prevents disruption by suppressing the communication of Registered State Change Notifications (RSCNs) between initiators”) and rather detailed in an QLogic PDF, we’ve found no details on the “device scan” option. It’s mentioned in QLogic documentation as
The Device Scan feature queries the connected device during login for FC-4 descriptor information. Disable this parameter only if the scan creates a conflict with the connected device.
So it’s hard to tell what we might be missing since we have turned it off.
Conclusion
If you happen to have similar symptoms, I’d strongly recommend to check your FC switch for these or similar settings!