SInce last month, our Nexenta based storage cluster has been deployed and I have now moved production data onto it.

A bump and bruise occurred last weekend (I had done an announcement already) and yesterday things burped again.

The problem? Looks like an issue with the 2 mirrored boot drives of the first head unit (each head manages its own volumes and HA is used to make sure a single head failure doesn’t cause an outage) are … bad.

One drive has full on SMART failure via the BIOS. Interesting…so replace that drive with another from the shelf…and the other drive is showing something ‘odd’. Yank it out, replace it, and move it another machine for testing.

And it is failing as well! Four hard disks, 2 per head unit, and 2 fail in one head unit, what are the chances of that? (the second head unit does not exhibit any of the same symptoms)

Now how do failing boot drives really cause problems? I don’t know, but if a scrub is started on the boot volume (with the bad disks), the system has intermittent stalls causing a delay in I/O to the NFS mounted volumes.

VMware ESX handles this gracefully but some older editions of operating systems don’t like to wait on I/O to complete. FreeBSD 6-STABLE (where 6.x and x > 2) is not doing so well here, while 7-STABLE and 8-STABLE are doing great. Ubuntu 8.04 and 10.04 also weathered the issue just fine, and one Windows Server 2003 system was burping, the rest of the Windows Server based systems were unaffected.

Very frustrating to find this issue and quite difficult to track down.

Log entries from the VMware virtuals helped immensely but correlating things took a bit to see that the issue was coming from that head unit, then to find what was wrong with the head unit took even more time.

Now the issue is resolved and I’ll need to rebuild the now ‘bad’ head unit with new disks (I don’t trust the data on the replacement disks right now as there were multiple problems on the previous drives) and bring it fully back into the cluster.

Statement: Nexenta is not at fault at all, only mentioned because that is what we are using, the issue is hardware and not software and no blame should be attributed to Nexenta.