It is time to add space to our VMware cluster for storage of VMs for our customers.

We started initially using Compellent SAN storage. Worked well, had a lot of space, but the performance was not what I was looking for (it is all SATA based), using F/C is far more complicated than we need for this project, and just plain expensive to upgrade (adding a shelf of 16 450GB SAS disks is more expensive than the solution presented here).

Spin forward a few months, I took one of our older NetApp FAS270c systems and ripped out the 73GB 10K disks, went to eBay and purchased 2 shelves of 300GB 10K F/C disks and after an afternoon of shuffling we had ourselves a cluster for storage. Performance is absolutely great (and predictable). Expanding it? Expensive, just like the Compellent, and I also wanted to investigate some of the new things that can be done with storage, like inline (and synchronous) compression and data deduplication.

History lesson: Years and years ago I did a lot with Solaris and I have kept my feet wet playing with OpenSolaris and ZFS. I won’t bore you with the great details of why ZFS is the shit (links at the bottom of the post) or why Solaris needs to live on forever (because no one can thread at the kernel level like Solaris), but I will tell you that using Solaris (or OpenSolaris) with ZFS is a combination that is very tough to beat. So last year I used the Sun Try-and-Buy program to test out a 7110 and I absolutely loved the interface, and the price drop that occurred while I had it! But someone within Sun decided that no, as a TaB customer the new pricing is not available. I was floored by this. I shipped back the 7110, and anyway, I really wanted a cluster, not a single head, single JBOD enclosure.

In the end I wanted a cluster!

That’s 2 heads, automatic failover, etherchannel/trunk/IPMP/LACP oh hell, I just wanted multiple ethernets bound together if possible, basic reporting (I can SNMP for the real stuff), and finally, a price tag that I can feel good about for storage of our customer data.

This is where NexentaStor comes in! They have the pieces all put together so I don’t have to self-engineer something. I have a vendor I can harass or ask questions of. I can focus on what makes my business work instead of working to make my business stable.

I chose NexentaStor because I wanted something more than just some hand built item and I wanted a working implementation of high availability without a lot of hackery(tm) on my part. I actually do not like building servers, messing with a ton of configurations and creating job security because things get so complicated that I can’t go on vacation. I had run up OpenSolaris with ZFS on multiple systems. I am comfortable with it, I am happy with it, but I want something that others can handle if I am not around. This includes web GUI management and basic reporting and a simplified command line without all of the UNIXisms that drive a layman batty!

Do check out their website, we are using the commercial version and there are developer and community editions available as well.

My parts list for the head units (there are 2 so double all parts):

  • 1 SuperMicro SYS-6016T-NTRF4+ 1U chassis+MB originally specced, later changed because of availability
  • 2 Intel Xeon E5520 Nehalem
  • 6 4GB registered ECC (24GB total)
  • 1 SuperMicro AOC-USAS-L8i SAS HBA
  • 1 CABL-0167L Backplane Cable
  • 2 Western Digital 500GB RE3 SATA (in mirror for boot)

My parts list for the JBOD shelving units (there are 2 so double all parts)

  • 1 SuperChassis SC846E2-R900B 4U 24 bay SAS with SAS Expander
  • 2 CABL-0166L external SAS cables
  • 1 SuperMicro CSE-PTJBOD-CB1 power card
  • 2 CABL-0168L internal SAS internal->external cables
  • 24 Seagate ST31000424SS 1TB SAS disks (22 installed, 2 on shelf)
  • 2 Crucial RealSSD 256GB 2.5″ SSD for mirrored ZIL
  • 2 AOC-SMP-LSISS9253 SAS interposer cards for SATA->SAS interconnections

Basic nerd data…

Note – benchmarking is very hard to do. I can easily show you bonnie++ output saturating a gig ethernet, I can show you output from multiple VMs saturating 2 gig ethernets but that didn’t really tell me anything.

In fact, I can show you output from 8 VMs pushing a total of 4Gbps of aggregate throughput, 2Gbps of write traffic with 2Gbps of read traffic (well, not a full base 2 version of such, more like base 10), but instead I’d rather show you what things look like with what I have on the server right now. Ie; the numbers are really boring and making flat graphs just doesn’t look cool.

(as an aside, reaching over 10K I/Os per second reads and writes is easy with this solution, reaching 20K I/Os per second reads is much harder while the scaling of writes per second continued to move towards the 30K mark before service transaction time started creeping up over 1.2ms – yah, thems are nerd numbers)

One head, one volume, 4 Windows Server 2008R2 systems installed, patched, sitting idle. Each VM is configured with 2GB RAM, 200GB disk shared over NFS connection on bonded gigabit ethernet.

/$ df -k /volumes/volume02/volume02
Filesystem             size   used  avail capacity  Mounted on
volume02/volume02       13T    16G    13T     1%   volume02/volume02

So about 16GB in use, but the volume and share have deduplication turned on and this is where the real fun starts.

The systems are as I described above, installed, patched, and idle. Nothing is going on, and before you ask ‘why’, it is because I have torn down all of my benchmarking VMs, I got tired of seeing that I could saturate the ethernet network repeatedly though the storage system was fine, quiet, and bored out of its proverbial skull. I just can’t create enough synthetic load to really create problems, not from I/O, not from pure streaming reads and writes.

/$ zpool list volume02
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
volume02  18.9T  17.7G  18.9T     0%  1.29x  ONLINE  -

1.29x – if I read this right – it is a savings of almost 30% over not using dedupe. From reading the different white papers and blog posts, the deduped data doesn’t impose any performance overhead for reading whatsoever while writing is only marginally affected. If that is almost 30% savings

Whew, while this post isn’t all that informative, it is full of a summary of what I was playing with for the last couple of weeks.

Below are pictures of the system(s) and links to pertinent data you may enjoy.

Deduplication now in ZFS – Virtually All The Time

ZFS Deduplication : Jeff Bonwick’s Blog

First Look at ZFS Deduplication – The Blog of Ben Rockwood

Nexenta Adds Dedupe To Open-Source ZFS Storage – Storage – IT …

NexentaStor FAQ

Nexenta Systems announces record growth in 2009 | NAS Storage Server

Nexenta Systems

and onto…

taking the twins to the data center for mounting

Head units on cart to the data center

new storage system in the rack