A storage cluster is born
It is time to add space to our VMware cluster for storage of VMs for our customers.
We started initially using Compellent SAN storage. Worked well, had a lot of space, but the performance was not what I was looking for (it is all SATA based), using F/C is far more complicated than we need for this project, and just plain expensive to upgrade (adding a shelf of 16 450GB SAS disks is more expensive than the solution presented here).
Spin forward a few months, I took one of our older NetApp FAS270c systems and ripped out the 73GB 10K disks, went to eBay and purchased 2 shelves of 300GB 10K F/C disks and after an afternoon of shuffling we had ourselves a cluster for storage. Performance is absolutely great (and predictable). Expanding it? Expensive, just like the Compellent, and I also wanted to investigate some of the new things that can be done with storage, like inline (and synchronous) compression and data deduplication.
History lesson: Years and years ago I did a lot with Solaris and I have kept my feet wet playing with OpenSolaris and ZFS. I won’t bore you with the great details of why ZFS is the shit (links at the bottom of the post) or why Solaris needs to live on forever (because no one can thread at the kernel level like Solaris), but I will tell you that using Solaris (or OpenSolaris) with ZFS is a combination that is very tough to beat. So last year I used the Sun Try-and-Buy program to test out a 7110 and I absolutely loved the interface, and the price drop that occurred while I had it! But someone within Sun decided that no, as a TaB customer the new pricing is not available. I was floored by this. I shipped back the 7110, and anyway, I really wanted a cluster, not a single head, single JBOD enclosure.
In the end I wanted a cluster!
That’s 2 heads, automatic failover, etherchannel/trunk/IPMP/LACP oh hell, I just wanted multiple ethernets bound together if possible, basic reporting (I can SNMP for the real stuff), and finally, a price tag that I can feel good about for storage of our customer data.
This is where NexentaStor comes in! They have the pieces all put together so I don’t have to self-engineer something. I have a vendor I can harass or ask questions of. I can focus on what makes my business work instead of working to make my business stable.
I chose NexentaStor because I wanted something more than just some hand built item and I wanted a working implementation of high availability without a lot of hackery(tm) on my part. I actually do not like building servers, messing with a ton of configurations and creating job security because things get so complicated that I can’t go on vacation. I had run up OpenSolaris with ZFS on multiple systems. I am comfortable with it, I am happy with it, but I want something that others can handle if I am not around. This includes web GUI management and basic reporting and a simplified command line without all of the UNIXisms that drive a layman batty!
Do check out their website, we are using the commercial version and there are developer and community editions available as well.
My parts list for the head units (there are 2 so double all parts):
- 1 SuperMicro SYS-6016T-NTRF4+ 1U chassis+MB originally specced, later changed because of availability
- 2 Intel Xeon E5520 Nehalem
- 6 4GB registered ECC (24GB total)
- 1 SuperMicro AOC-USAS-L8i SAS HBA
- 1 CABL-0167L Backplane Cable
- 2 Western Digital 500GB RE3 SATA (in mirror for boot)
My parts list for the JBOD shelving units (there are 2 so double all parts)
- 1 SuperChassis SC846E2-R900B 4U 24 bay SAS with SAS Expander
- 2 CABL-0166L external SAS cables
- 1 SuperMicro CSE-PTJBOD-CB1 power card
- 2 CABL-0168L internal SAS internal->external cables
- 24 Seagate ST31000424SS 1TB SAS disks (22 installed, 2 on shelf)
- 2 Crucial RealSSD 256GB 2.5″ SSD for mirrored ZIL
- 2 AOC-SMP-LSISS9253 SAS interposer cards for SATA->SAS interconnections
Basic nerd data…
Note – benchmarking is very hard to do. I can easily show you bonnie++ output saturating a gig ethernet, I can show you output from multiple VMs saturating 2 gig ethernets but that didn’t really tell me anything.
In fact, I can show you output from 8 VMs pushing a total of 4Gbps of aggregate throughput, 2Gbps of write traffic with 2Gbps of read traffic (well, not a full base 2 version of such, more like base 10), but instead I’d rather show you what things look like with what I have on the server right now. Ie; the numbers are really boring and making flat graphs just doesn’t look cool.
(as an aside, reaching over 10K I/Os per second reads and writes is easy with this solution, reaching 20K I/Os per second reads is much harder while the scaling of writes per second continued to move towards the 30K mark before service transaction time started creeping up over 1.2ms – yah, thems are nerd numbers)
One head, one volume, 4 Windows Server 2008R2 systems installed, patched, sitting idle. Each VM is configured with 2GB RAM, 200GB disk shared over NFS connection on bonded gigabit ethernet.
/$ df -k /volumes/volume02/volume02 Filesystem size used avail capacity Mounted on volume02/volume02 13T 16G 13T 1% volume02/volume02
So about 16GB in use, but the volume and share have deduplication turned on and this is where the real fun starts.
The systems are as I described above, installed, patched, and idle. Nothing is going on, and before you ask ‘why’, it is because I have torn down all of my benchmarking VMs, I got tired of seeing that I could saturate the ethernet network repeatedly though the storage system was fine, quiet, and bored out of its proverbial skull. I just can’t create enough synthetic load to really create problems, not from I/O, not from pure streaming reads and writes.
/$ zpool list volume02 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT volume02 18.9T 17.7G 18.9T 0% 1.29x ONLINE -
1.29x – if I read this right – it is a savings of almost 30% over not using dedupe. From reading the different white papers and blog posts, the deduped data doesn’t impose any performance overhead for reading whatsoever while writing is only marginally affected. If that is almost 30% savings
Whew, while this post isn’t all that informative, it is full of a summary of what I was playing with for the last couple of weeks.
Below are pictures of the system(s) and links to pertinent data you may enjoy.
Deduplication now in ZFS – Virtually All The Time
ZFS Deduplication : Jeff Bonwick’s Blog
First Look at ZFS Deduplication – The Blog of Ben Rockwood
Nexenta Adds Dedupe To Open-Source ZFS Storage – Storage – IT …
Nexenta Systems announces record growth in 2009 | NAS Storage Server
and onto…


about 3 months ago
Looks like a fun build!
A few geeky questions.. ;)
Why 3gb SAS instead of 6gb SAS for future-proof’in, especially since your RealSSD’s can’t get maxed out on 3gb! :) (Also hope you don’t run into OpenSolaris bug 6900767 – that bit us on a few OpenSolaris boxes with the 1068 controllers. Annoying that is.) I have a hunch the reason is “lack of availability of 6gb backplanes”, but hey, gotta ask. ;)
Are you spacing the SSDs out across the 4 3gb lanes, since each of the SSDs could easily saturate a single lane on reads?
How are the RealSSD’s performing for ZIL’s? I worry about the slowdown over time with writes to them without TRIM, etc. are you planning on removing and scrubbing the discs occasionally to get the performance back? I would’ve loved to go with them for ZIL along with cache, but was a bit nervous about that. If you’ve found a way to deal with it that’d make my next build easier!
How are you cabling it? Cascade shelf-to-shelf, and then connect a HBA on each end of it?
In any case, nice build! Care to post a total cost? :)
about 3 months ago
When will I reach 3Gbps?
Guessing never! So was not really worried about that.
If (huge if at that) I hit a performance issue I’ll build another and move the VMs at that time. I am not interested in THE ONE storage system to rule them all.
Pricing? I have spreadsheets! Part numbers were listed and easy to extrapolate :)
about 3 months ago
Well, it’s pretty easy to hit 3gbps on a channel with 6-12 drives on it (depending on how things are stacked), running with some sort of RAID level (as that causes more data to go across the drives than just the actual amount of traffic being written or read) especially if 1-2 of those drives are SSDs.. but I suppose the real question is will it actually *matter* [ie - cause slowness to the point where it's visible to the end user], and the answer is almost certainly no. ;)
Are you running compression in combination with dedupe? I’ve been quite impressed with the I/O that is still possible through a system that is compressing and deduplicating at the same time.
about 3 months ago
Yep, compression and dedupe on, one of the reasons I went with the ZFS based testing in thebfirst place.
I think ZFS should get more love from that L community and swallow that earnest pride that makes the GPL such a POS at times. FreeBSD can do it!
But I ain’t using FreeBSD for this as I want HA.
As for the 3Gbps stuff, I built my raid groups around 7 drives per and while technically true I can hit that internally I think that reality will show that this won’t be an issue. If it becomes an issue I can rebuild heads into 2U systems and spin up another HBA and multiple-path, or better yet…build another basket for my eggs :)
about 3 months ago
I’m totally with you – wishing for a ZFS port for, well, everything. ;) btrfs is interesting, but it’d be nice to have the effort go into improving zfs.. ah well.
Totally agreed on the more eggs statement! I do like the approach of limiting the size and cost of an individual basket so more can be created as needed.. ;)
about 3 months ago
One more geeky question – where did the interposer card fit in? Not finding anything but “buy” links.. is it a small adapter card that fits in between the drive and the backplane in the sled, or something else? Pictures would be appreciated, but probably difficult to get as I’m imagining this is in heavy use already. ;)
Thx!
about 3 months ago
Oh that’s exactly where the interposer fits.
SATA disk (or SSD) -> interposer -> SAS bus
As far as pictures – oops, everything is mounted and running, sorry :( Not heavy yet, but that will change in the coming days!
about 3 months ago
Yeah, that’s kind of what I figured! :)
Did you need to use a special sled to fit the interposer in? Or do the standard sleds in the enclosure you picked up have a second set of screw holes that give you enough room to add the interposer?
Thanks!
about 3 months ago
We used them for the SSD, so no idea what would happen on a normal SATA disk.
Probably just move up one set of wholes for the screws…
about 3 months ago
Well, are the SSDs mounted in the 2.5″ adapters, or are they mounted internally or somethin’?
about 3 months ago
They are mounted inside of adapters, then mounted into the drive sleds, then mounted in the external JBOD enclosures.
about 3 months ago
Are you using fletcher4 with verify or sha256 for the dedup?
about 2 months ago
SHA256 – skipping verify at this time though I can flip the bit easily enough.
about 3 months ago
The realssd drives have a pretty big dram cache sans supercap – Doesn’t that make the Zil volatile?
about 2 months ago
It could, but the ZIL devices are housed in the external enclosure and will migrate with the volume in the case of an HA failover.
The chance of a power outage is pretty low (though not impossible) and there are 2 ZIL devices in mirror configuration to help prevent issues in the case of some kind of failure.
Not perfect though close enough for this kind of thing.
This link has some data to help: http://www.anandtech.com/show/2909
about 1 month ago
Great info in this post! I’m going to build out a storage system and am using your config as a guide. I’m not that familiar with SuperMicro and wondered how you attach the external storage to the 1U 6016? The SuperMicro AOC-USAS-L8i SAS HBA looks like an internal RAID card (and also looks like some custom SuperMicro UIO format where the server you listed looks like it uses PCI-E only)? Sorry for the noob questions, but there seems to be a lot of moving pieces in this storage thing.
about 1 month ago
There are a ton of parts to this.
The external storage is connected via external SAS connectors.
We used PCIe cards for things, the UIO card we did use was for ethernet. As we didn’t need hardware raid for things, the SAS HBA is JBOD.
about 1 month ago
Hello,
This is Spandana Goli from Nexenta Systems.
Having read your blogs on Nexentastor , we are happy to learn about your positive experience.
Is there an email address I could reach you at to discuss further opportunities?
Regards,
Spandana
about 1 month ago
You can go to our website and find our phone number….
Then call me.
I don’t need more email…. :)
about 1 month ago
Thank you for your post. I’ve played around with OpenSolaris, Nexenta Core and NexentaStor in the past and all of them look very promising.
What NexentaStor licenses are required for your HA configuration? I would guess 2x Enterprise [Silver/Gold] Edition xTB + 1x HA Plugin?
about 1 month ago
For me…
32TB plus +8TB then a 4TB for the other head, plus HA.
We have 40TB of managed disk in total.
about 1 month ago
Also, how did you configure your zpools? Mirrored RAID-Zs?
about 1 month ago
Did raidz2 groups, three per per volume, 2 volumes, plus hot and cold spares.
Also have write logs mirrored per volume.