On 2012-01-02 21:45, lee@frumble.claremont.edu wrote:
> I would configure the drives in a RAID-5 array with 3 for data, 1 for=20
> parity, and 1 hot spare drive. I'd also ensure that the system hosting=20
> the controller card and the drive enclosure were on a UPS and,=20
> preferrably, in a temperature controlled room.

The chances of you not being able to recover the RAID after a single=20
drive failure are too high.  You are likely to find an uncorrectable=20
error elsewhere when you go to rebuild the lost array member.  These=20
consumer drives have uncorrectable read error rates of 1 in 10^14 bits=20
.... and they hold 10^13 bits of data.   That is not enough of a margin=20
for me!


>
> If you have the budget, you could build out dual RAID-5 setups
> and either mirror them via RAID-1 or layer ZFS on top of both.
> [ZFS sounds quite nice; thanks to whoever for mentioning it.]

There are few reasons to layer ZFS on top of hardware RAID, and many=20
reasons not to. Its correction and repair mechanisms are much stronger=20
than RAID-5, not to mention resolving "read-write-hole" and stripe write=20
performance issues via variable stripe size. Additionally, ZFS can and=20
will keep all those drives busy with intelligent IO reordering that it=20
can only do reliably if there is not a RAID controller underneath=20
"lying" about what spindles exist and what data has made it to disk!

In my experience, a RAID controller will offline a device on the first=20
encountered read error. At that point, the entire device is inconsistent=20
and invalid (as more writes occur on the other members), and requires a=20
simultaneous reliable read of the entire surface of all the other disks=20
to "rebuild".  A strenuous task to say the least, happening at "just the=20
wrong time".   In contrast, when ZFS encounters a block read error, it=20
retries from a redundant copy anywhere it exists (maybe on the same=20
device), and immediately re-writes the offending block. For most drives,=20
this fixes it. The block is added to the internal bad list, and the=20
re-write causes an immediate reallocation from the spares area.  The=20
original device is not taken offline, and the replacement block can come=20
from the very same device (by default all metadata blocks are allocated=20
in two places, so the filesystem structure can survive significant=20
damage even on a single drive).

Frankly, it's odd to me. The whole idea that you need a complicated=20
"super reliable" controller, with its own OS, firmware, and=20
battery-backed RAM no less, to make sure the storage system remains=20
consistent in the face of a crash, with no input or advice from the OS=20
or filesystem layer, so you can present the "illusion" of a single dumb=20
block device...  It seems so brittle.

ZFS is fantastic.  A backup strategy feasible under ZFS is a 3-way=20
mirror, with a rotating member that you pull and take to the vault. ZFS=20
will "resilver" the out-of-date device by copying only the data needed=20
to bring it up to date since the last recorded transaction group on the=20
device.   If you do this with normal RAID and a 3TB drive, you'll tie up=20
your storage system for 12-16 hours while ~50MB/s moves from disk1 to=20
disk 2.  Additionally, ZFS can "stream" the difference between two=20
snapshots, so you can have a master taking snapshots every minute or=20
every month, and "sending" the differences to a slave, who "receives"=20
them into its own filesystem.  Just like log shipping of a database.=20
Since ZFS supports countless filesystems per storage pool, you can=20
create one per user, one per installed app, one per database instance,=20
etc... to make this kind of backup and data management easier.

I have used ZFS for years on my FreeBSD servers, and I have to say that=20
it's very comforting to snapshot the entire system before doing=20
something like an OS or database upgrade.  I trust the reliability=20
mechanisms, so that's all the backup I need before proceeding to wipe=20
out the system. And I can recover by changing one line in the boot=20
loader config file to mount the snapshot as root.  I have an SATA=20
enclosure and buy drives in pairs from different manufacturers, and just=20
add them to the pool as a mirror pair. I.e., "zpool add pool0 mirror=20
/dev/ada3 /dev/ada4" is all that's required to grow the pool, and the=20
new space is available to every FS.  I enable ZFS compression, so I=20
don't feel that I must "squeeze" more space out with parity RAID.

It's also comforting that I can take these disks and plug them directly=20
into another FreeBSD, Solaris, Linux, or MacOS machine, and get my=20
data.  ZFS is freely available on those OSes, and they are just standard=20
disks with standard GPT partitions. I have lost important data in the=20
past because a new controller could not be found to replace a failed=20
controller, or the replacement machine overwrote the very RAID array it=20
was supposed to recover because it didn't "know" there was an array on=20
those disks.  With ZFS I don't have to match controller model,=20
controller firmware version, OS driver, or OS version.  ZFS itself is=20
versioned and upgradeable in place, so moving to a newer OS is graceful.


Joe







--=20
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist
.