Locked History Actions

RaidFailureRecovery

The Joy of RAID

For over a year now, this site had been running on a FreeBSD system with four 250 gig hard drives in a RAID5 configuration -- two IDE drives (master and slave) sharing a controller and two SATA drives. I knew using IDE slaves in a RAID configuration was a bad idea for performance, but performance wasn’t a major concern. I subsequently learned that it’s also a bad idea from a reliability standpoint.

A couple months ago one the IDE slave drive suffered a catastrophic failure. Not only would it not come out of reset, it prevented the other IDE drive on the same controller from coming up! One drive failing on a RAID5 system is an expected event. Two is not. And booting up in that situation (the root fs was on a non-RAID partition) left FreeBSD's gvinum software RAID system dazed and confused.

Pulling the IDE slave drive got me back two three operational drives, but the damage was done. Nothing I did would persuade the array to come up. In fact, most gvinum commands I tried did little more than panic the kernel. I installed a new good drive in the IDE slave position in hopes that the array would automatically rebuild. No such luck.

Feelings of panic began to set in. I had no backups -- the RAID solution was supposed to keep my data safe from non-physical threats. But I kept reminding myself that this is software RAID and that all my data was still on the three good disks. Worst case I could use the FreeBSD kernel source to write a program that sequentially read the disk sectors and rebuilt the original filesystem image. Not something I was eager to do, but one way or another I would get my data back as long as I didn’t do something stupid that further corrupted the disks.

After a week reading the kernel source and very careful poking around with a disk editor, I discovered that the gvinum metadata on the three good drives was inconsistent. The two SATA drives matched, but the good IDE master was corrupt. Could this explain why gvinum couldn’t bring up the array? The metadata binary format was simple enough, so I recreated the metadata on the IDE master to match the two SATA disks. That did the trick and I was finally able to get the array to come online. A quick fsck to repair a few file system inconsistencies and all of my data was once again accessible.

I immediately backed everything up to a USB drive. Now with my data safe and the failed IDE slave drive replaced, I set about looking for a way to rebuild the data on the new drive. The gvinum rebuildparity command looked like just what I needed. Imagine my disappointment when I discovered it is not actually implemented in FreeBSD 6.x! There is a convoluted way to accomplish the same thing, but it looked absurdly complicated.

At this point I had a complete non-RAID copy of my data, so facing the complex task of restoring the array and the newly discovered limitations in gvinum's recovery capabilities, rethinking my server strategy seemed to be in order. I originally went with RAID5 because it minimized the cost of redundancy. With a collection of cheap 200 - 250 gig drives, I could get about 600 gig of storage. But drives are cheaper now, and I already had the 750 gig external with my backup. I decided to simply pick up a cheap 1 TB drive and go with a much simpler mirroring solution. A little more expensive, but it freed up the 200 gig drives for backup duty and should lower my electricity cost.

All that remains now is to choose a mirroring solution and rebuild the server.