djrb.org | mdadm woes

« 2017/07/08 - mdadm woes » | linux

I have a Linux (Ubuntu) server which I use for hosting my Perforce service. The server itself is an ASRock C2750D4I, which is an 8-core Atom CPU in a passively cooled package. I run with 16GB of ECC RAM, 4x4TB Western Digital drives in RAID5 and boot from an 8GB SanDisk USB.

A huge amount of inspiration for the build came from this post.

Power surge, dead board

A few weeks back we had a power outage at my home. These happen very infrequently (I live in the UK) and it lasted for barely a second. All my computer hardware rebooted as a result, and all seemed fine except this server didn't come back up.

After extensive diagnostics I decided the motherboard was dead and RMA'ed it to ASRock in The Netherlands. A couple of weeks later and I received a replacement as they'd been unable to repair the original.

Lesson learned here, at minimum plug your computer hardware into surge protected power bars, but preferably UPSes.

Rebuild

After checking the board without any HDDs/etc plugged in, and running into some difficulties with it POSTing, I reconnected all the drives in the same way they were previously plugged in, plugged in the USB drive to boot off of, and powered it up.

Linux booted mostly as normal, but then went into "emergency mode" since it failed to bring up the RAID array. Oh dear.

After a bit of investigation, it seemed that two of the drives had lost their md superblocks and so couldn't be added to the RAID. Reasons for this are currently unclear, and I'm concerned I may have a latent issue with the new motherboard, perhaps trying to be "helpful" and trying to automatically set up a couple of my drives in a RAID; we'll see.

The fix

In the end, I had to add the broken drives back to the RAID and let them be recovered.

sudo mdadm /dev/md0 --add /dev/sdcsudo mdadm /dev/md0 --add /dev/sddwatch cat /proc/mdstat

Output:

Personalities : [raid6] [raid5] [raid4] [linear [multipath] [raid0] [raid1] [raid10]md0 : active raid6 ssd[5] sdc[4] sda[0] sdb[1]      7813774336 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/2] [UU__]      [=>...................] recovery =   7.9% (310168544/3906887168) finish=407.7min speed=147013K/secunused devices: <none>

Once recovered, everything was back to normal.