Sometimes a hard disk is hinting on an upcoming failure. Some disks start to make unexpected sounds, others are silent and only cause some noise in your syslog. In most cases the disk will automatically reallocate one or two damaged sectors and you should start planning on buying a new disk while your data is safe. However, sometimes the disk won’t automatically reallocate these sectors and you’ll have to do that manually yourself. Luckily, this doesn’t include any rocket science.
A few days ago, one of my disks reported some problems in my syslog while rebuilding a RAID5-array:
Jan 29 18:19:54 dragon kernel: [66774.973049] end_request: I/O error, dev sdb, sector 1261069669
Jan 29 18:19:54 dragon kernel: [66774.973054] raid5:md3: read error not correctable (sector 405431640 on sdb6).
Jan 29 18:19:54 dragon kernel: [66774.973059] raid5: Disk failure on sdb6, disabling device.
Jan 29 18:20:11 dragon kernel: [66792.180513] sd 3:0:0:0: [sdb] Unhandled sense code
Jan 29 18:20:11 dragon kernel: [66792.180516] sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 29 18:20:11 dragon kernel: [66792.180521] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Jan 29 18:20:11 dragon kernel: [66792.180547] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error – auto reallocate failed
Jan 29 18:20:11 dragon kernel: [66792.180553] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 4b 2a 6c 4c 00 00 c0 00
Jan 29 18:20:11 dragon kernel: [66792.180564] end_request: I/O error, dev sdb, sector 1261071601
Modern hard disk drives are equipped with a small amount of spare sectors to reallocate damaged sectors. However, a sector only gets relocated when a write operation fails. A failing read operation will, in most cases, only throw an I/O error. In the unlikely event a second read does succeed, some disks perform a auto-reallocation and data is preserved. In my case, the second read failed miserably (“Unrecovered read error – auto reallocate failed“).
The read errors were caused by a sync of a new RAID5 array, which was initially running in degraded mode (on /dev/sdb and /dev/sdc, with /dev/sdd missing). Obviously, mdadm kicked sdb out of the already degraded RAID5-array, leaving nothing but sdc. That’s not something to be very happy about…
The only solution to this problem, was to force sdb to dynamically relocate the damaged sectors. That way, mdadm wouldn’t encounter the read errors and the initial sync of the array would succeed. A tool like hdparm can help you with forcing a disk to reallocate a sector, by simply issuing a write command to the damaged sector. First, check out the number of reallocated sectors on the disk:
$ smartctl -a /dev/sdb | grep -i reallocated
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always – 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 0
The zeroes at the end of the lines indicate that there are no reallocated sectors on /dev/sdb. Let’s check whether sector 1261069669 is really damaged:
$ hdparm –read-sector 1261069669 /dev/sdb
/dev/sdb: Input/Output error
Now, issue the write command (note that hdparm will completely bypass regular block layer read/write mechanisms) to the damaged sector(s). Note that the data on these sectors will be lost forever!
$ hdparm –write-sector 1261069669 /dev/sdb
Use of –write-sector is VERY DANGEROUS.
You are trying to deliberately overwrite a low-level sector on the media
This is a BAD idea, and can easily result in total data loss.
Please supply the –yes-i-know-what-i-am-doing flag if you really want this.
$ hdparm –write-sector 1261069669 –yes-i-know-what-i-am-doing /dev/sdb
/dev/sdb: re-writing sector 1261069669: succeeded
$hdparm –write-sector 1261071601 –yes-i-know-what-i-am-doing /dev/sdb
/dev/sdb: re-writing sector 1261071601: succeeded
Now, use hdparm again to check the availability of the reallocated sectors:
$ hdparm –read-sector 1261069669
reading sector 1261069669: succeeded
(a lot of zeroes should follow)
And using SMART we can check whether the disk has registered two reallocated sectors:
$ smartctl -a /dev/sdb | grep -i reallocated
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always – 2
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always – 2
It’s actually quite simple to force mdadm to continue using sdb as if nothing ever happened:
$ mdadm –assemble –force /dev/md3 /dev/sdb6 /dev/sdc6
(mdadm will complain about being forced to increase the event counter of sdb6)
$ mdadm /dev/md3 –add /dev/sdd6
And a few minutes later, the array is as good as new!