Computer Hanged
My Linux box hanged yesterday! When a Linux box hangs, you normally blame (1) a memory-leaked program, (2) an I/O-hungry program, or (3) a hardware failure. Checked my Conky monitor, no program was swallowing my memory or I/O bandwidth. Hardware problem? I checked my /var/log/messages, and something really went wrong (I've highlighted those abnormal events):
Mar 10 15:53:11 peace-desktop kernel: [ 2434.881018] ata3: hard resetting link Mar 10 15:53:18 peace-desktop kernel: [ 2442.392316] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Mar 10 15:53:23 peace-desktop kernel: [ 2447.400079] ata3.00: qc timeout (cmd 0xec) Mar 10 15:53:23 peace-desktop kernel: [ 2447.400092] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) Mar 10 15:53:23 peace-desktop kernel: [ 2447.400110] ata3: hard resetting link Mar 10 15:53:24 peace-desktop kernel: [ 2447.940060] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Mar 10 15:53:24 peace-desktop kernel: [ 2447.943275] ata3.00: configured for UDMA/33 Mar 10 15:53:24 peace-desktop kernel: [ 2447.943311] ata3: EH complete
Was it really a harddisk problem? Turned out that it's not that simple yet!
Diagnosis
Firstly, what harddisk is connected to ata3? I did a dmesg | grep ata3:
[ 0.719901] ata3: SATA max UDMA/133 abar m8192@0xf7f76000 port 0xf7f76200 irq 27 [ 1.251283] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1.252639] ata3.00: ATA-8: ST3500320AS, SD15, max UDMA/133 [ 1.252642] ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32) [ 1.254419] ata3.00: configured for UDMA/133 ... [ 1828.599143] ata3.00: exception Emask 0x10 SAct 0xfffd SErr 0x1990000 action 0xe frozen [ 1828.599153] ata3.00: irq_stat 0x08400000, interface fatal error, PHY RDY changed [ 1828.599163] ata3: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns } [ 1828.599180] ata3.00: cmd 61/00:00:3f:c0:0c/04:00:00:00:00/40 tag 0 ncq 524288 out [ 1828.599189] ata3.00: status: { DRDY } [ 1828.599202] ata3.00: cmd 61/00:10:3f:48:0c/04:00:00:00:00/40 tag 2 ncq 524288 out [ 1828.599211] ata3.00: status: { DRDY }
Well, it's my second harddisk. I fired up Palimpsest Disk Utility, and ran a short self-test for the harddisk. I got this:

What a bad luck! It is a Seagate Barracuda 500GB harddisk, and it's used for only 1.5 years. The harddisk was not heavily accessed as it was used for backup purpose only, and it was often kept in a healthy temperature below 40°C. Luckily it is still within warranty period, but I have to use my computer without any backup space. This is not safe. So I decided to buy a new harddisk.
New Harddisk Failed
I bought a Western Digital Cavier Green Power 1TB harddisk with RM289. After connecting it to a SATA socket and partitioning the harddisk, I reboot the computer. What?! My box didn't use the harddisk! I checked my /var/log/messages to see what my box was joking with me!
Mar 10 19:53:26 peace-desktop kernel: [ 563.831340] ata3: hard resetting link Mar 10 19:53:32 peace-desktop kernel: [ 569.421348] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Mar 10 19:54:02 peace-desktop kernel: [ 599.421657] ata3.00: qc timeout (cmd 0xec) Mar 10 19:54:02 peace-desktop kernel: [ 599.421670] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) Mar 10 19:54:02 peace-desktop kernel: [ 599.421684] ata3.00: disabled Mar 10 19:54:02 peace-desktop kernel: [ 599.421708] ata3: hard resetting link Mar 10 19:54:07 peace-desktop kernel: [ 604.960046] ata3: link is slow to respond, please be patient (ready=0)
Similar errors? Wait, was it a motherboard or cable problem instead of my harddisk problem? I shut down my box and connected the harddisk to another SATA socket. Booted up my box again and the harddisk was detected now! Doing a dmesg confirmed the change and found no more error messages:
[ 1.242115] ata2.00: ATA-8: WDC WD10EARS-00Z5B1, 80.00A80, max UDMA/133 [ 1.242118] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32) [ 1.244401] ata2.00: configured for UDMA/133
Great! Note that the harddisk is now connected to ata2 instead of ata3. I then rsync-ed 300GB important data with the harddisk, and it survived perfectly after nearly 3 hours of continuous data copying. So, could it be that my old Seagate harddisk didn't fail at all, and it's instead the ata3 socket problem?
I am going to check it soon, and will share my findings I had checked the harddisk again and posted my findings in Part 2.

