Skip navigation.
Home
Write anything I want to write...

Reply to comment

Harddisk Failure? -- Part 1

Computer Hanged

My Linux box hanged yesterday! When a Linux box hangs, you normally blame (1) a memory-leaked program, (2) an I/O-hungry program, or (3) a hardware failure. Checked my Conky monitor, no program was swallowing my memory or I/O bandwidth. Hardware problem? I checked my /var/log/messages, and something really went wrong (I've highlighted those abnormal events):

  Mar 10 15:53:11 peace-desktop kernel: [ 2434.881018] ata3: hard resetting link
  Mar 10 15:53:18 peace-desktop kernel: [ 2442.392316] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  Mar 10 15:53:23 peace-desktop kernel: [ 2447.400079] ata3.00: qc timeout (cmd 0xec)
  Mar 10 15:53:23 peace-desktop kernel: [ 2447.400092] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  Mar 10 15:53:23 peace-desktop kernel: [ 2447.400110] ata3: hard resetting link
  Mar 10 15:53:24 peace-desktop kernel: [ 2447.940060] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  Mar 10 15:53:24 peace-desktop kernel: [ 2447.943275] ata3.00: configured for UDMA/33
  Mar 10 15:53:24 peace-desktop kernel: [ 2447.943311] ata3: EH complete

Was it really a harddisk problem? Turned out that it's not that simple yet!

Diagnosis

Firstly, what harddisk is connected to ata3? I did a dmesg | grep ata3:

  [    0.719901] ata3: SATA max UDMA/133 abar m8192@0xf7f76000 port 0xf7f76200 irq 27
  [    1.251283] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [    1.252639] ata3.00: ATA-8: ST3500320AS, SD15, max UDMA/133
  [    1.252642] ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32)
  [    1.254419] ata3.00: configured for UDMA/133
  ...
  [ 1828.599143] ata3.00: exception Emask 0x10 SAct 0xfffd SErr 0x1990000 action 0xe frozen
  [ 1828.599153] ata3.00: irq_stat 0x08400000, interface fatal error, PHY RDY changed
  [ 1828.599163] ata3: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
  [ 1828.599180] ata3.00: cmd 61/00:00:3f:c0:0c/04:00:00:00:00/40 tag 0 ncq 524288 out
  [ 1828.599189] ata3.00: status: { DRDY }
  [ 1828.599202] ata3.00: cmd 61/00:10:3f:48:0c/04:00:00:00:00/40 tag 2 ncq 524288 out
  [ 1828.599211] ata3.00: status: { DRDY }

Well, it's my second harddisk. I fired up Palimpsest Disk Utility, and ran a short self-test for the harddisk. I got this:

What a bad luck! It is a Seagate Barracuda 500GB harddisk, and it's used for only 1.5 years. The harddisk was not heavily accessed as it was used for backup purpose only, and it was often kept in a healthy temperature below 40°C. Luckily it is still within warranty period, but I have to use my computer without any backup space. This is not safe. So I decided to buy a new harddisk.

New Harddisk Failed

I bought a Western Digital Cavier Green Power 1TB harddisk with RM289. After connecting it to a SATA socket and partitioning the harddisk, I reboot the computer. What?! My box didn't use the harddisk! I checked my /var/log/messages to see what my box was joking with me!

  Mar 10 19:53:26 peace-desktop kernel: [  563.831340] ata3: hard resetting link
  Mar 10 19:53:32 peace-desktop kernel: [  569.421348] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
  Mar 10 19:54:02 peace-desktop kernel: [  599.421657] ata3.00: qc timeout (cmd 0xec)
  Mar 10 19:54:02 peace-desktop kernel: [  599.421670] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
  Mar 10 19:54:02 peace-desktop kernel: [  599.421684] ata3.00: disabled
  Mar 10 19:54:02 peace-desktop kernel: [  599.421708] ata3: hard resetting link
  Mar 10 19:54:07 peace-desktop kernel: [  604.960046] ata3: link is slow to respond, please be patient (ready=0)

Similar errors? Wait, was it a motherboard or cable problem instead of my harddisk problem? I shut down my box and connected the harddisk to another SATA socket. Booted up my box again and the harddisk was detected now! Doing a dmesg confirmed the change and found no more error messages:

  [    1.242115] ata2.00: ATA-8: WDC WD10EARS-00Z5B1, 80.00A80, max UDMA/133
  [    1.242118] ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 31/32)
  [    1.244401] ata2.00: configured for UDMA/133

Great! Note that the harddisk is now connected to ata2 instead of ata3. I then rsync-ed 300GB important data with the harddisk, and it survived perfectly after nearly 3 hours of continuous data copying. So, could it be that my old Seagate harddisk didn't fail at all, and it's instead the ata3 socket problem?

I am going to check it soon, and will share my findings I had checked the harddisk again and posted my findings in Part 2.

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions. (Case-insensitive)