Programming Answer: Persistent problem with losing SATA harddiscs after weeks/months of uptime

I've got a problem wtih has been keeping me somewhat busy for about a year. In random intervals, sata harddiscs simply get lost after several unsuccessful hard resets. This is not related with thermal issues (I keep a complete log of all temperature sensors), nor with the load on the system (in fact it seems more likely to happen on an idle system). I've recently switched from 2.6.26 to 2.6.32 and the problem has gone noticeably worse to about biweekly crashes (before: average once every three months).

A typical log entry of such an event looks like this (from 2.6.32 kernel)

ata1: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
ata1: irq_stat 0x00400000, PHY RDY changed
ata1: SError: { Persist PHYRdyChg 10B8B }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1: hard resetting link
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 01 06 9b 8f 00 00 10 00
end_request: I/O error, dev sda, sector 17210255
Buffer I/O error on device sda1, logical block 2151274
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 2151275
lost page write due to I/O error on sda1
sd 0:0:0:0: rejecting I/O to offline device
Aborting journal on device sda1.
sd 0:0:0:0: rejecting I/O to offline device
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1: link online but 1 devices misclassified, retrying
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
ata1: reset failed (errno=-11), retrying in 5 secs
ata1: hard resetting link
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1: link online but 1 devices misclassified, retrying
ata1: reset failed (errno=-11), retrying in 5 secs
sd 0:0:0:0: rejecting I/O to offline device
ata1: hard resetting link
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: link online but device misclassifed
ata1: link online but 1 devices misclassified, retrying
ata1: reset failed (errno=-11), retrying in 30 secs
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: rejecting I/O to offline device

Kernel 2.6.26 gives slightly different logs (last log entries before losting root partition = where /var/log is stored... I'm thinking of using a software raid on all my harddiscs for /var/log :->>)

Jun  3 15:49:57 athlon64 kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
Jun  3 15:50:10 athlon64 kernel: ata1: irq_stat 0x00400000, PHY RDY changed
Jun  3 15:50:10 athlon64 kernel: ata1: SError: { Persist PHYRdyChg 10B8B }
Jun  3 15:50:10 athlon64 kernel: ata1: hard resetting link
Jun  3 15:50:10 athlon64 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun  3 15:50:10 athlon64 kernel: ata1: failed to recover some devices, retrying in 5 secs

I've tested multiple harddiscs on multiple ports of the same controller (including one SSD), reduced the SATA bandwidth to 1.5Gbps, switched off NCQ for all devices, several BIOS updates (including SSD firmware), switching from pure AHCI to IDE mode in the BIOS - no effect. I'm using the M3A79-T Deluxe Motherboard by Asus which has a AMD SB750 SATA controller...

00:11.0 SATA controller [0106]: ATI Technologies Inc Unknown device [1002:4390] (prog-if 01 [AHCI 1.0])
        Subsystem: ASUSTeK Computer Inc. Unknown device [1043:81ef]
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 22
        I/O ports at 9000 [size=8]
        I/O ports at 8000 [size=4]
        I/O ports at 7000 [size=8]
        I/O ports at 6000 [size=4]
        I/O ports at 5000 [size=16]
        Memory at f6fff800 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [60] Power Management version 2
        Capabilities: [70] #12 [0010]

... and seems to need the SB600 PMP workaround during startup and sometimes during operation. In some cases the SB600 PMP SRST workaround appears before a device becomes dead, but in most cases it does not. This is how it looks like in dmesg during boot:

ata3: softreset failed (device not ready)
ata2: softreset failed (device not ready)
ata2: applying SB600 PMP SRST workaround and retrying
ata1: softreset failed (device not ready)
ata1: applying SB600 PMP SRST workaround and retrying
ata3: applying SB600 PMP SRST workaround and retrying
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: FORCE: horkage modified (noncq)
ata1.00: ATA-8: OCZ-VERTEX, 1.41, max UDMA/133
ata1.00: 62533296 sectors, multi 1: LBA48 NCQ (not used)
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access     ATA      OCZ-VERTEX       1.41 PQ: 0 ANSI: 5
ata2.00: FORCE: horkage modified (noncq)
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

I'm really stumped by this problem, any ideas? I'm even thinking of getting a PCIe SATA controller which is very well supported and very stable by 2.6.32 (any suggestions?) and stop using the on-board controller entirely.

From serverfault Alexsee

What drives? Velociraptors have a known issue with some firmware versions that ersult in malfunctions every around 50 days - for some seconds. If that happens at a wrong time.... well.... they drop out. Others sometimes also have problems. Seeking firmware updates sounds like a decent idea.

Alexsee : This problem appears for all kinds of drives, including... Western Digital (WD15EADS-00R6B0) Seagate (ST32000542AS rev. CC32, ST31500341AS rev. CC1H) Hitachi (HTS541680J9SA00) OCZ Vertex SSD (rev. 1.41) - I've initially had the problem with the OCZ Vertex - removed, checked, replaced by Notebook harddisc (Hitachi) - then the same problem appeared for the first Seagate, this was also replaced by a 2TB Hitachi, etc.. There was no single HD that was in the system all the time, and none of the HDs showed any problem in another system with extensive read/write tests.

From TomTom
We're now having this issue as well on hardware that's been stably running Linux (Debian then Ubuntu) for years. Suddenly in the last weeks, 5 identical machines exhibit this problem. And these are the simplest imaginable servers: no hardware raid, Xeon, SATA.

It smacks of a newly introduced kernel bug.

/p

From

Programming Answer

Tuesday, January 25, 2011

Persistent problem with losing SATA harddiscs after weeks/months of uptime

0 comments:

Post a Comment

Blog Archive