2009-09-07 12:29:37

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Sorry for the late follow up to this thread, but I'm also seeing symptoms that
look identical to these and would be grateful for any advice. I think I can
reasonably rule out a single faulty drive, controller or cabling set as I'm
seeing it across a cluster of Supermicro machines with six Seagate ST3750523AS
SATA drives in each and the drive that times out is apparently randomly
distributed across the cluster. (Of course, since the hardware is identical, it
could still be a hardware design or firmware problem.)

We're running x86-64 2.6.30.5 (and previously 2.6.30.4 where we also saw the
problem) with these drives on top of an on-motherboard ahci controller:

ahci 0000:00:1f.2: version 3.0
alloc irq_desc for 19 on cpu 0 node 0
alloc kstat_irqs on cpu 0 node 0
ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f impl RAID mode
ahci 0000:00:1f.2: flags: 64bit ncq pm led pmp slum part
ahci 0000:00:1f.2: setting latency timer to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
scsi4 : ahci
scsi5 : ahci
ata1: SATA max UDMA/133 abar m1024@0xd8500400 port 0xd8500500 irq 19
ata2: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 19
ata3: SATA max UDMA/133 irq 19
ata4: SATA max UDMA/133 irq_stat 0x00400040, connection status changed
ata5: SATA max UDMA/133 irq_stat 0x00400040, connection status changed
ata6: SATA max UDMA/133 irq_stat 0x00400040, connection status changed
[...]
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: ST3750523AS, CC34, max UDMA/133
ata1.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
scsi 0:0:0:0: Direct-Access ATA ST3750523AS CC34 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 1465149168 512-byte hardware sectors: (750 GB/698 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda:<5>sd 0:0:0:0: Attached scsi generic sg0 type 0
sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
[etc]

These machines host large numbers of kvm virtual machines and have three RAID
arrays, a RAID1 of /dev/sd[abcdef]1 for /, a RAID10 of /dev/sd[abcdef]2 for
swap and a large RAID10 of /dev/sd[abcdef]3 which is used as an LVM2 PV, out of
which the virtual drives are carved.

Everything will be running fine when suddenly:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: softreset failed (device not ready)
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
end_request: I/O error, dev sda, sector 1465147272
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sda3, disabling device.
raid10: Operation continuing on 5 devices.

The drive shows no errors in the SMART log, it doesn't really have any read or
write problems at 1465147272 (verified with O_DIRECT dd), and the reallocated
sector count is still zero. mdadm --remove and mdadm --add on the component
bring everything back to life fine.

I'm using the deadline IO scheduler with default 5s timeout value for these
disks, in case that could be part of the problem? LVM2 does O_DIRECT reads and
writes to the md array to manipulate its metadata, and my virtual machines are
doing quite a lot of IO to the logical volumes too, which they open O_SYNC, so
the arrays are reasonably heavily loaded.

I think there are two things that concern me here. One is obviously that the
timeouts and resets are happening at all and I'd like to get to the bottom
of this. However, the other is that the response to a SCSI disk having to be
reset is to chuck it out of the array even though it'll be fine following
the reset! I'd very much like to stop this happening somehow.

A couple of other problems I'm seeing, which may be connected, and make this
problem more painful...

Perhaps it's inevitable because of the IO load the machine is under, but I
find I need to set /proc/sys/dev/raid/speed_limit_min right down to 0 before
any resync or all disk accesses seem to deadlock completely, including the
RAID resync itself. In this state, all attempts to dd even a single block
from /dev/md2 will hang forever. With the speed limit minimum reduced to 0,
resync proceeds fine most of the time, occasionally dropping down to zero
speed when other IO is going on (which is fine), but apparently not
deadlocking.

I have a bitmap on the array, but sometimes when I remove and re-add a
failed component, it doesn't seem to use the bitmap and does a lengthy full
recovery instead. One example that's ongoing at the moment:-

[=>...................] recovery = 5.7% (40219648/703205312) finish=7546.3min speed=1463K/sec
bitmap: 34/126 pages [136KB], 8192KB chunk

which is rather painful and has to be throttled back with speed_limit_max to
avoid the virtual machines running on top of it from having extremely poor IO
latency.

Best wishes,

Chris.


2009-09-07 12:03:56

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Chris Webb <[email protected]> writes:

> I have a bitmap on the array, but sometimes when I remove and re-add a
> failed component, it doesn't seem to use the bitmap and does a lengthy full
> recovery instead. One example that's ongoing at the moment:-
>
> [=>...................] recovery = 5.7% (40219648/703205312) finish=7546.3min speed=1463K/sec
> bitmap: 34/126 pages [136KB], 8192KB chunk
>
> which is rather painful and has to be throttled back with speed_limit_max to
> avoid the virtual machines running on top of it from having extremely poor IO
> latency.

I've also noticed that during this recovery, I'm seeing lots of timeouts but
they don't seem to interrupt the resync:

05:47:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
05:47:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
05:47:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
05:47:39 ata5.00: status: { DRDY }
05:47:39 ata5: hard resetting link
05:47:49 ata5: softreset failed (device not ready)
05:47:49 ata5: hard resetting link
05:47:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
05:47:49 ata5.00: configured for UDMA/133
05:47:49 ata5: EH complete

08:17:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
08:17:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
08:17:39 res 40/00:00:35:83:f8/00:00:4d:00:00/40 Emask 0x4 (timeout)
08:17:39 ata5.00: status: { DRDY }
08:17:39 ata5: hard resetting link
08:17:49 ata5: softreset failed (device not ready)
08:17:49 ata5: hard resetting link
08:17:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
08:17:49 ata5.00: configured for UDMA/133
08:17:49 ata5: EH complete

10:22:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
10:22:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
10:22:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
10:22:39 ata5.00: status: { DRDY }
10:22:39 ata5: hard resetting link
10:22:49 ata5: softreset failed (device not ready)
10:22:49 ata5: hard resetting link
10:22:50 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
10:22:51 ata5.00: configured for UDMA/133
10:22:51 ata5: EH complete

Cheers,

Chris.

2009-09-07 17:05:21

by Allan Wind

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On 2009-09-07T12:44:42, Chris Webb wrote:
> Sorry for the late follow up to this thread, but I'm also seeing symptoms that
> look identical to these and would be grateful for any advice. I think I can
> reasonably rule out a single faulty drive, controller or cabling set as I'm
> seeing it across a cluster of Supermicro machines with six Seagate ST3750523AS
> SATA drives in each and the drive that times out is apparently randomly
> distributed across the cluster. (Of course, since the hardware is identical, it
> could still be a hardware design or firmware problem.)

Seeing the same thing with a Supermicro motherboard and a pair WDC 2 TB
drives. Disabling NCQ does not resolve the issue, nor increasing
the safe_mode_delay. This is with 2.6.30.4. This machine is
sitting on its hand (i.e. no significant load).


/Allan
--
Allan Wind
Life Integrity, LLC
<http://lifeintegrity.com>

2009-09-07 23:26:56

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On Mon September 7 2009, Allan Wind wrote:
> On 2009-09-07T12:44:42, Chris Webb wrote:
> > Sorry for the late follow up to this thread, but I'm also seeing symptoms
> > that look identical to these and would be grateful for any advice. I
> > think I can reasonably rule out a single faulty drive, controller or
> > cabling set as I'm seeing it across a cluster of Supermicro machines with
> > six Seagate ST3750523AS SATA drives in each and the drive that times out
> > is apparently randomly distributed across the cluster. (Of course, since
> > the hardware is identical, it could still be a hardware design or
> > firmware problem.)
>
> Seeing the same thing with a Supermicro motherboard and a pair WDC 2 TB
> drives. Disabling NCQ does not resolve the issue, nor increasing
> the safe_mode_delay. This is with 2.6.30.4. This machine is
> sitting on its hand (i.e. no significant load).

I have the same issue with a single WD 2TB Green drive. Technically two, but
it always only gets errors from the same drive, so I was assuming it was the
drive. I only have to setup the raid0 array, and put some light load on it for
the kernel to start complaining, and eventually it just kicks the drive
completely with the following messages:

sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
end_request: I/O error, dev sdb, sector 202026972

The drive does work fine prior to the frozen timeout errors. And I was using
it in windows (same raid0 config) just fine with no errors what so ever.

>
> /Allan
>


--
Thomas Fjellstrom
[email protected]

2009-09-09 12:02:20

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Chris Webb <[email protected]> writes:

> I've also noticed that during this recovery, I'm seeing lots of timeouts but
> they don't seem to interrupt the resync:
>
> 05:47:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> 05:47:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
> 05:47:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
> 05:47:39 ata5.00: status: { DRDY }
> 05:47:39 ata5: hard resetting link
> 05:47:49 ata5: softreset failed (device not ready)
> 05:47:49 ata5: hard resetting link
> 05:47:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 05:47:49 ata5.00: configured for UDMA/133
> 05:47:49 ata5: EH complete
>
> 08:17:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> 08:17:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
> 08:17:39 res 40/00:00:35:83:f8/00:00:4d:00:00/40 Emask 0x4 (timeout)
> 08:17:39 ata5.00: status: { DRDY }
> 08:17:39 ata5: hard resetting link
> 08:17:49 ata5: softreset failed (device not ready)
> 08:17:49 ata5: hard resetting link
> 08:17:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 08:17:49 ata5.00: configured for UDMA/133
> 08:17:49 ata5: EH complete
>
> 10:22:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> 10:22:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
> 10:22:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
> 10:22:39 ata5.00: status: { DRDY }
> 10:22:39 ata5: hard resetting link
> 10:22:49 ata5: softreset failed (device not ready)
> 10:22:49 ata5: hard resetting link
> 10:22:50 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 10:22:51 ata5.00: configured for UDMA/133
> 10:22:51 ata5: EH complete

... the difference being that a timeout which causes a super_written failure
seems to return an I/O error whereas the others don't:

ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
end_request: I/O error, dev sde, sector 1465147272
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.

I wonder what's different about these two timeouts such that one causes an I/O
error and the other just causes a retry after reset? Presumably if the latter
was also just a retry, everything would be (closer to being) fine.

Cheers,

Chris.

2009-09-14 07:45:09

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello, Chris.

Chris Webb wrote:
> Chris Webb <[email protected]> writes:
>
>> I've also noticed that during this recovery, I'm seeing lots of timeouts but
>> they don't seem to interrupt the resync:
>>
>> 05:47:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> 05:47:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
>> 05:47:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
>> 05:47:39 ata5.00: status: { DRDY }
>> 05:47:39 ata5: hard resetting link
>> 05:47:49 ata5: softreset failed (device not ready)
>> 05:47:49 ata5: hard resetting link
>> 05:47:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> 05:47:49 ata5.00: configured for UDMA/133
>> 05:47:49 ata5: EH complete
>>
>> 08:17:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> 08:17:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
>> 08:17:39 res 40/00:00:35:83:f8/00:00:4d:00:00/40 Emask 0x4 (timeout)
>> 08:17:39 ata5.00: status: { DRDY }
>> 08:17:39 ata5: hard resetting link
>> 08:17:49 ata5: softreset failed (device not ready)
>> 08:17:49 ata5: hard resetting link
>> 08:17:49 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> 08:17:49 ata5.00: configured for UDMA/133
>> 08:17:49 ata5: EH complete
>>
>> 10:22:39 ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> 10:22:39 ata5.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
>> 10:22:39 res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
>> 10:22:39 ata5.00: status: { DRDY }
>> 10:22:39 ata5: hard resetting link
>> 10:22:49 ata5: softreset failed (device not ready)
>> 10:22:49 ata5: hard resetting link
>> 10:22:50 ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> 10:22:51 ata5.00: configured for UDMA/133
>> 10:22:51 ata5: EH complete
>
> ... the difference being that a timeout which causes a super_written failure
> seems to return an I/O error whereas the others don't:

The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
from the regular IO paths or md. It's probably being issued via SG_IO
from userland. These failures don't affect normal operation.

> ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
> ata5.00: status: { DRDY }
> ata5: hard resetting link
> ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata5.00: configured for UDMA/133
> ata5: EH complete
> end_request: I/O error, dev sde, sector 1465147272
> md: super_written gets error=-5, uptodate=0
> raid10: Disk failure on sde3, disabling device.
>
> I wonder what's different about these two timeouts such that one causes an I/O
> error and the other just causes a retry after reset? Presumably if the latter
> was also just a retry, everything would be (closer to being) fine.

Because this error is actually seen by the md layer and FLUSH in
general can't be retried cleanly. On retrial, the drive goes on and
retry the sectors after the point of failure. I'm not sure whether
FLUSH is actually failing here or it's a communication glitch. At any
rate, if FLUSH is failing or timing out, the only right thing to do is
to kick it out of the array as keeping after retrying may lead to
silent data corruption. Seriously, it's most likely a hardware
malfunction although I can't tell where the problem is with the given
data. Get the hardware fixed.

Thanks.

--
tejun

2009-09-14 07:44:55

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
>> I wonder what's different about these two timeouts such that one causes an I/O
>> error and the other just causes a retry after reset? Presumably if the latter
>> was also just a retry, everything would be (closer to being) fine.
>
> Because this error is actually seen by the md layer and FLUSH in
> general can't be retried cleanly. On retrial, the drive goes on and
> retry the sectors after the point of failure. I'm not sure whether
> FLUSH is actually failing here or it's a communication glitch. At any
> rate, if FLUSH is failing or timing out, the only right thing to do is
> to kick it out of the array as keeping after retrying may lead to
> silent data corruption. Seriously, it's most likely a hardware
> malfunction although I can't tell where the problem is with the given
> data. Get the hardware fixed.

Oooh, another possibility is the above continuous IDENTIFY tries.
Doing things like that generally isn't a good idea because vendors
don't expect IDENTIFY to be mixed regularly with normal IOs and
firmwares aren't tested against that. Even smart commands sometimes
cause problems. So, finding out the thing which is obsessed with the
identity of the drive and stopping it might help.

Thanks.

--
tejun

2009-09-14 07:53:23

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Thomas Fjellstrom wrote:
> I have the same issue with a single WD 2TB Green drive. Technically two, but
> it always only gets errors from the same drive, so I was assuming it was the
> drive. I only have to setup the raid0 array, and put some light load on it for
> the kernel to start complaining, and eventually it just kicks the drive
> completely with the following messages:
>
> sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> end_request: I/O error, dev sdb, sector 202026972
>
> The drive does work fine prior to the frozen timeout errors. And I was using
> it in windows (same raid0 config) just fine with no errors what so ever.

Can you post full dmesg output? The above doesn't tell much about ATA
side of things.

Thanks.

--
tejun

2009-09-14 12:48:28

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
..
> Oooh, another possibility is the above continuous IDENTIFY tries.
> Doing things like that generally isn't a good idea because vendors
> don't expect IDENTIFY to be mixed regularly with normal IOs and
> firmwares aren't tested against that. Even smart commands sometimes
> cause problems. So, finding out the thing which is obsessed with the
> identity of the drive and stopping it might help.
..

Bullpucky. That sort of thing, specifically with IDENTIFY,
has never been an issue.

I wonder if the IDENTIFY is actually coming from libata EH
after something else failed ?

2009-09-14 13:05:23

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord wrote:
> Tejun Heo wrote:
> ..
>> Oooh, another possibility is the above continuous IDENTIFY tries.
>> Doing things like that generally isn't a good idea because vendors
>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>> firmwares aren't tested against that. Even smart commands sometimes
>> cause problems. So, finding out the thing which is obsessed with the
>> identity of the drive and stopping it might help.
> ..
>
> Bullpucky. That sort of thing, specifically with IDENTIFY,
> has never been an issue.

With SMART it has. I wouldn't be too surprised if some new firmware
chokes on repeated IDENTIFY mixed with stream of NCQ commands. It's
just not something people (including vendors) do regularly.

> I wonder if the IDENTIFY is actually coming from libata EH
> after something else failed ?

In that case, libata-eh would print "ataP.DD: failed to IDENTIFY
(blah, err_mask=0x%x\n" instead of the full TF dump.

Thanks.

--
tejun

Subject: Re: MD/RAID time out writing superblock

On Mon, 14 Sep 2009, Tejun Heo wrote:
> Oooh, another possibility is the above continuous IDENTIFY tries.
> Doing things like that generally isn't a good idea because vendors
> don't expect IDENTIFY to be mixed regularly with normal IOs and

IMHO that means the kernel should be special-casing such commands, then (i.e
quiesce drive, do command, quiesce driver, start IO again), probably
rate-limiting it for good effect.

This is the kind of stuff that userspace should NOT have to worry about
(because it will get it wrong and cause data corruption eventually).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-09-14 13:24:50

by Gabor Gombas

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On Mon, Sep 14, 2009 at 04:41:56PM +0900, Tejun Heo wrote:

> Because this error is actually seen by the md layer and FLUSH in
> general can't be retried cleanly. On retrial, the drive goes on and
> retry the sectors after the point of failure. I'm not sure whether
> FLUSH is actually failing here or it's a communication glitch. At any
> rate, if FLUSH is failing or timing out, the only right thing to do is
> to kick it out of the array as keeping after retrying may lead to
> silent data corruption.

Hmm, how's that supposed to work with TLER on WD enterprise drives?
Isn't the idea behind TLER to prevent drives being kicked out of the
array because the RAID system can have a much more intelligent
retry/recovery logic than a single drive?

AFAIK md RAID can already take advantage of TLER if the operation that's
failing due to TLER is a READ, but I don't know what happens if TLER
kicks in during a WRITE or a FLUSH.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2009-09-14 13:24:59

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Henrique de Moraes Holschuh wrote:
> On Mon, 14 Sep 2009, Tejun Heo wrote:
>> Oooh, another possibility is the above continuous IDENTIFY tries.
>> Doing things like that generally isn't a good idea because vendors
>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>
> IMHO that means the kernel should be special-casing such commands, then (i.e
> quiesce drive, do command, quiesce driver, start IO again), probably
> rate-limiting it for good effect.
>
> This is the kind of stuff that userspace should NOT have to worry about
> (because it will get it wrong and cause data corruption eventually).

If this indeed is the case (As Mark pointed out, there hasn't been any
precedence involving IDENTIFY but it's also the first time I see
IDENTIFY timeouts which are issued from userland), this is the kind
that userspace shouldn't do to begin with.

There was another similar problem. Some acpi package in ubuntu issues
APM adjustment commands whenever power related stuff changes. The
firmware on the drive which shipped on Samsung NC10 for some reason
locks up after being hit with enough of those commands. It's just not
safe to assume these kind of stuff would reliably work. If you're
ready to do some research and experiments, it's fine. If you're doing
OEM customization with specific hardware and QA, sure, why not (this
is basically what windows OEMs do too). But, doing things which
aren't _usually_ used that way repeatedly _by default_ is asking for
trouble. There's a reason why these operations are root only. :-)

Thanks.

--
tejun

Subject: Re: MD/RAID time out writing superblock

On Mon, 14 Sep 2009, Tejun Heo wrote:
> Henrique de Moraes Holschuh wrote:
> > On Mon, 14 Sep 2009, Tejun Heo wrote:
> >> Oooh, another possibility is the above continuous IDENTIFY tries.
> >> Doing things like that generally isn't a good idea because vendors
> >> don't expect IDENTIFY to be mixed regularly with normal IOs and
> >
> > IMHO that means the kernel should be special-casing such commands, then (i.e
> > quiesce drive, do command, quiesce driver, start IO again), probably
> > rate-limiting it for good effect.
> >
> > This is the kind of stuff that userspace should NOT have to worry about
> > (because it will get it wrong and cause data corruption eventually).
>
> If this indeed is the case (As Mark pointed out, there hasn't been any
> precedence involving IDENTIFY but it's also the first time I see
> IDENTIFY timeouts which are issued from userland), this is the kind
> that userspace shouldn't do to begin with.

There are many reasons why userspace would issue identify (note: I didn't
say they are good reasons), and off the hand I recall hddtemp as a likely
culprit. Also, sometimes the local admin does hdparm -I for whatever
reason. So, I am not surprised someone found a way to cause many IDENTIFY
commands to be issued.

Other SMART-maintenance utilities might issue IDENTIFY as well. And if this
is an issue with SMART in general, smartd issues SMART commands (I don't
know if it uses IDENTIFY) once per hour to check attributes, and can be
configured to fire off SMART short/long/offline tests automatically. The
local admin sends SMART commands (through smartctl) with the disks hot to
check the error log after EH, etc.

IMHO, the kernel really should be protecting userland against data
corruption here, even if it means a massive hit on disk performance while
the SMART commands are being processed.

> There was another similar problem. Some acpi package in ubuntu issues
> APM adjustment commands whenever power related stuff changes. The

Yes. If you fail to do this on ThinkPads (many models, but probably not
all), your disk will break in 1-2yr maximum, and THAT assumes you have
Hitachi notebook HDs that are supposed to take 600k head unloads before
croaking... most other vendors say thay can only do 300k head unloads in
their datasheets (if you can find a datasheet at all). If you need a reason
to buy Hitachi HDs, this is it: they give you full, proper datasheets.

The *firmware* of these laptops will issue these annoying APM commands by
itself when power state changes, and not even setting the BIOS to
"performance" mode makes it stop with the destructive behaviour. So any
disk that cannot take receiving APM commands many times per day on such
laptops will cause problems.

Now, why Ubuntu would do this outside of the ThinkPads, or target anything
other than magnetic disk media, I don't know. Maybe other laptop vendors
also had the same idea. Maybe Ubuntu was simplistic on their approach when
they added this defensive feature. Maybe it was considered a PM feature and
it is not even related to the ThinkPad APM annoyance. You'd have to ask
them.

> firmware on the drive which shipped on Samsung NC10 for some reason
> locks up after being hit with enough of those commands. It's just not
> safe to assume these kind of stuff would reliably work. If you're

Maybe we can blacklist such commands on drives known to mismimplement them?

> ready to do some research and experiments, it's fine. If you're doing
> OEM customization with specific hardware and QA, sure, why not (this
> is basically what windows OEMs do too). But, doing things which
> aren't _usually_ used that way repeatedly _by default_ is asking for
> trouble. There's a reason why these operations are root only. :-)

There are real user cases for APM commands, and for SMART commands...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-09-14 14:25:17

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
> Mark Lord wrote:
>> Tejun Heo wrote:
>> ..
>>> Oooh, another possibility is the above continuous IDENTIFY tries.
>>> Doing things like that generally isn't a good idea because vendors
>>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>>> firmwares aren't tested against that. Even smart commands sometimes
>>> cause problems. So, finding out the thing which is obsessed with the
>>> identity of the drive and stopping it might help.
>> ..
>>
>> Bullpucky. That sort of thing, specifically with IDENTIFY,
>> has never been an issue.
>
> With SMART it has. I wouldn't be too surprised if some new firmware
> chokes on repeated IDENTIFY mixed with stream of NCQ commands. It's
> just not something people (including vendors) do regularly.
..

Yeah, some drives really don't like SMART commands (hddtemp & smartctl).
That's a strange one, too. Because the whole idea of SMART
is that it gets used to periodically monitor drive health.

IDENTIFY is much safer -- usually no media access after initial spin-up,
and lots of things exercise it quite regularly.

Pretty much any hdparm command triggers an IDENTIFY beforehand now,
hddtemp and smartctl both use it too.

I suspect we're missing some info from this specific failure.
Looking back at Chris's earlier posting, the whole thing started
with a FLUSH_CACHE_EXT failure. Once that happens, all bets are
off on anything that follows.

> Everything will be running fine when suddenly:
>
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: hard resetting link
> ata1: softreset failed (device not ready)
> ata1: hard resetting link
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: configured for UDMA/133
> ata1: EH complete
> end_request: I/O error, dev sda, sector 1465147272
> md: super_written gets error=-5, uptodate=0
> raid10: Disk failure on sda3, disabling device.
> raid10: Operation continuing on 5 devices.

2009-09-14 14:34:46

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Henrique de Moraes Holschuh wrote:
>>> This is the kind of stuff that userspace should NOT have to worry about
>>> (because it will get it wrong and cause data corruption eventually).
>> If this indeed is the case (As Mark pointed out, there hasn't been any
>> precedence involving IDENTIFY but it's also the first time I see
>> IDENTIFY timeouts which are issued from userland), this is the kind
>> that userspace shouldn't do to begin with.
>
> There are many reasons why userspace would issue identify (note: I didn't
> say they are good reasons), and off the hand I recall hddtemp as a likely
> culprit. Also, sometimes the local admin does hdparm -I for whatever
> reason. So, I am not surprised someone found a way to cause many IDENTIFY
> commands to be issued.

Heh... and there have been plenty of IO errors and timeouts coming
from hddtemp. :-)

> Other SMART-maintenance utilities might issue IDENTIFY as well. And if this
> is an issue with SMART in general, smartd issues SMART commands (I don't
> know if it uses IDENTIFY) once per hour to check attributes, and can be
> configured to fire off SMART short/long/offline tests automatically. The
> local admin sends SMART commands (through smartctl) with the disks hot to
> check the error log after EH, etc.
>
> IMHO, the kernel really should be protecting userland against data
> corruption here, even if it means a massive hit on disk performance while
> the SMART commands are being processed.

I don't know. The problem is with test coverage. As those aren't
used too often, they don't get tested too much so the coverage of the
blacklist wouldn't be too good and so on and there's very good reason
why those aren't used too often. They're not all that useful for most
people.

>> There was another similar problem. Some acpi package in ubuntu issues
>> APM adjustment commands whenever power related stuff changes. The
>
> Yes. If you fail to do this on ThinkPads (many models, but probably not
> all), your disk will break in 1-2yr maximum, and THAT assumes you have
> Hitachi notebook HDs that are supposed to take 600k head unloads before
> croaking... most other vendors say thay can only do 300k head unloads in
> their datasheets (if you can find a datasheet at all). If you need a reason
> to buy Hitachi HDs, this is it: they give you full, proper datasheets.

There are plenty drives and configurations like that and different
drives need different APM value to function properly. storage-fixup
deals exactly with the problem.

http://git.kernel.org/?p=linux/kernel/git/tj/storage-fixup.git;a=summary

But please note that it's only done once during boot and resume on
machines which are known to specifically need it and with values
reported to work.

> The *firmware* of these laptops will issue these annoying APM commands by
> itself when power state changes, and not even setting the BIOS to
> "performance" mode makes it stop with the destructive behaviour. So any
> disk that cannot take receiving APM commands many times per day on such
> laptops will cause problems.

Yeap, well, that's what vendors do. They put together specific subset
of components and try to figure out configurations which work. If you
replace components on your own, they won't guarantee it will work.
Sucky but that's the way it is.

> Now, why Ubuntu would do this outside of the ThinkPads, or target anything
> other than magnetic disk media, I don't know. Maybe other laptop vendors
> also had the same idea. Maybe Ubuntu was simplistic on their approach when
> they added this defensive feature. Maybe it was considered a PM feature and
> it is not even related to the ThinkPad APM annoyance. You'd have to ask
> them.

The feature probabaly doesn't have much to do with the frequent head
unload problem. Unplugging or pluggin in the AC cord also triggered
APM commands to be issued so it's more likely they were trying to
optimize performance / power balance. The only problem is that APM
setting values aren't clearly defined and just are not too well
tested.

>> firmware on the drive which shipped on Samsung NC10 for some reason
>> locks up after being hit with enough of those commands. It's just not
>> safe to assume these kind of stuff would reliably work. If you're
>
> Maybe we can blacklist such commands on drives known to mismimplement them?

Yes, a possibility but we're unlikely to build meaningful coverage and
likely to prevent valid usages too. ie. A firmware might lock up when
APM settings are adjusted continuously while setting it once after
booting is fine. I really want to avoid implementing such logics for
different drives in kernel.

>> ready to do some research and experiments, it's fine. If you're doing
>> OEM customization with specific hardware and QA, sure, why not (this
>> is basically what windows OEMs do too). But, doing things which
>> aren't _usually_ used that way repeatedly _by default_ is asking for
>> trouble. There's a reason why these operations are root only. :-)
>
> There are real user cases for APM commands, and for SMART commands...

Yeap, sure, but it just doesn't work very well, not yet at least.
SMART is usually better tested than APM but given the number of
reports I've seen from hddtemp users, certain aspects of it are broken
on many drives. There isn't a clear answer. For usual parts of
SMART, it's probably pretty safe but then again don't go too far with
it. Do it every several hours or every day not every ten secs. APM
is way more dangerous, if your machine needs it use it minimally. If
certain combination of values are known to work for the particular
configuration, go ahead and use it. In other cases, just stay away
from it.

What people use often get tested and verified by vendors and promptly
fixed. What people don't use often won't be and will be unreliable.
If you want to do things people usually don't do, it's your
responsibility to ensure it actually works.

Thanks.

--
tejun

2009-09-14 21:13:47

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On Mon September 14 2009, Tejun Heo wrote:
> Hello,
>
> Thomas Fjellstrom wrote:
> > I have the same issue with a single WD 2TB Green drive. Technically two,
> > but it always only gets errors from the same drive, so I was assuming it
> > was the drive. I only have to setup the raid0 array, and put some light
> > load on it for the kernel to start complaining, and eventually it just
> > kicks the drive completely with the following messages:
> >
> > sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> > end_request: I/O error, dev sdb, sector 202026972
> >
> > The drive does work fine prior to the frozen timeout errors. And I was
> > using it in windows (same raid0 config) just fine with no errors what so
> > ever.
>
> Can you post full dmesg output? The above doesn't tell much about ATA
> side of things.
>
> Thanks.
>

Sure, I've attached the full dmesg from a full test I ran today (I couldn't
find the old log where that bit came from). I'm running 2.6.31-rc9 right now,
and will probably update to the final 31 release soonish. The test I ran
actually finished (dd if=/dev/sdc of=/dev/null bs=8M), whereas with earlier
kernels it was completely failing. Of course, I was actually trying to bring
up the md raid0 array (2x2TB), mount the filesystem, and copy the files off
before. mdraid is probably more sensitive to the end_request errors than dd
is.

--
Thomas Fjellstrom
[email protected]


Attachments:
disk.dmesg (86.92 kB)

2009-09-14 22:24:10

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Thomas Fjellstrom wrote:
> Sure, I've attached the full dmesg from a full test I ran today (I couldn't
> find the old log where that bit came from). I'm running 2.6.31-rc9 right now,
> and will probably update to the final 31 release soonish. The test I ran
> actually finished (dd if=/dev/sdc of=/dev/null bs=8M), whereas with earlier
> kernels it was completely failing. Of course, I was actually trying to bring
> up the md raid0 array (2x2TB), mount the filesystem, and copy the files off
> before. mdraid is probably more sensitive to the end_request errors than dd
> is.

[ 2.056357] ata5: softreset failed (device not ready)
[ 2.056412] ata5: applying SB600 PMP SRST workaround and retrying

The above two are expected. It's a bug in SB600 controller being
worked around.

[ 2.220160] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2.269157] ata5.00: ATA-8: WDC WD20EADS-00R6B0, 01.00A01, max UDMA/133
[ 2.269214] ata5.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 2.275112] ata5.00: configured for UDMA/133

All seem well.

[ 7089.781711] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 7089.781731] ata5.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
[ 7089.781735] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

This is SMART ENABLE OPERATIONS and the command gets retried a lot of
times with the same result.

[32410.780251] ata5.00: status: { DRDY }
[32410.780262] ata5: hard resetting link
[32411.264544] ata5: softreset failed (device not ready)
[32411.264554] ata5: applying SB600 PMP SRST workaround and retrying
[32411.428072] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[32411.440112] ata5.00: configured for UDMA/33
[32411.440148] ata5: EH complete
[32452.781180] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[32452.781199] ata5.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
[32452.781202] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Then, one SMART RETURN STATUS gets timed out.

[32464.106741] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[32464.106751] ata5.00: irq_stat 0x40000001
[32464.106769] ata5.00: cmd 25/00:08:00:88:e0/00:00:e8:00:00/e0 tag 0 dma 4096 in
[32464.106772] res 41/04:00:00:88:e0/00:00:e8:00:00/e0 Emask 0x1 (device error)

Then, device fails READ_EXT.

[32510.730059] Descriptor sense data with sense descriptors (in hex):
[32510.730064] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[32510.730082] e8 e0 88 00
[32510.730090] sd 5:0:0:0: [sdc] Add. Sense: No additional sense information
[32510.730098] end_request: I/O error, dev sdc, sector 3907028992
[32510.730106] Buffer I/O error on device sdc, logical block 488378624

After several retries, libata gives up and sd does too.

[32510.730142] ata5: EH complete
[32526.780076] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[32526.780097] ata5.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
[32526.780100] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[32526.780107] ata5.00: status: { DRDY }
[32526.780119] ata5: hard resetting link
[32536.785177] ata5: softreset failed (device not ready)
[32536.785189] ata5: hard resetting link
[32546.789238] ata5: softreset failed (device not ready)
[32546.789249] ata5: hard resetting link
[32557.360064] ata5: link is slow to respond, please be patient (ready=0)
[32573.836192] ata5: softreset failed (device not ready)
[32573.836202] ata5: applying SB600 PMP SRST workaround and retrying
[32581.792026] ata5: softreset failed (device not ready)
[32581.792039] ata5: hard resetting link
[32587.000775] ata5: softreset failed (device not ready)
[32587.000784] ata5: reset failed, giving up
[32587.000790] ata5.00: disabled
[32587.000822] ata5: EH complete

Then, SMART ENABLE again, which now drives the drive off the limit and
it never comes back.

Does disabling whatever is issuing those SMART commands make any
difference?

Thanks.

--
tejun

2009-09-16 22:34:32

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hi Tejun. Thanks for following up to this. We've done some more experimentation
over the last couple of days based on your suggestions and thoughts.

Tejun Heo <[email protected]> writes:
> Seriously, it's most likely a hardware malfunction although I can't tell
> where the problem is with the given data. Get the hardware fixed.

We know this isn't caused by a single faulty piece of hardware, because we have
a cluster of identical machines and all have shown this behaviour. This doesn't
mean that there isn't a hardware problem, but if there is one, it's a design
problem or firmware bug affecting all of our hosts.

There have also been a few reports of problems which look very similar in this
thread from people with somewhat different hardware and drives to ours.

> The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
> from the regular IO paths or md. It's probably being issued via SG_IO
> from userland. These failures don't affect normal operation.
[...]
> Oooh, another possibility is the above continuous IDENTIFY tries.
> Doing things like that generally isn't a good idea because vendors
> don't expect IDENTIFY to be mixed regularly with normal IOs and
> firmwares aren't tested against that. Even smart commands sometimes
> cause problems. So, finding out the thing which is obsessed with the
> identity of the drive and stopping it might help.

We tracked this down to some (excessively frequent!) monitoring we were doing
using smartctl. Things were improved considerably by stopping smartd and
disabling all callers of smartctl, although it doesn't appear to have been a
cure. The frequency of these timeouts during resync seems to have gone from
about once every two hours to about once a day, which means we've been able to
complete some resyncs whereas we were unable to before.

What we still see are (fewer) 'frozen' exceptions leading to a drive reset and
an 'end_request: I/O error', such as [1]. The drive is then promptly kicked out
of the raid array.

Some of these timeouts also leave us with a completely dead drive, and we need
to reboot the machine before it can be accessed again. (Hot plugging it out and
back in again isn't sufficient to bring it back to life, so maybe a controller
problem, although other drives on the same controller stay alive?) An example
is [2].

There are two more symptoms we are seeing on the same which may be
connected, or may be separate bugs in their own right:

- 'cat /proc/mdstat' sometimes hangs before returning during normal
operation, although most of the time it is fine. We have seen hangs of
up to 15-20 seconds during resync. Might this be a less severe example
of the lock-up which causes a timeout and reset after 30 seconds?

- We've also had a few occasions of O_SYNC writes to raid arrays (from
qemu-kvm via LVM2) completely deadlocking against resync writes when the
maximum md resync speed is set sufficiently high, even where the minimum
md resync speed is set to zero (although this certainly helps). However,
I suspect this is an unrelated issue as I've seen this on other hardware
running other kernel configs.

For reference, we're using the ahci driver and deadline IO scheduler with the
default tuning parameters, our motherboards are SuperMicro X7DBN (Intel ESB2
SATA 3.0Gbps Controller) and we have six 750GB Seagate ST3750523AS drives
attached to each motherboard. Also, since first reporting this, I've managed
to reproduce the problem whilst running Linux 2.6.29.6, 2.6.30.5 and the
newly released 2.6.31.

What do you think are our next steps in tracking this one down should be? My
only ideas are:

- We could experiment with NCQ settings. I've already briefly changed
/sys/block/sd*/device/queue_depth down from 31 to 1. It didn't seem to stop
delays in getting back info from /proc/mdstat, so put it back up again fearing
that the performance hit would make the problem worse, but perhaps I should
leave it off for a more extended period to verify that we still get timeouts
long enough to leave slots without it?

- We could try replacing the drives that are currently kicked out of one of the
arrays with drives from another manufacturer to see if the drive model
is implicated. Is the drive or the controller a more likely problem?

Any advice would be very gratefully received.

Cheers,

Chris.

[1] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: softreset failed (device not ready)
ata5: hard resetting link
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata5.00: configured for UDMA/133
ata5: EH complete
end_request: I/O error, dev sde, sector 1465147264
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.
raid10: Operation continuing on 4 devices.

[2] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: softreset failed (device not ready)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: hard resetting link
ata1: softreset failed (device not ready)
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: limiting SATA link speed to 1.5 Gbps
ata1: hard resetting link
ata1: softreset failed (device not ready)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1: hard resetting link
ata1: softreset failed (device not ready)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1: EH complete
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 1465147272
end_request: I/O error, dev sda, sector 1465147272
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sda3, disabling device.
raid10: Operation continuing on 4 devices.
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 8396584
end_request: I/O error, dev sda, sector 8396584
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sda1, disabling device.
raid1: Operation continuing on 5 devices.
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 32
raid1: sda1: rescheduling sector 0
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 8396800
raid10: sda2: rescheduling sector 0
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 8396800
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sda, sector 8396800
raid10: Disk failure on sda2, disabling device.
raid10: Operation continuing on 5 devices.

2009-09-16 23:23:59

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord <[email protected]> writes:

> I suspect we're missing some info from this specific failure.
> Looking back at Chris's earlier posting, the whole thing started
> with a FLUSH_CACHE_EXT failure. Once that happens, all bets are
> off on anything that follows.
>
> >Everything will be running fine when suddenly:
> >
> > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> > ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> > res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
> > ata1.00: status: { DRDY }
> > ata1: hard resetting link
> > ata1: softreset failed (device not ready)
> > ata1: hard resetting link
> > ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > ata1.00: configured for UDMA/133
> > ata1: EH complete
> > end_request: I/O error, dev sda, sector 1465147272
> > md: super_written gets error=-5, uptodate=0
> > raid10: Disk failure on sda3, disabling device.
> > raid10: Operation continuing on 5 devices.

Hi Mark. Yes, when the first timeout after a clean boot happens, it's with
an 0xea flush command every time:

[...]
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: ATA-8: ST3750523AS, CC34, max UDMA/133
ata5.00: 1465149168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata5.00: configured for UDMA/133
scsi 4:0:0:0: Direct-Access ATA ST3750523AS CC34 PQ: 0 ANSI: 5
sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors: (750 GB/698 GiB)
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors: (750 GB/698 GiB)
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sde: sde1 sde2 sde3
sd 4:0:0:0: [sde] Attached SCSI disk
sd 4:0:0:0: Attached scsi generic sg4 type 0

[later]
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete
sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors: (750 GB/698 GiB)
sd 4:0:0:0: [sde] Write Protect is off
sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
end_request: I/O error, dev sde, sector 1465147264
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.
raid10: Operation continuing on 4 devices.

Best wishes,

Chris.

2009-09-16 23:48:57

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Chris Webb wrote:
> Hi Tejun. Thanks for following up to this. We've done some more
> experimentation over the last couple of days based on your
> suggestions and thoughts.
>
> Tejun Heo <[email protected]> writes:
>> Seriously, it's most likely a hardware malfunction although I can't tell
>> where the problem is with the given data. Get the hardware fixed.
>
> We know this isn't caused by a single faulty piece of hardware,
> because we have a cluster of identical machines and all have shown
> this behaviour. This doesn't mean that there isn't a hardware
> problem, but if there is one, it's a design problem or firmware bug
> affecting all of our hosts.

If it's multiple machines, it's much less likely to be faulty drives,
but if the machines are configured mostly identically, hardware
problems can't be ruled out either.

> There have also been a few reports of problems which look very
> similar in this thread from people with somewhat different hardware
> and drives to ours.

I wouldn't connect the reported cases too eagerly at this point. Too
many different causes end up showing similar symptoms especially with
timeouts.

>> The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
>> from the regular IO paths or md. It's probably being issued via SG_IO
>> from userland. These failures don't affect normal operation.
> [...]
>> Oooh, another possibility is the above continuous IDENTIFY tries.
>> Doing things like that generally isn't a good idea because vendors
>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>> firmwares aren't tested against that. Even smart commands sometimes
>> cause problems. So, finding out the thing which is obsessed with the
>> identity of the drive and stopping it might help.
>
> We tracked this down to some (excessively frequent!) monitoring we
> were doing using smartctl. Things were improved considerably by
> stopping smartd and disabling all callers of smartctl, although it
> doesn't appear to have been a cure. The frequency of these timeouts
> during resync seems to have gone from about once every two hours to
> about once a day, which means we've been able to complete some
> resyncs whereas we were unable to before.

That's interesting. One important side effect of issuing IDENTIFY is
that they will serialize command streams as they are not NCQ commands
and thus could change command patterns significantly.

> What we still see are (fewer) 'frozen' exceptions leading to a drive
> reset and an 'end_request: I/O error', such as [1]. The drive is
> then promptly kicked out of the raid array.

That's flush timeout and md is right to kick the drive out.

> Some of these timeouts also leave us with a completely dead drive,
> and we need to reboot the machine before it can be accessed
> again. (Hot plugging it out and back in again isn't sufficient to
> bring it back to life, so maybe a controller problem, although other
> drives on the same controller stay alive?) An example is [2].

Ports behave mostly independently and it sure is possible that one
port locks up while others operate fine. I've never seen such
incidents reported for intel ahci's tho. If you hot unplug and then
replug the drive, what does the kernel say?

> There are two more symptoms we are seeing on the same which may be
> connected, or may be separate bugs in their own right:
>
> - 'cat /proc/mdstat' sometimes hangs before returning during normal
> operation, although most of the time it is fine. We have seen hangs of
> up to 15-20 seconds during resync. Might this be a less severe example
> of the lock-up which causes a timeout and reset after 30 seconds?
>
> - We've also had a few occasions of O_SYNC writes to raid arrays (from
> qemu-kvm via LVM2) completely deadlocking against resync writes when the
> maximum md resync speed is set sufficiently high, even where the minimum
> md resync speed is set to zero (although this certainly helps). However,
> I suspect this is an unrelated issue as I've seen this on other hardware
> running other kernel configs.

I think these two will be best answered by Neil Brown. Neil?

> For reference, we're using the ahci driver and deadline IO scheduler with the
> default tuning parameters, our motherboards are SuperMicro X7DBN (Intel ESB2
> SATA 3.0Gbps Controller) and we have six 750GB Seagate ST3750523AS drives
> attached to each motherboard. Also, since first reporting this, I've managed
> to reproduce the problem whilst running Linux 2.6.29.6, 2.6.30.5 and the
> newly released 2.6.31.
>
> What do you think are our next steps in tracking this one down should be? My
> only ideas are:
>
> - We could experiment with NCQ settings. I've already briefly
> changed /sys/block/sd*/device/queue_depth down from 31 to 1. It
> didn't seem to stop delays in getting back info from
> /proc/mdstat, so put it back up again fearing that the
> performance hit would make the problem worse, but perhaps I
> should leave it off for a more extended period to verify that we
> still get timeouts long enough to leave slots without it?
>
> - We could try replacing the drives that are currently kicked out
> of one of the arrays with drives from another manufacturer to
> see if the drive model is implicated. Is the drive or the
> controller a more likely problem?

The most common cause for FLUSH timeout has been power related issues.
This problem becomes more pronounced in RAID configurations because
FLUSHes end up being issued to all drives in the array simultaneously
causing concurrent power spikes from the drives. When proper barrier
was introduced to md earlier this year, I got two separate reports
where brief voltage drops caused by simultaneous FLUSHes led to drives
powering off briefly and losing data in its buffer leading to data
corruption. People always think their PSUs are good because they are
rated high wattage and bear hefty price tag but many people reporting
problems which end up being diagnosed as power problem have these
fancy PSUs.

So, if your machines share the same configuration, the first thing
I'll do would be to prepare a separate PSU, power it up and connect
half of the drives including what used to be the offending one to it
and see whether the failure pattern changes.

Thanks.

--
tejun

2009-09-17 13:29:45

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Chris Webb wrote:
> Mark Lord <[email protected]> writes:
>
>> I suspect we're missing some info from this specific failure.
>> Looking back at Chris's earlier posting, the whole thing started
>> with a FLUSH_CACHE_EXT failure. Once that happens, all bets are
>> off on anything that follows.
>>
>>> Everything will be running fine when suddenly:
>>>
>>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>> ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>> res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
>>> ata1.00: status: { DRDY }
>>> ata1: hard resetting link
>>> ata1: softreset failed (device not ready)
>>> ata1: hard resetting link
>>> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> ata1.00: configured for UDMA/133
>>> ata1: EH complete
>>> end_request: I/O error, dev sda, sector 1465147272
>>> md: super_written gets error=-5, uptodate=0
>>> raid10: Disk failure on sda3, disabling device.
>>> raid10: Operation continuing on 5 devices.
>
> Hi Mark. Yes, when the first timeout after a clean boot happens, it's with
> an 0xea flush command every time:
..

Yes. Is this still happening from time to time now?
If so, disable the smartmontools daemon (smartd) and see if the problem goes away.
And especially disable hddtemp (which issues SMART commands) if that is also around.

It would be good to discover if those are the triggers for what's happening here.

Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
If not, then I think we may need to add code to do it.


Cheers

2009-09-17 13:32:44

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord wrote:
>
> Is this still happening from time to time now?
> If so, disable the smartmontools daemon (smartd) and see if the problem
> goes away.
> And especially disable hddtemp (which issues SMART commands) if that is
> also around.
>
> It would be good to discover if those are the triggers for what's
> happening here.
..

Ah.. I've just now read your other recent posting, so no need to answer again here.



> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
> If not, then I think we may need to add code to do it.

2009-09-17 13:35:19

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
> Hello,
>
> Chris Webb wrote:
>> Hi Tejun. Thanks for following up to this. We've done some more
>> experimentation over the last couple of days based on your
>> suggestions and thoughts.
>>
>> Tejun Heo <[email protected]> writes:
>>> Seriously, it's most likely a hardware malfunction although I can't tell
>>> where the problem is with the given data. Get the hardware fixed.
>> We know this isn't caused by a single faulty piece of hardware,
>> because we have a cluster of identical machines and all have shown
>> this behaviour. This doesn't mean that there isn't a hardware
>> problem, but if there is one, it's a design problem or firmware bug
>> affecting all of our hosts.
>
> If it's multiple machines, it's much less likely to be faulty drives,
> but if the machines are configured mostly identically, hardware
> problems can't be ruled out either.
>
>> There have also been a few reports of problems which look very
>> similar in this thread from people with somewhat different hardware
>> and drives to ours.
>
> I wouldn't connect the reported cases too eagerly at this point. Too
> many different causes end up showing similar symptoms especially with
> timeouts.
>
>>> The aboves are IDENTIFY. Who's issuing IDENTIFY regularly? It isn't
>>> from the regular IO paths or md. It's probably being issued via SG_IO
>>> from userland. These failures don't affect normal operation.
>> [...]
>>> Oooh, another possibility is the above continuous IDENTIFY tries.
>>> Doing things like that generally isn't a good idea because vendors
>>> don't expect IDENTIFY to be mixed regularly with normal IOs and
>>> firmwares aren't tested against that. Even smart commands sometimes
>>> cause problems. So, finding out the thing which is obsessed with the
>>> identity of the drive and stopping it might help.
>> We tracked this down to some (excessively frequent!) monitoring we
>> were doing using smartctl. Things were improved considerably by
>> stopping smartd and disabling all callers of smartctl, although it
>> doesn't appear to have been a cure. The frequency of these timeouts
>> during resync seems to have gone from about once every two hours to
>> about once a day, which means we've been able to complete some
>> resyncs whereas we were unable to before.
>
> That's interesting. One important side effect of issuing IDENTIFY is
> that they will serialize command streams as they are not NCQ commands
> and thus could change command patterns significantly.
..

SMART is the opcode that is most frequently implicated here, not IDENTIFY.
Note that even a barrier FLUSH CACHE is non NCQ and will serialize the stream.

Cheers

2009-09-17 13:37:24

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord <[email protected]> writes:

> Yes. Is this still happening from time to time now?

Hi Mark. We're still seeing the flush timeouts (0xea) with accompanying
errors, but not the IDENTIFYs anymore. We now don't have smartd nor smartctl
on the systems, and there's nothing else running in userspace that accesses
the drives other than via block reads and writes via the md arrays.

Cheers,

Chris.

2009-09-17 15:35:23

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Mark Lord wrote:
> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?

Nope.

> If not, then I think we may need to add code to do it.

Hmm... can you explain a bit more? That seems rather extreme to me.

Thanks.

--
tejun

2009-09-17 15:48:32

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Mark Lord wrote:
>> That's interesting. One important side effect of issuing IDENTIFY is
>> that they will serialize command streams as they are not NCQ commands
>> and thus could change command patterns significantly.
> ..
>
> SMART is the opcode that is most frequently implicated here, not IDENTIFY.

Yeap, any non-NCQ commands would do it.

> Note that even a barrier FLUSH CACHE is non NCQ and will serialize
> the stream.

Yeah, I was just thinking that issuing non-NCQ commands mixed with NCQ
commands would make the command stream fluctuate more. Modern drives
are pretty good at lowering power consumption while idle so being more
fluctuative (is it a word?) might have something to do with the
problem. Just a wild speculation tho.

Thanks.

--
tejun

2009-09-17 16:16:11

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo wrote:
> Hello,
>
> Mark Lord wrote:
>> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
>
> Nope.
>
>> If not, then I think we may need to add code to do it.
>
> Hmm... can you explain a bit more? That seems rather extreme to me.
..

You may recall that I first raised this issue about a year ago,
when my own RAID0 array (MythTV box) started showing errors very
similar to what Chris is reporting.

These were easily triggered by running hddtemp once every few seconds
to log drive temperatures during Myth recording sessions.

hddtemp uses SMART commands.

The actual errors in the logs were command timeouts,
but at this point I no longer remember which opcode was
actually timing out. Disabling the onboard write cache
immediately "cured" the problem, at the expense of MUCH
slower I/O times.

My theory at the time, was that some non-NCQ commands might be triggering
an internal FLUSH CACHE within the (Hitachi) drive firmware, which then
caused the original command to timeout in libata (due to the large amounts
of data present in the onboard write-caches).

Now that more people are playing the game, we're seeing more and more
reports of strange interactions with smartd running in the background.

I suspect more and more now that this is an (avoidable) interaction
between the write-cache and the SMART opcode, and it could perhaps be
avoided by doing a FLUSH CACHE before any SMART (or non-data command) opcode.

Cheers

2009-09-17 16:17:19

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord wrote:
> Tejun Heo wrote:
>> Hello,
>>
>> Mark Lord wrote:
>>> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
>>
>> Nope.
>>
>>> If not, then I think we may need to add code to do it.
>>
>> Hmm... can you explain a bit more? That seems rather extreme to me.
> ..
>
> You may recall that I first raised this issue about a year ago,
> when my own RAID0 array (MythTV box) started showing errors very
> similar to what Chris is reporting.
>
> These were easily triggered by running hddtemp once every few seconds
> to log drive temperatures during Myth recording sessions.
>
> hddtemp uses SMART commands.
>
> The actual errors in the logs were command timeouts,
> but at this point I no longer remember which opcode was
> actually timing out. Disabling the onboard write cache
> immediately "cured" the problem, at the expense of MUCH
> slower I/O times.
..

Speaking of which..

Chris: I wonder if the errors will also vanish in your situation
by disabling the onboard write-caches in the drives ?

Eg. hdparm -W0 /dev/sd?

2009-09-18 17:05:20

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Mark Lord <[email protected]> writes:

> Speaking of which..
>
> Chris: I wonder if the errors will also vanish in your situation
> by disabling the onboard write-caches in the drives ?
>
> Eg. hdparm -W0 /dev/sd?

Hi Mark. I've got a test machine on its way at the moment, so I'll make sure
I check this one out on it too.

Cheers,

Chris.

2009-09-20 18:36:41

by Robert Hancock

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On 09/17/2009 10:16 AM, Mark Lord wrote:
> Tejun Heo wrote:
>> Hello,
>>
>> Mark Lord wrote:
>>> Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
>>
>> Nope.
>>
>>> If not, then I think we may need to add code to do it.
>>
>> Hmm... can you explain a bit more? That seems rather extreme to me.
> ..
>
> You may recall that I first raised this issue about a year ago,
> when my own RAID0 array (MythTV box) started showing errors very
> similar to what Chris is reporting.
>
> These were easily triggered by running hddtemp once every few seconds
> to log drive temperatures during Myth recording sessions.
>
> hddtemp uses SMART commands.
>
> The actual errors in the logs were command timeouts,
> but at this point I no longer remember which opcode was
> actually timing out. Disabling the onboard write cache
> immediately "cured" the problem, at the expense of MUCH
> slower I/O times.
>
> My theory at the time, was that some non-NCQ commands might be triggering
> an internal FLUSH CACHE within the (Hitachi) drive firmware, which then
> caused the original command to timeout in libata (due to the large amounts
> of data present in the onboard write-caches).
>
> Now that more people are playing the game, we're seeing more and more
> reports of strange interactions with smartd running in the background.

Well, unless the SMART commands are using a non-standard timeout, it'll
be the same as the timeout for the flush cache, so the flush cache would
have timed out too..

>
> I suspect more and more now that this is an (avoidable) interaction
> between the write-cache and the SMART opcode, and it could perhaps be
> avoided by doing a FLUSH CACHE before any SMART (or non-data command)
> opcode.
>
> Cheers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2009-09-21 10:31:47

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Chris Webb <[email protected]> writes:

> Mark Lord <[email protected]> writes:
>
> > Speaking of which..
> >
> > Chris: I wonder if the errors will also vanish in your situation
> > by disabling the onboard write-caches in the drives ?
> >
> > Eg. hdparm -W0 /dev/sd?
>
> Hi Mark. I've got a test machine on its way at the moment, so I'll make sure
> I check this one out on it too.

Our test machine is still being built, but we had an opportunity to try this on
a couple of the live machines when their RAID arrays failed over the weekend.
We still got timeouts, but (predictably!) they're not on flushes any more:

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd 35/00:08:98:c6:00/00:00:4e:00:00/e0 tag 0 dm
res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/33
ata2: EH complete
[...]
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd 35/00:08:18:94:68/00:00:3d:00:00/e0 tag 0 dm
res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/33
ata2: EH complete
[...]

all the way through the night.

I also have these in the log, but they are immediately after turning off the
write caching in all drives, so may be a red herring with data still being
written out.

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd c8/00:08:00:20:80/00:00:00:00:00/e0 tag 0 dm
res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/133
ata2: EH complete
ata2.00: limiting speed to UDMA/100:PIO4
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata2.00: cmd 25/00:08:80:3e:2d/00:00:4e:00:00/e0 tag 0 dm
res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: link is slow to respond, please be patient (ready=0
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: configured for UDMA/100
ata2: EH complete

On another machine, I saw this with write caching turned off:

ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
res 40/00:00:40:1f:80/00:00:00:00:00/40 Emask 0x4 (timeout)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
res 40/00:00:20:1f:80/00:00:00:00:00/40 Emask 0x4 (timeout)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: qc timeout (cmd 0xef)
ata2.00: failed to set xfermode (err_mask=0x4)
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: link is slow to respond, please be patient (ready=0)
ata2: softreset failed (device not ready)
ata2: limiting SATA link speed to 1.5 Gbps
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: reset failed, giving up
ata2.00: disabled
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: link is slow to respond, please be patient (ready=0)
ata2: softreset failed (device not ready)
ata2: hard resetting link
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: EH complete
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 8396584
end_request: I/O error, dev sdb, sector 8396584
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sdb1, disabling device.
raid1: Operation continuing on 5 devices.
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 8396632
end_request: I/O error, dev sdb, sector 8396632
md: super_written gets error=-5, uptodate=0
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 654934840
raid10: sdb3: rescheduling sector 1788594488
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 1311583568
raid10: Disk failure on sdb3, disabling device.
raid10: Operation continuing on 3 devices.
Buffer I/O error on device dm-51, logical block 31930
lost page write due to I/O error on dm-51
Buffer I/O error on device dm-51, logical block 31931
lost page write due to I/O error on dm-51
Buffer I/O error on device dm-51, logical block 31932
lost page write due to I/O error on dm-51
Buffer I/O error on device dm-51, logical block 31933
lost page write due to I/O error on dm-51
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 1465147272
end_request: I/O error, dev sdb, sector 1465147272
md: super_written gets error=-5, uptodate=0
sd 1:0:0:0: [sdb] Unhandled error code
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
end_request: I/O error, dev sdb, sector 8396584
end_request: I/O error, dev sdb, sector 8396584
md: super_written gets error=-5, uptodate=0
sd 1:0:0:0: [sdb] READ CAPACITY(16) failed
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
sd 1:0:0:0: [sdb] Sense not available.
sd 1:0:0:0: [sdb] READ CAPACITY failed
sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
sd 1:0:0:0: [sdb] Sense not available.
sd 1:0:0:0: [sdb] Asking for cache data failed
sd 1:0:0:0: [sdb] Assuming drive cache: write through
sdb: detected capacity change from 750156374016 to 0
raid10: sdb: unrecoverable I/O read error for block 1788594488
Buffer I/O error on device dm-59, logical block 204023
Buffer I/O error on device dm-59, logical block 204023
Buffer I/O error on device dm-43, logical block 24845
lost page write due to I/O error on dm-43
Buffer I/O error on device dm-62, logical block 558722
lost page write due to I/O error on dm-62
Buffer I/O error on device dm-43, logical block 24846
lost page write due to I/O error on dm-43
RAID1 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:0, dev:sdb1
disk 2, wo:0, o:1, dev:sdc1
RAID1 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sda1
disk 2, wo:0, o:1, dev:sdc1
disk 3, wo:0, o:1, dev:sdd1
disk 4, wo:0, o:1, dev:sde1
disk 5, wo:0, o:1, dev:sdf1
raid10: Disk failure on sdb2, disabling device.
raid10: Operation continuing on 3 devices.
raid10: sdb: unrecoverable I/O read error for block 0
RAID10 conf printout:
--- wd:3 rd:6
disk 1, wo:1, o:0, dev:sdb2
disk 2, wo:0, o:1, dev:sdc2
disk 3, wo:0, o:1, dev:sdd2
disk 5, wo:0, o:1, dev:sdf2
RAID10 conf printout:
--- wd:3 rd:6
disk 2, wo:0, o:1, dev:sdc2
disk 3, wo:0, o:1, dev:sdd2
disk 5, wo:0, o:1, dev:sdf2
md: md2: resync done.
RAID10 conf printout:
--- wd:3 rd:6
disk 1, wo:1, o:0, dev:sdb3
disk 2, wo:0, o:1, dev:sdc3
disk 3, wo:0, o:1, dev:sdd3
disk 5, wo:0, o:1, dev:sdf3
RAID10 conf printout:
--- wd:3 rd:6
disk 2, wo:0, o:1, dev:sdc3
disk 3, wo:0, o:1, dev:sdd3
disk 5, wo:0, o:1, dev:sdf3

Cheers,

Chris.

2009-09-21 19:47:51

by Mark Lord

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Chris Webb wrote:
> Chris Webb <[email protected]> writes:
>
>> Mark Lord <[email protected]> writes:
>>
>>> Speaking of which..
>>>
>>> Chris: I wonder if the errors will also vanish in your situation
>>> by disabling the onboard write-caches in the drives ?
>>>
>>> Eg. hdparm -W0 /dev/sd?
>> Hi Mark. I've got a test machine on its way at the moment, so I'll make sure
>> I check this one out on it too.
>
> Our test machine is still being built, but we had an opportunity to try this on
> a couple of the live machines when their RAID arrays failed over the weekend.
> We still got timeouts, but (predictably!) they're not on flushes any more:
>
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.00: cmd 35/00:08:98:c6:00/00:00:4e:00:00/e0 tag 0 dm
...
> all the way through the night.
>
> I also have these in the log, but they are immediately after turning off the
> write caching in all drives, so may be a red herring with data still being
> written out.
>
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
> ata2.00: cmd c8/00:08:00:20:80/00:00:00:00:00/e0 tag 0 dm
...
> On another machine, I saw this with write caching turned off:
>
> ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
...

0x35 is a 48-bit DMA WRITE, 0xc8 is a 28-bit DMA READ,
and 0x61 is an NCQ WRITE.

Looks like some kind of hardware trouble to me.
And as Tejun suggested, it's difficult to guess at
a cause other than the PSU.

Cheers, and good luck.

2009-09-22 06:16:20

by Robert Hancock

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On 09/21/2009 04:26 AM, Chris Webb wrote:
> On another machine, I saw this with write caching turned off:
>
> ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
> res 40/00:00:40:1f:80/00:00:00:00:00/40 Emask 0x4 (timeout)
> ata2.00: status: { DRDY }
> ata2: hard resetting link
> ata2: softreset failed (device not ready)
> ata2: hard resetting link
> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata2.00: configured for UDMA/133
> ata2: EH complete
> ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
> ata2.00: cmd 61/08:00:28:1f:80/00:00:00:00:00/40 tag 0 ncq 4096 out
> res 40/00:00:20:1f:80/00:00:00:00:00/40 Emask 0x4 (timeout)
> ata2.00: status: { DRDY }
> ata2: hard resetting link
> ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata2.00: qc timeout (cmd 0xef)
> ata2.00: failed to set xfermode (err_mask=0x4)
> ata2: hard resetting link
> ata2: softreset failed (device not ready)
> ata2: hard resetting link
> ata2: softreset failed (device not ready)
> ata2: hard resetting link
> ata2: link is slow to respond, please be patient (ready=0)
> ata2: softreset failed (device not ready)
> ata2: limiting SATA link speed to 1.5 Gbps
> ata2: hard resetting link
> ata2: softreset failed (device not ready)
> ata2: reset failed, giving up
> ata2.00: disabled

Basically an NCQ command timed out, then the drive basically stopped
talking to the controller even after banging on it with multiple resets.
That failure especially looks suspicious of some kind of hardware issue..