2009-09-17 00:33:40

by NeilBrown

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On Thursday September 17, [email protected] wrote:
>
> > There are two more symptoms we are seeing on the same which may be
> > connected, or may be separate bugs in their own right:
> >
> > - 'cat /proc/mdstat' sometimes hangs before returning during normal
> > operation, although most of the time it is fine. We have seen hangs of
> > up to 15-20 seconds during resync. Might this be a less severe example
> > of the lock-up which causes a timeout and reset after 30 seconds?
> >
> > - We've also had a few occasions of O_SYNC writes to raid arrays (from
> > qemu-kvm via LVM2) completely deadlocking against resync writes when the
> > maximum md resync speed is set sufficiently high, even where the minimum
> > md resync speed is set to zero (although this certainly helps). However,
> > I suspect this is an unrelated issue as I've seen this on other hardware
> > running other kernel configs.
>
> I think these two will be best answered by Neil Brown. Neil?
>

"cat /proc/mdstat" should only hang if the mddev reconfig_mutex is
held for an extended period of time.
The reconfig_mutex is held while superblocks are being written.

So yes, an extended device timeout while updating the md superblock
can cause "cat /proc/mdstat" to hang for the duration of the timeout.

For the O_SYNC:
I think this is a RAID1 - is that correct?
With RAID1, as soon as any IO request arrives, resync is suspended and
as soon as all resync requests complete, the IO is permitted to
proceed.
So normal IO takes absolute precedence over resync IO.

So I am very surprised to here that O_SYNC writes deadlock
completed.
As O_SYNC writes are serialised, there will be a moment between
every pair when there is no IO pending. This will allow resync to
get one "window" of resync IO started between each pair of writes.
So I can well believe that a sequence of O_SYNC writes are a couple
of orders of magnitude slower when resync is happening than without.
But it shouldn't deadlock completely.
Once you get about 64 sectors of O_SYNC IO through, the resync
should notice and back-off and resync IO will be limited to the
'minimum' speed.

NeilBrown


2009-09-17 11:58:38

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo <[email protected]> writes:

> The most common cause for FLUSH timeout has been power related issues.
> This problem becomes more pronounced in RAID configurations because
> FLUSHes end up being issued to all drives in the array simultaneously
> causing concurrent power spikes from the drives. When proper barrier
> was introduced to md earlier this year, I got two separate reports
> where brief voltage drops caused by simultaneous FLUSHes led to drives
> powering off briefly and losing data in its buffer leading to data
> corruption. People always think their PSUs are good because they are
> rated high wattage and bear hefty price tag but many people reporting
> problems which end up being diagnosed as power problem have these
> fancy PSUs.

Hi Tejun. This sounds very plausible as a diagnosis. Six drives hanging off the
single power supply is that maximum that can be fitted in this Supermicro
chassis, and we have 32GB of RAM and two 4-core Xeon processors in there too,
so we could well be right at the limit for the rating of the power supply.

> So, if your machines share the same configuration, the first thing I'll do
> would be to prepare a separate PSU, power it up and connect half of the
> drives including what used to be the offending one to it and see whether
> the failure pattern changes.

It's quite hard for us to do this with these machines as we have them managed
by a third party in a datacentre to which we don't have physical access.
However, I could very easily get an extra 'test' machine built in there,
generate a work load that consistently reproduces the problems on the six
drives, and then retry with an array build from 5, 4, 3 and 2 drives
successively, taking out the unused drives from chassis, to see if reducing the
load on the power supply with a smaller array helps.

When I try to write a test case, would it be worth me trying to reproduce
without md in the loop, e.g. do 6-way simultaneous random-seek+write+sync
continuously, or is it better to rely on md's barrier support and just do
random-seek+write via md? Is there a standard work pattern/write size that
would be particularly likely to provoke power overload problems on drives?

Neil Brown <[email protected]> writes:

> [Chris Webb <[email protected]> wrote:]
>
> > 'cat /proc/mdstat' sometimes hangs before returning during normal
> > operation, although most of the time it is fine. We have seen hangs of
> > up to 15-20 seconds during resync. Might this be a less severe example
> > of the lock-up which causes a timeout and reset after 30 seconds?
>
> "cat /proc/mdstat" should only hang if the mddev reconfig_mutex is
> held for an extended period of time.
> The reconfig_mutex is held while superblocks are being written.
>
> So yes, an extended device timeout while updating the md superblock
> can cause "cat /proc/mdstat" to hang for the duration of the timeout.

Thanks Neil. This implies that when we see these fifteen second hangs reading
/proc/mdstat without write errors, there are genuinely successful superblock
writes which are taking fifteen seconds to complete, presumably corresponding
to flushes which complete but take a full 15s to do so.

Would such very slow (but ultimately successful) flushes be consistent with the
theory of power supply issues affecting the drives? It feels like the 30s
timeouts on flush could be just a more severe version of the 15s very slow
flushes.

Tejun Heo <[email protected]> writes:

> > Some of these timeouts also leave us with a completely dead drive,
> > and we need to reboot the machine before it can be accessed
> > again. (Hot plugging it out and back in again isn't sufficient to
> > bring it back to life, so maybe a controller problem, although other
> > drives on the same controller stay alive?) An example is [2].
>
> Ports behave mostly independently and it sure is possible that one
> port locks up while others operate fine. I've never seen such
> incidents reported for intel ahci's tho. If you hot unplug and then
> replug the drive, what does the kernel say?

We've only tried this once, and on that occasion there was nothing in the
kernel log at all. (I actually telephoned the data centre engineer to ask when
he was going to do it for us because I didn't see any messages, and it turned
out he already had!)

Cheers,

Chris.

2009-09-17 12:01:11

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Neil Brown <[email protected]> writes:

> For the O_SYNC:
> I think this is a RAID1 - is that correct?

Hi Neil. It's a RAID10n2 of six disks, but I've also seen the behaviour on a
RAID1 of two disks around the time of 2.6.27.

> With RAID1, as soon as any IO request arrives, resync is suspended and
> as soon as all resync requests complete, the IO is permitted to
> proceed.
> So normal IO takes absolute precedence over resync IO.
>
> So I am very surprised to here that O_SYNC writes deadlock
> completed.
> As O_SYNC writes are serialised, there will be a moment between
> every pair when there is no IO pending. This will allow resync to
> get one "window" of resync IO started between each pair of writes.
> So I can well believe that a sequence of O_SYNC writes are a couple
> of orders of magnitude slower when resync is happening than without.
> But it shouldn't deadlock completely.
> Once you get about 64 sectors of O_SYNC IO through, the resync
> should notice and back-off and resync IO will be limited to the
> 'minimum' speed.

The symptoms seem to be that I can't read or write to /dev/mdX but I can
read from the underlying /dev/sd* devices fine, at pretty much full speed. I
didn't try writing to them as there's lots of live customer data on the RAID
arrays!

The configuration is lvm2 (i.e. device-mapper linear targets) on top of md
on top of sd, and we've seen the symptoms with the virtual machines
accessing the logical volumes configured to open in O_SYNC mode, and with
them configured to open in O_DIRECT mode. During the deadlock, cat
/proc/mdstat does return promptly (i.e. not blocked), and shows a slow and
gradually falling sync rate---I think that there's no sync writing going on
either and the drives are genuinely idle. We have to reset the machine to
bring it back to life and a graceful reboot fails.

Anyway, I see this relatively infrequently, so what I'll try to do is to
create a reproducible test case and then follow up to you and the RAID list
with that. At the moment, I understand that my reports is a bit anecdotal,
and without a proper idea of what conditions are needed to make it happen
its pretty much impossible to diagnose or work on!

Cheers,

Chris.

2009-09-17 15:45:34

by Tejun Heo

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Hello,

Chris Webb wrote:
> It's quite hard for us to do this with these machines as we have
> them managed by a third party in a datacentre to which we don't have
> physical access. However, I could very easily get an extra 'test'
> machine built in there, generate a work load that consistently
> reproduces the problems on the six drives, and then retry with an
> array build from 5, 4, 3 and 2 drives successively, taking out the
> unused drives from chassis, to see if reducing the load on the power
> supply with a smaller array helps.

Yeap, that also should shed some light on it.

> When I try to write a test case, would it be worth me trying to
> reproduce without md in the loop, e.g. do 6-way simultaneous
> random-seek+write+sync continuously, or is it better to rely on md's
> barrier support and just do random-seek+write via md? Is there a
> standard work pattern/write size that would be particularly likely
> to provoke power overload problems on drives?

Excluding it out of the chain would be helpful but if md can reproduce
the problem reliably trying with md first would be easier. :-)

>> So yes, an extended device timeout while updating the md superblock
>> can cause "cat /proc/mdstat" to hang for the duration of the timeout.
>
> Thanks Neil. This implies that when we see these fifteen second
> hangs reading /proc/mdstat without write errors, there are genuinely
> successful superblock writes which are taking fifteen seconds to
> complete, presumably corresponding to flushes which complete but
> take a full 15s to do so.
>
> Would such very slow (but ultimately successful) flushes be
> consistent with the theory of power supply issues affecting the
> drives? It feels like the 30s timeouts on flush could be just a more
> severe version of the 15s very slow flushes.

Probably not. Power problems usually don't resolve themselves with
longer timeout. If the drive genuinely takes longer than 30s to
flush, it would be very interesting tho. That's something people have
been worrying about but hasn't materialized yet. The timeout is
controlled by SD_TIMEOUT in drivers/scsi/sd.h. You might want to bump
it up to, say, 60s and see whether anything changes.

>>> Some of these timeouts also leave us with a completely dead drive,
>>> and we need to reboot the machine before it can be accessed
>>> again. (Hot plugging it out and back in again isn't sufficient to
>>> bring it back to life, so maybe a controller problem, although other
>>> drives on the same controller stay alive?) An example is [2].
>> Ports behave mostly independently and it sure is possible that one
>> port locks up while others operate fine. I've never seen such
>> incidents reported for intel ahci's tho. If you hot unplug and then
>> replug the drive, what does the kernel say?
>
> We've only tried this once, and on that occasion there was nothing
> in the kernel log at all. (I actually telephoned the data centre
> engineer to ask when he was going to do it for us because I didn't
> see any messages, and it turned out he already had!)

Hmmm... that means the host port was dead. Strange, I've never seen
intel ahci doing that. If possible, it would be great if you can
verify it.

Thanks.

--
tejun

2009-09-18 17:08:25

by Chris Webb

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

Tejun Heo <[email protected]> writes:

> Chris Webb wrote:
>
> > Would such very slow (but ultimately successful) flushes be
> > consistent with the theory of power supply issues affecting the
> > drives? It feels like the 30s timeouts on flush could be just a more
> > severe version of the 15s very slow flushes.
>
> Probably not. Power problems usually don't resolve themselves with
> longer timeout. If the drive genuinely takes longer than 30s to
> flush, it would be very interesting tho. That's something people have
> been worrying about but hasn't materialized yet. The timeout is
> controlled by SD_TIMEOUT in drivers/scsi/sd.h. You might want to bump
> it up to, say, 60s and see whether anything changes.

I'll add that to the list of things to check out on the test machine with a
more disposable installation on it! The 15s flushes we're seeing on
superblock barrier writes do already feel dangerously close to the 30s
hardcoded timeout to me: it's only a factor of two.

Cheers,

Chris.

2009-09-20 18:46:31

by Robert Hancock

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On 09/17/2009 09:44 AM, Tejun Heo wrote:
>> Thanks Neil. This implies that when we see these fifteen second
>> hangs reading /proc/mdstat without write errors, there are genuinely
>> successful superblock writes which are taking fifteen seconds to
>> complete, presumably corresponding to flushes which complete but
>> take a full 15s to do so.
>>
>> Would such very slow (but ultimately successful) flushes be
>> consistent with the theory of power supply issues affecting the
>> drives? It feels like the 30s timeouts on flush could be just a more
>> severe version of the 15s very slow flushes.
>
> Probably not. Power problems usually don't resolve themselves with
> longer timeout. If the drive genuinely takes longer than 30s to
> flush, it would be very interesting tho. That's something people have
> been worrying about but hasn't materialized yet. The timeout is
> controlled by SD_TIMEOUT in drivers/scsi/sd.h. You might want to bump
> it up to, say, 60s and see whether anything changes.

It's possible if the power dip only slightly disrupted the drive it
might just take longer to complete the write. I've also seen reports of
vibration issues causing problems in RAID arrays (there's a video on
Youtube of a guy yelling at a Sun disk array during heavy I/O and the
resulting vibrations causing an immediate spike in I/O service times).
Could be something like that causing issues with simultaneous media
access to all drives in the array, too..

2009-09-21 00:02:24

by Kyle Moffett

[permalink] [raw]
Subject: Re: MD/RAID time out writing superblock

On Sun, Sep 20, 2009 at 14:46, Robert Hancock <[email protected]> wrote:
> On 09/17/2009 09:44 AM, Tejun Heo wrote:
>>>
>>> Thanks Neil. This implies that when we see these fifteen second
>>> hangs reading /proc/mdstat without write errors, there are genuinely
>>> successful superblock writes which are taking fifteen seconds to
>>> complete, presumably corresponding to flushes which complete but
>>> take a full 15s to do so.
>>>
>>> Would such very slow (but ultimately successful) flushes be
>>> consistent with the theory of power supply issues affecting the
>>> drives? It feels like the 30s timeouts on flush could be just a more
>>> severe version of the 15s very slow flushes.
>>
>> Probably not.  Power problems usually don't resolve themselves with
>> longer timeout.  If the drive genuinely takes longer than 30s to
>> flush, it would be very interesting tho.  That's something people have
>> been worrying about but hasn't materialized yet.  The timeout is
>> controlled by SD_TIMEOUT in drivers/scsi/sd.h.  You might want to bump
>> it up to, say, 60s and see whether anything changes.
>
> It's possible if the power dip only slightly disrupted the drive it might
> just take longer to complete the write. I've also seen reports of vibration
> issues causing problems in RAID arrays (there's a video on Youtube of a guy
> yelling at a Sun disk array during heavy I/O and the resulting vibrations
> causing an immediate spike in I/O service times). Could be something like
> that causing issues with simultaneous media access to all drives in the
> array, too..

There have been a rather large number of reported firmware problems
lately with various models of Seagate SATA drives; typically they
cause command timeouts and occasionally they completely brick the
drive (restart does not fix it). I possessed 3 of these for a while
and they pretty consistently fell over (even with just 3 in a
low-power-CPU box with a good PSU rated for 8 drives).

You might check with the various Seagate tech support lines to see if
your drive firmwares are affected by the bugs (Some were related to
NCQ command processing, others were just single-command failures).

Cheers,
Kyle Moffett