LinuxLists.cc - Monthly md check == hung machine; how do I debug?

2008-02-03 21:28:49

Subject: Monthly md check == hung machine; how do I debug?

I've got a machine with a 4 disk SATA raid10 configuration using md.
The entire disk is loop-AES encrypted, but that shouldn't matter
here.

Once a month, Debian runs:

/usr/share/mdadm/checkarray --cron --all --quiet

and the machine hangs within 30 minutes of that starting.

It seems that I can avoid the hang by not having "mdadm --monitor"
running, but I'm not certain if that's the case or if I've just been
lucky this go-round.

I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
Athlon(tm) 64 Processor 3700+.

I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
can't find anything that looks relevant.

So, how can I (help you all) debug this?

-Robin

--
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

2008-02-04 06:09:46

by martin f krafft

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

also sprach Robin Lee Powell <[email protected]> [2008.02.04.1021 +1300]:
> /usr/share/mdadm/checkarray --cron --all --quiet

FYI:
http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=blob;f=debian/checkarray

It basically does

echo check > /sys/block/$array/md/sync_action

for all arrays.

--
martin | http://madduck.net/ | http://two.sentenc.es/

i feel like i'm diagonally parked in a parallel universe.

spamtraps: [email protected]

Attachments:

(No filename) (460.00 B)
digital_signature_gpg.asc (189.00 B)
Digital signature (see http://martin-krafft.net/gpg/) Download all attachments

2008-02-04 06:59:38

by Robin Lee Powell

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Mon, Feb 04, 2008 at 06:37:02PM +1300, martin f krafft wrote:
> also sprach Robin Lee Powell <[email protected]> [2008.02.04.1021 +1300]:
> > /usr/share/mdadm/checkarray --cron --all --quiet
>
> FYI:
> http://git.debian.org/?p=pkg-mdadm/mdadm.git;a=blob;f=debian/checkarray
>
> It basically does
>
> echo check > /sys/block/$array/md/sync_action
>
> for all arrays.

Thanks for the clarification.

I've tried a few more times, by the way, and it seems that without
"mdadm --monitor" running, the hang doesn't occur.

I'd certainly prefer being notified of state changes, though, so
that's not much of a solution.

-Robin

--
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

2008-02-04 10:42:59

by Nick Piggin

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> I've got a machine with a 4 disk SATA raid10 configuration using md.
> The entire disk is loop-AES encrypted, but that shouldn't matter
> here.
>
> Once a month, Debian runs:
>
> /usr/share/mdadm/checkarray --cron --all --quiet
>
> and the machine hangs within 30 minutes of that starting.
>
> It seems that I can avoid the hang by not having "mdadm --monitor"
> running, but I'm not certain if that's the case or if I've just been
> lucky this go-round.
>
> I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> Athlon(tm) 64 Processor 3700+.
>
> I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> can't find anything that looks relevant.
>
> So, how can I (help you all) debug this?

Do you have a serial console? Does it respond to pings?

Can you try to get sysrq+T traces, and sysrq+P traces, and post
them?

2008-02-05 17:11:00

by Robin Lee Powell

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Mon, Feb 04, 2008 at 09:40:55PM +1100, Nick Piggin wrote:
> On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> > I've got a machine with a 4 disk SATA raid10 configuration using
> > md. The entire disk is loop-AES encrypted, but that shouldn't
> > matter here.
> >
> > Once a month, Debian runs:
> >
> > /usr/share/mdadm/checkarray --cron --all --quiet
> >
> > and the machine hangs within 30 minutes of that starting.
> >
> > It seems that I can avoid the hang by not having "mdadm
> > --monitor" running, but I'm not certain if that's the case or if
> > I've just been lucky this go-round.
> >
> > I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> > Athlon(tm) 64 Processor 3700+.
> >
> > I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> > can't find anything that looks relevant.
> >
> > So, how can I (help you all) debug this?
>
> Do you have a serial console? Does it respond to pings?

No and yes.

> Can you try to get sysrq+T traces, and sysrq+P traces, and post
> them?

I played with those after you suggested it, but without serial
console had no way to capture them.

I was able to solve the problem, however, like so:

132c133
< # CONFIG_PREEMPT_NONE is not set
---
> CONFIG_PREEMPT_NONE=y
134,135c135,136
< CONFIG_PREEMPT=y
< CONFIG_PREEMPT_BKL=y
---
> # CONFIG_PREEMPT is not set
> # CONFIG_PREEMPT_BKL is not set

-Robin

--
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

2008-02-05 18:55:37

by Lennart Sorensen

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Tue, Feb 05, 2008 at 09:10:05AM -0800, Robin Lee Powell wrote:
> On Mon, Feb 04, 2008 at 09:40:55PM +1100, Nick Piggin wrote:
> > On Monday 04 February 2008 08:21, Robin Lee Powell wrote:
> > > I've got a machine with a 4 disk SATA raid10 configuration using
> > > md. The entire disk is loop-AES encrypted, but that shouldn't
> > > matter here.
> > >
> > > Once a month, Debian runs:
> > >
> > > /usr/share/mdadm/checkarray --cron --all --quiet
> > >
> > > and the machine hangs within 30 minutes of that starting.
> > >
> > > It seems that I can avoid the hang by not having "mdadm
> > > --monitor" running, but I'm not certain if that's the case or if
> > > I've just been lucky this go-round.
> > >
> > > I'm on kernel 2.6.23.1, my own compile thereof, x86_64, AMD
> > > Athlon(tm) 64 Processor 3700+.
> > >
> > > I've looked through all the 2.6.23 and 2.6.24 Changelogs, and I
> > > can't find anything that looks relevant.
> > >
> > > So, how can I (help you all) debug this?
> >
> > Do you have a serial console? Does it respond to pings?
>
> No and yes.
>
> > Can you try to get sysrq+T traces, and sysrq+P traces, and post
> > them?
>
> I played with those after you suggested it, but without serial
> console had no way to capture them.
>
> I was able to solve the problem, however, like so:

I tend to adjust the max disk speed raid is allowed to use, since the
default of 200MB/s makes the system close to unusable while it is taking
place. Could having slow disk access be causing things to lock up?

Things made much more sense some years ago when the default was 10MB/s
or something along those lines.

Who has 200MB/s capable hardware anyhow?

--
Len Sorensen

2008-02-05 19:18:50

by Robin Lee Powell

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Tue, Feb 05, 2008 at 01:55:17PM -0500, Lennart Sorensen wrote:
> I tend to adjust the max disk speed raid is allowed to use, since
> the default of 200MB/s makes the system close to unusable while it
> is taking place. Could having slow disk access be causing things
> to lock up?

I don't know if it could or not, but I have no performance problems
at all when the sync is running, as long as it doesn't lock up, so
it seems unlikely to me.

(Shout out to a fellow CSC sysadmin, btw.)

-Robin

--
Lojban Reason #17: http://en.wikipedia.org/wiki/Buffalo_buffalo
Proud Supporter of the Singularity Institute - http://singinst.org/
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/

2008-02-05 20:28:23

by NeilBrown

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Tuesday February 5, [email protected] wrote:
>
> I was able to solve the problem, however, like so:
>
> 132c133
> < # CONFIG_PREEMPT_NONE is not set
> ---
> > CONFIG_PREEMPT_NONE=y
> 134,135c135,136
> < CONFIG_PREEMPT=y
> < CONFIG_PREEMPT_BKL=y
> ---
> > # CONFIG_PREEMPT is not set
> > # CONFIG_PREEMPT_BKL is not set
>

This suggests that there is some sort of race.
Given that I've never hit it on SMP machines, it is probably a very
small window that opens immediately after some event that triggers
kernel preemption.

The only "mdadm --monitor" does in the kernel is read /proc/mdstat and
maybe make some GET_ARRAY_INFO/ GET_DISK_INFO ioctl calls.

They don't do much more than grab the reconfig_mutex.....

What sort of hardware do you have? x86? SMP or uni-processor?
Also, exactly what kernel are you running?

I might see if I can reproduce it... so if you can send me the broken
.config, that might help too.

Thanks,
NeilBrown

2008-02-05 21:18:26

by Robin Lee Powell

[permalink] [raw]

Subject: Re: Monthly md check == hung machine; how do I debug?

On Wed, Feb 06, 2008 at 07:27:56AM +1100, Neil Brown wrote:
> On Tuesday February 5, [email protected] wrote:
> >
> > I was able to solve the problem, however, like so:
> >
> > 132c133
> > < # CONFIG_PREEMPT_NONE is not set
> > ---
> > > CONFIG_PREEMPT_NONE=y
> > 134,135c135,136
> > < CONFIG_PREEMPT=y
> > < CONFIG_PREEMPT_BKL=y
> > ---
> > > # CONFIG_PREEMPT is not set
> > > # CONFIG_PREEMPT_BKL is not set
> >
>
> This suggests that there is some sort of race. Given that I've
> never hit it on SMP machines, it is probably a very small window
> that opens immediately after some event that triggers kernel
> preemption.
>
> The only "mdadm --monitor" does

Going to stop you right there; "mdadm --monitor" wasn't it, nor was
smartd as I thought at one point. I honestly don't know what was
triggering it, except maybe disk access. The fact that backups were
running at the same time as the sync seemed to make it happen
faster; that's the best I've got at this point.

> What sort of hardware do you have? x86? SMP or uni-processor?
> Also, exactly what kernel are you running?

rlpowell@chain> uname -a
Linux chain.digitalkingdom.org 2.6.23.1-dk3 #4 SMP Mon Feb 4 06:14:44 PST 2008 x86_64 GNU/Linux
rlpowell@chain> cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 39
model name : AMD Athlon(tm) 64 Processor 3700+
stepping : 1
cpu MHz : 2210.251
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflu
t fxsr_opt lm 3dnowext 3dnow up rep_good pni lahf_lm
bogomips : 4422.66
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

> I might see if I can reproduce it... so if you can send me the
> broken .config, that might help too.

http://teddyb.org/~rlpowell/media/regular/config-2.6.23.1-dk2.txt

-Robin