2007-06-13 18:23:26

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/13/07, Mike Snitzer <[email protected]> wrote:
> On 6/12/07, Neil Brown <[email protected]> wrote:
...
> > > > On 6/12/07, Neil Brown <[email protected]> wrote:
> > > > > On Tuesday June 12, [email protected] wrote:
> > > > > >
> > > > > > I can provided more detailed information; please just ask.
> > > > > >
> > > > >
> > > > > A complete sysrq trace (all processes) might help.

Bringing this back to a wider audience. I provided the full sysrq
trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
following trace:

md0_raid1 D ffff810026183ce0 5368 31663 11 3822 29488 (L-TLB)
ffff810026183ce0 ffff810031e9b5f8 0000000000000008 000000000000000a
ffff810037eef040 ffff810037e17100 00043e64d2983c1f 0000000000004c7f
ffff810037eef210 0000000100000001 000000081c506640 00000000ffffffff
Call Trace:
[<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
[<ffffffff801b9364>] md_super_wait+0xa8/0xbc
[<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
[<ffffffff801b9adb>] md_update_sb+0x1dd/0x23a
[<ffffffff801bed2a>] md_check_recovery+0x15f/0x449
[<ffffffff882a1af3>] :raid1:raid1d+0x27/0xc1e
[<ffffffff80233209>] thread_return+0x0/0xde
[<ffffffff8023279c>] __sched_text_start+0xc/0xa79
[<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
[<ffffffff80233a9f>] schedule_timeout+0x1e/0xad
[<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
[<ffffffff801bd06c>] md_thread+0xf8/0x10e
[<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
[<ffffffff801bcf74>] md_thread+0x0/0x10e
[<ffffffff8003e5e7>] kthread+0xd4/0x109
[<ffffffff8000a505>] child_rip+0xa/0x11
[<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
[<ffffffff8003e513>] kthread+0x0/0x109
[<ffffffff8000a4fb>] child_rip+0x0/0x11

To which Neil had the following to say:

> > md0_raid1 is holding the lock on the array and trying to write out the
> > superblocks for some reason, and the write isn't completing.
> > As it is holding the locks, mdadm and /proc/mdstat are hanging.
> >
> > You seem to have nbd-servers running on this machine. Are they
> > serving the device that md is using. (i.e. a loop-back situation). I
> > would expect memory deadlocks would be very easy to hit in that
> > situation, but I don't know if that is what has happened.
> >
> > Nothing else stands out.
> >
> > Could you clarify the arrangement of nbd. Where are the servers and
> > what are they serving?
>
> We're using MD+NBD for disaster recovery (one local scsi device, one
> remote via nbd). The nbd-server is not contributing to md0. The
> nbd-server is connected to a remote machine that is running a raid1
> remotely

To take this further I've now collected a full sysrq trace of this
hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
md0_raid1 trace is comparable to the RHEL5 trace from above:

md0_raid1 D ffff810001089780 0 8583 51 8952 8260 (L-TLB)
ffff810812393ca8 0000000000000046 ffff8107b7fbac00 000000000000000a
ffff81081f3c6a18 ffff81081f3c67d0 ffff8104ffe8f100 000044819ddcd5e2
000000000000eb8b 00000007028009c7
Call Trace: <ffffffff801e1f94>{generic_make_request+501}
<ffffffff8026946c>{md_super_wait+168}
<ffffffff80145aa2>{autoremove_wake_function+0}
<ffffffff8026f056>{write_page+128} <ffffffff80269ac7>{md_update_sb+220}
<ffffffff8026bda5>{md_check_recovery+361}
<ffffffff883a97f5>{:raid1:raid1d+38}
<ffffffff8013ad8f>{lock_timer_base+27}
<ffffffff8013ae01>{try_to_del_timer_sync+81}
<ffffffff8013ae16>{del_timer_sync+12}
<ffffffff802d9adf>{schedule_timeout+146}
<ffffffff801456a9>{keventd_create_kthread+0}
<ffffffff8026d5d8>{md_thread+248}
<ffffffff80145aa2>{autoremove_wake_function+0}
<ffffffff8026d4e0>{md_thread+0}
<ffffffff80145965>{kthread+236} <ffffffff8010bdce>{child_rip+8}
<ffffffff801456a9>{keventd_create_kthread+0}
<ffffffff80145879>{kthread+0}
<ffffffff8010bdc6>{child_rip+0}

Taking a step back, here is what was done to reproduce on SLES10:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as "faulty"
4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
above md0_raid1 trace.

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds (and MD marks the nbd0 member
faulty); and the other tasks that were blocking waiting for the MD
lock (e.g. 'cat /proc/mdstat') then complete immediately.

It should be noted that this MD+NBD configuration has worked
flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
RHEL4U4 distro). Steps have not been taken to try to reproduce with
2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
others to suggest I do so.

2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
yet both SLES10 and RHEL5 kernels do:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7

If not this specific NBD change, something appears to have changed
with how NBD behaves in the face of it's connection to the server
being lost. Almost like the MD superblock update that would be
written to nbd0 is blocking within nbd or the network layer because of
a network timeout issue?

I will try to get a better understanding of what is _really_ happening
with systemtap; but others' suggestions/insight is very welcome.

regards,
Mike


2007-06-13 23:31:16

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/13/07, Mike Snitzer <[email protected]> wrote:
> On 6/13/07, Mike Snitzer <[email protected]> wrote:
> > On 6/12/07, Neil Brown <[email protected]> wrote:
> ...
> > > > > On 6/12/07, Neil Brown <[email protected]> wrote:
> > > > > > On Tuesday June 12, [email protected] wrote:
> > > > > > >
> > > > > > > I can provided more detailed information; please just ask.
> > > > > > >
> > > > > >
> > > > > > A complete sysrq trace (all processes) might help.
>
> Bringing this back to a wider audience. I provided the full sysrq
> trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
> following trace:
>
> md0_raid1 D ffff810026183ce0 5368 31663 11 3822 29488 (L-TLB)
> ffff810026183ce0 ffff810031e9b5f8 0000000000000008 000000000000000a
> ffff810037eef040 ffff810037e17100 00043e64d2983c1f 0000000000004c7f
> ffff810037eef210 0000000100000001 000000081c506640 00000000ffffffff
> Call Trace:
> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> [<ffffffff801b9364>] md_super_wait+0xa8/0xbc
> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
> [<ffffffff801b9adb>] md_update_sb+0x1dd/0x23a
> [<ffffffff801bed2a>] md_check_recovery+0x15f/0x449
> [<ffffffff882a1af3>] :raid1:raid1d+0x27/0xc1e
> [<ffffffff80233209>] thread_return+0x0/0xde
> [<ffffffff8023279c>] __sched_text_start+0xc/0xa79
> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> [<ffffffff80233a9f>] schedule_timeout+0x1e/0xad
> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> [<ffffffff801bd06c>] md_thread+0xf8/0x10e
> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
> [<ffffffff801bcf74>] md_thread+0x0/0x10e
> [<ffffffff8003e5e7>] kthread+0xd4/0x109
> [<ffffffff8000a505>] child_rip+0xa/0x11
> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> [<ffffffff8003e513>] kthread+0x0/0x109
> [<ffffffff8000a4fb>] child_rip+0x0/0x11
>
> To which Neil had the following to say:
>
> > > md0_raid1 is holding the lock on the array and trying to write out the
> > > superblocks for some reason, and the write isn't completing.
> > > As it is holding the locks, mdadm and /proc/mdstat are hanging.
...

> > We're using MD+NBD for disaster recovery (one local scsi device, one
> > remote via nbd). The nbd-server is not contributing to md0. The
> > nbd-server is connected to a remote machine that is running a raid1
> > remotely
>
> To take this further I've now collected a full sysrq trace of this
> hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
> md0_raid1 trace is comparable to the RHEL5 trace from above:
>
> md0_raid1 D ffff810001089780 0 8583 51 8952 8260 (L-TLB)
> ffff810812393ca8 0000000000000046 ffff8107b7fbac00 000000000000000a
> ffff81081f3c6a18 ffff81081f3c67d0 ffff8104ffe8f100 000044819ddcd5e2
> 000000000000eb8b 00000007028009c7
> Call Trace: <ffffffff801e1f94>{generic_make_request+501}
> <ffffffff8026946c>{md_super_wait+168}
> <ffffffff80145aa2>{autoremove_wake_function+0}
> <ffffffff8026f056>{write_page+128} <ffffffff80269ac7>{md_update_sb+220}
> <ffffffff8026bda5>{md_check_recovery+361}
> <ffffffff883a97f5>{:raid1:raid1d+38}
> <ffffffff8013ad8f>{lock_timer_base+27}
> <ffffffff8013ae01>{try_to_del_timer_sync+81}
> <ffffffff8013ae16>{del_timer_sync+12}
> <ffffffff802d9adf>{schedule_timeout+146}
> <ffffffff801456a9>{keventd_create_kthread+0}
> <ffffffff8026d5d8>{md_thread+248}
> <ffffffff80145aa2>{autoremove_wake_function+0}
> <ffffffff8026d4e0>{md_thread+0}
> <ffffffff80145965>{kthread+236} <ffffffff8010bdce>{child_rip+8}
> <ffffffff801456a9>{keventd_create_kthread+0}
> <ffffffff80145879>{kthread+0}
> <ffffffff8010bdc6>{child_rip+0}
>
> Taking a step back, here is what was done to reproduce on SLES10:
> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> one remote member (nbd0)
> 2) power off the remote machine, whereby severing nbd0's connection
> 3) perform IO to the filesystem that is on the md0 device to enduce
> the MD layer to mark the nbd device as "faulty"
> 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
> above md0_raid1 trace.
>
> To be clear, the MD superblock update hangs indefinitely on RHEL5.
> But with SLES10 it eventually succeeds (and MD marks the nbd0 member
> faulty); and the other tasks that were blocking waiting for the MD
> lock (e.g. 'cat /proc/mdstat') then complete immediately.
>
> It should be noted that this MD+NBD configuration has worked
> flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
> RHEL4U4 distro). Steps have not been taken to try to reproduce with
> 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
> others to suggest I do so.
>
> 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
> yet both SLES10 and RHEL5 kernels do:
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7
>
> If not this specific NBD change, something appears to have changed
> with how NBD behaves in the face of it's connection to the server
> being lost. Almost like the MD superblock update that would be
> written to nbd0 is blocking within nbd or the network layer because of
> a network timeout issue?

Just a quick update; it is really starting to look like there is
definitely an issue with the nbd kernel driver. I booted the SLES10
2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the
nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD
hang. But it _still_ occurs with the 4-step process I outlined above.

The nbd0 device _should_ feel an NBD_DISCONNECT because the nbd-server
is no longer running (the node it was running on was powered off)...
however the nbd-client is still connected to the kernel (meaning the
kernel didn't return an error back to userspace).

Also, MD is still blocking waiting to write the superblock (presumably
to nbd0).

Mike

2007-06-14 21:05:55

by Bill Davidsen

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Mike Snitzer wrote:
> On 6/13/07, Mike Snitzer <[email protected]> wrote:
>> On 6/13/07, Mike Snitzer <[email protected]> wrote:
>> > On 6/12/07, Neil Brown <[email protected]> wrote:
>> ...
>> > > > > On 6/12/07, Neil Brown <[email protected]> wrote:
>> > > > > > On Tuesday June 12, [email protected] wrote:
>> > > > > > >
>> > > > > > > I can provided more detailed information; please just ask.
>> > > > > > >
>> > > > > >
>> > > > > > A complete sysrq trace (all processes) might help.
>>
>> Bringing this back to a wider audience. I provided the full sysrq
>> trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
>> following trace:
>>
>> md0_raid1 D ffff810026183ce0 5368 31663 11 3822
>> 29488 (L-TLB)
>> ffff810026183ce0 ffff810031e9b5f8 0000000000000008 000000000000000a
>> ffff810037eef040 ffff810037e17100 00043e64d2983c1f 0000000000004c7f
>> ffff810037eef210 0000000100000001 000000081c506640 00000000ffffffff
>> Call Trace:
>> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
>> [<ffffffff801b9364>] md_super_wait+0xa8/0xbc
>> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
>> [<ffffffff801b9adb>] md_update_sb+0x1dd/0x23a
>> [<ffffffff801bed2a>] md_check_recovery+0x15f/0x449
>> [<ffffffff882a1af3>] :raid1:raid1d+0x27/0xc1e
>> [<ffffffff80233209>] thread_return+0x0/0xde
>> [<ffffffff8023279c>] __sched_text_start+0xc/0xa79
>> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
>> [<ffffffff80233a9f>] schedule_timeout+0x1e/0xad
>> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
>> [<ffffffff801bd06c>] md_thread+0xf8/0x10e
>> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
>> [<ffffffff801bcf74>] md_thread+0x0/0x10e
>> [<ffffffff8003e5e7>] kthread+0xd4/0x109
>> [<ffffffff8000a505>] child_rip+0xa/0x11
>> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
>> [<ffffffff8003e513>] kthread+0x0/0x109
>> [<ffffffff8000a4fb>] child_rip+0x0/0x11
>>
>> To which Neil had the following to say:
>>
>> > > md0_raid1 is holding the lock on the array and trying to write
>> out the
>> > > superblocks for some reason, and the write isn't completing.
>> > > As it is holding the locks, mdadm and /proc/mdstat are hanging.
> ...
>
>> > We're using MD+NBD for disaster recovery (one local scsi device, one
>> > remote via nbd). The nbd-server is not contributing to md0. The
>> > nbd-server is connected to a remote machine that is running a raid1
>> > remotely
>>
>> To take this further I've now collected a full sysrq trace of this
>> hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
>> md0_raid1 trace is comparable to the RHEL5 trace from above:
>>
>> md0_raid1 D ffff810001089780 0 8583 51 8952
>> 8260 (L-TLB)
>> ffff810812393ca8 0000000000000046 ffff8107b7fbac00 000000000000000a
>> ffff81081f3c6a18 ffff81081f3c67d0 ffff8104ffe8f100
>> 000044819ddcd5e2
>> 000000000000eb8b 00000007028009c7
>> Call Trace: <ffffffff801e1f94>{generic_make_request+501}
>> <ffffffff8026946c>{md_super_wait+168}
>> <ffffffff80145aa2>{autoremove_wake_function+0}
>> <ffffffff8026f056>{write_page+128}
>> <ffffffff80269ac7>{md_update_sb+220}
>> <ffffffff8026bda5>{md_check_recovery+361}
>> <ffffffff883a97f5>{:raid1:raid1d+38}
>> <ffffffff8013ad8f>{lock_timer_base+27}
>> <ffffffff8013ae01>{try_to_del_timer_sync+81}
>> <ffffffff8013ae16>{del_timer_sync+12}
>> <ffffffff802d9adf>{schedule_timeout+146}
>> <ffffffff801456a9>{keventd_create_kthread+0}
>> <ffffffff8026d5d8>{md_thread+248}
>> <ffffffff80145aa2>{autoremove_wake_function+0}
>> <ffffffff8026d4e0>{md_thread+0}
>> <ffffffff80145965>{kthread+236} <ffffffff8010bdce>{child_rip+8}
>> <ffffffff801456a9>{keventd_create_kthread+0}
>> <ffffffff80145879>{kthread+0}
>> <ffffffff8010bdc6>{child_rip+0}
>>
>> Taking a step back, here is what was done to reproduce on SLES10:
>> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
>> one remote member (nbd0)
>> 2) power off the remote machine, whereby severing nbd0's connection
>> 3) perform IO to the filesystem that is on the md0 device to enduce
>> the MD layer to mark the nbd device as "faulty"
>> 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
>> above md0_raid1 trace.
>>
>> To be clear, the MD superblock update hangs indefinitely on RHEL5.
>> But with SLES10 it eventually succeeds (and MD marks the nbd0 member
>> faulty); and the other tasks that were blocking waiting for the MD
>> lock (e.g. 'cat /proc/mdstat') then complete immediately.
>>
>> It should be noted that this MD+NBD configuration has worked
>> flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
>> RHEL4U4 distro). Steps have not been taken to try to reproduce with
>> 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
>> others to suggest I do so.
>>
>> 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
>> yet both SLES10 and RHEL5 kernels do:
>> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7
>>
>>
>> If not this specific NBD change, something appears to have changed
>> with how NBD behaves in the face of it's connection to the server
>> being lost. Almost like the MD superblock update that would be
>> written to nbd0 is blocking within nbd or the network layer because of
>> a network timeout issue?
>
> Just a quick update; it is really starting to look like there is
> definitely an issue with the nbd kernel driver. I booted the SLES10
> 2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the
> nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD
> hang. But it _still_ occurs with the 4-step process I outlined above.
>
First, running an smp kernel with maxcpus=1 is not the same as running a
uni kernel, not is nosmp option. The code is different.

Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
but was told it wouldn't work with smp and I kind of lost interest. If
Neil thinks it should work in 2.6.21 or later I'll test it, since I have
a machine which wants a fresh install soon, and is both backed up and
available.
> The nbd0 device _should_ feel an NBD_DISCONNECT because the nbd-server
> is no longer running (the node it was running on was powered off)...
> however the nbd-client is still connected to the kernel (meaning the
> kernel didn't return an error back to userspace).
> Also, MD is still blocking waiting to write the superblock (presumably
> to nbd0).

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-06-14 21:57:19

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/14/07, Bill Davidsen <[email protected]> wrote:
> Mike Snitzer wrote:
> > On 6/13/07, Mike Snitzer <[email protected]> wrote:
> >> On 6/13/07, Mike Snitzer <[email protected]> wrote:
> >> > On 6/12/07, Neil Brown <[email protected]> wrote:
> >> ...
> >> > > > > On 6/12/07, Neil Brown <[email protected]> wrote:
> >> > > > > > On Tuesday June 12, [email protected] wrote:
> >> > > > > > >
> >> > > > > > > I can provided more detailed information; please just ask.
> >> > > > > > >
> >> > > > > >
> >> > > > > > A complete sysrq trace (all processes) might help.
> >>
> >> Bringing this back to a wider audience. I provided the full sysrq
> >> trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
> >> following trace:
> >>
> >> md0_raid1 D ffff810026183ce0 5368 31663 11 3822
> >> 29488 (L-TLB)
> >> ffff810026183ce0 ffff810031e9b5f8 0000000000000008 000000000000000a
> >> ffff810037eef040 ffff810037e17100 00043e64d2983c1f 0000000000004c7f
> >> ffff810037eef210 0000000100000001 000000081c506640 00000000ffffffff
> >> Call Trace:
> >> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> >> [<ffffffff801b9364>] md_super_wait+0xa8/0xbc
> >> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
> >> [<ffffffff801b9adb>] md_update_sb+0x1dd/0x23a
> >> [<ffffffff801bed2a>] md_check_recovery+0x15f/0x449
> >> [<ffffffff882a1af3>] :raid1:raid1d+0x27/0xc1e
> >> [<ffffffff80233209>] thread_return+0x0/0xde
> >> [<ffffffff8023279c>] __sched_text_start+0xc/0xa79
> >> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> >> [<ffffffff80233a9f>] schedule_timeout+0x1e/0xad
> >> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> >> [<ffffffff801bd06c>] md_thread+0xf8/0x10e
> >> [<ffffffff8003e711>] autoremove_wake_function+0x0/0x2e
> >> [<ffffffff801bcf74>] md_thread+0x0/0x10e
> >> [<ffffffff8003e5e7>] kthread+0xd4/0x109
> >> [<ffffffff8000a505>] child_rip+0xa/0x11
> >> [<ffffffff8003e371>] keventd_create_kthread+0x0/0x61
> >> [<ffffffff8003e513>] kthread+0x0/0x109
> >> [<ffffffff8000a4fb>] child_rip+0x0/0x11
> >>
> >> To which Neil had the following to say:
> >>
> >> > > md0_raid1 is holding the lock on the array and trying to write
> >> out the
> >> > > superblocks for some reason, and the write isn't completing.
> >> > > As it is holding the locks, mdadm and /proc/mdstat are hanging.
> > ...
> >
> >> > We're using MD+NBD for disaster recovery (one local scsi device, one
> >> > remote via nbd). The nbd-server is not contributing to md0. The
> >> > nbd-server is connected to a remote machine that is running a raid1
> >> > remotely
> >>
> >> To take this further I've now collected a full sysrq trace of this
> >> hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
> >> md0_raid1 trace is comparable to the RHEL5 trace from above:
> >>
> >> md0_raid1 D ffff810001089780 0 8583 51 8952
> >> 8260 (L-TLB)
> >> ffff810812393ca8 0000000000000046 ffff8107b7fbac00 000000000000000a
> >> ffff81081f3c6a18 ffff81081f3c67d0 ffff8104ffe8f100
> >> 000044819ddcd5e2
> >> 000000000000eb8b 00000007028009c7
> >> Call Trace: <ffffffff801e1f94>{generic_make_request+501}
> >> <ffffffff8026946c>{md_super_wait+168}
> >> <ffffffff80145aa2>{autoremove_wake_function+0}
> >> <ffffffff8026f056>{write_page+128}
> >> <ffffffff80269ac7>{md_update_sb+220}
> >> <ffffffff8026bda5>{md_check_recovery+361}
> >> <ffffffff883a97f5>{:raid1:raid1d+38}
> >> <ffffffff8013ad8f>{lock_timer_base+27}
> >> <ffffffff8013ae01>{try_to_del_timer_sync+81}
> >> <ffffffff8013ae16>{del_timer_sync+12}
> >> <ffffffff802d9adf>{schedule_timeout+146}
> >> <ffffffff801456a9>{keventd_create_kthread+0}
> >> <ffffffff8026d5d8>{md_thread+248}
> >> <ffffffff80145aa2>{autoremove_wake_function+0}
> >> <ffffffff8026d4e0>{md_thread+0}
> >> <ffffffff80145965>{kthread+236} <ffffffff8010bdce>{child_rip+8}
> >> <ffffffff801456a9>{keventd_create_kthread+0}
> >> <ffffffff80145879>{kthread+0}
> >> <ffffffff8010bdc6>{child_rip+0}
> >>
> >> Taking a step back, here is what was done to reproduce on SLES10:
> >> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> >> one remote member (nbd0)
> >> 2) power off the remote machine, whereby severing nbd0's connection
> >> 3) perform IO to the filesystem that is on the md0 device to enduce
> >> the MD layer to mark the nbd device as "faulty"
> >> 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
> >> above md0_raid1 trace.
> >>
> >> To be clear, the MD superblock update hangs indefinitely on RHEL5.
> >> But with SLES10 it eventually succeeds (and MD marks the nbd0 member
> >> faulty); and the other tasks that were blocking waiting for the MD
> >> lock (e.g. 'cat /proc/mdstat') then complete immediately.
> >>
> >> It should be noted that this MD+NBD configuration has worked
> >> flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
> >> RHEL4U4 distro). Steps have not been taken to try to reproduce with
> >> 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
> >> others to suggest I do so.
> >>
> >> 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
> >> yet both SLES10 and RHEL5 kernels do:
> >> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7
> >>
> >>
> >> If not this specific NBD change, something appears to have changed
> >> with how NBD behaves in the face of it's connection to the server
> >> being lost. Almost like the MD superblock update that would be
> >> written to nbd0 is blocking within nbd or the network layer because of
> >> a network timeout issue?
> >
> > Just a quick update; it is really starting to look like there is
> > definitely an issue with the nbd kernel driver. I booted the SLES10
> > 2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the
> > nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD
> > hang. But it _still_ occurs with the 4-step process I outlined above.
> >
> First, running an smp kernel with maxcpus=1 is not the same as running a
> uni kernel, not is nosmp option. The code is different.

I tried nosmp and this dell 8-way I'm using wouldn't boot...

> Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
> but was told it wouldn't work with smp and I kind of lost interest. If
> Neil thinks it should work in 2.6.21 or later I'll test it, since I have
> a machine which wants a fresh install soon, and is both backed up and
> available.

I'm fairly certain that this is an nbd issue and MD is hanging as a
side-effect of nbd getting wedged. As far as nbd not working on SMP;
I thought Herbert Xu fixed it in 2.6.16?

Is that to say that his fix was incomplete and/or useless?

Who is the maintainer of the nbd code in the kernel?

regards,
Mike

2007-06-15 00:41:14

by Paul Clements

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Bill Davidsen wrote:

> Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
> but was told it wouldn't work with smp and I kind of lost interest. If
> Neil thinks it should work in 2.6.21 or later I'll test it, since I have
> a machine which wants a fresh install soon, and is both backed up and
> available.

Please stop this. nbd is working perfectly fine, AFAIK. I use it every
day, and so do 100s of our customers. What exactly is it that not's
working? If there's a problem, please send the bug report.

Thank You,
Paul

2007-06-15 01:00:56

by Paul Clements

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Mike Snitzer wrote:

> Just a quick update; it is really starting to look like there is
> definitely an issue with the nbd kernel driver. I booted the SLES10
> 2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the
> nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD
> hang. But it _still_ occurs with the 4-step process I outlined above.
>
> The nbd0 device _should_ feel an NBD_DISCONNECT because the nbd-server
> is no longer running (the node it was running on was powered off)...

What do you mean, nbd should _feel_ an NBD_DISCONNECT ?

NBD_DISCONNECT is a manual process, not an automatic one.

--
Paul

2007-06-15 01:02:08

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/14/07, Paul Clements <[email protected]> wrote:
> Bill Davidsen wrote:
>
> > Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
> > but was told it wouldn't work with smp and I kind of lost interest. If
> > Neil thinks it should work in 2.6.21 or later I'll test it, since I have
> > a machine which wants a fresh install soon, and is both backed up and
> > available.
>
> Please stop this. nbd is working perfectly fine, AFAIK. I use it every
> day, and so do 100s of our customers. What exactly is it that not's
> working? If there's a problem, please send the bug report.

Paul,

This thread details what I've experienced using MD (raid1) with 2
devices; one being a local scsi device and the other is an NBD device.
I've yet to put effort to pinpointing the problem in a kernel.org
kernel; however both SLES10 and RHEL5 kernels appear to be hanging in
either 1) nbd or 2) the socket layer.

Here are the steps to reproduce reliably on SLES10 SP1:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as "faulty"
4) cat /proc/mdstat hangs, sysrq trace was collected

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds after ~5min (and MD marks the
nbd0 member faulty); and the other tasks that were blocking waiting
for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately.

If you look back in this thread you'll see traces for md0_raid1 for
both SLES10 and RHEL5. I hope to try to reproduce this issue on
kernel.org 2.6.16.46 (the basis for SLES10). If I can I'll then git
bisect back to try to pinpoint the regression; I obviously need to
verify that 2.6.16 works in this situation on SMP.

Mike

2007-06-15 01:05:34

by Paul Clements

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Mike Snitzer wrote:

> Here are the steps to reproduce reliably on SLES10 SP1:
> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> one remote member (nbd0)
> 2) power off the remote machine, whereby severing nbd0's connection
> 3) perform IO to the filesystem that is on the md0 device to enduce
> the MD layer to mark the nbd device as "faulty"
> 4) cat /proc/mdstat hangs, sysrq trace was collected

That's working as designed. NBD works over TCP. You're going to have to
wait for TCP to time out before an error occurs. Until then I/O will hang.

--
Paul

2007-06-15 01:10:58

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/14/07, Paul Clements <[email protected]> wrote:
> Mike Snitzer wrote:
>
> > Here are the steps to reproduce reliably on SLES10 SP1:
> > 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> > one remote member (nbd0)
> > 2) power off the remote machine, whereby severing nbd0's connection
> > 3) perform IO to the filesystem that is on the md0 device to enduce
> > the MD layer to mark the nbd device as "faulty"
> > 4) cat /proc/mdstat hangs, sysrq trace was collected
>
> That's working as designed. NBD works over TCP. You're going to have to
> wait for TCP to time out before an error occurs. Until then I/O will hang.

With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
kernel like I am with RHEL5 and SLES10. This hang (tcp timeout) is
indefinite oh RHEL5 and ~5min on SLES10.

Should/can I be playing with TCP timeout values? Why was this not a
concern with kernel.org 2.6.15.7; I was able to "feel" the nbd
connection break immediately; no MD superblock update hangs, no
longwinded (or indefinite) TCP timeout.

regards,
Mike

2007-06-15 01:16:26

by Paul Clements

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Mike Snitzer wrote:
> On 6/14/07, Paul Clements <[email protected]> wrote:
>> Mike Snitzer wrote:
>>
>> > Here are the steps to reproduce reliably on SLES10 SP1:
>> > 1) establish a raid1 mirror (md0) using one local member (sdc1) and
>> > one remote member (nbd0)
>> > 2) power off the remote machine, whereby severing nbd0's connection
>> > 3) perform IO to the filesystem that is on the md0 device to enduce
>> > the MD layer to mark the nbd device as "faulty"
>> > 4) cat /proc/mdstat hangs, sysrq trace was collected
>>
>> That's working as designed. NBD works over TCP. You're going to have to
>> wait for TCP to time out before an error occurs. Until then I/O will
>> hang.
>
> With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
> kernel like I am with RHEL5 and SLES10. This hang (tcp timeout) is
> indefinite oh RHEL5 and ~5min on SLES10.
>
> Should/can I be playing with TCP timeout values? Why was this not a
> concern with kernel.org 2.6.15.7; I was able to "feel" the nbd
> connection break immediately; no MD superblock update hangs, no
> longwinded (or indefinite) TCP timeout.

I don't know. I've never seen nbd immediately start returning I/O
errors. Perhaps something was different about the configuration?
If the other other machine rebooted quickly, for instance, you'd get a
connection reset, which would kill the nbd connection.

--
Paul

2007-06-15 01:21:22

by Mike Snitzer

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

On 6/14/07, Paul Clements <[email protected]> wrote:
> Mike Snitzer wrote:
> > On 6/14/07, Paul Clements <[email protected]> wrote:
> >> Mike Snitzer wrote:
> >>
> >> > Here are the steps to reproduce reliably on SLES10 SP1:
> >> > 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> >> > one remote member (nbd0)
> >> > 2) power off the remote machine, whereby severing nbd0's connection
> >> > 3) perform IO to the filesystem that is on the md0 device to enduce
> >> > the MD layer to mark the nbd device as "faulty"
> >> > 4) cat /proc/mdstat hangs, sysrq trace was collected
> >>
> >> That's working as designed. NBD works over TCP. You're going to have to
> >> wait for TCP to time out before an error occurs. Until then I/O will
> >> hang.
> >
> > With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
> > kernel like I am with RHEL5 and SLES10. This hang (tcp timeout) is
> > indefinite oh RHEL5 and ~5min on SLES10.
> >
> > Should/can I be playing with TCP timeout values? Why was this not a
> > concern with kernel.org 2.6.15.7; I was able to "feel" the nbd
> > connection break immediately; no MD superblock update hangs, no
> > longwinded (or indefinite) TCP timeout.
>
> I don't know. I've never seen nbd immediately start returning I/O
> errors. Perhaps something was different about the configuration?
> If the other other machine rebooted quickly, for instance, you'd get a
> connection reset, which would kill the nbd connection.

OK, I'll retest the 2.6.15.7 setup. As for SLES10 and RHEL5, I've
been leaving the remote server powered off. As such I'm at the full
mercy of the TCP timeout. It is odd that RHEL5 has been hanging
indefinitely but I'll dig deeper on that once I come to terms with how
kernel.org and SLES10 behaves.

I'll update with my findings for completeness.

Thanks for your insight!
Mike

2007-06-15 13:21:46

by Bill Davidsen

[permalink] [raw]
Subject: Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

Paul Clements wrote:
> Bill Davidsen wrote:
>
>> Second, AFAIK nbd hasn't working in a while. I haven't tried it in
>> ages, but was told it wouldn't work with smp and I kind of lost
>> interest. If Neil thinks it should work in 2.6.21 or later I'll test
>> it, since I have a machine which wants a fresh install soon, and is
>> both backed up and available.
>
> Please stop this. nbd is working perfectly fine, AFAIK. I use it every
> day, and so do 100s of our customers. What exactly is it that not's
> working? If there's a problem, please send the bug report.
Could you clarify what kernel, distribution, and mdadm version is used,
and how often the nbd server becomes unavailable to the clients? And
your clients are SMP? By "working perfectly fine," I assume you do mean
in the same way as described in the original posting, and not just with
the client, server, and network all fully functional.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979