LinuxLists.cc - Kernel 3.4.X NFS regression

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS regression

Forwarding to Linux-NFS ML

On 06/10/2012 01:56 PM, Joerg Platte wrote:

> All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> same NFS related problem:
>
> Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> than 120 seconds.
> Jun 10 09:23:36 coco kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> 2 0x00000000
> Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> 00000000 0000000a 00000282 df465e70
> Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> df44d2b0 e33fa6b3 00000282 de764ae0
> Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> df465ea4 c013610c 00000000 d78bcf80
> Jun 10 09:23:36 coco kernel: Call Trace:
> Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cda678>]
> nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> nfsd4_record_grace_done+0x10/0x12 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> [nfsd]
> Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> nfsd4_process_open1+0x32b/0x32b [nfsd]
> Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
>
> A kworker task is stuck in D state and nfs mounts from other clients do
> not work at all. This happens only on one machine, another one with the
> same kernel (same self compiled Debian package) works. All previous 3.3
> kernels work as well.
>
> Since this machine is remote it is not that easy to bisect to find the
> bad commit. Are there any other things I can try?
>
> regards,
> Joerg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-06-10 15:01:03

by Myklebust, Trond

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

Cc: [email protected] + bfields and changing title to label it
as a server regression since that is what the trace appears to imply.

On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
> All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> same NFS related problem:
>
> Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> than 120 seconds.
> Jun 10 09:23:36 coco kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> 2 0x00000000
> Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> 00000000 0000000a 00000282 df465e70
> Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> df44d2b0 e33fa6b3 00000282 de764ae0
> Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> df465ea4 c013610c 00000000 d78bcf80
> Jun 10 09:23:36 coco kernel: Call Trace:
> Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cda678>]
> nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> nfsd4_record_grace_done+0x10/0x12 [nfsd]
> Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> [nfsd]
> Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> nfsd4_process_open1+0x32b/0x32b [nfsd]
> Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
>
> A kworker task is stuck in D state and nfs mounts from other clients do
> not work at all. This happens only on one machine, another one with the
> same kernel (same self compiled Debian package) works. All previous 3.3
> kernels work as well.
>
> Since this machine is remote it is not that easy to bisect to find the
> bad commit. Are there any other things I can try?
>
> regards,
> Joerg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2012-06-11 12:16:40

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote:
> Cc: [email protected] + bfields and changing title to label it
> as a server regression since that is what the trace appears to imply.
>
> On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
> > All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> > same NFS related problem:
> >
> > Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> > than 120 seconds.
> > Jun 10 09:23:36 coco kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> > 2 0x00000000
> > Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> > 00000000 0000000a 00000282 df465e70
> > Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> > df44d2b0 e33fa6b3 00000282 de764ae0
> > Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> > df465ea4 c013610c 00000000 d78bcf80
> > Jun 10 09:23:36 coco kernel: Call Trace:
> > Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> > Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> > Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> > Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> > Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> > Jun 10 09:23:36 coco kernel: [<e0cda678>]
> > nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> > Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> > nfsd4_record_grace_done+0x10/0x12 [nfsd]
> > Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> > [nfsd]
> > Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> > Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> > Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> > nfsd4_process_open1+0x32b/0x32b [nfsd]
> > Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> > Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> > Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> > Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> > Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
> >
> > A kworker task is stuck in D state and nfs mounts from other clients do
> > not work at all. This happens only on one machine, another one with the
> > same kernel (same self compiled Debian package) works. All previous 3.3
> > kernels work as well.
> >
> > Since this machine is remote it is not that easy to bisect to find the
> > bad commit. Are there any other things I can try?

If you create a directory named /var/lib/nfs/v4recovery/, does the
problem go away?

My guess would be that it's trying to upcall to the new reboot-recovery
state daemon, and you don't have that running.

Before the addition of that upcall state was kept in
/var/lib/nfs/v4recovery. So we decide whether to use the old method or
the new one by checking for the existance of that path.

But I'm guessing we were wrong to assume that existing setups that
people perceived as working would have that path, because the failures
in the absence of that path were probably less obvious.

--b.

2012-06-11 12:39:50

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 08:16:34 -0400
bfields <[email protected]> wrote:

> On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote:
> > Cc: [email protected] + bfields and changing title to label it
> > as a server regression since that is what the trace appears to imply.
> >
> > On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
> > > All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> > > same NFS related problem:
> > >
> > > Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> > > than 120 seconds.
> > > Jun 10 09:23:36 coco kernel: "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> > > 2 0x00000000
> > > Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> > > 00000000 0000000a 00000282 df465e70
> > > Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> > > df44d2b0 e33fa6b3 00000282 de764ae0
> > > Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> > > df465ea4 c013610c 00000000 d78bcf80
> > > Jun 10 09:23:36 coco kernel: Call Trace:
> > > Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> > > Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> > > Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> > > Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> > > Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> > > Jun 10 09:23:36 coco kernel: [<e0cda678>]
> > > nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> > > Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> > > nfsd4_record_grace_done+0x10/0x12 [nfsd]
> > > Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> > > [nfsd]
> > > Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> > > Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> > > Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> > > nfsd4_process_open1+0x32b/0x32b [nfsd]
> > > Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> > > Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> > > Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> > > Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> > > Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
> > >
> > > A kworker task is stuck in D state and nfs mounts from other clients do
> > > not work at all. This happens only on one machine, another one with the
> > > same kernel (same self compiled Debian package) works. All previous 3.3
> > > kernels work as well.
> > >
> > > Since this machine is remote it is not that easy to bisect to find the
> > > bad commit. Are there any other things I can try?
>
> If you create a directory named /var/lib/nfs/v4recovery/, does the
> problem go away?
>
> My guess would be that it's trying to upcall to the new reboot-recovery
> state daemon, and you don't have that running.
>
> Before the addition of that upcall state was kept in
> /var/lib/nfs/v4recovery. So we decide whether to use the old method or
> the new one by checking for the existance of that path.
>
> But I'm guessing we were wrong to assume that existing setups that
> people perceived as working would have that path, because the failures
> in the absence of that path were probably less obvious.
>
> --b.

This sounds like the same problem that Hans reported as well. I've not
been able to reproduce that so far. Here's what I get when I start nfsd
with no v4recoverdir and nfsdcld isn't running:

[ 109.715080] NFSD: starting 90-second grace period
[ 229.984220] NFSD: Unable to end grace period: -110

What I don't quite understand is why the queue_timeout job isn't
getting run here. What should happen is that 30s after upcall,
rpc_timeout_upcall_queue should run. The message will be found to be
still sitting on the , so it should set its status to -ETIMEDOUT
and wake up the caller.

I can only assume that the queue_timeout job isn't getting run for some
reason, but I'm unclear on why that would be.

--
Jeff Layton <[email protected]>

2012-06-11 13:13:52

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 08:39:32 -0400
Jeff Layton <[email protected]> wrote:

> On Mon, 11 Jun 2012 08:16:34 -0400
> bfields <[email protected]> wrote:
>
> > On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote:
> > > Cc: [email protected] + bfields and changing title to label it
> > > as a server regression since that is what the trace appears to imply.
> > >
> > > On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
> > > > All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> > > > same NFS related problem:
> > > >
> > > > Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> > > > than 120 seconds.
> > > > Jun 10 09:23:36 coco kernel: "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> > > > 2 0x00000000
> > > > Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> > > > 00000000 0000000a 00000282 df465e70
> > > > Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> > > > df44d2b0 e33fa6b3 00000282 de764ae0
> > > > Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> > > > df465ea4 c013610c 00000000 d78bcf80
> > > > Jun 10 09:23:36 coco kernel: Call Trace:
> > > > Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> > > > Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> > > > Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> > > > Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> > > > Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> > > > Jun 10 09:23:36 coco kernel: [<e0cda678>]
> > > > nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> > > > Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> > > > nfsd4_record_grace_done+0x10/0x12 [nfsd]
> > > > Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> > > > [nfsd]
> > > > Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> > > > Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> > > > Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> > > > nfsd4_process_open1+0x32b/0x32b [nfsd]
> > > > Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> > > > Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> > > > Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> > > > Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> > > > Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
> > > >
> > > > A kworker task is stuck in D state and nfs mounts from other clients do
> > > > not work at all. This happens only on one machine, another one with the
> > > > same kernel (same self compiled Debian package) works. All previous 3.3
> > > > kernels work as well.
> > > >
> > > > Since this machine is remote it is not that easy to bisect to find the
> > > > bad commit. Are there any other things I can try?
> >
> > If you create a directory named /var/lib/nfs/v4recovery/, does the
> > problem go away?
> >
> > My guess would be that it's trying to upcall to the new reboot-recovery
> > state daemon, and you don't have that running.
> >
> > Before the addition of that upcall state was kept in
> > /var/lib/nfs/v4recovery. So we decide whether to use the old method or
> > the new one by checking for the existance of that path.
> >
> > But I'm guessing we were wrong to assume that existing setups that
> > people perceived as working would have that path, because the failures
> > in the absence of that path were probably less obvious.
> >
> > --b.
>
> This sounds like the same problem that Hans reported as well. I've not
> been able to reproduce that so far. Here's what I get when I start nfsd
> with no v4recoverdir and nfsdcld isn't running:
>
> [ 109.715080] NFSD: starting 90-second grace period
> [ 229.984220] NFSD: Unable to end grace period: -110
>
> What I don't quite understand is why the queue_timeout job isn't
> getting run here. What should happen is that 30s after upcall,
> rpc_timeout_upcall_queue should run. The message will be found to be
> still sitting on the , so it should set its status to -ETIMEDOUT
> and wake up the caller.
>
> I can only assume that the queue_timeout job isn't getting run for some
> reason, but I'm unclear on why that would be.
>

Ahh, I think I see the bug. From rpc_timeout_upcall_queue:

-----------------------[snip]--------------------------
dentry = dget(pipe->dentry);
spin_unlock(&pipe->lock);
if (dentry) {
rpc_purge_list(&RPC_I(dentry->d_inode)->waitq,
&free_list, destroy_msg, -ETIMEDOUT);
dput(dentry);
}
-----------------------[snip]--------------------------

...when there is no dentry (as there wouldn't be when rpc_pipefs isn't
mounted), then the rpc_purge_list won't run. FWIW, you'd probably see
similar problems if you attempted a sec=krb5 mount without having
rpc_pipefs mounted.

I'm still looking at the code to see what the right fix is. For now,
making sure you have a /var/lib/nfs/v4recoverydir is probably the
easiest workaround.

--
Jeff Layton <[email protected]>

2012-06-11 13:25:59

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

Am 11.06.2012 um 15:13 schrieb Jeff Layton <[email protected]>:

>
> Ahh, I think I see the bug. From rpc_timeout_upcall_queue:
>
> -----------------------[snip]--------------------------
> dentry = dget(pipe->dentry);
> spin_unlock(&pipe->lock);
> if (dentry) {
> rpc_purge_list(&RPC_I(dentry->d_inode)->waitq,
> &free_list, destroy_msg, -ETIMEDOUT);
> dput(dentry);
> }
> -----------------------[snip]--------------------------
>
> ...when there is no dentry (as there wouldn't be when rpc_pipefs isn't
> mounted), then the rpc_purge_list won't run. FWIW, you'd probably see
> similar problems if you attempted a sec=krb5 mount without having
> rpc_pipefs mounted.
>
> I'm still looking at the code to see what the right fix is. For now,
> making sure you have a /var/lib/nfs/v4recoverydir is probably the
> easiest workaround.
>
> --
> Jeff Layton <[email protected]>

On my computer that shows the bug there is indeed no v4recovery directory. There is one on the other machine that does not show this bug. I'll test the latest kernel with this directory created today afternoon.

Thank you for the analysis!

Jörg -

2012-06-11 13:32:42

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 03:39 PM, Jeff Layton wrote:

> On Mon, 11 Jun 2012 08:16:34 -0400
> bfields <[email protected]> wrote:
>
>> On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote:
>>> Cc: [email protected] + bfields and changing title to label it
>>> as a server regression since that is what the trace appears to imply.
>>>
>>> On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
>>>> All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
>>>> same NFS related problem:
>>>>
>>>> Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
>>>> than 120 seconds.
>>>> Jun 10 09:23:36 coco kernel: "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
>>>> 2 0x00000000
>>>> Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
>>>> 00000000 0000000a 00000282 df465e70
>>>> Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
>>>> df44d2b0 e33fa6b3 00000282 de764ae0
>>>> Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
>>>> df465ea4 c013610c 00000000 d78bcf80
>>>> Jun 10 09:23:36 coco kernel: Call Trace:
>>>> Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
>>>> Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
>>>> Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
>>>> Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
>>>> Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
>>>> Jun 10 09:23:36 coco kernel: [<e0cda678>]
>>>> nfsd4_cld_grace_done+0x60/0x99 [nfsd]
>>>> Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
>>>> nfsd4_record_grace_done+0x10/0x12 [nfsd]
>>>> Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
>>>> [nfsd]
>>>> Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
>>>> Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
>>>> Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
>>>> nfsd4_process_open1+0x32b/0x32b [nfsd]
>>>> Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
>>>> Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
>>>> Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
>>>> Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
>>>> Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
>>>>
>>>> A kworker task is stuck in D state and nfs mounts from other clients do
>>>> not work at all. This happens only on one machine, another one with the
>>>> same kernel (same self compiled Debian package) works. All previous 3.3
>>>> kernels work as well.
>>>>
>>>> Since this machine is remote it is not that easy to bisect to find the
>>>> bad commit. Are there any other things I can try?
>>
>> If you create a directory named /var/lib/nfs/v4recovery/, does the
>> problem go away?
>>
>> My guess would be that it's trying to upcall to the new reboot-recovery
>> state daemon, and you don't have that running.
>>
>> Before the addition of that upcall state was kept in
>> /var/lib/nfs/v4recovery. So we decide whether to use the old method or
>> the new one by checking for the existance of that path.
>>
>> But I'm guessing we were wrong to assume that existing setups that
>> people perceived as working would have that path, because the failures
>> in the absence of that path were probably less obvious.
>>
>> --b.
>
> This sounds like the same problem that Hans reported as well. I've not
> been able to reproduce that so far. Here's what I get when I start nfsd
> with no v4recoverdir and nfsdcld isn't running:
>
> [ 109.715080] NFSD: starting 90-second grace period
> [ 229.984220] NFSD: Unable to end grace period: -110
>
> What I don't quite understand is why the queue_timeout job isn't
> getting run here. What should happen is that 30s after upcall,
> rpc_timeout_upcall_queue should run. The message will be found to be
> still sitting on the , so it should set its status to -ETIMEDOUT
> and wake up the caller.
>
> I can only assume that the queue_timeout job isn't getting run for some
> reason, but I'm unclear on why that would be.
>

Regression fixing aside. I would consider changing the all mechanism to
a call_usermodehelper mechanism. Not only it cuts the in-kernel code
to 1/3, it also cuts user-mode code to 1/3. And specially it relives you
of any special daemon setup dependency. All you do is run an app/script
that does what it does when it does it, directly without anyone waiting
and/or any kind of handshake.

It is easy to pass any kind of parameters from Kernel to user-mode. Passing
info from user-mode to Kernel is also easy by setting up a sysfs connection
point.

And most important there are no timeouts in the new-kernel vs old user-mode.
If the script/app does not exists the call_usermodehelper returns immediately
and the old behavior can be used.

And lastly if persistent performance is an issue in the steady state. (since
calling call_usermodehelper in the hot path can be slow at times) Then I
would consider that the init script ran at startup via call_usermodehelper
then sets up a faster communication channel like a udev even and/or some
other event mechanism. In any way the old dual local-RPC channel has proved
to be a pain in the ass.

(BTW: if you attempt it you will see that so many lines of code where eliminated
you might consider it for a Regression fixing to @stable)

Thank to Steve Dickson who suggested this wonderful idea when I had the same
exact problem. I'm just repeating his suggestion, and in light of the experience
of implementing both methods.

Just my $0.017
Boaz

2012-06-11 13:44:40

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 04:32 PM, Boaz Harrosh wrote:

> On 06/11/2012 03:39 PM, Jeff Layton wrote:
>
>>>
>>> But I'm guessing we were wrong to assume that existing setups that
>>> people perceived as working would have that path, because the failures
>>> in the absence of that path were probably less obvious.
>>>

One more thing, the most important one. We have already fixed that in the
past and I was hoping the lesson was learned. Apparently it was not, and
we are doomed to do this mistake for ever!!

What ever crap fails times out and crashes, in the recovery code, we don't
give a dam. It should never affect any Server-client communication.

When the grace periods ends the clients gates opens period. *Any* error
return from state recovery code must be carefully ignored and normal
operations resumed. At most on error, we move into a mode where any
recovery request from client is accepted, since we don't have any better
data to verify it.

Please comb recovery code to make sure any catastrophe is safely ignored.
We already did that before and it used to work.

>>> --b.

Thanks
Boaz

2012-06-11 13:51:51

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 16:32:10 +0300
Boaz Harrosh <[email protected]> wrote:

> On 06/11/2012 03:39 PM, Jeff Layton wrote:
>
> > On Mon, 11 Jun 2012 08:16:34 -0400
> > bfields <[email protected]> wrote:
> >
> >> On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote:
> >>> Cc: [email protected] + bfields and changing title to label it
> >>> as a server regression since that is what the trace appears to imply.
> >>>
> >>> On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote:
> >>>> All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the
> >>>> same NFS related problem:
> >>>>
> >>>> Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more
> >>>> than 120 seconds.
> >>>> Jun 10 09:23:36 coco kernel: "echo 0 >
> >>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>>> Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8
> >>>> 2 0x00000000
> >>>> Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c
> >>>> 00000000 0000000a 00000282 df465e70
> >>>> Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84
> >>>> df44d2b0 e33fa6b3 00000282 de764ae0
> >>>> Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6
> >>>> df465ea4 c013610c 00000000 d78bcf80
> >>>> Jun 10 09:23:36 coco kernel: Call Trace:
> >>>> Jun 10 09:23:36 coco kernel: [<c012e0f6>] ? add_timer+0x11/0x17
> >>>> Jun 10 09:23:36 coco kernel: [<c013610c>] ? queue_delayed_work_on+0x74/0xf0
> >>>> Jun 10 09:23:36 coco kernel: [<c0136ba4>] ? queue_delayed_work+0x1b/0x28
> >>>> Jun 10 09:23:36 coco kernel: [<c0350f5b>] schedule+0x1d/0x4c
> >>>> Jun 10 09:23:36 coco kernel: [<e0cda5f1>] cld_pipe_upcall+0x4e/0x75 [nfsd]
> >>>> Jun 10 09:23:36 coco kernel: [<e0cda678>]
> >>>> nfsd4_cld_grace_done+0x60/0x99 [nfsd]
> >>>> Jun 10 09:23:36 coco kernel: [<e0cd9cb5>]
> >>>> nfsd4_record_grace_done+0x10/0x12 [nfsd]
> >>>> Jun 10 09:23:36 coco kernel: [<e0cd6696>] laundromat_main+0x291/0x2d8
> >>>> [nfsd]
> >>>> Jun 10 09:23:36 coco kernel: [<c0136d2f>] process_one_work+0xff/0x325
> >>>> Jun 10 09:23:36 coco kernel: [<c0134bec>] ? start_worker+0x20/0x23
> >>>> Jun 10 09:23:36 coco kernel: [<e0cd6405>] ?
> >>>> nfsd4_process_open1+0x32b/0x32b [nfsd]
> >>>> Jun 10 09:23:36 coco kernel: [<c013727a>] worker_thread+0xf4/0x39a
> >>>> Jun 10 09:23:36 coco kernel: [<c0137186>] ? rescuer_thread+0x231/0x231
> >>>> Jun 10 09:23:36 coco kernel: [<c013a556>] kthread+0x6c/0x6e
> >>>> Jun 10 09:23:36 coco kernel: [<c013a4ea>] ? kthreadd+0xe8/0xe8
> >>>> Jun 10 09:23:36 coco kernel: [<c035263e>] kernel_thread_helper+0x6/0xd
> >>>>
> >>>> A kworker task is stuck in D state and nfs mounts from other clients do
> >>>> not work at all. This happens only on one machine, another one with the
> >>>> same kernel (same self compiled Debian package) works. All previous 3.3
> >>>> kernels work as well.
> >>>>
> >>>> Since this machine is remote it is not that easy to bisect to find the
> >>>> bad commit. Are there any other things I can try?
> >>
> >> If you create a directory named /var/lib/nfs/v4recovery/, does the
> >> problem go away?
> >>
> >> My guess would be that it's trying to upcall to the new reboot-recovery
> >> state daemon, and you don't have that running.
> >>
> >> Before the addition of that upcall state was kept in
> >> /var/lib/nfs/v4recovery. So we decide whether to use the old method or
> >> the new one by checking for the existance of that path.
> >>
> >> But I'm guessing we were wrong to assume that existing setups that
> >> people perceived as working would have that path, because the failures
> >> in the absence of that path were probably less obvious.
> >>
> >> --b.
> >
> > This sounds like the same problem that Hans reported as well. I've not
> > been able to reproduce that so far. Here's what I get when I start nfsd
> > with no v4recoverdir and nfsdcld isn't running:
> >
> > [ 109.715080] NFSD: starting 90-second grace period
> > [ 229.984220] NFSD: Unable to end grace period: -110
> >
> > What I don't quite understand is why the queue_timeout job isn't
> > getting run here. What should happen is that 30s after upcall,
> > rpc_timeout_upcall_queue should run. The message will be found to be
> > still sitting on the , so it should set its status to -ETIMEDOUT
> > and wake up the caller.
> >
> > I can only assume that the queue_timeout job isn't getting run for some
> > reason, but I'm unclear on why that would be.
> >
>
>
> Regression fixing aside. I would consider changing the all mechanism to
> a call_usermodehelper mechanism. Not only it cuts the in-kernel code
> to 1/3, it also cuts user-mode code to 1/3. And specially it relives you
> of any special daemon setup dependency. All you do is run an app/script
> that does what it does when it does it, directly without anyone waiting
> and/or any kind of handshake.
>

That was considered here, but the problem with the usermode helper is
that you can't pass anything back to the kernel but a simple status
code (and that's assuming that you wait for it to exit). In the near
future, we'll need to pass back more info to the kernel for this, so
the usermode helper callout wasn't suitable.

--
Jeff Layton <[email protected]>

2012-06-11 14:03:54

[permalink] [raw]

Subject: [PATCH] rpc_pipefs: allow rpc_purge_list to take a NULL waitq pointer

In the event that we don't have a dentry for a rpc_pipefs pipe, we still
need to allow the queue_timeout job to clean out the queue. There's just
no waitq to wake up in that event.

Cc: [email protected]
Reported-by: Hans de Bruin <[email protected]>
Reported-by: Joerg Platte <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/rpc_pipe.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 0404047..21fde99 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -71,7 +71,9 @@ static void rpc_purge_list(wait_queue_head_t *waitq, struct list_head *head,
msg->errno = err;
destroy_msg(msg);
} while (!list_empty(head));
- wake_up(waitq);
+
+ if (waitq)
+ wake_up(waitq);
}

static void
@@ -91,11 +93,9 @@ rpc_timeout_upcall_queue(struct work_struct *work)
}
dentry = dget(pipe->dentry);
spin_unlock(&pipe->lock);
- if (dentry) {
- rpc_purge_list(&RPC_I(dentry->d_inode)->waitq,
- &free_list, destroy_msg, -ETIMEDOUT);
- dput(dentry);
- }
+ rpc_purge_list(dentry ? &RPC_I(dentry->d_inode)->waitq : NULL,
+ &free_list, destroy_msg, -ETIMEDOUT);
+ dput(dentry);
}

ssize_t rpc_pipe_generic_upcall(struct file *filp, struct rpc_pipe_msg *msg,
--
1.7.7.6

2012-06-11 14:05:54

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 04:51 PM, Jeff Layton wrote:

>
> That was considered here, but the problem with the usermode helper is
> that you can't pass anything back to the kernel but a simple status
> code (and that's assuming that you wait for it to exit). In the near
> future, we'll need to pass back more info to the kernel for this, so
> the usermode helper callout wasn't suitable.
>

I have answered that in my mail. Repeated here again. Well you made
a simple mistake. Because it is *easy* to pass back any number and
size of information from user-mode.

You just setup a sysfs entry points where the answers are written
back to. It's an easy trick to setup a thread safe, way with a
cookie but 90% of the time you don't have to. Say you set up
a structure of per-client (identified uniquely) then user mode
answers back per client, concurrency will not do any harm, since
you answer to the same question the same answer. ans so on. Each
problem it's own.

If you want we can talk about this, it would be easy for me to setup
a toll free conference number we can all use.

Cheers
Boaz

2012-06-11 14:11:59

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 17:05:28 +0300
Boaz Harrosh <[email protected]> wrote:

> On 06/11/2012 04:51 PM, Jeff Layton wrote:
>
> >
> > That was considered here, but the problem with the usermode helper is
> > that you can't pass anything back to the kernel but a simple status
> > code (and that's assuming that you wait for it to exit). In the near
> > future, we'll need to pass back more info to the kernel for this, so
> > the usermode helper callout wasn't suitable.
> >
>
>
> I have answered that in my mail. Repeated here again. Well you made
> a simple mistake. Because it is *easy* to pass back any number and
> size of information from user-mode.
>
> You just setup a sysfs entry points where the answers are written
> back to. It's an easy trick to setup a thread safe, way with a
> cookie but 90% of the time you don't have to. Say you set up
> a structure of per-client (identified uniquely) then user mode
> answers back per client, concurrency will not do any harm, since
> you answer to the same question the same answer. ans so on. Each
> problem it's own.
>
> If you want we can talk about this, it would be easy for me to setup
> a toll free conference number we can all use.

That helpful advice would have been welcome about 3-4 months ago when I
first proposed this in detail. At that point you're working with
multiple upcall/downcall mechanisms, which was something I was keen to
avoid.

I'm not opposed to moving in that direction, but it basically means
you're going to rip out everything I've got here so far and replace it.

If you're willing to do that work, I'll be happy to work with you on
it, but I don't have the time or inclination to do that on my own right
now.

--
Jeff Layton <[email protected]>

2012-06-11 14:20:56

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 15:25:55 +0200
Jörg Platte <[email protected]> wrote:

> Am 11.06.2012 um 15:13 schrieb Jeff Layton <[email protected]>:
>
> >
> > Ahh, I think I see the bug. From rpc_timeout_upcall_queue:
> >
> > -----------------------[snip]--------------------------
> > dentry = dget(pipe->dentry);
> > spin_unlock(&pipe->lock);
> > if (dentry) {
> > rpc_purge_list(&RPC_I(dentry->d_inode)->waitq,
> > &free_list, destroy_msg, -ETIMEDOUT);
> > dput(dentry);
> > }
> > -----------------------[snip]--------------------------
> >
> > ...when there is no dentry (as there wouldn't be when rpc_pipefs isn't
> > mounted), then the rpc_purge_list won't run. FWIW, you'd probably see
> > similar problems if you attempted a sec=krb5 mount without having
> > rpc_pipefs mounted.
> >
> > I'm still looking at the code to see what the right fix is. For now,
> > making sure you have a /var/lib/nfs/v4recoverydir is probably the
> > easiest workaround.
> >
> > --
> > Jeff Layton <[email protected]>
>
> On my computer that shows the bug there is indeed no v4recovery directory. There is one on the other machine that does not show this bug. I'll test the latest kernel with this directory created today afternoon.
>
> Thank you for the analysis!
>

No problem. One possibility for the difference is that rpc_pipefs is
mounted on the machine that doesn't show the problem, but isn't mouted
on the machine that does.

--
Jeff Layton <[email protected]>

2012-06-11 14:30:14

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 16:44:09 +0300
Boaz Harrosh <[email protected]> wrote:

> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
>
> > On 06/11/2012 03:39 PM, Jeff Layton wrote:
> >
> >>>
> >>> But I'm guessing we were wrong to assume that existing setups that
> >>> people perceived as working would have that path, because the failures
> >>> in the absence of that path were probably less obvious.
> >>>
>
>
> One more thing, the most important one. We have already fixed that in the
> past and I was hoping the lesson was learned. Apparently it was not, and
> we are doomed to do this mistake for ever!!
>
> What ever crap fails times out and crashes, in the recovery code, we don't
> give a dam. It should never affect any Server-client communication.
>
> When the grace periods ends the clients gates opens period. *Any* error
> return from state recovery code must be carefully ignored and normal
> operations resumed. At most on error, we move into a mode where any
> recovery request from client is accepted, since we don't have any better
> data to verify it.
>
> Please comb recovery code to make sure any catastrophe is safely ignored.
> We already did that before and it used to work.
>

That's not the case, and hasn't ever been AFAICT. The code has changed
a bit recently, but the existing behavior in this regard was preserved.
>From nfs4_check_open_reclaim:

return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;

...if there is no client record, then the reclaim request fails. Doesn't
the RFC mandate that?

--
Jeff Layton <[email protected]>

2012-06-11 14:45:53

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 05:11 PM, Jeff Layton wrote:

> On Mon, 11 Jun 2012 17:05:28 +0300
> Boaz Harrosh <[email protected]> wrote:
>
>> On 06/11/2012 04:51 PM, Jeff Layton wrote:
>>
>>>
>>> That was considered here, but the problem with the usermode helper is
>>> that you can't pass anything back to the kernel but a simple status
>>> code (and that's assuming that you wait for it to exit). In the near
>>> future, we'll need to pass back more info to the kernel for this, so
>>> the usermode helper callout wasn't suitable.
>>>
>>
>>
>> I have answered that in my mail. Repeated here again. Well you made
>> a simple mistake. Because it is *easy* to pass back any number and
>> size of information from user-mode.
>>
>> You just setup a sysfs entry points where the answers are written
>> back to. It's an easy trick to setup a thread safe, way with a
>> cookie but 90% of the time you don't have to. Say you set up
>> a structure of per-client (identified uniquely) then user mode
>> answers back per client, concurrency will not do any harm, since
>> you answer to the same question the same answer. ans so on. Each
>> problem it's own.
>>
>> If you want we can talk about this, it would be easy for me to setup
>> a toll free conference number we can all use.
>
> That helpful advice would have been welcome about 3-4 months ago when I
> first proposed this in detail. At that point you're working with
> multiple upcall/downcall mechanisms, which was something I was keen to
> avoid.
>
> I'm not opposed to moving in that direction, but it basically means
> you're going to rip out everything I've got here so far and replace it.
>
> If you're willing to do that work, I'll be happy to work with you on
> it, but I don't have the time or inclination to do that on my own right
> now.
>

No such luck. sorry. I wish I could, but coming from a competing server
company, you can imagine the priority of that ever happening.
(Even though I use the Linux-Server everyday for my development and
am putting lots of efforts into still, mainly in pnfs)

Hopefully re-examining the code, it could all be salvaged just the
same, only lots of code thrown a way.

But mean-while please address my concern below:
Boaz Harrosh wrote:

> One more thing, the most important one. We have already fixed that in the
> past and I was hoping the lesson was learned. Apparently it was not, and
> we are doomed to do this mistake for ever!!
>
> What ever crap fails times out and crashes, in the recovery code, we don't
> give a dam. It should never affect any Server-client communication.
>
> When the grace periods ends the clients gates opens period. *Any* error
> return from state recovery code must be carefully ignored and normal
> operations resumed. At most on error, we move into a mode where any
> recovery request from client is accepted, since we don't have any better
> data to verify it.
>
> Please comb recovery code to make sure any catastrophe is safely ignored.
> We already did that before and it used to work.

We should make sure that any state recovery code does not interfere with
regular operations. and fails gracefully / shuts up.

We used to have that, apparently it re-broke. Clients should always be granted
access, after grace period. And Server should be made sure not to fail in any
situation.

I would look into it but I'm not uptodate anymore, I wish you or Bruce could.

Thanks for your work so far, sorry to be bearer of bad news
Boaz

2012-06-11 14:55:33

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, 11 Jun 2012 17:45:06 +0300
Boaz Harrosh <[email protected]> wrote:

> On 06/11/2012 05:11 PM, Jeff Layton wrote:
>
> > On Mon, 11 Jun 2012 17:05:28 +0300
> > Boaz Harrosh <[email protected]> wrote:
> >
> >> On 06/11/2012 04:51 PM, Jeff Layton wrote:
> >>
> >>>
> >>> That was considered here, but the problem with the usermode helper is
> >>> that you can't pass anything back to the kernel but a simple status
> >>> code (and that's assuming that you wait for it to exit). In the near
> >>> future, we'll need to pass back more info to the kernel for this, so
> >>> the usermode helper callout wasn't suitable.
> >>>
> >>
> >>
> >> I have answered that in my mail. Repeated here again. Well you made
> >> a simple mistake. Because it is *easy* to pass back any number and
> >> size of information from user-mode.
> >>
> >> You just setup a sysfs entry points where the answers are written
> >> back to. It's an easy trick to setup a thread safe, way with a
> >> cookie but 90% of the time you don't have to. Say you set up
> >> a structure of per-client (identified uniquely) then user mode
> >> answers back per client, concurrency will not do any harm, since
> >> you answer to the same question the same answer. ans so on. Each
> >> problem it's own.
> >>
> >> If you want we can talk about this, it would be easy for me to setup
> >> a toll free conference number we can all use.
> >
> > That helpful advice would have been welcome about 3-4 months ago when I
> > first proposed this in detail. At that point you're working with
> > multiple upcall/downcall mechanisms, which was something I was keen to
> > avoid.
> >
> > I'm not opposed to moving in that direction, but it basically means
> > you're going to rip out everything I've got here so far and replace it.
> >
> > If you're willing to do that work, I'll be happy to work with you on
> > it, but I don't have the time or inclination to do that on my own right
> > now.
> >
>
>
> No such luck. sorry. I wish I could, but coming from a competing server
> company, you can imagine the priority of that ever happening.
> (Even though I use the Linux-Server everyday for my development and
> am putting lots of efforts into still, mainly in pnfs)
>
> Hopefully re-examining the code, it could all be salvaged just the
> same, only lots of code thrown a way.
>
> But mean-while please address my concern below:
> Boaz Harrosh wrote:
>
> > One more thing, the most important one. We have already fixed that in the
> > past and I was hoping the lesson was learned. Apparently it was not, and
> > we are doomed to do this mistake for ever!!
> >
> > What ever crap fails times out and crashes, in the recovery code, we don't
> > give a dam. It should never affect any Server-client communication.
> >
> > When the grace periods ends the clients gates opens period. *Any* error
> > return from state recovery code must be carefully ignored and normal
> > operations resumed. At most on error, we move into a mode where any
> > recovery request from client is accepted, since we don't have any better
> > data to verify it.
> >
> > Please comb recovery code to make sure any catastrophe is safely ignored.
> > We already did that before and it used to work.
>
>
> We should make sure that any state recovery code does not interfere with
> regular operations. and fails gracefully / shuts up.
>
> We used to have that, apparently it re-broke. Clients should always be granted
> access, after grace period. And Server should be made sure not to fail in any
> situation.
>
> I would look into it but I'm not uptodate anymore, I wish you or Bruce could.
>
> Thanks for your work so far, sorry to be bearer of bad news
> Boaz

This problem turned out to be a fairly straightforward bug in the
rpc_pipefs queue timeout mechanism that was causing the laundromat job
to hang and hence to keep the state lock locked. I just sent a patch
that should fix it.

I guess I'm not clear on what you're saying is broken. Modulo the
original bug here, clients are allowed access after the grace period
whether the upcalls are working or not.

What we cannot allow is reclaim requests outside of the grace period,
since we can't verify whether there was conflicting state in the
interim period. That's true whether the server has a functioning client
tracking mechanism or not.

--
Jeff Layton <[email protected]>

2012-06-11 15:02:08

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 05:29 PM, Jeff Layton wrote:

> On Mon, 11 Jun 2012 16:44:09 +0300
> Boaz Harrosh <[email protected]> wrote:
>
>> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
>>
>>> On 06/11/2012 03:39 PM, Jeff Layton wrote:
>>>
>>>>>
>>>>> But I'm guessing we were wrong to assume that existing setups that
>>>>> people perceived as working would have that path, because the failures
>>>>> in the absence of that path were probably less obvious.
>>>>>
>>
>>
>> One more thing, the most important one. We have already fixed that in the
>> past and I was hoping the lesson was learned. Apparently it was not, and
>> we are doomed to do this mistake for ever!!
>>
>> What ever crap fails times out and crashes, in the recovery code, we don't
>> give a dam. It should never affect any Server-client communication.
>>
>> When the grace periods ends the clients gates opens period. *Any* error
>> return from state recovery code must be carefully ignored and normal
>> operations resumed. At most on error, we move into a mode where any
>> recovery request from client is accepted, since we don't have any better
>> data to verify it.
>>
>> Please comb recovery code to make sure any catastrophe is safely ignored.
>> We already did that before and it used to work.
>>
>
> That's not the case, and hasn't ever been AFAICT. The code has changed
> a bit recently, but the existing behavior in this regard was preserved.
> From nfs4_check_open_reclaim:
>
> return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;
>
> ...if there is no client record, then the reclaim request fails. Doesn't
> the RFC mandate that?
>

Regardless of what RFC mandates and what is returned to client, (Which sounds
very unrobust to me) I'm sure the client handles nfserr_reclaim_bad just
fine.

It's the server that's getting stuck in its own feet and stops responding.
That's what I meant. We should always resume normal operations after
the grace period ends.

I did not see any reports of client getting into trouble because of
unexpected nfserr_reclaim_bad, did you?

Thanks
Boaz

2012-06-11 15:05:00

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 05:55 PM, Jeff Layton wrote:

> On Mon, 11 Jun 2012 17:45:06 +0300
> Boaz Harrosh <[email protected]> wrote:
>
>> On 06/11/2012 05:11 PM, Jeff Layton wrote:
>>
>>> On Mon, 11 Jun 2012 17:05:28 +0300
>>> Boaz Harrosh <[email protected]> wrote:
>>>
>>>> On 06/11/2012 04:51 PM, Jeff Layton wrote:
>>>>
>>>>>
>>>>> That was considered here, but the problem with the usermode helper is
>>>>> that you can't pass anything back to the kernel but a simple status
>>>>> code (and that's assuming that you wait for it to exit). In the near
>>>>> future, we'll need to pass back more info to the kernel for this, so
>>>>> the usermode helper callout wasn't suitable.
>>>>>
>>>>
>>>>
>>>> I have answered that in my mail. Repeated here again. Well you made
>>>> a simple mistake. Because it is *easy* to pass back any number and
>>>> size of information from user-mode.
>>>>
>>>> You just setup a sysfs entry points where the answers are written
>>>> back to. It's an easy trick to setup a thread safe, way with a
>>>> cookie but 90% of the time you don't have to. Say you set up
>>>> a structure of per-client (identified uniquely) then user mode
>>>> answers back per client, concurrency will not do any harm, since
>>>> you answer to the same question the same answer. ans so on. Each
>>>> problem it's own.
>>>>
>>>> If you want we can talk about this, it would be easy for me to setup
>>>> a toll free conference number we can all use.
>>>
>>> That helpful advice would have been welcome about 3-4 months ago when I
>>> first proposed this in detail. At that point you're working with
>>> multiple upcall/downcall mechanisms, which was something I was keen to
>>> avoid.
>>>
>>> I'm not opposed to moving in that direction, but it basically means
>>> you're going to rip out everything I've got here so far and replace it.
>>>
>>> If you're willing to do that work, I'll be happy to work with you on
>>> it, but I don't have the time or inclination to do that on my own right
>>> now.
>>>
>>
>>
>> No such luck. sorry. I wish I could, but coming from a competing server
>> company, you can imagine the priority of that ever happening.
>> (Even though I use the Linux-Server everyday for my development and
>> am putting lots of efforts into still, mainly in pnfs)
>>
>> Hopefully re-examining the code, it could all be salvaged just the
>> same, only lots of code thrown a way.
>>
>> But mean-while please address my concern below:
>> Boaz Harrosh wrote:
>>
>>> One more thing, the most important one. We have already fixed that in the
>>> past and I was hoping the lesson was learned. Apparently it was not, and
>>> we are doomed to do this mistake for ever!!
>>>
>>> What ever crap fails times out and crashes, in the recovery code, we don't
>>> give a dam. It should never affect any Server-client communication.
>>>
>>> When the grace periods ends the clients gates opens period. *Any* error
>>> return from state recovery code must be carefully ignored and normal
>>> operations resumed. At most on error, we move into a mode where any
>>> recovery request from client is accepted, since we don't have any better
>>> data to verify it.
>>>
>>> Please comb recovery code to make sure any catastrophe is safely ignored.
>>> We already did that before and it used to work.
>>
>>
>> We should make sure that any state recovery code does not interfere with
>> regular operations. and fails gracefully / shuts up.
>>
>> We used to have that, apparently it re-broke. Clients should always be granted
>> access, after grace period. And Server should be made sure not to fail in any
>> situation.
>>
>> I would look into it but I'm not uptodate anymore, I wish you or Bruce could.
>>
>> Thanks for your work so far, sorry to be bearer of bad news
>> Boaz
>
> This problem turned out to be a fairly straightforward bug in the
> rpc_pipefs queue timeout mechanism that was causing the laundromat job
> to hang and hence to keep the state lock locked. I just sent a patch
> that should fix it.
>
> I guess I'm not clear on what you're saying is broken. Modulo the
> original bug here, clients are allowed access after the grace period
> whether the upcalls are working or not.
>
> What we cannot allow is reclaim requests outside of the grace period,
> since we can't verify whether there was conflicting state in the
> interim period. That's true whether the server has a functioning client
> tracking mechanism or not.
>

I agree. Sorry we keep communicating on two different threads. Dis regard
the other last mail.

Sounds good then. My point, we should be very defensive with state recovery
code not getting in our way.

Thanks
Boaz

2012-06-11 15:15:21

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Mon, Jun 11, 2012 at 06:01:13PM +0300, Boaz Harrosh wrote:
> On 06/11/2012 05:29 PM, Jeff Layton wrote:
>
> > On Mon, 11 Jun 2012 16:44:09 +0300
> > Boaz Harrosh <[email protected]> wrote:
> >
> >> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
> >>
> >>> On 06/11/2012 03:39 PM, Jeff Layton wrote:
> >>>
> >>>>>
> >>>>> But I'm guessing we were wrong to assume that existing setups that
> >>>>> people perceived as working would have that path, because the failures
> >>>>> in the absence of that path were probably less obvious.
> >>>>>
> >>
> >>
> >> One more thing, the most important one. We have already fixed that in the
> >> past and I was hoping the lesson was learned. Apparently it was not, and
> >> we are doomed to do this mistake for ever!!
> >>
> >> What ever crap fails times out and crashes, in the recovery code, we don't
> >> give a dam. It should never affect any Server-client communication.
> >>
> >> When the grace periods ends the clients gates opens period. *Any* error
> >> return from state recovery code must be carefully ignored and normal
> >> operations resumed. At most on error, we move into a mode where any
> >> recovery request from client is accepted, since we don't have any better
> >> data to verify it.
> >>
> >> Please comb recovery code to make sure any catastrophe is safely ignored.
> >> We already did that before and it used to work.
> >>
> >
> > That's not the case, and hasn't ever been AFAICT. The code has changed
> > a bit recently, but the existing behavior in this regard was preserved.
> > From nfs4_check_open_reclaim:
> >
> > return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;
> >
> > ...if there is no client record, then the reclaim request fails. Doesn't
> > the RFC mandate that?
> >
>
>
> Regardless of what RFC mandates and what is returned to client, (Which sounds
> very unrobust to me) I'm sure the client handles nfserr_reclaim_bad just
> fine.
>
> It's the server that's getting stuck in its own feet and stops responding.
> That's what I meant. We should always resume normal operations after
> the grace period ends.
>
> I did not see any reports of client getting into trouble because of
> unexpected nfserr_reclaim_bad, did you?

We did have a few bugs in that area, and as far as I know they're fixed
(and have stayed fixed!).

The one other thing we've seen at testing events is clients not sending
reclaim_complete: not only is it mandatory (with state to reclaim or
not), it's actually mandatory for servers to fail further operations
until it's sent. However the problems were all seen with unreleased
client code that the implementors said they'd fix.

--b.

2012-06-11 16:05:12

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 11.06.2012 16:20, Jeff Layton wrote:
> No problem. One possibility for the difference is that rpc_pipefs is
> mounted on the machine that doesn't show the problem, but isn't mouted
> on the machine that does.

Just for the records, after creating the v4recovery directory the
problem is gone. Currently I don't understand who has deleted the
directory (possibly me :) ), since it is part of the already installed
nfs-server package...

I'll check your patch later because my build host is rather slow...

regards,
Joerg

2012-06-11 16:26:27

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 06:15 PM, bfields wrote:

> On Mon, Jun 11, 2012 at 06:01:13PM +0300, Boaz Harrosh wrote:
>> On 06/11/2012 05:29 PM, Jeff Layton wrote:
>>
>>> On Mon, 11 Jun 2012 16:44:09 +0300
>>> Boaz Harrosh <[email protected]> wrote:
>>>
>>>> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
>>>>
>>>>> On 06/11/2012 03:39 PM, Jeff Layton wrote:
>>>>>
>>>>>>>
>>>>>>> But I'm guessing we were wrong to assume that existing setups that
>>>>>>> people perceived as working would have that path, because the failures
>>>>>>> in the absence of that path were probably less obvious.
>>>>>>>
>>>>
>>>>
>>>> One more thing, the most important one. We have already fixed that in the
>>>> past and I was hoping the lesson was learned. Apparently it was not, and
>>>> we are doomed to do this mistake for ever!!
>>>>
>>>> What ever crap fails times out and crashes, in the recovery code, we don't
>>>> give a dam. It should never affect any Server-client communication.
>>>>
>>>> When the grace periods ends the clients gates opens period. *Any* error
>>>> return from state recovery code must be carefully ignored and normal
>>>> operations resumed. At most on error, we move into a mode where any
>>>> recovery request from client is accepted, since we don't have any better
>>>> data to verify it.
>>>>
>>>> Please comb recovery code to make sure any catastrophe is safely ignored.
>>>> We already did that before and it used to work.
>>>>
>>>
>>> That's not the case, and hasn't ever been AFAICT. The code has changed
>>> a bit recently, but the existing behavior in this regard was preserved.
>>> From nfs4_check_open_reclaim:
>>>
>>> return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;
>>>
>>> ...if there is no client record, then the reclaim request fails. Doesn't
>>> the RFC mandate that?
>>>
>>
>>
>> Regardless of what RFC mandates and what is returned to client, (Which sounds
>> very unrobust to me) I'm sure the client handles nfserr_reclaim_bad just
>> fine.
>>
>> It's the server that's getting stuck in its own feet and stops responding.
>> That's what I meant. We should always resume normal operations after
>> the grace period ends.
>>
>> I did not see any reports of client getting into trouble because of
>> unexpected nfserr_reclaim_bad, did you?
>
> We did have a few bugs in that area, and as far as I know they're fixed
> (and have stayed fixed!).
>
> The one other thing we've seen at testing events is clients not sending
> reclaim_complete: not only is it mandatory (with state to reclaim or
> not), it's actually mandatory for servers to fail further operations
> until it's sent. However the problems were all seen with unreleased
> client code that the implementors said they'd fix.
>

Yep, I agree. And additionally the bug Jeff fixed (At the other part of this
thread) where the recovery code managed to tromp all over the laundromat thread
and cause the Server to freeze.

I guess it's unavoidable. Bugs will always exist with new code. We should just
re audit the state recovery code in parts where it shares resources with regular
code, that errors are handled cleanly, like in Jeff's patch above.

> --b.

Thanks
Boaz

2012-06-15 15:24:17

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

I'm replying to this bug report since I found another nfs server
regression with the same kernel version (3.4.2), so the subject matches.
This time it is a "kernel BUG at fs/nfsd/nfs4state.c:1044"

Since the machine was not responsive after the BUG and rebooted I was
only able to take a picture of the full stacktrace. You can download it
from here:

https://ferdi.naasa.net/url/jplatte/IMG_3139.JPG

The patch "[PATCH] rpc_pipefs: allow rpc_purge_list to take a NULL waitq
pointer" was not applied for this trace, but the bug is the same with
this patch applied. So there must be another problem.

The crash is reproducable just by creating some nfs traffic and I can
hit it within minutes.

regards,
Joerg

2012-06-15 16:28:13

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Kernel 3.4.X NFS server regression

On Fri, Jun 15, 2012 at 05:24:10PM +0200, Joerg Platte wrote:
> I'm replying to this bug report since I found another nfs server
> regression with the same kernel version (3.4.2), so the subject
> matches. This time it is a "kernel BUG at fs/nfsd/nfs4state.c:1044"
>
> Since the machine was not responsive after the BUG and rebooted I
> was only able to take a picture of the full stacktrace. You can
> download it from here:
>
> https://ferdi.naasa.net/url/jplatte/IMG_3139.JPG
>
> The patch "[PATCH] rpc_pipefs: allow rpc_purge_list to take a NULL
> waitq pointer" was not applied for this trace, but the bug is the
> same with this patch applied. So there must be another problem.
>
> The crash is reproducable just by creating some nfs traffic and I
> can hit it within minutes.

This is fixed by a patch submitted for 3.5.

If things go normally I assume Linus will pull this when he bets back
from vacation in the next few days and then it should go to stable not
too long after.

--b.

commit bc2df47a408f2d64cf81bcfd0f6e3e14c84cb0ab
Author: J. Bruce Fields <[email protected]>
Date: Tue Jun 12 08:28:48 2012 -0400

nfsd4: BUG_ON(!is_spin_locked()) no good on UP kernels

Most frequent symptom was a BUG triggering in expire_client, with the
server locking up shortly thereafter.

Introduced by 508dc6e110c6dbdc0bbe84298ccfe22de7538486 "nfsd41:
free_session/free_client must be called under the client_lock".

Cc: [email protected]
Cc: Benny Halevy <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 8fdc9ec..94effd5 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -900,7 +900,7 @@ static void free_session(struct kref *kref)
struct nfsd4_session *ses;
int mem;

- BUG_ON(!spin_is_locked(&client_lock));
+ lockdep_assert_held(&client_lock);
ses = container_of(kref, struct nfsd4_session, se_ref);
nfsd4_del_conns(ses);
spin_lock(&nfsd_drc_lock);
@@ -1080,7 +1080,7 @@ static struct nfs4_client *alloc_client(struct xdr_netobj name)
static inline void
free_client(struct nfs4_client *clp)
{
- BUG_ON(!spin_is_locked(&client_lock));
+ lockdep_assert_held(&client_lock);
while (!list_empty(&clp->cl_sessions)) {
struct nfsd4_session *ses;
ses = list_entry(clp->cl_sessions.next, struct nfsd4_session,

2012-06-15 17:19:26