2009-06-19 22:54:39

by Simon Kirby

[permalink] [raw]
Subject: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

'lo!

On a few different boxes with heavy NFSv3 serving (and quite a bit of
file locking), I am seeing the following Oops regularly recurring
(2.6.28.10 and 2.6.29.4):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000032
IP: [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/dm-47/removable
CPU 6
Modules linked in: drbd cn aoe xt_MARK bnx2 zlib_inflate e1000e
Pid: 1471, comm: rpciod/6 Not tainted 2.6.29.4-hw #2 PowerEdge 1950
RIP: 0010:[<ffffffff8068499c>] [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
RSP: 0018:ffff88043d8ddde0 EFLAGS: 00010282
RAX: 0000000000000012 RBX: ffff880309c80780 RCX: ffff88043de65930
RDX: 0000000000009554 RSI: 0000000000009554 RDI: ffff880309c80780
RBP: ffff88043d8dde00 R08: ffff880309c80d14 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff880309c80d00 R14: ffff880167591400 R15: ffff880167591490
FS: 0000000000000000(0000) GS:ffff88043efa9500(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000032 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rpciod/6 (pid: 1471, threadinfo ffff88043d8dc000, task ffff88043d8d1db0)
Stack:
ffff880167591400 ffff880167591400 0000000000000000 0000000000000001
ffff88043d8dde20 ffffffff8067ce97 0000000000000001 ffff8801675914b0
ffff88043d8dde60 ffffffff8067d487 ffff88043d8dde40 ffff8801675914b0
Call Trace:
[<ffffffff8067ce97>] rpc_exit_task+0x27/0x60
[<ffffffff8067d487>] __rpc_execute+0x87/0x2d0
[<ffffffff8067d6d0>] ? rpc_async_schedule+0x0/0x20
[<ffffffff8067d6e0>] rpc_async_schedule+0x10/0x20
[<ffffffff80252100>] run_workqueue+0xa0/0x150
[<ffffffff80252cb3>] worker_thread+0x93/0xd0
[<ffffffff80256370>] ? autoremove_wake_function+0x0/0x40
[<ffffffff80252c20>] ? worker_thread+0x0/0xd0
[<ffffffff80255efd>] kthread+0x4d/0x80
[<ffffffff8020d61a>] child_rip+0xa/0x20
[<ffffffff80255eb0>] ? kthread+0x0/0x80
[<ffffffff8020d610>] ? child_rip+0x0/0x20
Code: 44 8b 67 30 48 8b 1e 41 83 fc fb 74 32 41 83 fc a3 74 2f 45 85 e4 78 30 8b 56 14 66 85 d2 90 74 55 48 8b 43 08 0f b7 f2 48 89 df <ff> 50 20 f0 0f ba ab e0 03 00 00 04 19 c0 45 31 e4 eb 16 90 41
RIP [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
RSP <ffff88043d8ddde0>
CR2: 0000000000000032
---[ end trace 209472a1a8b9753c ]---

After this occurs, a lot of "lockd: server x not responding, timed out"
messages start occurring for almost all of the clients, and locking
becomes lossy (it works, eventually), until the NFS server is rebooted.

The backtrace is always the same and always in rpcb_getport_done().

(gdb) disassemble rpcb_getport_done
Dump of assembler code for function rpcb_getport_done:
00e0 <rpcb_getport_done+0>: push %rbp
00e1 <rpcb_getport_done+1>: mov %rsp,%rbp
00e4 <rpcb_getport_done+4>: sub $0x20,%rsp
00e8 <rpcb_getport_done+8>: mov %r13,0x10(%rsp)
00ed <rpcb_getport_done+13>: mov %r14,0x18(%rsp)
00f2 <rpcb_getport_done+18>: mov %rsi,%r13
00f5 <rpcb_getport_done+21>: mov %rbx,(%rsp)
00f9 <rpcb_getport_done+25>: mov %r12,0x8(%rsp)
00fe <rpcb_getport_done+30>: mov %rdi,%r14
0101 <rpcb_getport_done+33>: mov 0x30(%rdi),%r12d
0105 <rpcb_getport_done+37>: mov (%rsi),%rbx
0108 <rpcb_getport_done+40>: cmp $0xfffffffffffffffb,%r12d
010c <rpcb_getport_done+44>: je 0x140 <rpcb_getport_done+96>
010e <rpcb_getport_done+46>: cmp $0xffffffffffffffa3,%r12d
0112 <rpcb_getport_done+50>: je 0x143 <rpcb_getport_done+99>
0114 <rpcb_getport_done+52>: test %r12d,%r12d
0117 <rpcb_getport_done+55>: js 0x149 <rpcb_getport_done+105>
0119 <rpcb_getport_done+57>: mov 0x14(%rsi),%edx
011c <rpcb_getport_done+60>: test %dx,%dx
011f <rpcb_getport_done+63>: nop
0120 <rpcb_getport_done+64>: je 0x177 <rpcb_getport_done+151>
0122 <rpcb_getport_done+66>: mov 0x8(%rbx),%rax
0126 <rpcb_getport_done+70>: movzwl %dx,%esi
0129 <rpcb_getport_done+73>: mov %rbx,%rdi
012c <rpcb_getport_done+76>: callq *0x20(%rax) <================
012f <rpcb_getport_done+79>: lock btsl $0x4,0x3e0(%rbx)
0138 <rpcb_getport_done+88>: sbb %eax,%eax
013a <rpcb_getport_done+90>: xor %r12d,%r12d
013d <rpcb_getport_done+93>: jmp 0x155 <rpcb_getport_done+117>
013f <rpcb_getport_done+95>: nop
0140 <rpcb_getport_done+96>: mov $0xa3,%r12b
0143 <rpcb_getport_done+99>: incl 0x3ec(%rbx)
0149 <rpcb_getport_done+105>: mov 0x8(%rbx),%rax
014d <rpcb_getport_done+109>: xor %esi,%esi
014f <rpcb_getport_done+111>: mov %rbx,%rdi
0152 <rpcb_getport_done+114>: callq *0x20(%rax)
...

>From my crappy assembler reading, this looks like this code:

/* Succeeded */
xprt->ops->set_port(xprt, map->r_port);

RAX is 0x12, and 0x12 + 0x20 is the fault address of 0x32, so it looks
like xprt or ops is NULL. 0x20 is the offset of set_port in xprt->ops.

Looks like there is some sort of race here...? :)

Simon-


2009-06-22 21:11:24

by Simon Kirby

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Sat, Jun 20, 2009 at 10:09:41PM -0700, Simon Kirby wrote:

> Actually, we just saw another similar crash on another machine which is
> an NFS client from this server (no nfsd running). Same backtrace, but
> this time RAX was "32322e32352e3031", which is obviously ASCII
> ("22.25.01"), so memory scribbling seems to definitely be happening...

Good news: 2.6.30 seems to have fixed whatever the original scribbling
source was. I see at least a couple of suspect commits in the log, but
I'm not sure which yet.

However, with 2.6.30, it seems 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad
is causing us a large performance regression. The server's response
latency is huge compared to normal. I suspected this patch was the
culprit, so I wrote over the instruction that loads SVC_MAX_WAKING before
this comparison:

+ if (pool->sp_nwaking >= SVC_MAX_WAKING) {
+ /* too many threads are runnable and trying to wake up */
+ thread_avail = 0;
+ }

...when I raised SVC_MAX_WAKING to 40ish, the problem for us disappears.

The problem is that with just 72 nfsd processes running, the NFS socket
has a ~1 MB backlog of packets on it, even though "ps" shows most of the
nfsd threads are not blocked. This is on an 8 core system, with high NFS
packet rates. More NFS threads (300) made no difference.

As soon as I raised SVC_MAX_WAKING, the load average went up again to
what it normally was before with 2.6.29, but the socket's receive backlog
went down to nearly 0 again, and the request latency is now back to
normal.

I think the issue here is that whatever calls svc_xprt_enqueue() isn't
doing it again as soon as the threads sleep again, but only when the next
packet comes in, or something...

Simon-

2009-06-20 19:57:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Fri, 2009-06-19 at 15:54 -0700, Simon Kirby wrote:
> 'lo!
>
> On a few different boxes with heavy NFSv3 serving (and quite a bit of
> file locking), I am seeing the following Oops regularly recurring
> (2.6.28.10 and 2.6.29.4):
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000032
> IP: [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
> PGD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/block/dm-47/removable
> CPU 6
> Modules linked in: drbd cn aoe xt_MARK bnx2 zlib_inflate e1000e
> Pid: 1471, comm: rpciod/6 Not tainted 2.6.29.4-hw #2 PowerEdge 1950
> RIP: 0010:[<ffffffff8068499c>] [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
> RSP: 0018:ffff88043d8ddde0 EFLAGS: 00010282
> RAX: 0000000000000012 RBX: ffff880309c80780 RCX: ffff88043de65930
> RDX: 0000000000009554 RSI: 0000000000009554 RDI: ffff880309c80780
> RBP: ffff88043d8dde00 R08: ffff880309c80d14 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff880309c80d00 R14: ffff880167591400 R15: ffff880167591490
> FS: 0000000000000000(0000) GS:ffff88043efa9500(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000032 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process rpciod/6 (pid: 1471, threadinfo ffff88043d8dc000, task ffff88043d8d1db0)
> Stack:
> ffff880167591400 ffff880167591400 0000000000000000 0000000000000001
> ffff88043d8dde20 ffffffff8067ce97 0000000000000001 ffff8801675914b0
> ffff88043d8dde60 ffffffff8067d487 ffff88043d8dde40 ffff8801675914b0
> Call Trace:
> [<ffffffff8067ce97>] rpc_exit_task+0x27/0x60
> [<ffffffff8067d487>] __rpc_execute+0x87/0x2d0
> [<ffffffff8067d6d0>] ? rpc_async_schedule+0x0/0x20
> [<ffffffff8067d6e0>] rpc_async_schedule+0x10/0x20
> [<ffffffff80252100>] run_workqueue+0xa0/0x150
> [<ffffffff80252cb3>] worker_thread+0x93/0xd0
> [<ffffffff80256370>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff80252c20>] ? worker_thread+0x0/0xd0
> [<ffffffff80255efd>] kthread+0x4d/0x80
> [<ffffffff8020d61a>] child_rip+0xa/0x20
> [<ffffffff80255eb0>] ? kthread+0x0/0x80
> [<ffffffff8020d610>] ? child_rip+0x0/0x20
> Code: 44 8b 67 30 48 8b 1e 41 83 fc fb 74 32 41 83 fc a3 74 2f 45 85 e4 78 30 8b 56 14 66 85 d2 90 74 55 48 8b 43 08 0f b7 f2 48 89 df <ff> 50 20 f0 0f ba ab e0 03 00 00 04 19 c0 45 31 e4 eb 16 90 41
> RIP [<ffffffff8068499c>] rpcb_getport_done+0x4c/0xd0
> RSP <ffff88043d8ddde0>
> CR2: 0000000000000032
> ---[ end trace 209472a1a8b9753c ]---
>
> After this occurs, a lot of "lockd: server x not responding, timed out"
> messages start occurring for almost all of the clients, and locking
> becomes lossy (it works, eventually), until the NFS server is rebooted.
>
> The backtrace is always the same and always in rpcb_getport_done().
>
> (gdb) disassemble rpcb_getport_done
> Dump of assembler code for function rpcb_getport_done:
> 00e0 <rpcb_getport_done+0>: push %rbp
> 00e1 <rpcb_getport_done+1>: mov %rsp,%rbp
> 00e4 <rpcb_getport_done+4>: sub $0x20,%rsp
> 00e8 <rpcb_getport_done+8>: mov %r13,0x10(%rsp)
> 00ed <rpcb_getport_done+13>: mov %r14,0x18(%rsp)
> 00f2 <rpcb_getport_done+18>: mov %rsi,%r13
> 00f5 <rpcb_getport_done+21>: mov %rbx,(%rsp)
> 00f9 <rpcb_getport_done+25>: mov %r12,0x8(%rsp)
> 00fe <rpcb_getport_done+30>: mov %rdi,%r14
> 0101 <rpcb_getport_done+33>: mov 0x30(%rdi),%r12d
> 0105 <rpcb_getport_done+37>: mov (%rsi),%rbx
> 0108 <rpcb_getport_done+40>: cmp $0xfffffffffffffffb,%r12d
> 010c <rpcb_getport_done+44>: je 0x140 <rpcb_getport_done+96>
> 010e <rpcb_getport_done+46>: cmp $0xffffffffffffffa3,%r12d
> 0112 <rpcb_getport_done+50>: je 0x143 <rpcb_getport_done+99>
> 0114 <rpcb_getport_done+52>: test %r12d,%r12d
> 0117 <rpcb_getport_done+55>: js 0x149 <rpcb_getport_done+105>
> 0119 <rpcb_getport_done+57>: mov 0x14(%rsi),%edx
> 011c <rpcb_getport_done+60>: test %dx,%dx
> 011f <rpcb_getport_done+63>: nop
> 0120 <rpcb_getport_done+64>: je 0x177 <rpcb_getport_done+151>
> 0122 <rpcb_getport_done+66>: mov 0x8(%rbx),%rax
> 0126 <rpcb_getport_done+70>: movzwl %dx,%esi
> 0129 <rpcb_getport_done+73>: mov %rbx,%rdi
> 012c <rpcb_getport_done+76>: callq *0x20(%rax) <================
> 012f <rpcb_getport_done+79>: lock btsl $0x4,0x3e0(%rbx)
> 0138 <rpcb_getport_done+88>: sbb %eax,%eax
> 013a <rpcb_getport_done+90>: xor %r12d,%r12d
> 013d <rpcb_getport_done+93>: jmp 0x155 <rpcb_getport_done+117>
> 013f <rpcb_getport_done+95>: nop
> 0140 <rpcb_getport_done+96>: mov $0xa3,%r12b
> 0143 <rpcb_getport_done+99>: incl 0x3ec(%rbx)
> 0149 <rpcb_getport_done+105>: mov 0x8(%rbx),%rax
> 014d <rpcb_getport_done+109>: xor %esi,%esi
> 014f <rpcb_getport_done+111>: mov %rbx,%rdi
> 0152 <rpcb_getport_done+114>: callq *0x20(%rax)
> ...
>
> From my crappy assembler reading, this looks like this code:
>
> /* Succeeded */
> xprt->ops->set_port(xprt, map->r_port);
>
> RAX is 0x12, and 0x12 + 0x20 is the fault address of 0x32, so it looks
> like xprt or ops is NULL. 0x20 is the offset of set_port in xprt->ops.

>From your disassembly, it looks as if the value of xprt->ops is 0x12.

> Looks like there is some sort of race here...? :)

Looks more like some kind of memory scribble to me. The value of
xprt->ops never changes after initialisation of the transport.

Have you tried turning on the various memory debugging tools?

Trond


2009-06-21 05:09:42

by Simon Kirby

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Sat, Jun 20, 2009 at 03:57:35PM -0400, Trond Myklebust wrote:

> Looks more like some kind of memory scribble to me. The value of
> xprt->ops never changes after initialisation of the transport.
>
> Have you tried turning on the various memory debugging tools?

Hi Trond,

No, but I will try this now.

Actually, we just saw another similar crash on another machine which is
an NFS client from this server (no nfsd running). Same backtrace, but
this time RAX was "32322e32352e3031", which is obviously ASCII
("22.25.01"), so memory scribbling seems to definitely be happening...

Simon-

2009-07-09 17:27:42

by Simon Kirby

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

Hello,

It seems this email to Greg Banks is bouncing (no longer works at SGI),
and I see git commit 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad is still
in HEAD (and still causing problems for our load).

Can somebody else eyeball this, please? I don't understand enough about
this particular change to fix the request latency / queue backlogging
that this patch seems to introduce.

It would seem to me that this patch is flawed because svc_xprt_enqueue()
is edge-triggered upon the arrival of packets, but the NFS threads
themselves cannot then pull another request off of the socket queue.
This patch likely helps with the particular benchmark, but not in our
load case where there is a heavy mix of cached and uncached NFS requests.

Simon-

On Mon, Jun 22, 2009 at 02:11:26PM -0700, Simon Kirby wrote:

> On Sat, Jun 20, 2009 at 10:09:41PM -0700, Simon Kirby wrote:
>
> > Actually, we just saw another similar crash on another machine which is
> > an NFS client from this server (no nfsd running). Same backtrace, but
> > this time RAX was "32322e32352e3031", which is obviously ASCII
> > ("22.25.01"), so memory scribbling seems to definitely be happening...
>
> Good news: 2.6.30 seems to have fixed whatever the original scribbling
> source was. I see at least a couple of suspect commits in the log, but
> I'm not sure which yet.
>
> However, with 2.6.30, it seems 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad
> is causing us a large performance regression. The server's response
> latency is huge compared to normal. I suspected this patch was the
> culprit, so I wrote over the instruction that loads SVC_MAX_WAKING before
> this comparison:
>
> + if (pool->sp_nwaking >= SVC_MAX_WAKING) {
> + /* too many threads are runnable and trying to wake up */
> + thread_avail = 0;
> + }
>
> ...when I raised SVC_MAX_WAKING to 40ish, the problem for us disappears.
>
> The problem is that with just 72 nfsd processes running, the NFS socket
> has a ~1 MB backlog of packets on it, even though "ps" shows most of the
> nfsd threads are not blocked. This is on an 8 core system, with high NFS
> packet rates. More NFS threads (300) made no difference.
>
> As soon as I raised SVC_MAX_WAKING, the load average went up again to
> what it normally was before with 2.6.29, but the socket's receive backlog
> went down to nearly 0 again, and the request latency is now back to
> normal.
>
> I think the issue here is that whatever calls svc_xprt_enqueue() isn't
> doing it again as soon as the threads sleep again, but only when the next
> packet comes in, or something...
>
> Simon-

2009-07-10 22:34:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Thu, Jul 09, 2009 at 10:27:39AM -0700, Simon Kirby wrote:
> Hello,
>
> It seems this email to Greg Banks is bouncing (no longer works at SGI),

Yes, I've cc'd his new address. (But he's on vacation.)

> and I see git commit 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad is still
> in HEAD (and still causing problems for our load).
>
> Can somebody else eyeball this, please? I don't understand enough about
> this particular change to fix the request latency / queue backlogging
> that this patch seems to introduce.
>
> It would seem to me that this patch is flawed because svc_xprt_enqueue()
> is edge-triggered upon the arrival of packets, but the NFS threads
> themselves cannot then pull another request off of the socket queue.
> This patch likely helps with the particular benchmark, but not in our
> load case where there is a heavy mix of cached and uncached NFS requests.

That sounds plausible. I'll need to take some time to look at it.

--b.

>
> Simon-
>
> On Mon, Jun 22, 2009 at 02:11:26PM -0700, Simon Kirby wrote:
>
> > On Sat, Jun 20, 2009 at 10:09:41PM -0700, Simon Kirby wrote:
> >
> > > Actually, we just saw another similar crash on another machine which is
> > > an NFS client from this server (no nfsd running). Same backtrace, but
> > > this time RAX was "32322e32352e3031", which is obviously ASCII
> > > ("22.25.01"), so memory scribbling seems to definitely be happening...
> >
> > Good news: 2.6.30 seems to have fixed whatever the original scribbling
> > source was. I see at least a couple of suspect commits in the log, but
> > I'm not sure which yet.
> >
> > However, with 2.6.30, it seems 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad
> > is causing us a large performance regression. The server's response
> > latency is huge compared to normal. I suspected this patch was the
> > culprit, so I wrote over the instruction that loads SVC_MAX_WAKING before
> > this comparison:
> >
> > + if (pool->sp_nwaking >= SVC_MAX_WAKING) {
> > + /* too many threads are runnable and trying to wake up */
> > + thread_avail = 0;
> > + }
> >
> > ...when I raised SVC_MAX_WAKING to 40ish, the problem for us disappears.
> >
> > The problem is that with just 72 nfsd processes running, the NFS socket
> > has a ~1 MB backlog of packets on it, even though "ps" shows most of the
> > nfsd threads are not blocked. This is on an 8 core system, with high NFS
> > packet rates. More NFS threads (300) made no difference.
> >
> > As soon as I raised SVC_MAX_WAKING, the load average went up again to
> > what it normally was before with 2.6.29, but the socket's receive backlog
> > went down to nearly 0 again, and the request latency is now back to
> > normal.
> >
> > I think the issue here is that whatever calls svc_xprt_enqueue() isn't
> > doing it again as soon as the threads sleep again, but only when the next
> > packet comes in, or something...
> >
> > Simon-
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-08-10 23:55:35

by J. Bruce Fields

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Fri, Jul 10, 2009 at 06:34:08PM -0400, bfields wrote:
> On Thu, Jul 09, 2009 at 10:27:39AM -0700, Simon Kirby wrote:
> > Hello,
> >
> > It seems this email to Greg Banks is bouncing (no longer works at SGI),
>
> Yes, I've cc'd his new address. (But he's on vacation.)
>
> > and I see git commit 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad is still
> > in HEAD (and still causing problems for our load).
> >
> > Can somebody else eyeball this, please? I don't understand enough about
> > this particular change to fix the request latency / queue backlogging
> > that this patch seems to introduce.
> >
> > It would seem to me that this patch is flawed because svc_xprt_enqueue()
> > is edge-triggered upon the arrival of packets, but the NFS threads
> > themselves cannot then pull another request off of the socket queue.
> > This patch likely helps with the particular benchmark, but not in our
> > load case where there is a heavy mix of cached and uncached NFS requests.
>
> That sounds plausible. I'll need to take some time to look at it.

Looking back at that commit--I'm now confused about how it was meant to
work. In the case where the woken-up thread is waiting inside of
svc_recv(), ->nwaking doesn't get decremented at all until the request
is processed and svc_recv() is called again--effectively limiting the
number of concurrent requests to 5 per pool, so, if I read the code
correctly, likely to cause problems if your workload would benefit from
lots of requests being able to wait on io simultaneously (e.g. if you
have a large working set and more than 5 spindles per pool).

The nwaking count probably also leaks in some cases (e.g. if a thread
exits?)

I'm inclined to revert the patch and take another look at Greg's
original problem.

--b.

2009-08-11 17:17:47

by Simon Kirby

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Mon, Aug 10, 2009 at 07:55:36PM -0400, J. Bruce Fields wrote:

> Looking back at that commit--I'm now confused about how it was meant to
> work. In the case where the woken-up thread is waiting inside of
> svc_recv(), ->nwaking doesn't get decremented at all until the request
> is processed and svc_recv() is called again--effectively limiting the
> number of concurrent requests to 5 per pool, so, if I read the code
> correctly, likely to cause problems if your workload would benefit from
> lots of requests being able to wait on io simultaneously (e.g. if you
> have a large working set and more than 5 spindles per pool).

Yes, this box is serving about 50 TB of storage space, so there are more
than 5 spindles. :)

I can't believe others aren't all complaining about the same problem, but
I guess the loads are different.

> I'm inclined to revert the patch and take another look at Greg's
> original problem.

I'm inclined to be totally happy with that! :)

Simon-

2009-10-15 22:07:27

by Simon Kirby

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Tue, Aug 11, 2009 at 10:17:45AM -0700, Simon Kirby wrote:

> On Mon, Aug 10, 2009 at 07:55:36PM -0400, J. Bruce Fields wrote:
>
> > Looking back at that commit--I'm now confused about how it was meant to
> > work. In the case where the woken-up thread is waiting inside of
> > svc_recv(), ->nwaking doesn't get decremented at all until the request
> > is processed and svc_recv() is called again--effectively limiting the
> > number of concurrent requests to 5 per pool, so, if I read the code
> > correctly, likely to cause problems if your workload would benefit from
> > lots of requests being able to wait on io simultaneously (e.g. if you
> > have a large working set and more than 5 spindles per pool).
>
> Yes, this box is serving about 50 TB of storage space, so there are more
> than 5 spindles. :)
>
> I can't believe others aren't all complaining about the same problem, but
> I guess the loads are different.
>
> > I'm inclined to revert the patch and take another look at Greg's
> > original problem.
>
> I'm inclined to be totally happy with that! :)

By the way, the original commit for this,
59a252ff8c0f2fa32c896f69d56ae33e641ce7ad, still seems to be in the
kernel. Would you like me to make a patch to remove this code?

I suspect it can't just be pulled as-is because the pool stats were
added later which now export the "overloads-avoided" through
/proc/fs/nfsd/pool_stats . I suspect the header is there to make the
format dynamic (eg: we could kill overloads-avoided) without breaking
backwards compatibility...?

I see overloads-avoided still rising even with SVC_MAX_WAKING set to
127 instead of 5. It seems to happen a lot when storage gets a little
stuck, but it is probably increasing the latency of other requests that
could be served from cache.

Speaking of latency, I was also looking at blktrace output for the
underlying devices, and it seems like there are cases where knfsd is
issuing _read_ requests in a way that leave the queue plugged sometimes,
causing the unplug timer to have to clear it. I still need to try to
track this down (see recent linux-btrace thread).

Simon-

2009-10-15 22:52:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: kernel NULL pointer dereference in rpcb_getport_done (2.6.29.4)

On Thu, Oct 15, 2009 at 02:46:38PM -0700, Simon Kirby wrote:
> On Tue, Aug 11, 2009 at 10:17:45AM -0700, Simon Kirby wrote:
>
> > On Mon, Aug 10, 2009 at 07:55:36PM -0400, J. Bruce Fields wrote:
> >
> > > Looking back at that commit--I'm now confused about how it was meant to
> > > work. In the case where the woken-up thread is waiting inside of
> > > svc_recv(), ->nwaking doesn't get decremented at all until the request
> > > is processed and svc_recv() is called again--effectively limiting the
> > > number of concurrent requests to 5 per pool, so, if I read the code
> > > correctly, likely to cause problems if your workload would benefit from
> > > lots of requests being able to wait on io simultaneously (e.g. if you
> > > have a large working set and more than 5 spindles per pool).
> >
> > Yes, this box is serving about 50 TB of storage space, so there are more
> > than 5 spindles. :)
> >
> > I can't believe others aren't all complaining about the same problem, but
> > I guess the loads are different.
> >
> > > I'm inclined to revert the patch and take another look at Greg's
> > > original problem.
> >
> > I'm inclined to be totally happy with that! :)
>
> By the way, the original commit for this,
> 59a252ff8c0f2fa32c896f69d56ae33e641ce7ad, still seems to be in the
> kernel.

Apologies, yes, this fell through the cracks.

> Would you like me to make a patch to remove this code?

That would be helpful.

And I think Greg may be back from his travels, so perhaps he'll be able
to comment on it too.

> I suspect it can't just be pulled as-is because the pool stats were
> added later which now export the "overloads-avoided" through
> /proc/fs/nfsd/pool_stats . I suspect the header is there to make the
> format dynamic (eg: we could kill overloads-avoided) without breaking
> backwards compatibility...?

A line-based instead of column-based format probably would have made it
more likely parsers would get this right. Hm, I thought

http://oss.sgi.com/projects/pcp/

had some code that parsed this? But a quick "git grep pool_stats" on
the git repo doesn't show me anything.

We could be conservative and just set the field to zero, but it would be
better if the format were flexible--and changing the number of columns
early on might be one way to ensure people know it will be....

Of course, best would be if Greg could explain what we're
misunderstanding about the overload avoidance--perhaps there's a still
a simple bugfix that would make it work for your situation....

> I see overloads-avoided still rising even with SVC_MAX_WAKING set to
> 127 instead of 5. It seems to happen a lot when storage gets a little
> stuck, but it is probably increasing the latency of other requests that
> could be served from cache.
>
> Speaking of latency, I was also looking at blktrace output for the
> underlying devices, and it seems like there are cases where knfsd is
> issuing _read_ requests in a way that leave the queue plugged sometimes,
> causing the unplug timer to have to clear it. I still need to try to
> track this down (see recent linux-btrace thread).

A quick web search failed to turn up that thread for me.--b.