LinuxLists.cc - 5.3.0 Regression: rpc.nfsd v4 uninterruptible sleep for 5+ minutes w/o rpc-statd/etc

2019-09-19 13:54:07

Subject: 5.3.0 Regression: rpc.nfsd v4 uninterruptible sleep for 5+ minutes w/o rpc-statd/etc

For a really long time (years?) if you forced NFS v4 only, you could
mask a lot of unnecessary services.

In /etc/nfs.conf, in "[nfsd] I've been able to set "vers3=n", and then
mask the following services:
* gssproxy
* nfs-blkmap
* rpc-statd
* rpcbind (service & socket)

Upgrading from 5.2.14 to 5.3.0, nfs-server.service (rpc.nfsd) has
exactly a 5 minute delay, and sometimes longer.

nfs-utils 2.4.1 (latest). Downgrading only linux back to 5.2.14 makes
it work again.

Unmasking the above services/socket is a temporary workaround.

I'll mention I do include "ipv6.disable=1" in my kernel arguments,
since I see in strace it attempts using IPV6 and fails right after the
hang. Don't think that's related, but just in case.

ALL DIAGNOSTICS BELOW are with everything commented in "/etc/exports",
because it reproduces the problem as well. They are also with
everything commented in "/etc/nfs.conf" except for in "[nfsd]"
"vers3=n".

In IRC, bcodding said he's been looking at a similar bug, but that
probably isn't exactly like mine. Potentially having to do with
changes of handling of ENOTCON and EAGAIN in net/sunrpc/clnt.c, not
sure if that refers to mine or his.

The default nfs.conf includes "[nfsd]... # debug=0" which sort of
suggests to get debugging information to set "debug=1". But, doing
this gave "rpc.nfsd: Invalid debug facility: 1". I also tried
"debug=0x7FFF" from debug.h for NFSDDBG_ALL, but it doesn't like that
either. "man rpc.nfsd" shows the --debug option. I disabled
nfs-server.service and rebooted to manually start it with this option,
and it gave a whopping 3 lines at start, and nothing during or after
the 5 minute delay:

=====
rpc.nfsd: knfsd is currently down
rpc.nfsd: Writing version string to kernel: -2 -3
rpc.nfsd: Created AF_INET TCP socket.
=====

The first time I ran strace, it goes into uninterruptible sleep and shows:

=====
openat(AT_FDCWD, "/proc/fs/nfsd/versions", O_RDONLY) = 3
read(3, <<the massive delay is here>>"-2 -3 +4 +4.1 +4.2\n", 128) = 19
=====

After re-running it after speaking with bcodding, strace showed:

=====
bind(4, (sa_family=AF_INET, sin_port=htons(2049),
sin_addr=inet_addr("0.0.0.0")), 16) = 0
listen(4, 64) = 0
write(3 "4\n", 2<<the massive delay is here>>) = 2
=====

Not sure why the change in the strace hang spot.

After speaking with bcodding, he gave me a list of events to set,
visible here: http://ix.io/1nTX

I ran "strace rpc.nfsd --debug &> log" (so the strace and the --debug
is mixed) which is viewable here: http://ix.io/1o2s

And, the contents of /sys/kernel/debug/tracing/trace after it finally
completed was 143MB, compared to 5.2.14's 17K! This trace including
the first and last 1000 lines, skipping the hundreds of thousands of
lines between so it's only 325K is viewable here: http://ix.io/1o2y

This trace only goes from kernel time 990.504680 to 995.938377. I
think it must have hit a maximum MB or number of lines, because it
didn't come out of that for for 5 minutes.

He also asked to see the trace of it working (which I hope meant on
5.2.14, rather than with services unmasked.)

The 5.2.14 working strace with --debug is viewable here: http://ix.io/1o8V

And the 5.2.14 working trace is viewable here: http://ix.io/1ofY

2019-09-19 13:55:04

by James Harvey

[permalink] [raw]

Subject: Re: 5.3.0 Regression: rpc.nfsd v4 uninterruptible sleep for 5+ minutes w/o rpc-statd/etc

On Thu, Sep 19, 2019 at 9:00 AM James Harvey <[email protected]> wrote:
>
> For a really long time (years?) if you forced NFS v4 only, you could
> mask a lot of unnecessary services.
>
> In /etc/nfs.conf, in "[nfsd] I've been able to set "vers3=n", and then
> mask the following services:
> * gssproxy
> * nfs-blkmap
> * rpc-statd
> * rpcbind (service & socket)
>
> Upgrading from 5.2.14 to 5.3.0, nfs-server.service (rpc.nfsd) has
> exactly a 5 minute delay, and sometimes longer.
>
> ...

P.S. During a few of the earlier boots where I was seeing strace show
the delay was reading "/proc/fs/nfsd/versions", I do have this in
journalctl, when I tried in another term cat'ing it during the
rpc.nfsd delay: (it repeats a few times during the 5 minute delay)

INFO: task cat:3400 blocked for more than 122 seconds.
Not tainted 5.3.0-arch1-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cat D 0 3400 3029 0x00000080
Call Trace:
? __schedule+0x27f/0x6d0
schedule+0x43/0xd0
schedule_preempt_disabled+0x14/0x20
__mutex_lock.isra.0+0x27d/0x530
write_versions+0x38/0x3c0 [nfsd]
? _copy_from_user+0x37/0x60
? write_ports+0x320/0x320 [nfsd]
nfsctl_transaction_write+0x45/0x70 [nfsd]
nfsctl_transaction_read+0x3b/0x60 [nfsd]
vfs_read+0x9d/0x150
ksys_read+0x67/0xe0
do_syscall_64+0x5f/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f47286fa155
Code: Bad RIP value.
RSP: 002b:00007ffcd2a74438 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f47286fa155
RDX: 0000000000020000 RSI: 00007f47282ba000 RDI: 0000000000000003
RBP: 00007f47282ba000 R08: 00007f47282b9010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 00007f47282ba000
R13: 0000000000000003 R14: 0000000000000fff R15: 0000000000020000
svc: failed to register lockdv1 RPC service (errno 110).
lockd_up: makesock failed, error=-110

2019-09-26 19:51:59

by J. Bruce Fields

[permalink] [raw]

Subject: Re: 5.3.0 Regression: rpc.nfsd v4 uninterruptible sleep for 5+ minutes w/o rpc-statd/etc

On Thu, Sep 19, 2019 at 09:17:03AM -0400, James Harvey wrote:
> On Thu, Sep 19, 2019 at 9:00 AM James Harvey <[email protected]> wrote:
> >
> > For a really long time (years?) if you forced NFS v4 only, you could
> > mask a lot of unnecessary services.
> >
> > In /etc/nfs.conf, in "[nfsd] I've been able to set "vers3=n", and then
> > mask the following services:
> > * gssproxy
> > * nfs-blkmap
> > * rpc-statd
> > * rpcbind (service & socket)
> >
> > Upgrading from 5.2.14 to 5.3.0, nfs-server.service (rpc.nfsd) has
> > exactly a 5 minute delay, and sometimes longer.
> >
> > ...
>
> P.S. During a few of the earlier boots where I was seeing strace show
> the delay was reading "/proc/fs/nfsd/versions", I do have this in
> journalctl, when I tried in another term cat'ing it during the
> rpc.nfsd delay: (it repeats a few times during the 5 minute delay)

It's waiting on the nfsd_mutex.

So somebody else is sitting around blocking for five minutes while
holding nfsd_mutex. That lock covers a lot of stuff so that may not
narrow down much who's blocking.

An alt-sysrq-t dump might help.

(So, hit alt-sysrq-t, or echo t to /proc/sysrq-trigger, then look in
dmesg for tasks that are waiting.)

There's not a whole ton of changes between 5.2 and 5.3, but I'm not
seeing an obvious culprit.

--b.

>
> INFO: task cat:3400 blocked for more than 122 seconds.
> Not tainted 5.3.0-arch1-1-ARCH #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> cat D 0 3400 3029 0x00000080
> Call Trace:
> ? __schedule+0x27f/0x6d0
> schedule+0x43/0xd0
> schedule_preempt_disabled+0x14/0x20
> __mutex_lock.isra.0+0x27d/0x530
> write_versions+0x38/0x3c0 [nfsd]
> ? _copy_from_user+0x37/0x60
> ? write_ports+0x320/0x320 [nfsd]
> nfsctl_transaction_write+0x45/0x70 [nfsd]
> nfsctl_transaction_read+0x3b/0x60 [nfsd]
> vfs_read+0x9d/0x150
> ksys_read+0x67/0xe0
> do_syscall_64+0x5f/0x1c0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f47286fa155
> Code: Bad RIP value.
> RSP: 002b:00007ffcd2a74438 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f47286fa155
> RDX: 0000000000020000 RSI: 00007f47282ba000 RDI: 0000000000000003
> RBP: 00007f47282ba000 R08: 00007f47282b9010 R09: 0000000000000000
> R10: 0000000000000022 R11: 0000000000000246 R12: 00007f47282ba000
> R13: 0000000000000003 R14: 0000000000000fff R15: 0000000000020000
> svc: failed to register lockdv1 RPC service (errno 110).
> lockd_up: makesock failed, error=-110

2019-10-01 18:24:10

by Benjamin Coddington

[permalink] [raw]

Subject: Re: 5.3.0 Regression: rpc.nfsd v4 uninterruptible sleep for 5+ minutes w/o rpc-statd/etc

On 19 Sep 2019, at 9:00, James Harvey wrote:

> For a really long time (years?) if you forced NFS v4 only, you could
> mask a lot of unnecessary services.
>
> In /etc/nfs.conf, in "[nfsd] I've been able to set "vers3=n", and then
> mask the following services:
> * gssproxy
> * nfs-blkmap
> * rpc-statd
> * rpcbind (service & socket)
>
> Upgrading from 5.2.14 to 5.3.0, nfs-server.service (rpc.nfsd) has
> exactly a 5 minute delay, and sometimes longer.

A bisect ends on:
4f8943f80883 SUNRPC: Replace direct task wakeups from softirq context

That commit changed the way we pull the error from the socket, previously
we'd wake the task with whatever error is in sk_err from xs_error_report(),
but now we use SO_ERROR - but that's only after possibly running through
xs_wake_disconnect which forces a closure which can change sk_err.

So, I think xs_error_report sees ECONNREFUSED, but we wake tasks with
ENOTCONN, and the client machine spins us back around again to reconnect, we
do this until things time out.

I'll send a patch to revert to the previous behavior of waking tasks with
the error as it was in xs_error_report by copying it over to the sock_xprt
struct and waking the tasks with that value.

There's another subtle change here besides that race: SO_ERROR can return
the socket's soft error, not just what's in sk_err. That can be fun things
like EINVAL if routing lookups fail..

Ben