2014-10-20 17:36:59

by Christoph Hellwig

[permalink] [raw]
Subject: xfstests generic/075 failure on recent Linus' tree

Running 4.1 against a same host server backed off XFS I run into
a hang in generic/075:

generic/075 18s ...[ 408.796877] nfs: server 127.0.0.1 not responding, still trying
[ 408.799131] nfs: server 127.0.0.1 not responding, still trying
[ 408.801357] nfs: server 127.0.0.1 not responding, still trying
[ 443.676971] nfs: server 127.0.0.1 not responding, timed out
[ 623.837009] nfs: server 127.0.0.1 not responding, timed out
[ 628.716855] nfs: server 127.0.0.1 not responding, timed out
[ 803.996883] nfs: server 127.0.0.1 not responding, timed out
[ 813.783542] nfs: server 127.0.0.1 not responding, timed out
[ 984.156873] nfs: server 127.0.0.1 not responding, timed out
[ 998.876901] nfs: server 127.0.0.1 not responding, timed out



2014-10-27 19:00:41

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 24 Oct 2014 13:08:28 -0400
Jeff Layton <[email protected]> wrote:

> On Fri, 24 Oct 2014 18:59:47 +0300
> Trond Myklebust <[email protected]> wrote:
>
> > On Fri, Oct 24, 2014 at 5:57 PM, Jeff Layton
> > <[email protected]> wrote:
> > >> @@ -1228,6 +1231,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> > >> dropit:
> > >> svc_authorise(rqstp); /* doesn't hurt to call this twice */
> > >> dprintk("svc: svc_process dropit\n");
> > >
> > > I don't think this will fix it either. I turned the above dprintk into
> > > a normal printk and it never fired during the test. As best I can tell,
> > > svc_process_common is not returning 0 when this occurs.
> >
> > OK. Is perhaps the "revisit canceled" triggering in svc_revisit()? I'm
> > having trouble understanding the call chain for that stuff, but it too
> > looks as if it can trigger some strange behaviour.
> >
>
> I don't think that's it either. I turned the dprintks in svc_revisit
> into a printks just to be sure, and they didn't fire either.
>
> Basically, I don't think we ever do anything in svc_defer for v4.1
> requests, due to this at the top of it:
>
> if (rqstp->rq_arg.page_len || !rqstp->rq_usedeferral)
> return NULL; /* if more than a page, give up FIXME */
>
> ...basically rq_usedeferral should be set in most cases for v4.1
> requests. It gets set when processing the compound and then unset
> afterward.
>
> That said, I suppose you could end up deferring the request if it
> occurs before the pc_func gets called, but I haven't seen any evidence
> of that happening so far with this test.
>
> I do concur with Christoph that I've only been able to reproduce this
> while running on the loopback interface. If I have server and client in
> different VMs, then this test runs just fine. Could this be related to
> the changes that Neil sent in recently to make loopback mounts work
> better?
>
> One idea might be reasonable to backport 2aca5b869ace67 to something
> v3.17-ish and see whether it's still reproducible?
>

Ok, I've made a bit more progress on this, mostly by adding a fair
number of tracepoints into the client and server request dispatching
code. I'll be posting patches to add those for eventually, but they
need a little cleanup first.

In any case, here's a typical request as it's supposed to work:

mount.nfs-906 [002] ...1 22711.996969: xprt_transmit: xprt=0xffff8800ce961000 xid=0xa8a34513 status=0
nfsd-678 [000] ...1 22711.997082: svc_recv: rq_xid=0xa8a34513 status=164
nfsd-678 [000] ..s8 22711.997185: xprt_lookup_rqst: xprt=0xffff8800ce961000 xid=0xa8a34513 status=0
nfsd-678 [000] ..s8 22711.997186: xprt_complete_rqst: xprt=0xffff8800ce961000 xid=0xa8a34513 status=140
nfsd-678 [000] ...1 22711.997236: svc_send: rq_xid=0xa8a34513 dropme=0 status=144
nfsd-678 [000] ...1 22711.997236: svc_process: rq_xid=0xa8a34513 dropme=0 status=144

...and here's what we see when things start hanging:

kworker/2:2-107 [002] ...1 22741.696070: xprt_transmit: xprt=0xffff8800ce961000 xid=0xc3a84513 status=0
nfsd-678 [002] .N.1 22741.696917: svc_recv: rq_xid=0xc3a84513 status=208
nfsd-678 [002] ...1 22741.699890: svc_send: rq_xid=0xc3a84513 dropme=0 status=262252
nfsd-678 [002] ...1 22741.699891: svc_process: rq_xid=0xc3a84513 dropme=0 status=262252

What's happening is that xprt_transmit is called to send the request
to the server. The server then gets it, and immediately processes it
(in less than a second) and sends the reply.

The problem is that at some point, we stop getting sk_data_ready
events on the socket, even when the server has clearly sent something
to the client. I'm unclear on why that is, but I suspect that something
is broken in the lower level socket code, possibly just in the loopback
code?

At this point, I think I'm pretty close to ready to call in the netdev
cavalry. I'll write up a synopsis of what we know so far, and send it
there (and cc it here) in the hopes that someone will have some insight.

--
Jeff Layton <[email protected]>

2014-10-24 13:42:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, Oct 24, 2014 at 01:08:28PM +0300, Trond Myklebust wrote:
> Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
> drop a request unless the connection is broken.

I can still reproduce the issue with this patch applied.

2014-10-24 08:14:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

On Wed, Oct 22, 2014 at 4:08 PM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 03:00:27PM +0300, Trond Myklebust wrote:
>> Does the NFS client show a TCP connection to port 2049 on 127.0.0.1?
>
> From netstat -a
>
> tcp 0 262352 localhost:nfs localhost:684 ESTABLISHED
> tcp 0 0 localhost:684 localhost:nfs ESTABLISHED
>
>
> Note that about 1/4 to 1/3 of the hangs show a backtrace like:
>
> [ 480.293522] INFO: task fsx:14631 blocked for more than 120 seconds.
> [ 480.296181] Not tainted 3.18.0-rc1+ #1519
> [ 480.299073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 480.304028] fsx D ffffffff81dcbd90 0 14631 14430 0x00000004
> [ 480.307132] ffff88007a457b08 0000000000000046 ffff880072db0b50 0000000000013dc0
> [ 480.310401] ffff88007a457fd8 0000000000013dc0 ffff88007d524310 ffff880072db0b50
> [ 480.312772] 0000000000000000 0000000000000002 0000000000000001 0000000000000001
> [ 480.315200] Call Trace:
> [ 480.315946] [<ffffffff81dcbd90>] ? bit_wait_timeout+0x60/0x60
> [ 480.317358] [<ffffffff810f906a>] ? mark_held_locks+0x6a/0x90
> [ 480.318818] [<ffffffff8111ef25>] ? ktime_get+0x105/0x140
> [ 480.320167] [<ffffffff810830af>] ? kvm_clock_read+0x1f/0x30
> [ 480.321537] [<ffffffff810830c9>] ? kvm_clock_get_cycles+0x9/0x10
> [ 480.322871] [<ffffffff8111eec5>] ? ktime_get+0xa5/0x140
> [ 480.324360] [<ffffffff811405ee>] ? __delayacct_blkio_start+0x1e/0x30
> [ 480.325829] [<ffffffff81dcbd90>] ? bit_wait_timeout+0x60/0x60
> [ 480.327252] [<ffffffff81dcb7a4>] schedule+0x24/0x70
> [ 480.328471] [<ffffffff81dcb87a>] io_schedule+0x8a/0xd0
> [ 480.329683] [<ffffffff81dcbdb6>] bit_wait_io+0x26/0x40
> [ 480.330902] [<ffffffff81dcbe7e>] __wait_on_bit_lock+0x6e/0xb0
> [ 480.332189] [<ffffffff81178de2>] ? find_get_entries+0x22/0x160
> [ 480.336273] [<ffffffff8117653c>] ? find_get_entry+0x8c/0xc0
> [ 480.337719] [<ffffffff811764b0>] ? find_get_pages_contig+0x1a0/0x1a0
> [ 480.339280] [<ffffffff81176755>] __lock_page+0x95/0xa0
> [ 480.340518] [<ffffffff810ee160>] ? wake_atomic_t_function+0x30/0x30
> [ 480.342066] [<ffffffff81184e66>] truncate_inode_pages_range+0x3c6/0x710
> [ 480.343853] [<ffffffff81185230>] truncate_inode_pages+0x10/0x20
> [ 480.345306] [<ffffffff81185286>] truncate_pagecache+0x46/0x70
> [ 480.346481] [<ffffffff81345e5e>] nfs_setattr_update_inode+0x9e/0x120
> [ 480.348372] [<ffffffff813682f8>] nfs4_proc_setattr+0xb8/0x100
> [ 480.349751] [<ffffffff813477b6>] nfs_setattr+0xd6/0x1d0
> [ 480.350741] [<ffffffff811deb10>] notify_change+0x160/0x3c0
> [ 480.351748] [<ffffffff81200afb>] ? fsnotify+0x7b/0x310
> [ 480.353260] [<ffffffff811c0671>] do_truncate+0x61/0xa0
> [ 480.354829] [<ffffffff811c09f4>] do_sys_ftruncate.constprop.16+0x104/0x160
> [ 480.356754] [<ffffffff8178760e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 480.358561] [<ffffffff811c0a79>] SyS_ftruncate+0x9/0x10
> [ 480.360054] [<ffffffff81dd0fe9>] system_call_fastpath+0x12/0x17
> [ 480.361744] 2 locks held by fsx/14631:
> [ 480.362844] #0: (sb_writers#9){.+.+.+}, at: [<ffffffff811c09bf>] do_sys_ftruncate.constprop.16+0xcf/0x160
> [ 480.366248] #1: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff811c0663>] do_truncate+0x53/0xa0
>
OK. So If this is NFSv4.1, and the connection is between the client
and server is still established, then I suspect the problem is with
knfsd dropping a request. According to the rules in RFC3530 and
RFC5661, it isn't allowed to do that unless the connection is broken.

Jeff, could you please take a look?

Thanks
Trond
--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-10-24 14:58:09

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 24 Oct 2014 17:30:55 +0300
Trond Myklebust <[email protected]> wrote:

> On Fri, 2014-10-24 at 09:34 -0400, Jeff Layton wrote:
> > On Fri, 24 Oct 2014 07:26:44 -0400
> > Jeff Layton <[email protected]> wrote:
> >
> > > I guess the theory here is that the client sent the request, and got a
> > > TCP ACK and then the server never responded because it dropped the
> > > request? That sounds plausible and we almost certainly want something
> > > like the above for v4.0.
> > >
> > > If the above patch fixes Christoph's problem with v4.1 though, then I
> > > think we'll need more investigation into why. I don't see a way to get
> > > a dropped request with v4.1. Those mainly occur due to interaction with
> > > the DRC (disabled for v4.1) or with svc_defer (disabled for all v4
> > > requests).
> > >
> > > Is it possible that one of the nfsd threads is just hung while
> > > processing an RPC and that's why you're not getting a response?
> > >
> >
> > Ok, I'm able to reproduce this too and got a network trace. I think
> > Trond is correct as to the cause:
> >
> > I see a SETATTR call go out and then see a TCP ACK for it a little
> > later, but didn't see a reply sent for it.
> >
> > I sprinkled in some printks around the "dropit" codepath above and in
> > some other areas however and none of them fired. So while Trond's patch
> > looks like it might be correct, I don't think it'll fix the problem I'm
> > seeing.
> >
> > Sniffing a little later, it seems like the server replies are being
> > delayed somehow:
>
> You can get deferrals in the authentication code which trigger drops.
> Here is a variant which tries to deal with that:
>
> 8<-----------------------------------------------------
> From a9be1292152af2cb651415c44fd3d91b3d017240 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Fri, 24 Oct 2014 12:54:47 +0300
> Subject: [PATCH v2] nfsd: Ensure that NFSv4 always drops the connection when
> dropping a request
>
> Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
> drop a request unless the connection is broken. This is in order to allow
> clients to assume that they can always expect an answer unless the
> connection is broken.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: [email protected]
> ---
> v2: Also deal with NFSv4 requests dropped due to authentication issues
>
> fs/nfsd/nfs4proc.c | 6 ++++++
> include/linux/sunrpc/svc.h | 1 +
> net/sunrpc/svc.c | 13 +++++++++----
> 3 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index cdeb3cfd6f32..500ac76662a8 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1960,6 +1960,11 @@ static const char *nfsd4_op_name(unsigned opnum)
> return "unknown_operation";
> }
>
> +static void nfsd4_dropme(struct svc_rqst *rqstp)
> +{
> + svc_close_xprt(rqstp->rq_xprt);
> +}
> +
> #define nfsd4_voidres nfsd4_voidargs
> struct nfsd4_voidargs { int dummy; };
>
> @@ -1989,6 +1994,7 @@ struct svc_version nfsd_version4 = {
> .vs_nproc = 2,
> .vs_proc = nfsd_procedures4,
> .vs_dispatch = nfsd_dispatch,
> + .vs_dropme = nfsd4_dropme,
> .vs_xdrsize = NFS4_SVC_XDRSIZE,
> .vs_rpcb_optnl = 1,
> };
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 21678464883a..824656da1f6d 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -400,6 +400,7 @@ struct svc_version {
> * vs_dispatch == NULL means use default dispatcher.
> */
> int (*vs_dispatch)(struct svc_rqst *, __be32 *);
> + void (*vs_dropme)(struct svc_rqst *);
> };
>
> /*
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ca8a7958f4e6..a178496ce2c1 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1111,9 +1111,13 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> rqstp->rq_vers = vers = svc_getnl(argv); /* version number */
> rqstp->rq_proc = proc = svc_getnl(argv); /* procedure number */
>
> - for (progp = serv->sv_program; progp; progp = progp->pg_next)
> - if (prog == progp->pg_prog)
> + for (progp = serv->sv_program; progp; progp = progp->pg_next) {
> + if (prog == progp->pg_prog) {
> + if (vers < progp->pg_nvers)
> + versp = progp->pg_vers[vers];
> break;
> + }
> + }
>
> /*
> * Decode auth data, and add verifier to reply buffer.
> @@ -1148,8 +1152,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> if (progp == NULL)
> goto err_bad_prog;
>
> - if (vers >= progp->pg_nvers ||
> - !(versp = progp->pg_vers[vers]))
> + if (versp == NULL)
> goto err_bad_vers;
>
> procp = versp->vs_proc + proc;
> @@ -1228,6 +1231,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> dropit:
> svc_authorise(rqstp); /* doesn't hurt to call this twice */
> dprintk("svc: svc_process dropit\n");

I don't think this will fix it either. I turned the above dprintk into
a normal printk and it never fired during the test. As best I can tell,
svc_process_common is not returning 0 when this occurs.

> + if (versp && versp->vs_dropme)
> + versp->vs_dropme(rqstp);
> return 0;
>
> err_short_len:


--
Jeff Layton <[email protected]>

2014-10-22 13:08:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

On Wed, Oct 22, 2014 at 03:00:27PM +0300, Trond Myklebust wrote:
> Does the NFS client show a TCP connection to port 2049 on 127.0.0.1?

>From netstat -a

tcp 0 262352 localhost:nfs localhost:684 ESTABLISHED
tcp 0 0 localhost:684 localhost:nfs ESTABLISHED


Note that about 1/4 to 1/3 of the hangs show a backtrace like:

[ 480.293522] INFO: task fsx:14631 blocked for more than 120 seconds.
[ 480.296181] Not tainted 3.18.0-rc1+ #1519
[ 480.299073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 480.304028] fsx D ffffffff81dcbd90 0 14631 14430 0x00000004
[ 480.307132] ffff88007a457b08 0000000000000046 ffff880072db0b50 0000000000013dc0
[ 480.310401] ffff88007a457fd8 0000000000013dc0 ffff88007d524310 ffff880072db0b50
[ 480.312772] 0000000000000000 0000000000000002 0000000000000001 0000000000000001
[ 480.315200] Call Trace:
[ 480.315946] [<ffffffff81dcbd90>] ? bit_wait_timeout+0x60/0x60
[ 480.317358] [<ffffffff810f906a>] ? mark_held_locks+0x6a/0x90
[ 480.318818] [<ffffffff8111ef25>] ? ktime_get+0x105/0x140
[ 480.320167] [<ffffffff810830af>] ? kvm_clock_read+0x1f/0x30
[ 480.321537] [<ffffffff810830c9>] ? kvm_clock_get_cycles+0x9/0x10
[ 480.322871] [<ffffffff8111eec5>] ? ktime_get+0xa5/0x140
[ 480.324360] [<ffffffff811405ee>] ? __delayacct_blkio_start+0x1e/0x30
[ 480.325829] [<ffffffff81dcbd90>] ? bit_wait_timeout+0x60/0x60
[ 480.327252] [<ffffffff81dcb7a4>] schedule+0x24/0x70
[ 480.328471] [<ffffffff81dcb87a>] io_schedule+0x8a/0xd0
[ 480.329683] [<ffffffff81dcbdb6>] bit_wait_io+0x26/0x40
[ 480.330902] [<ffffffff81dcbe7e>] __wait_on_bit_lock+0x6e/0xb0
[ 480.332189] [<ffffffff81178de2>] ? find_get_entries+0x22/0x160
[ 480.336273] [<ffffffff8117653c>] ? find_get_entry+0x8c/0xc0
[ 480.337719] [<ffffffff811764b0>] ? find_get_pages_contig+0x1a0/0x1a0
[ 480.339280] [<ffffffff81176755>] __lock_page+0x95/0xa0
[ 480.340518] [<ffffffff810ee160>] ? wake_atomic_t_function+0x30/0x30
[ 480.342066] [<ffffffff81184e66>] truncate_inode_pages_range+0x3c6/0x710
[ 480.343853] [<ffffffff81185230>] truncate_inode_pages+0x10/0x20
[ 480.345306] [<ffffffff81185286>] truncate_pagecache+0x46/0x70
[ 480.346481] [<ffffffff81345e5e>] nfs_setattr_update_inode+0x9e/0x120
[ 480.348372] [<ffffffff813682f8>] nfs4_proc_setattr+0xb8/0x100
[ 480.349751] [<ffffffff813477b6>] nfs_setattr+0xd6/0x1d0
[ 480.350741] [<ffffffff811deb10>] notify_change+0x160/0x3c0
[ 480.351748] [<ffffffff81200afb>] ? fsnotify+0x7b/0x310
[ 480.353260] [<ffffffff811c0671>] do_truncate+0x61/0xa0
[ 480.354829] [<ffffffff811c09f4>] do_sys_ftruncate.constprop.16+0x104/0x160
[ 480.356754] [<ffffffff8178760e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 480.358561] [<ffffffff811c0a79>] SyS_ftruncate+0x9/0x10
[ 480.360054] [<ffffffff81dd0fe9>] system_call_fastpath+0x12/0x17
[ 480.361744] 2 locks held by fsx/14631:
[ 480.362844] #0: (sb_writers#9){.+.+.+}, at: [<ffffffff811c09bf>] do_sys_ftruncate.constprop.16+0xcf/0x160
[ 480.366248] #1: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff811c0663>] do_truncate+0x53/0xa0


2014-10-24 14:31:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 2014-10-24 at 09:34 -0400, Jeff Layton wrote:
> On Fri, 24 Oct 2014 07:26:44 -0400
> Jeff Layton <[email protected]> wrote:
>
> > I guess the theory here is that the client sent the request, and got a
> > TCP ACK and then the server never responded because it dropped the
> > request? That sounds plausible and we almost certainly want something
> > like the above for v4.0.
> >
> > If the above patch fixes Christoph's problem with v4.1 though, then I
> > think we'll need more investigation into why. I don't see a way to get
> > a dropped request with v4.1. Those mainly occur due to interaction with
> > the DRC (disabled for v4.1) or with svc_defer (disabled for all v4
> > requests).
> >
> > Is it possible that one of the nfsd threads is just hung while
> > processing an RPC and that's why you're not getting a response?
> >
>
> Ok, I'm able to reproduce this too and got a network trace. I think
> Trond is correct as to the cause:
>
> I see a SETATTR call go out and then see a TCP ACK for it a little
> later, but didn't see a reply sent for it.
>
> I sprinkled in some printks around the "dropit" codepath above and in
> some other areas however and none of them fired. So while Trond's patch
> looks like it might be correct, I don't think it'll fix the problem I'm
> seeing.
>
> Sniffing a little later, it seems like the server replies are being
> delayed somehow:

You can get deferrals in the authentication code which trigger drops.
Here is a variant which tries to deal with that:

8<-----------------------------------------------------
>From a9be1292152af2cb651415c44fd3d91b3d017240 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Fri, 24 Oct 2014 12:54:47 +0300
Subject: [PATCH v2] nfsd: Ensure that NFSv4 always drops the connection when
dropping a request

Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
drop a request unless the connection is broken. This is in order to allow
clients to assume that they can always expect an answer unless the
connection is broken.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected]
---
v2: Also deal with NFSv4 requests dropped due to authentication issues

fs/nfsd/nfs4proc.c | 6 ++++++
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 13 +++++++++----
3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index cdeb3cfd6f32..500ac76662a8 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1960,6 +1960,11 @@ static const char *nfsd4_op_name(unsigned opnum)
return "unknown_operation";
}

+static void nfsd4_dropme(struct svc_rqst *rqstp)
+{
+ svc_close_xprt(rqstp->rq_xprt);
+}
+
#define nfsd4_voidres nfsd4_voidargs
struct nfsd4_voidargs { int dummy; };

@@ -1989,6 +1994,7 @@ struct svc_version nfsd_version4 = {
.vs_nproc = 2,
.vs_proc = nfsd_procedures4,
.vs_dispatch = nfsd_dispatch,
+ .vs_dropme = nfsd4_dropme,
.vs_xdrsize = NFS4_SVC_XDRSIZE,
.vs_rpcb_optnl = 1,
};
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 21678464883a..824656da1f6d 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -400,6 +400,7 @@ struct svc_version {
* vs_dispatch == NULL means use default dispatcher.
*/
int (*vs_dispatch)(struct svc_rqst *, __be32 *);
+ void (*vs_dropme)(struct svc_rqst *);
};

/*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ca8a7958f4e6..a178496ce2c1 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1111,9 +1111,13 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
rqstp->rq_vers = vers = svc_getnl(argv); /* version number */
rqstp->rq_proc = proc = svc_getnl(argv); /* procedure number */

- for (progp = serv->sv_program; progp; progp = progp->pg_next)
- if (prog == progp->pg_prog)
+ for (progp = serv->sv_program; progp; progp = progp->pg_next) {
+ if (prog == progp->pg_prog) {
+ if (vers < progp->pg_nvers)
+ versp = progp->pg_vers[vers];
break;
+ }
+ }

/*
* Decode auth data, and add verifier to reply buffer.
@@ -1148,8 +1152,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
if (progp == NULL)
goto err_bad_prog;

- if (vers >= progp->pg_nvers ||
- !(versp = progp->pg_vers[vers]))
+ if (versp == NULL)
goto err_bad_vers;

procp = versp->vs_proc + proc;
@@ -1228,6 +1231,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
dropit:
svc_authorise(rqstp); /* doesn't hurt to call this twice */
dprintk("svc: svc_process dropit\n");
+ if (versp && versp->vs_dropme)
+ versp->vs_dropme(rqstp);
return 0;

err_short_len:
--
1.9.3





2014-10-22 12:00:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

On Wed, Oct 22, 2014 at 2:46 PM, Christoph Hellwig <[email protected]> wrote:
> On Mon, Oct 20, 2014 at 10:36:58AM -0700, Christoph Hellwig wrote:
>> Running 4.1 against a same host server backed off XFS I run into
>> a hang in generic/075:
>
> I've bisected this down to:
>
> commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e
> Author: Trond Myklebust <[email protected]>
> Date: Wed Sep 24 22:35:58 2014 -0400
>
> SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
>
>
> Note that I can only reproduce it when running locally, and only when
> not using pNFS (block layout).

Does the NFS client show a TCP connection to port 2049 on 127.0.0.1?

Cheers
Trond

2014-10-24 11:26:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 24 Oct 2014 13:08:28 +0300
Trond Myklebust <[email protected]> wrote:

> Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
> drop a request unless the connection is broken.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfsd/nfs4proc.c | 6 ++++++
> include/linux/sunrpc/svc.h | 1 +
> net/sunrpc/svc.c | 4 ++++
> 3 files changed, 11 insertions(+)
>
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index cdeb3cfd6f32..500ac76662a8 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -1960,6 +1960,11 @@ static const char *nfsd4_op_name(unsigned opnum)
> return "unknown_operation";
> }
>
> +static void nfsd4_dropme(struct svc_rqst *rqstp)
> +{
> + svc_close_xprt(rqstp->rq_xprt);
> +}
> +
> #define nfsd4_voidres nfsd4_voidargs
> struct nfsd4_voidargs { int dummy; };
>
> @@ -1989,6 +1994,7 @@ struct svc_version nfsd_version4 = {
> .vs_nproc = 2,
> .vs_proc = nfsd_procedures4,
> .vs_dispatch = nfsd_dispatch,
> + .vs_dropme = nfsd4_dropme,
> .vs_xdrsize = NFS4_SVC_XDRSIZE,
> .vs_rpcb_optnl = 1,
> };
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 21678464883a..824656da1f6d 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -400,6 +400,7 @@ struct svc_version {
> * vs_dispatch == NULL means use default dispatcher.
> */
> int (*vs_dispatch)(struct svc_rqst *, __be32 *);
> + void (*vs_dropme)(struct svc_rqst *);
> };
>
> /*
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ca8a7958f4e6..09136abfef26 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -1071,6 +1071,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> struct svc_version *versp = NULL; /* compiler food */
> struct svc_procedure *procp = NULL;
> struct svc_serv *serv = rqstp->rq_server;
> + void (*dropme)(struct svc_rqst *rqstp) = NULL;
> kxdrproc_t xdr;
> __be32 *statp;
> u32 prog, vers, proc;
> @@ -1151,6 +1152,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> if (vers >= progp->pg_nvers ||
> !(versp = progp->pg_vers[vers]))
> goto err_bad_vers;
> + dropme = versp->vs_dropme;
>
> procp = versp->vs_proc + proc;
> if (proc >= versp->vs_nproc || !procp->pc_func)
> @@ -1228,6 +1230,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> dropit:
> svc_authorise(rqstp); /* doesn't hurt to call this twice */
> dprintk("svc: svc_process dropit\n");
> + if (dropme != NULL)
> + dropme(rqstp);
> return 0;
>
> err_short_len:

(cc'ing Bruce too...)

I guess the theory here is that the client sent the request, and got a
TCP ACK and then the server never responded because it dropped the
request? That sounds plausible and we almost certainly want something
like the above for v4.0.

If the above patch fixes Christoph's problem with v4.1 though, then I
think we'll need more investigation into why. I don't see a way to get
a dropped request with v4.1. Those mainly occur due to interaction with
the DRC (disabled for v4.1) or with svc_defer (disabled for all v4
requests).

Is it possible that one of the nfsd threads is just hung while
processing an RPC and that's why you're not getting a response?

--
Jeff Layton <[email protected]>

2014-10-24 13:35:04

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 24 Oct 2014 07:26:44 -0400
Jeff Layton <[email protected]> wrote:

> On Fri, 24 Oct 2014 13:08:28 +0300
> Trond Myklebust <[email protected]> wrote:
>
> > Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
> > drop a request unless the connection is broken.
> >
> > Signed-off-by: Trond Myklebust <[email protected]>
> > ---
> > fs/nfsd/nfs4proc.c | 6 ++++++
> > include/linux/sunrpc/svc.h | 1 +
> > net/sunrpc/svc.c | 4 ++++
> > 3 files changed, 11 insertions(+)
> >
> > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > index cdeb3cfd6f32..500ac76662a8 100644
> > --- a/fs/nfsd/nfs4proc.c
> > +++ b/fs/nfsd/nfs4proc.c
> > @@ -1960,6 +1960,11 @@ static const char *nfsd4_op_name(unsigned opnum)
> > return "unknown_operation";
> > }
> >
> > +static void nfsd4_dropme(struct svc_rqst *rqstp)
> > +{
> > + svc_close_xprt(rqstp->rq_xprt);
> > +}
> > +
> > #define nfsd4_voidres nfsd4_voidargs
> > struct nfsd4_voidargs { int dummy; };
> >
> > @@ -1989,6 +1994,7 @@ struct svc_version nfsd_version4 = {
> > .vs_nproc = 2,
> > .vs_proc = nfsd_procedures4,
> > .vs_dispatch = nfsd_dispatch,
> > + .vs_dropme = nfsd4_dropme,
> > .vs_xdrsize = NFS4_SVC_XDRSIZE,
> > .vs_rpcb_optnl = 1,
> > };
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 21678464883a..824656da1f6d 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -400,6 +400,7 @@ struct svc_version {
> > * vs_dispatch == NULL means use default dispatcher.
> > */
> > int (*vs_dispatch)(struct svc_rqst *, __be32 *);
> > + void (*vs_dropme)(struct svc_rqst *);
> > };
> >
> > /*
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index ca8a7958f4e6..09136abfef26 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -1071,6 +1071,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> > struct svc_version *versp = NULL; /* compiler food */
> > struct svc_procedure *procp = NULL;
> > struct svc_serv *serv = rqstp->rq_server;
> > + void (*dropme)(struct svc_rqst *rqstp) = NULL;
> > kxdrproc_t xdr;
> > __be32 *statp;
> > u32 prog, vers, proc;
> > @@ -1151,6 +1152,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> > if (vers >= progp->pg_nvers ||
> > !(versp = progp->pg_vers[vers]))
> > goto err_bad_vers;
> > + dropme = versp->vs_dropme;
> >
> > procp = versp->vs_proc + proc;
> > if (proc >= versp->vs_nproc || !procp->pc_func)
> > @@ -1228,6 +1230,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> > dropit:
> > svc_authorise(rqstp); /* doesn't hurt to call this twice */
> > dprintk("svc: svc_process dropit\n");
> > + if (dropme != NULL)
> > + dropme(rqstp);
> > return 0;
> >
> > err_short_len:
>
> (cc'ing Bruce too...)
>
> I guess the theory here is that the client sent the request, and got a
> TCP ACK and then the server never responded because it dropped the
> request? That sounds plausible and we almost certainly want something
> like the above for v4.0.
>
> If the above patch fixes Christoph's problem with v4.1 though, then I
> think we'll need more investigation into why. I don't see a way to get
> a dropped request with v4.1. Those mainly occur due to interaction with
> the DRC (disabled for v4.1) or with svc_defer (disabled for all v4
> requests).
>
> Is it possible that one of the nfsd threads is just hung while
> processing an RPC and that's why you're not getting a response?
>

Ok, I'm able to reproduce this too and got a network trace. I think
Trond is correct as to the cause:

I see a SETATTR call go out and then see a TCP ACK for it a little
later, but didn't see a reply sent for it.

I sprinkled in some printks around the "dropit" codepath above and in
some other areas however and none of them fired. So while Trond's patch
looks like it might be correct, I don't think it'll fix the problem I'm
seeing.

Sniffing a little later, it seems like the server replies are being
delayed somehow:

362 450.126141 ::1 -> ::1 NFS 234 V4 Call SEQUENCE
363 450.126211 ::1 -> ::1 NFS 170 V4 Reply (Call In 349) SEQUENCE
364 450.126253 ::1 -> ::1 TCP 98 [TCP Dup ACK 362#1] 826→2049 [ACK] Seq=5577 Ack=1 Win=23025 Len=0 TSval=546384 TSecr=4294762921 SLE=131289 SRE=135581
365 450.126302 ::1 -> ::1 TCP 65550 [TCP Out-Of-Order] 2049→826 [ACK] Seq=1 Ack=5577 Win=16001 Len=65464 TSval=546384 TSecr=546384
366 450.126320 ::1 -> ::1 TCP 266 [TCP Out-Of-Order] 2049→826 [PSH, ACK] Seq=65465 Ack=5577 Win=16001 Len=180 TSval=546384 TSecr=546384[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]
367 450.327131 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5577 Win=16001 Len=65464 TSval=546585 TSecr=546384
368 450.729997 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5577 Win=16001 Len=65464 TSval=546988 TSecr=546384
369 451.536010 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5577 Win=16001 Len=65464 TSval=547794 TSecr=546384
370 453.149981 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5577 Win=16001 Len=65464 TSval=549408 TSecr=546384
371 455.150167 ::1 -> ::1 NFS 234 V4 Call SEQUENCE
372 455.150235 ::1 -> ::1 NFS 170 V4 Reply (Call In 362) SEQUENCE
373 455.150279 ::1 -> ::1 TCP 98 [TCP Dup ACK 371#1] 826→2049 [ACK] Seq=5713 Ack=1 Win=23025 Len=0 TSval=551408 TSecr=4294762921 SLE=131289 SRE=135665
374 455.350978 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5713 Win=16001 Len=65464 TSval=551609 TSecr=551408
375 455.753946 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5713 Win=16001 Len=65464 TSval=552012 TSecr=551408
376 456.559960 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5713 Win=16001 Len=65464 TSval=552818 TSecr=551408
377 458.173955 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5713 Win=16001 Len=65464 TSval=554432 TSecr=551408
378 461.398011 ::1 -> ::1 TCP 65550 [TCP Retransmission] 2049→826 [ACK] Seq=1 Ack=5713 Win=16001 Len=65464 TSval=557656 TSecr=551408

Note that the reply in 372 is a reply to the call in 362 even though it
was sent just after the *next* SEQUENCE call came in. There are also a
bunch of TCP retransmissions. Later SEQUENCE replies follow a similar
pattern.

Looking at /proc/meminfo, it looks like I have plenty of free memory so
I don't think this is a memory exhaustion situation. All of the nfsds
seem idle (I have 64 running) and the CPUs aren't especially busy.

It "smells" sort of like something is wrong in the lower networking
layers. That said, I'm not quite ready to point fingers there yet. I'll
do a bit more debugging and see what I can find.

Stay tuned...
--
Jeff Layton <[email protected]>

2014-10-22 11:46:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

On Mon, Oct 20, 2014 at 10:36:58AM -0700, Christoph Hellwig wrote:
> Running 4.1 against a same host server backed off XFS I run into
> a hang in generic/075:

I've bisected this down to:

commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e
Author: Trond Myklebust <[email protected]>
Date: Wed Sep 24 22:35:58 2014 -0400

SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT


Note that I can only reproduce it when running locally, and only when
not using pNFS (block layout).

2014-10-21 13:12:59

by Boaz Harrosh

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

On 10/20/2014 08:36 PM, Christoph Hellwig wrote:
> Running 4.1 against a same host server backed off XFS I run into
> a hang in generic/075:
>
> generic/075 18s ...[ 408.796877] nfs: server 127.0.0.1 not responding, still trying
> [ 408.799131] nfs: server 127.0.0.1 not responding, still trying
> [ 408.801357] nfs: server 127.0.0.1 not responding, still trying
> [ 443.676971] nfs: server 127.0.0.1 not responding, timed out
> [ 623.837009] nfs: server 127.0.0.1 not responding, timed out
> [ 628.716855] nfs: server 127.0.0.1 not responding, timed out
> [ 803.996883] nfs: server 127.0.0.1 not responding, timed out
> [ 813.783542] nfs: server 127.0.0.1 not responding, timed out
> [ 984.156873] nfs: server 127.0.0.1 not responding, timed out
> [ 998.876901] nfs: server 127.0.0.1 not responding, timed out
>
Sir Christoph

I'm not exactly sure about generic/075, I see it uses fsx. Might it be
a very heavy load of ltp/fsx ?

I'm asking because down the years I always had problems with localhost
(127.0.0.1 ?) mount, with oom conditions, specially when using a UML
or a tiny VM.

I know that for a long time it was not suppose to be tested, and
advertised as not supported. The explanation was because of the
way writeback works stacked FSs, and the upper layer FS consuming all
the allowed memory, not leaving any room for the lower layer FS to
do any writeback.

With the new writeback per BDI the situation was much much better
last I tried, but specially with NFS's lazy release of pages, I
still got a live lock in a tiny UML.
(So I'm not sure what is the official stand right now about local
mount, am curious to know?)

But surly you should know all this, just that can you reproduce
the same problem with two VMs? It might shed some light?
(Also did you try to force the server to a sync IO, I wish NFSD
could do true directIO from network buffers, but ...)

[Funny I'm having my own fights with fsx right now ;-)]

Thanks
Boaz


2014-10-24 15:59:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, Oct 24, 2014 at 5:57 PM, Jeff Layton
<[email protected]> wrote:
>> @@ -1228,6 +1231,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
>> dropit:
>> svc_authorise(rqstp); /* doesn't hurt to call this twice */
>> dprintk("svc: svc_process dropit\n");
>
> I don't think this will fix it either. I turned the above dprintk into
> a normal printk and it never fired during the test. As best I can tell,
> svc_process_common is not returning 0 when this occurs.

OK. Is perhaps the "revisit canceled" triggering in svc_revisit()? I'm
having trouble understanding the call chain for that stuff, but it too
looks as if it can trigger some strange behaviour.

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-10-24 17:08:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

On Fri, 24 Oct 2014 18:59:47 +0300
Trond Myklebust <[email protected]> wrote:

> On Fri, Oct 24, 2014 at 5:57 PM, Jeff Layton
> <[email protected]> wrote:
> >> @@ -1228,6 +1231,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
> >> dropit:
> >> svc_authorise(rqstp); /* doesn't hurt to call this twice */
> >> dprintk("svc: svc_process dropit\n");
> >
> > I don't think this will fix it either. I turned the above dprintk into
> > a normal printk and it never fired during the test. As best I can tell,
> > svc_process_common is not returning 0 when this occurs.
>
> OK. Is perhaps the "revisit canceled" triggering in svc_revisit()? I'm
> having trouble understanding the call chain for that stuff, but it too
> looks as if it can trigger some strange behaviour.
>

I don't think that's it either. I turned the dprintks in svc_revisit
into a printks just to be sure, and they didn't fire either.

Basically, I don't think we ever do anything in svc_defer for v4.1
requests, due to this at the top of it:

if (rqstp->rq_arg.page_len || !rqstp->rq_usedeferral)
return NULL; /* if more than a page, give up FIXME */

...basically rq_usedeferral should be set in most cases for v4.1
requests. It gets set when processing the compound and then unset
afterward.

That said, I suppose you could end up deferring the request if it
occurs before the pc_func gets called, but I haven't seen any evidence
of that happening so far with this test.

I do concur with Christoph that I've only been able to reproduce this
while running on the loopback interface. If I have server and client in
different VMs, then this test runs just fine. Could this be related to
the changes that Neil sent in recently to make loopback mounts work
better?

One idea might be reasonable to backport 2aca5b869ace67 to something
v3.17-ish and see whether it's still reproducible?

--
Jeff Layton <[email protected]>

2014-10-24 10:08:35

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH] nfsd: Ensure that NFSv4 always drops the connection when dropping a request

Both RFC3530 and RFC5661 impose a requirement on the server that it MUST NOT
drop a request unless the connection is broken.

Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfsd/nfs4proc.c | 6 ++++++
include/linux/sunrpc/svc.h | 1 +
net/sunrpc/svc.c | 4 ++++
3 files changed, 11 insertions(+)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index cdeb3cfd6f32..500ac76662a8 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1960,6 +1960,11 @@ static const char *nfsd4_op_name(unsigned opnum)
return "unknown_operation";
}

+static void nfsd4_dropme(struct svc_rqst *rqstp)
+{
+ svc_close_xprt(rqstp->rq_xprt);
+}
+
#define nfsd4_voidres nfsd4_voidargs
struct nfsd4_voidargs { int dummy; };

@@ -1989,6 +1994,7 @@ struct svc_version nfsd_version4 = {
.vs_nproc = 2,
.vs_proc = nfsd_procedures4,
.vs_dispatch = nfsd_dispatch,
+ .vs_dropme = nfsd4_dropme,
.vs_xdrsize = NFS4_SVC_XDRSIZE,
.vs_rpcb_optnl = 1,
};
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 21678464883a..824656da1f6d 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -400,6 +400,7 @@ struct svc_version {
* vs_dispatch == NULL means use default dispatcher.
*/
int (*vs_dispatch)(struct svc_rqst *, __be32 *);
+ void (*vs_dropme)(struct svc_rqst *);
};

/*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ca8a7958f4e6..09136abfef26 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1071,6 +1071,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
struct svc_version *versp = NULL; /* compiler food */
struct svc_procedure *procp = NULL;
struct svc_serv *serv = rqstp->rq_server;
+ void (*dropme)(struct svc_rqst *rqstp) = NULL;
kxdrproc_t xdr;
__be32 *statp;
u32 prog, vers, proc;
@@ -1151,6 +1152,7 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
if (vers >= progp->pg_nvers ||
!(versp = progp->pg_vers[vers]))
goto err_bad_vers;
+ dropme = versp->vs_dropme;

procp = versp->vs_proc + proc;
if (proc >= versp->vs_nproc || !procp->pc_func)
@@ -1228,6 +1230,8 @@ svc_process_common(struct svc_rqst *rqstp, struct kvec *argv, struct kvec *resv)
dropit:
svc_authorise(rqstp); /* doesn't hurt to call this twice */
dprintk("svc: svc_process dropit\n");
+ if (dropme != NULL)
+ dropme(rqstp);
return 0;

err_short_len:
--
1.9.3


2015-01-16 13:39:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: xfstests generic/075 failure on recent Linus' tree

FYI, I'm still seeing this in current Linus' HEAD, not just on
a single machine, but also when using multiple VMs on the same host.
I can't actually test multiple physical hosts, so I wonder if anyone
else can still reproduce it, and if yes wit which setup.

On Mon, Oct 20, 2014 at 10:36:58AM -0700, Christoph Hellwig wrote:
> Running 4.1 against a same host server backed off XFS I run into
> a hang in generic/075:
>
> generic/075 18s ...[ 408.796877] nfs: server 127.0.0.1 not responding, still trying
> [ 408.799131] nfs: server 127.0.0.1 not responding, still trying
> [ 408.801357] nfs: server 127.0.0.1 not responding, still trying
> [ 443.676971] nfs: server 127.0.0.1 not responding, timed out
> [ 623.837009] nfs: server 127.0.0.1 not responding, timed out
> [ 628.716855] nfs: server 127.0.0.1 not responding, timed out
> [ 803.996883] nfs: server 127.0.0.1 not responding, timed out
> [ 813.783542] nfs: server 127.0.0.1 not responding, timed out
> [ 984.156873] nfs: server 127.0.0.1 not responding, timed out
> [ 998.876901] nfs: server 127.0.0.1 not responding, timed out
>
---end quoted text---