2015-02-10 15:43:06

by J. Bruce Fields

[permalink] [raw]
Subject: hang on xfstests generic/074

I finally got around to running xfstests as part of my regular testing
and ran across a reproduceable hang on generic/074:

[110040.300055] INFO: task fstest:22762 blocked for more than 120 seconds.
[110040.300571] Tainted: G W 3.19.0-rc4-00206-g53ea83c #16
[110040.301082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[110040.301656] fstest D ffff88005246bc98 11320 22762 22761 0x00000000
[110040.302334] ffff88005246bc98 ffff88005246bc58 0000000000009000 ffff880057c4d7d0
[110040.303185] ffff88005246bfd8 ffff880052f84b90 ffff880057c4d7d0 0000000000000000
[110040.304041] 0000000000000000 0000000000000001 0000000000000000 ffff88005246bc58
[110040.305096] Call Trace:
[110040.305331] [<ffffffff810b0ed7>] ? prepare_to_wait+0x27/0x90
[110040.305770] [<ffffffffa000f492>] ? rpc_make_runnable+0xc2/0xd0 [sunrpc]
[110040.306274] [<ffffffff81a9550d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[110040.306799] [<ffffffff81a90910>] ? bit_wait+0x60/0x60
[110040.307258] [<ffffffff810bad1d>] ? trace_hardirqs_on_caller+0x15d/0x200
[110040.307798] [<ffffffff810badcd>] ? trace_hardirqs_on+0xd/0x10
[110040.308283] [<ffffffff81a90910>] ? bit_wait+0x60/0x60
[110040.308678] [<ffffffff81a8ffc9>] schedule+0x29/0x70
[110040.309054] [<ffffffff81a90295>] io_schedule+0x55/0x80
[110040.309510] [<ffffffff81a90944>] bit_wait_io+0x34/0x60
[110040.309909] [<ffffffff81a905b7>] __wait_on_bit+0x67/0x90
[110040.310313] [<ffffffff8114e72d>] ? find_get_pages_tag+0xd/0x210
[110040.310756] [<ffffffff8114d2c6>] wait_on_page_bit+0xb6/0xc0
[110040.311177] [<ffffffff810b1410>] ? autoremove_wake_function+0x40/0x40
[110040.311654] [<ffffffff8114d422>] filemap_fdatawait_range+0xf2/0x190
[110040.312131] [<ffffffff8114f062>] filemap_write_and_wait_range+0x42/0x70
[110040.312627] [<ffffffffa017866f>] nfs4_file_fsync+0x5f/0xb0 [nfsv4]
[110040.313090] [<ffffffffa0178905>] ? nfs4_do_check_delegation+0x5/0xc0 [nfsv4]
[110040.313616] [<ffffffff811d05a9>] vfs_fsync+0x29/0x40
[110040.313991] [<ffffffffa010fdea>] nfs_file_flush+0x7a/0xb0 [nfs]
[110040.314421] [<ffffffff8119b093>] filp_close+0x33/0x80
[110040.314801] [<ffffffff811bce42>] __close_fd+0x82/0xa0
[110040.315178] [<ffffffff8119b103>] SyS_close+0x23/0x50
[110040.315551] [<ffffffff81a960d2>] system_call_fastpath+0x12/0x17
[110040.316247] no locks held by fstest/22762.
[110063.328025] nfs: server f21-1 not responding, still trying ...

Is this anything known?

Client and server were running some version of my tree, 19-rc4 based.

--b.


2015-02-10 15:45:25

by J. Bruce Fields

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Tue, Feb 10, 2015 at 10:43:06AM -0500, bfields wrote:
> I finally got around to running xfstests as part of my regular testing
> and ran across a reproduceable hang on generic/074:

By the way, running ./check -nfs -g auto, I also see failures on:

generic/005 generic/017 generic/031 generic/032 generic/033 generic/035
generic/037 generic/053 generic/062 generic/068 generic/088 generic/089
generic/105 generic/126 generic/133 generic/184 generic/225 generic/277
generic/294 generic/306

and another 94 tests were skipped. I haven't looked into any of those
yet.

All of this is over 4.1.

--b.

2015-02-11 12:34:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
> I finally got around to running xfstests as part of my regular testing
> and ran across a reproduceable hang on generic/074:

Yes, I reported this about half a year ago. It was caused (or at least
unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
that commit fixes the issue for me.


2015-03-01 15:56:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Wed, Feb 11, 2015 at 7:34 AM, Christoph Hellwig <[email protected]> wrote:
> On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
>> I finally got around to running xfstests as part of my regular testing
>> and ran across a reproduceable hang on generic/074:
>
> Yes, I reported this about half a year ago. It was caused (or at least
> unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
> Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
> that commit fixes the issue for me.
>

Can you please recheck with the new 'devel' branch on
git://git.linux-nfs.org/projects/trondmy/linux-nfs.git ? That fixes
the test for me.

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2015-03-01 15:58:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Sun, Mar 1, 2015 at 10:56 AM, Trond Myklebust
<[email protected]> wrote:
> On Wed, Feb 11, 2015 at 7:34 AM, Christoph Hellwig <[email protected]> wrote:
>> On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
>>> I finally got around to running xfstests as part of my regular testing
>>> and ran across a reproduceable hang on generic/074:
>>
>> Yes, I reported this about half a year ago. It was caused (or at least
>> unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
>> Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
>> that commit fixes the issue for me.
>>
>
> Can you please recheck with the new 'devel' branch on
> git://git.linux-nfs.org/projects/trondmy/linux-nfs.git ? That fixes
> the test for me.
>

Oops. Never mind. I was thinking of a different issue, but also with
generic/074.

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2016-04-01 14:57:14

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Wed, 11 Feb 2015, Christoph Hellwig wrote:

> On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
> > I finally got around to running xfstests as part of my regular testing
> > and ran across a reproduceable hang on generic/074:
>
> Yes, I reported this about half a year ago. It was caused (or at least
> unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
> Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
> that commit fixes the issue for me.

I just ran into this.

Now that we have SO_REUSEPORT, can we get rid of
RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT?

Ben

2016-04-01 15:26:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Fri, Apr 1, 2016 at 10:57 AM, Benjamin Coddington
<[email protected]> wrote:
> On Wed, 11 Feb 2015, Christoph Hellwig wrote:
>
>> On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
>> > I finally got around to running xfstests as part of my regular testing
>> > and ran across a reproduceable hang on generic/074:
>>
>> Yes, I reported this about half a year ago. It was caused (or at least
>> unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
>> Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
>> that commit fixes the issue for me.
>
> I just ran into this.
>
> Now that we have SO_REUSEPORT, can we get rid of
> RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT?

They are unrelated.

If you are hitting this hang, then you have borked server that is
dropping NFSv4 RPC requests. The old behaviour of having the client
break the connection is not actually sanctioned by the NFSv4 protocol.

Cheers
Trond

2016-04-08 18:24:19

by Benjamin Coddington

[permalink] [raw]
Subject: Re: hang on xfstests generic/074

On Fri, 1 Apr 2016, Trond Myklebust wrote:

> On Fri, Apr 1, 2016 at 10:57 AM, Benjamin Coddington
> <[email protected]> wrote:
> > On Wed, 11 Feb 2015, Christoph Hellwig wrote:
> >
> >> On Tue, Feb 10, 2015 at 10:43:06AM -0500, J. Bruce Fields wrote:
> >> > I finally got around to running xfstests as part of my regular testing
> >> > and ran across a reproduceable hang on generic/074:
> >>
> >> Yes, I reported this about half a year ago. It was caused (or at least
> >> unhidden) by commit 2aca5b869ace67a63aab895659e5dc14c33a4d6e ("SUNRPC:
> >> Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT"). Reverting
> >> that commit fixes the issue for me.
> >
> > I just ran into this.
> >
> > Now that we have SO_REUSEPORT, can we get rid of
> > RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT?
>
> They are unrelated.
>
> If you are hitting this hang, then you have borked server that is
> dropping NFSv4 RPC requests. The old behaviour of having the client
> break the connection is not actually sanctioned by the NFSv4 protocol.
>
> Cheers
> Trond

Ah, thanks for pointing that out. It is a server bug, I think. I'm trying
to find out more, and I'll write about it under separate cover.

Ben