2015-05-21 01:21:58

by Guillaume Morin

[permalink] [raw]
Subject: [BUG] nfs3 client stops retrying to connect

Hi,

Last week we had an NFS outage and one of our NFS servers
went down for quite a while. All clients use it for NFSv3 hard mount
using TCP

Most clients recovered fine when the server came back up but a dozen or
so never did. They are all running 3.14.35 or later.

All tasks trying to use the mount are in D state and have not recovered.
The last and only nfs-related message in the kernel log is
nfs: server blah not responding, still trying

I spent some time trying to debug the problem. ftrace showed that the
rpc subsystems is trying to recover:

<idle>-0 [003] dNs. 4619235.426567: sched_wakeup: comm=daemon pid=4247 prio=120 success=1 target_cpu=003
daemon-4247 [003] .... 4619235.426582: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=-110 action=call_connect_status [sunrpc]
daemon-4247 [003] .... 4619235.426583: rpc_connect_status: task:44610@4, status -110
daemon-4247 [003] .... 4619235.426588: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_timeout [sunrpc]
daemon-4247 [003] .... 4619235.426589: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_bind [sunrpc]
daemon-4247 [003] .... 4619235.426590: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_connect [sunrpc]
daemon-4247 [003] ..s. 4619235.426591: rpc_task_sleep: task:44610@4 flags=00a0 state=0005 status=0 timeout=18000 queue=xprt_pending

This goes on and on but despite all this a tcpdump shows that the client
never tries to connect. There is absolutely no traffic between the
nfs client and the nfs server.

I looked at the daemon in crash and looked at its rpc_task, it looked
pretty legit:
struct rpc_task {
tk_count = {
counter = 2
},
tk_task = {
next = 0xffff88010bf67008,
prev = 0xffff88032f81e018
},
tk_client = 0xffff88032f81e000,
tk_rqstp = 0xffff8803310a4c00,
tk_msg = {
rpc_proc = 0xffffffffa08b7b18,
rpc_argp = 0xffff88016abe6a58,
rpc_resp = 0xffff880165d6db00,
rpc_cred = 0x0
},
tk_callback = 0xffffffffa06f19e0 <xprt_connect_status>,
tk_action = 0xffffffffa06ee310 <call_connect_status>,
tk_ops = 0xffffffffa070db20 <rpc_default_ops>,
tk_calldata = 0x0,
tk_timeout = 36000,
tk_runstate = 6,
tk_workqueue = 0x0,
tk_waitqueue = 0xffff880331352a58,
u = {
tk_work = {
data = {
counter = -131927684863392
},
entry = {
next = 0xffff880331352a60,
prev = 0x0
},
func = 0x0
},
tk_wait = {
list = {
next = 0xffff880331352a60,
prev = 0xffff880331352a60
},
links = {
next = 0x0,
prev = 0x0
},
timer_list = {
next = 0xffff880331352b00,
prev = 0xffff880331352b00
},
expires = 5680095648
}
},
tk_start = {
tv64 = 4088588208239840
},
tk_owner = 4247,
tk_status = 0,
tk_flags = 160,
tk_timeouts = 2929,
tk_pid = 44610,
tk_priority = 1 '\001',
tk_garb_retry = 2 '\002',
tk_cred_retry = 2 '\002',
tk_rebind_retry = 2 '\002'


When I dug into the nested rpc_xprt object, I noticed that the state was 0x15 which is
XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND

XPRT_CONNECTING is surprising since there was definitely no traffic
going on between the server and the client (and it was the case for
a few days)

If xprt_connect() is called while XPRT_CONNECTING is set, we do not try to
connect() again until it cleared. It seemed it was erroneously set.

To test this theory, I wrote a module that when loaded cleared that bit
in the xprt_task. As soon as the rpc task timed out, the client tried
to connect and the mount recovered gracefully.

I am not quite sure how it got into that state though as
xs_tcp_setup_socket always clears it. But it definitely happened on a
few machines.

If somehow xs_close() is called before the callback
happens, I think it could leave XPRT_CONNECTING on forever though
(since xs_tcp_setup_socket is never called), see
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/sunrpc/xprtsock.c?id=refs/tags/v3.14.43#n887

I am still have a few clients with the stuck mount so I could gather
more information if necessary.

Thanks in advance

Guillaume.

--
Guillaume Morin <[email protected]>


2015-06-03 18:29:55

by Chuck Lever

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect


On May 20, 2015, at 9:21 PM, Guillaume Morin <[email protected]> wrote:

> Hi,
>
> Last week we had an NFS outage and one of our NFS servers
> went down for quite a while. All clients use it for NFSv3 hard mount
> using TCP
>
> Most clients recovered fine when the server came back up but a dozen or
> so never did. They are all running 3.14.35 or later.
>
> All tasks trying to use the mount are in D state and have not recovered.
> The last and only nfs-related message in the kernel log is
> nfs: server blah not responding, still trying
>
> I spent some time trying to debug the problem. ftrace showed that the
> rpc subsystems is trying to recover:
>
> <idle>-0 [003] dNs. 4619235.426567: sched_wakeup: comm=daemon pid=4247 prio=120 success=1 target_cpu=003
> daemon-4247 [003] .... 4619235.426582: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=-110 action=call_connect_status [sunrpc]
> daemon-4247 [003] .... 4619235.426583: rpc_connect_status: task:44610@4, status -110
> daemon-4247 [003] .... 4619235.426588: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_timeout [sunrpc]
> daemon-4247 [003] .... 4619235.426589: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_bind [sunrpc]
> daemon-4247 [003] .... 4619235.426590: rpc_task_run_action: task:44610@4 flags=00a0 state=0005 status=0 action=call_connect [sunrpc]
> daemon-4247 [003] ..s. 4619235.426591: rpc_task_sleep: task:44610@4 flags=00a0 state=0005 status=0 timeout=18000 queue=xprt_pending
>
> This goes on and on but despite all this a tcpdump shows that the client
> never tries to connect. There is absolutely no traffic between the
> nfs client and the nfs server.
>
> I looked at the daemon in crash and looked at its rpc_task, it looked
> pretty legit:
> struct rpc_task {
> tk_count = {
> counter = 2
> },
> tk_task = {
> next = 0xffff88010bf67008,
> prev = 0xffff88032f81e018
> },
> tk_client = 0xffff88032f81e000,
> tk_rqstp = 0xffff8803310a4c00,
> tk_msg = {
> rpc_proc = 0xffffffffa08b7b18,
> rpc_argp = 0xffff88016abe6a58,
> rpc_resp = 0xffff880165d6db00,
> rpc_cred = 0x0
> },
> tk_callback = 0xffffffffa06f19e0 <xprt_connect_status>,
> tk_action = 0xffffffffa06ee310 <call_connect_status>,
> tk_ops = 0xffffffffa070db20 <rpc_default_ops>,
> tk_calldata = 0x0,
> tk_timeout = 36000,
> tk_runstate = 6,
> tk_workqueue = 0x0,
> tk_waitqueue = 0xffff880331352a58,
> u = {
> tk_work = {
> data = {
> counter = -131927684863392
> },
> entry = {
> next = 0xffff880331352a60,
> prev = 0x0
> },
> func = 0x0
> },
> tk_wait = {
> list = {
> next = 0xffff880331352a60,
> prev = 0xffff880331352a60
> },
> links = {
> next = 0x0,
> prev = 0x0
> },
> timer_list = {
> next = 0xffff880331352b00,
> prev = 0xffff880331352b00
> },
> expires = 5680095648
> }
> },
> tk_start = {
> tv64 = 4088588208239840
> },
> tk_owner = 4247,
> tk_status = 0,
> tk_flags = 160,
> tk_timeouts = 2929,
> tk_pid = 44610,
> tk_priority = 1 '\001',
> tk_garb_retry = 2 '\002',
> tk_cred_retry = 2 '\002',
> tk_rebind_retry = 2 '\002'
>
>
> When I dug into the nested rpc_xprt object, I noticed that the state was 0x15 which is
> XPRT_LOCKED|XPRT_CONNECTING|XPRT_BOUND
>
> XPRT_CONNECTING is surprising since there was definitely no traffic
> going on between the server and the client (and it was the case for
> a few days)
>
> If xprt_connect() is called while XPRT_CONNECTING is set, we do not try to
> connect() again until it cleared. It seemed it was erroneously set.
>
> To test this theory, I wrote a module that when loaded cleared that bit
> in the xprt_task. As soon as the rpc task timed out, the client tried
> to connect and the mount recovered gracefully.
>
> I am not quite sure how it got into that state though as
> xs_tcp_setup_socket always clears it. But it definitely happened on a
> few machines.
>
> If somehow xs_close() is called before the callback
> happens, I think it could leave XPRT_CONNECTING on forever though
> (since xs_tcp_setup_socket is never called), see
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/sunrpc/xprtsock.c?id=refs/tags/v3.14.43#n887
>
> I am still have a few clients with the stuck mount so I could gather
> more information if necessary.

A series of commits were merged into the v4.0 kernel, starting with commit 4dda9c8a5e34,
that changed the TCP connect logic significantly. It would be helpful to know if the
problem can be reproduced when your clients are running the v4.0 kernel.

--
Chuck Lever
[email protected]




2015-06-04 20:06:25

by Guillaume Morin

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect

Hi Chuck,

On 03 Jun 14:31, Chuck Lever wrote:
> > If somehow xs_close() is called before the callback
> > happens, I think it could leave XPRT_CONNECTING on forever though
> > (since xs_tcp_setup_socket is never called), see
> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/sunrpc/xprtsock.c?id=refs/tags/v3.14.43#n887
> >
> > I am still have a few clients with the stuck mount so I could gather
> > more information if necessary.
>
> A series of commits were merged into the v4.0 kernel, starting with commit 4dda9c8a5e34,
> that changed the TCP connect logic significantly. It would be helpful to know if the
> problem can be reproduced when your clients are running the v4.0 kernel.

Understood but I actually cannot reproduce it on 3.14 as well so I am
not hopeful I'll be able to try this.

This just happened during a kernel panic of our nfs server which stayed
down for a while, then only a dozen machines could not recover, the rest
was fine. So it is definitely not that easy to trigger.

So far all my attempts to reproduce this have failed. I tried mostly by
setting iptables to send RSTs back to the server randomly using iptables
and dropping syns pretty often. If you have any suggestions, that'd be
great

Do you have any thoughts about my impression that there could be race
between cancelling the callback in xs_close() that could leave
XPRT_CONNECTING on?

Thanks in advance

Guillaume.

--
Guillaume Morin <[email protected]>

2015-06-04 21:21:35

by Chuck Lever

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect


On Jun 4, 2015, at 4:06 PM, Guillaume Morin <[email protected]> wrote:

> Hi Chuck,
>
> On 03 Jun 14:31, Chuck Lever wrote:
>>> If somehow xs_close() is called before the callback
>>> happens, I think it could leave XPRT_CONNECTING on forever though
>>> (since xs_tcp_setup_socket is never called), see
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/sunrpc/xprtsock.c?id=refs/tags/v3.14.43#n887
>>>
>>> I am still have a few clients with the stuck mount so I could gather
>>> more information if necessary.
>>
>> A series of commits were merged into the v4.0 kernel, starting with commit 4dda9c8a5e34,
>> that changed the TCP connect logic significantly. It would be helpful to know if the
>> problem can be reproduced when your clients are running the v4.0 kernel.
>
> Understood but I actually cannot reproduce it on 3.14 as well so I am
> not hopeful I'll be able to try this.

I?ve heard other reports of similar behavior. Finding a reproducer would
be a good first step to confirming when it broke and whether it?s been
addressed in recent kernels.

It would also help to watch the logic in slow motion to see where the
problem manifests.

I know it?s going to be tough to find a reproducer.

> This just happened during a kernel panic of our nfs server which stayed
> down for a while, then only a dozen machines could not recover, the rest
> was fine. So it is definitely not that easy to trigger.
>
> So far all my attempts to reproduce this have failed. I tried mostly by
> setting iptables to send RSTs back to the server randomly using iptables
> and dropping syns pretty often. If you have any suggestions, that'd be
> great

Is there a workload running on that mount point? It probably shouldn?t
be idle when you try your experiment.

> Do you have any thoughts about my impression that there could be race
> between cancelling the callback in xs_close() that could leave
> XPRT_CONNECTING on?

I agree that XPRT_CONNECTING is probably the source of the issue.

But xs_tcp_close() can be called directly by autoclose (not likely if
there are pending RPCs) or transport shutdown (also not likely, same
reason). I?m skeptical there?s a race involving xs_close().

I?m wondering if there was a missing state change upcall, or the state
change upcall happened and xs_tcp_cancal_linger_timeout() exited
without clearing XPRT_CONNECTING.

It?s rather academic, though. All this code was replaced in 4.0.

--
Chuck Lever
[email protected]




2015-06-04 22:14:07

by Guillaume Morin

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect

On 04 Jun 17:23, Chuck Lever wrote:
> > This just happened during a kernel panic of our nfs server which stayed
> > down for a while, then only a dozen machines could not recover, the rest
> > was fine. So it is definitely not that easy to trigger.
> >
> > So far all my attempts to reproduce this have failed. I tried mostly by
> > setting iptables to send RSTs back to the server randomly using iptables
> > and dropping syns pretty often. If you have any suggestions, that'd be
> > great
>
> Is there a workload running on that mount point? It probably shouldn't
> be idle when you try your experiment.

I was just running ls -l on some dir. I could do some writes with dd.
The mount point that "froze" was lightly used and mostly for writes

> > Do you have any thoughts about my impression that there could be race
> > between cancelling the callback in xs_close() that could leave
> > XPRT_CONNECTING on?
>
> I agree that XPRT_CONNECTING is probably the source of the issue.
>
> But xs_tcp_close() can be called directly by autoclose (not likely if
> there are pending RPCs) or transport shutdown (also not likely, same
> reason). I'm skeptical there's a race involving xs_close().
>
> I'm wondering if there was a missing state change upcall, or the state
> change upcall happened and xs_tcp_cancal_linger_timeout() exited
> without clearing XPRT_CONNECTING.

I am 100% sure that XPRT_CONNECTING is the issue because 1) the state
had the flag up 2) there was absolutley no nfs network traffic between the
client and the server 3) I "unfroze" the mounts by clearing it manually.

xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag.
Either the callback is canceled and it clears the flag or the callback
will do it. I am not sure how this could leave the flag set but I am
not familiar with this code, so I could totally be missing something
obvious.

xs_tcp_close() is the only thing I have found which cancels the callback
and does not clear the flag.

> It's rather academic, though. All this code was replaced in 4.0.

Well, it's not academic for all the users of the stable branches which
might have this bug in the kernel they're running :-)

If I can reproduce this issue, I will happily test 4.0.

Guillaume.

--
Guillaume Morin <[email protected]>

2015-06-05 02:55:26

by Chuck Lever

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect


On Jun 4, 2015, at 6:14 PM, Guillaume Morin <[email protected]> wrote:

> On 04 Jun 17:23, Chuck Lever wrote:
>>> This just happened during a kernel panic of our nfs server which stayed
>>> down for a while, then only a dozen machines could not recover, the rest
>>> was fine. So it is definitely not that easy to trigger.
>>>
>>> So far all my attempts to reproduce this have failed. I tried mostly by
>>> setting iptables to send RSTs back to the server randomly using iptables
>>> and dropping syns pretty often. If you have any suggestions, that'd be
>>> great
>>
>> Is there a workload running on that mount point? It probably shouldn't
>> be idle when you try your experiment.
>
> I was just running ls -l on some dir. I could do some writes with dd.
> The mount point that "froze" was lightly used and mostly for writes
>
>>> Do you have any thoughts about my impression that there could be race
>>> between cancelling the callback in xs_close() that could leave
>>> XPRT_CONNECTING on?
>>
>> I agree that XPRT_CONNECTING is probably the source of the issue.
>>
>> But xs_tcp_close() can be called directly by autoclose (not likely if
>> there are pending RPCs) or transport shutdown (also not likely, same
>> reason). I'm skeptical there's a race involving xs_close().
>>
>> I'm wondering if there was a missing state change upcall, or the state
>> change upcall happened and xs_tcp_cancal_linger_timeout() exited
>> without clearing XPRT_CONNECTING.
>
> I am 100% sure that XPRT_CONNECTING is the issue because 1) the state
> had the flag up 2) there was absolutley no nfs network traffic between the
> client and the server 3) I "unfroze" the mounts by clearing it manually.
>
> xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag.

I?m speculating based on some comments in the git log, but what if
the transport never sees TCP_CLOSE, but rather gets an error_report
callback instead?

> Either the callback is canceled and it clears the flag or the callback
> will do it. I am not sure how this could leave the flag set but I am
> not familiar with this code, so I could totally be missing something
> obvious.
>
> xs_tcp_close() is the only thing I have found which cancels the callback
> and does not clear the flag.

How would xs_tcp_close() be invoked?

>> It's rather academic, though. All this code was replaced in 4.0.
>
> Well, it's not academic for all the users of the stable branches which
> might have this bug in the kernel they're running :-)

I didn?t mean to be glib. The point is, stable kernels are always fixed
by backporting an existing fix from a newer kernel.

> If I can reproduce this issue, I will happily test 4.0.

Thanks!

--
Chuck Lever
[email protected]




2015-06-08 17:10:09

by Guillaume Morin

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect

Chuck,

On 04 Jun 22:57, Chuck Lever wrote:
> > I am 100% sure that XPRT_CONNECTING is the issue because 1) the state
> > had the flag up 2) there was absolutley no nfs network traffic between the
> > client and the server 3) I "unfroze" the mounts by clearing it manually.
> >
> > xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag.
>
> I'm speculating based on some comments in the git log, but what if
> the transport never sees TCP_CLOSE, but rather gets an error_report
> callback instead?

I don't think that could be it because xs_tcp_setup_socket() does the
connecting and is clearing the bit in all cases so at the time you would get
a TCP_CLOSE it would have been cleared a while ago.

So that's why I thought the best explanation was finding a place where
the worker task running xs_tcp_setup_socket() is cancelled and the bit
not cleared. This is how I found xs_tcp_close()

> > Either the callback is canceled and it clears the flag or the callback
> > will do it. I am not sure how this could leave the flag set but I am
> > not familiar with this code, so I could totally be missing something
> > obvious.
> >
> > xs_tcp_close() is the only thing I have found which cancels the callback
> > and does not clear the flag.
>
> How would xs_tcp_close() be invoked?

TBH I do not know. It's the close() method of the xprt so I am assuming
there are a few places where it could be. But I am not familiar with
the code base..

> >> It's rather academic, though. All this code was replaced in 4.0.
> >
> > Well, it's not academic for all the users of the stable branches which
> > might have this bug in the kernel they're running :-)
>
> I didn't mean to be glib. The point is, stable kernels are always fixed
> by backporting an existing fix from a newer kernel.

The stable kernel rules says an "equivalent" fix in the Linus' tree. I
think that Greg would pick up this fix unless it's too complicated.

Nevertheless, it's such an annoying bug I am pretty sure the
distributions would pick it up if Greg does not.

We had to move an nfs server on friday and I got a few machines that had
the same issue again...

Thanks for your help, I appreciate it.

Guillaume.

--
Guillaume Morin <[email protected]>

2015-06-08 17:47:58

by Chuck Lever

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect


On Jun 8, 2015, at 1:10 PM, Guillaume Morin <[email protected]> wrote:

> Chuck,
>
> On 04 Jun 22:57, Chuck Lever wrote:
>>> I am 100% sure that XPRT_CONNECTING is the issue because 1) the state
>>> had the flag up 2) there was absolutley no nfs network traffic between the
>>> client and the server 3) I "unfroze" the mounts by clearing it manually.
>>>
>>> xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag.
>>
>> I'm speculating based on some comments in the git log, but what if
>> the transport never sees TCP_CLOSE, but rather gets an error_report
>> callback instead?
>
> I don't think that could be it because xs_tcp_setup_socket() does the
> connecting and is clearing the bit in all cases so at the time you would get
> a TCP_CLOSE it would have been cleared a while ago.

The linger timer is started by FIN_WAIT1 or LAST_ACK, and
xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
XPRT_CONNECTION_ABORT.

At a guess there could be a race between xs_tcp_cancel_linger_timeout
and the connect worker clearing those flags.

> So that's why I thought the best explanation was finding a place where
> the worker task running xs_tcp_setup_socket() is cancelled and the bit
> not cleared. This is how I found xs_tcp_close()
>
>>> Either the callback is canceled and it clears the flag or the callback
>>> will do it. I am not sure how this could leave the flag set but I am
>>> not familiar with this code, so I could totally be missing something
>>> obvious.
>>>
>>> xs_tcp_close() is the only thing I have found which cancels the callback
>>> and does not clear the flag.
>>
>> How would xs_tcp_close() be invoked?
>
> TBH I do not know. It's the close() method of the xprt so I am assuming
> there are a few places where it could be. But I am not familiar with
> the code base..

AFAICT ->close is invoked when the transport is being shut down, in other
words at umount time. It is also invoked when the autoclose timer fires.

Autoclose is simply a mechanism for reaping NFS sockets that are idle.
I think the timer is 5 or 6 minutes.

Autoclose won?t fire if there is frequent work being done on the mount
point. If this is related to autoclose, then the workload on the client
might need to be sparse (NFS requests only every few minutes or so) to
reproduce it.

For example, autoclose fires and tries to shut down the socket after the
server is no longer responding.

> We had to move an nfs server on friday and I got a few machines that had
> the same issue again?

That suggests one requirement for your reproducer: after clients have
mounted it, the NFS server needs to be fully down for an extended period.

Since some clients recovered, I assume the server retained its IP address.
Did the network route change?

--
Chuck Lever
[email protected]




2015-06-08 18:12:12

by Guillaume Morin

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect

On 08 Jun 13:50, Chuck Lever wrote:
> The linger timer is started by FIN_WAIT1 or LAST_ACK, and
> xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
> XPRT_CONNECTION_ABORT.
>
> At a guess there could be a race between xs_tcp_cancel_linger_timeout
> and the connect worker clearing those flags.

The connect worker is xs_tcp_setup_socket(). It clears the connecting
bit in all code paths. So the only kind of race I can see here is
another function cancelling it before it runs without clearing the bit.

xs_tcp_cancel_linger_timeout() does the right thing afaict. It clears
the bit if cancel_delayed_work() returns a non-zero value.

The only other place where the worker is cancelled is xs_close() but it
does not clear the bit. So if it cancels the worker before it had
started running, the bit will stay up.

> AFAICT ->close is invoked when the transport is being shut down, in other
> words at umount time. It is also invoked when the autoclose timer fires.
>
> Autoclose is simply a mechanism for reaping NFS sockets that are idle.
> I think the timer is 5 or 6 minutes.
>
> Autoclose won't fire if there is frequent work being done on the mount
> point. If this is related to autoclose, then the workload on the client
> might need to be sparse (NFS requests only every few minutes or so) to
> reproduce it.
>
> For example, autoclose fires and tries to shut down the socket after the
> server is no longer responding.

It does not seem that autoclose is the cause here since it has happened
only during server outages.

If autoclose and umount are the only thing that can call xs_close(),
that seems unlikely to eb the problem. But I see that xprt_connect()
can call it too so that gives me some hope

> > We had to move an nfs server on friday and I got a few machines that had
> > the same issue again?
>
> That suggests one requirement for your reproducer: after clients have
> mounted it, the NFS server needs to be fully down for an extended period.

Yes, it seems to be the case but if it's a race this just gives more
opportunity to race.

> Since some clients recovered, I assume the server retained its IP address.
> Did the network route change?

No the route did not change

--
Guillaume Morin <[email protected]>

2015-08-25 15:16:17

by Guillaume Morin

[permalink] [raw]
Subject: Re: [BUG] nfs3 client stops retrying to connect

On 08 Jun 20:12, Guillaume Morin wrote:
>
> On 08 Jun 13:50, Chuck Lever wrote:
> > The linger timer is started by FIN_WAIT1 or LAST_ACK, and
> > xs_tcp_schedule_linger_timeout sets XPRT_CONNECTING and
> > XPRT_CONNECTION_ABORT.
> >
> > At a guess there could be a race between xs_tcp_cancel_linger_timeout
> > and the connect worker clearing those flags.
>
> The connect worker is xs_tcp_setup_socket(). It clears the connecting
> bit in all code paths. So the only kind of race I can see here is
> another function cancelling it before it runs without clearing the bit.
>
> xs_tcp_cancel_linger_timeout() does the right thing afaict. It clears
> the bit if cancel_delayed_work() returns a non-zero value.
>
> The only other place where the worker is cancelled is xs_close() but it
> does not clear the bit. So if it cancels the worker before it had
> started running, the bit will stay up.

FWIW I patched our production kernel a couple months ago to clear the
connecting bit in xs_close(). Since then we've had a few nfs server
downtime and the problem has never reoccured while before the change we
always had a few machines that could not reconnect. I feel fairly
confident this was the bug.

I am posting the change in case it helps someone running one of the
stable kernels

sunrpc: call xprt_clear_connecting in xs_close

It closes the race where the CONNECTING bit in the xprt
is left on while the kernel is not trying to connect

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 41c2f9d..1b71c59 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -891,6 +891,7 @@ static void xs_close(struct rpc_xprt *xprt)
dprintk("RPC: xs_close xprt %p\n", xprt);

cancel_delayed_work_sync(&transport->connect_worker);
+ xprt_clear_connecting(xprt);

xs_reset_transport(transport);
xprt->reestablish_timeout = 0;


Another option would be is to call clear_bit a few lines later but
clear_bit is never used for CONNECTING so I went with this.

--
Guillaume Morin <[email protected]>