We need to hear from the NFS client maintainers about
the below question and the patch.
> Begin forwarded message:
>
> From: [email protected]
> Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
> Date: March 14, 2023 at 12:19:30 PM EDT
> To: Chuck Lever III <[email protected]>
> Cc: Linux NFS Mailing List <[email protected]>, Helen Chao <[email protected]>
>
>
> On 3/8/23 11:03 AM, [email protected] wrote:
>> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>>
>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <[email protected]> wrote:
>>>>
>>>> Currently call_bind_status places a hard limit of 3 to the number of
>>>> retries on EACCES error. This limit was done to accommodate the behavior
>>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>>> killed on the NFS server. However this change causes problem for other
>>>> servers that take a little longer than 9 seconds for the port mapper to
>>>> become ready when the NFS server is restarted.
>>>>
>>>> This patch removes this hard coded limit and let the RPC handles
>>>> the retry according to whether the export is soft or hard mounted.
>>>>
>>>> To avoid the hang with buggy server, the client can use soft mount for
>>>> the export.
>>>>
>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>>> Reported-by: Helen Chao <[email protected]>
>>>> Tested-by: Helen Chao <[email protected]>
>>>> Signed-off-by: Dai Ngo <[email protected]>
>>> Helen is the royal queen of ^C ;-)
>>>
>>> Did you try ^C on a mount while it waits for a rebind?
>>
>> She uses a test script that restarts the NFS server while NLM lock test
>> is running. The failure is random, sometimes it fails and sometimes it
>> passes depending on when the LOCK/UNLOCK requests come in so I think
>> it's hard to time it to do the ^C, but I will ask.
>
> We did the test with ^C and here is what we found.
>
> For synchronous RPC task the signal was delivered to the RPC task and
> the task exit with -ERESTARTSYS from __rpc_execute as expected.
>
> For asynchronous RPC task the process that invokes the RPC task to send
> the request detected the signal in rpc_wait_for_completion_task and exits
> with -ERESTARTSYS. However the async RPC was allowed to continue to run
> to completion. So if the async RPC task was retrying an operation and
> the NFS server was down, it will retry forever if this is a hard mount
> or until the NFS server comes back up.
>
> The question for the list is should we propagate the signal to the async
> task via rpc_signal_task to stop its execution or just leave it alone as is.
>
> -Dai
>
>>
>> Thanks,
>> -Dai
>>
>>>
>>>
>>>> ---
>>>> include/linux/sunrpc/sched.h | 3 +--
>>>> net/sunrpc/clnt.c | 3 ---
>>>> net/sunrpc/sched.c | 1 -
>>>> 3 files changed, 1 insertion(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>>>> index b8ca3ecaf8d7..8ada7dc802d3 100644
>>>> --- a/include/linux/sunrpc/sched.h
>>>> +++ b/include/linux/sunrpc/sched.h
>>>> @@ -90,8 +90,7 @@ struct rpc_task {
>>>> #endif
>>>> unsigned char tk_priority : 2,/* Task priority */
>>>> tk_garb_retry : 2,
>>>> - tk_cred_retry : 2,
>>>> - tk_rebind_retry : 2;
>>>> + tk_cred_retry : 2;
>>>> };
>>>>
>>>> typedef void (*rpc_action)(struct rpc_task *);
>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>>>> index 0b0b9f1eed46..63b438d8564b 100644
>>>> --- a/net/sunrpc/clnt.c
>>>> +++ b/net/sunrpc/clnt.c
>>>> @@ -2050,9 +2050,6 @@ call_bind_status(struct rpc_task *task)
>>>> status = -EOPNOTSUPP;
>>>> break;
>>>> }
>>>> - if (task->tk_rebind_retry == 0)
>>>> - break;
>>>> - task->tk_rebind_retry--;
>>>> rpc_delay(task, 3*HZ);
>>>> goto retry_timeout;
>>>> case -ENOBUFS:
>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>>> index be587a308e05..c8321de341ee 100644
>>>> --- a/net/sunrpc/sched.c
>>>> +++ b/net/sunrpc/sched.c
>>>> @@ -817,7 +817,6 @@ rpc_init_task_statistics(struct rpc_task *task)
>>>> /* Initialize retry counters */
>>>> task->tk_garb_retry = 2;
>>>> task->tk_cred_retry = 2;
>>>> - task->tk_rebind_retry = 2;
>>>>
>>>> /* starting timestamp */
>>>> task->tk_start = ktime_get();
>>>> --
>>>> 2.9.5
>>>>
>>> --
>>> Chuck Lever
--
Chuck Lever
Hi,
Just a friendly reminder for the status of this patch.
Thanks,
-Dai
On 3/16/23 6:38 AM, Chuck Lever III wrote:
> We need to hear from the NFS client maintainers about
> the below question and the patch.
>
>
>> Begin forwarded message:
>>
>> From: [email protected]
>> Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status
>> Date: March 14, 2023 at 12:19:30 PM EDT
>> To: Chuck Lever III <[email protected]>
>> Cc: Linux NFS Mailing List <[email protected]>, Helen Chao <[email protected]>
>>
>>
>> On 3/8/23 11:03 AM, [email protected] wrote:
>>> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <[email protected]> wrote:
>>>>>
>>>>> Currently call_bind_status places a hard limit of 3 to the number of
>>>>> retries on EACCES error. This limit was done to accommodate the behavior
>>>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>>>> killed on the NFS server. However this change causes problem for other
>>>>> servers that take a little longer than 9 seconds for the port mapper to
>>>>> become ready when the NFS server is restarted.
>>>>>
>>>>> This patch removes this hard coded limit and let the RPC handles
>>>>> the retry according to whether the export is soft or hard mounted.
>>>>>
>>>>> To avoid the hang with buggy server, the client can use soft mount for
>>>>> the export.
>>>>>
>>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>>>> Reported-by: Helen Chao <[email protected]>
>>>>> Tested-by: Helen Chao <[email protected]>
>>>>> Signed-off-by: Dai Ngo <[email protected]>
>>>> Helen is the royal queen of ^C ;-)
>>>>
>>>> Did you try ^C on a mount while it waits for a rebind?
>>> She uses a test script that restarts the NFS server while NLM lock test
>>> is running. The failure is random, sometimes it fails and sometimes it
>>> passes depending on when the LOCK/UNLOCK requests come in so I think
>>> it's hard to time it to do the ^C, but I will ask.
>> We did the test with ^C and here is what we found.
>>
>> For synchronous RPC task the signal was delivered to the RPC task and
>> the task exit with -ERESTARTSYS from __rpc_execute as expected.
>>
>> For asynchronous RPC task the process that invokes the RPC task to send
>> the request detected the signal in rpc_wait_for_completion_task and exits
>> with -ERESTARTSYS. However the async RPC was allowed to continue to run
>> to completion. So if the async RPC task was retrying an operation and
>> the NFS server was down, it will retry forever if this is a hard mount
>> or until the NFS server comes back up.
>>
>> The question for the list is should we propagate the signal to the async
>> task via rpc_signal_task to stop its execution or just leave it alone as is.
>>
>> -Dai
>>
>>> Thanks,
>>> -Dai
>>>
>>>>
>>>>> ---
>>>>> include/linux/sunrpc/sched.h | 3 +--
>>>>> net/sunrpc/clnt.c | 3 ---
>>>>> net/sunrpc/sched.c | 1 -
>>>>> 3 files changed, 1 insertion(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
>>>>> index b8ca3ecaf8d7..8ada7dc802d3 100644
>>>>> --- a/include/linux/sunrpc/sched.h
>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>> @@ -90,8 +90,7 @@ struct rpc_task {
>>>>> #endif
>>>>> unsigned char tk_priority : 2,/* Task priority */
>>>>> tk_garb_retry : 2,
>>>>> - tk_cred_retry : 2,
>>>>> - tk_rebind_retry : 2;
>>>>> + tk_cred_retry : 2;
>>>>> };
>>>>>
>>>>> typedef void (*rpc_action)(struct rpc_task *);
>>>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>>>>> index 0b0b9f1eed46..63b438d8564b 100644
>>>>> --- a/net/sunrpc/clnt.c
>>>>> +++ b/net/sunrpc/clnt.c
>>>>> @@ -2050,9 +2050,6 @@ call_bind_status(struct rpc_task *task)
>>>>> status = -EOPNOTSUPP;
>>>>> break;
>>>>> }
>>>>> - if (task->tk_rebind_retry == 0)
>>>>> - break;
>>>>> - task->tk_rebind_retry--;
>>>>> rpc_delay(task, 3*HZ);
>>>>> goto retry_timeout;
>>>>> case -ENOBUFS:
>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>>>> index be587a308e05..c8321de341ee 100644
>>>>> --- a/net/sunrpc/sched.c
>>>>> +++ b/net/sunrpc/sched.c
>>>>> @@ -817,7 +817,6 @@ rpc_init_task_statistics(struct rpc_task *task)
>>>>> /* Initialize retry counters */
>>>>> task->tk_garb_retry = 2;
>>>>> task->tk_cred_retry = 2;
>>>>> - task->tk_rebind_retry = 2;
>>>>>
>>>>> /* starting timestamp */
>>>>> task->tk_start = ktime_get();
>>>>> --
>>>>> 2.9.5
>>>>>
>>>> --
>>>> Chuck Lever
> --
> Chuck Lever
>
>