This reverts commit 4b0ab51db32eba0f48b7618254742f143364a28d.
This change causes 'rpc.nfsd' to hang for long time if rpcbind is not
available.
If
--no-nfs-version 2 --no-nfs-version 3
is given, the delay is about 6.5 minutes. When trying to register
all versions, the delay is over half an hour.
Before this commit, and after reverting it, nfsd fails (when v3 is
requested) or succeeds (when only v4 is requested) immediately.
Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/xprtsock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 111767ab124a..2a938055e95b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -795,6 +795,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
xs_sock_reset_connection_flags(xprt);
/* Mark transport as closed and wake up all pending tasks */
xprt_disconnect_done(xprt);
+ xprt_force_disconnect(xprt);
}
/**
--
2.9.2
> On Aug 3, 2016, at 19:33, NeilBrown <[email protected]> wrote:
>
>
> This reverts commit 4b0ab51db32eba0f48b7618254742f143364a28d.
>
> This change causes 'rpc.nfsd' to hang for long time if rpcbind is not
> available.
> If
> --no-nfs-version 2 --no-nfs-version 3
> is given, the delay is about 6.5 minutes. When trying to register
> all versions, the delay is over half an hour.
> Before this commit, and after reverting it, nfsd fails (when v3 is
> requested) or succeeds (when only v4 is requested) immediately.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> net/sunrpc/xprtsock.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 111767ab124a..2a938055e95b 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -795,6 +795,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
> xs_sock_reset_connection_flags(xprt);
> /* Mark transport as closed and wake up all pending tasks */
> xprt_disconnect_done(xprt);
> + xprt_force_disconnect(xprt);
> }
>
If there is an outstanding request, then _that_ is supposed to redrive the connection. Why is the xprt_disconnect_done() not functioning as per the comment above it?
On Thu, Aug 04 2016, Trond Myklebust wrote:
>> On Aug 3, 2016, at 19:33, NeilBrown <[email protected]> wrote:
>>
>>
>> This reverts commit 4b0ab51db32eba0f48b7618254742f143364a28d.
>>
>> This change causes 'rpc.nfsd' to hang for long time if rpcbind is not
>> available.
>> If
>> --no-nfs-version 2 --no-nfs-version 3
>> is given, the delay is about 6.5 minutes. When trying to register
>> all versions, the delay is over half an hour.
>> Before this commit, and after reverting it, nfsd fails (when v3 is
>> requested) or succeeds (when only v4 is requested) immediately.
>>
>> Signed-off-by: NeilBrown <[email protected]>
>> ---
>> net/sunrpc/xprtsock.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 111767ab124a..2a938055e95b 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -795,6 +795,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
>> xs_sock_reset_connection_flags(xprt);
>> /* Mark transport as closed and wake up all pending tasks */
>> xprt_disconnect_done(xprt);
>> + xprt_force_disconnect(xprt);
>> }
>>
>
> If there is an outstanding request, then _that_ is supposed to redrive the connection. Why is the xprt_disconnect_done() not functioning as per the comment above it?
No idea, sorry. I haven't tried to follow though exactly what is
happening.
The symptom is trivial to reproduce:
rpc.nfsd 0
killall rpcbind
rpc.nfsd 8
This will either complete immediately with kernel log messages, or block
for a long time.
I might try to figure out what is happening, but I suspect there are
others who know this code a lot better than I do.
NeilBrown
On Thu, Aug 04 2016, Trond Myklebust wrote:
>> On Aug 3, 2016, at 19:33, NeilBrown <[email protected]> wrote:
>>
>>
>> This reverts commit 4b0ab51db32eba0f48b7618254742f143364a28d.
>>
>> This change causes 'rpc.nfsd' to hang for long time if rpcbind is not
>> available.
>> If
>> --no-nfs-version 2 --no-nfs-version 3
>> is given, the delay is about 6.5 minutes. When trying to register
>> all versions, the delay is over half an hour.
>> Before this commit, and after reverting it, nfsd fails (when v3 is
>> requested) or succeeds (when only v4 is requested) immediately.
>>
>> Signed-off-by: NeilBrown <[email protected]>
>> ---
>> net/sunrpc/xprtsock.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 111767ab124a..2a938055e95b 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -795,6 +795,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
>> xs_sock_reset_connection_flags(xprt);
>> /* Mark transport as closed and wake up all pending tasks */
>> xprt_disconnect_done(xprt);
>> + xprt_force_disconnect(xprt);
>> }
>>
>
> If there is an outstanding request, then _that_ is supposed to redrive the connection. Why is the xprt_disconnect_done() not functioning as per the comment above it?
I made some time to look at this and have a somewhat clearer
understanding.
The rpcbind code makes about 24 requests, include NULL pings, UNSET
requests to clear any old registrations, and SET requests.
The first (a NULL) fails immediately due to ECONNREFUSED. Others aren't
even attempted for a growing number of seconds because xs_connect()
finds that transport->sock is not NULL, so it delays before trying.
Once it tries, it failed immediately and we move on to the next request.
transport->sock is set to NULL by xs_reset_transport() which xs_connect() calls
as part of the the timeout, but ->reestablish_timeout is not reset.
The next time xs_connect is called, ->sock will be non-NULL again as
transport->reestablish_timeout will be longer.
transport->sock is only set to NULL *before* the test in xs_connect() by a
call to xs_close() or ->close(). But nothing in triggering this.
xs_sock_mark_closed() currently doesn't actually mark the socket as
closed (XPRT_CLOSE_WAITING), or close it directly.
static void xs_sock_mark_closed(struct rpc_xprt *xprt)
{
xs_sock_reset_connection_flags(xprt);
/* Mark transport as closed and wake up all pending tasks */
xprt_disconnect_done(xprt);
}
xs_sock_reset_connection_flags() clears XPRT_CLOSE_WAITING,
XPRT_CLOSING, and XPRT_SOCK_DATA_READY, and adds some memory barrier.
xprt_disconnect_done() clears XPRT_CONNECTED and wakes up the task.
So none of this will "Mark transport as closed" like the comment
suggests. They just mark it as disconnected.
xs_sock_reset_connection_flags() even *clears* XPRT_CLOSE_WAITING, which
we really want to be set.
The commit said:
Under all conditions, it should be quite sufficient just to mark
the socket as disconnected. It will then be closed by the
transport shutdown or reconnect code.
The "reconnect code" is presumably xprt_connect()? This will close the
socket if XPRT_CLOSE_WAIT is set., but not if the socket is marked as
disconnected. When marked as disconnected, it tries to reconnect -
which incurs a timeout.
The "transport shutdown" code is presumably xprt_destroy()? Yes, that
would close the socket, but that doesn't straight away. The rpcbind
client can try 10 UNSET calls before destroying the transport.
It is possible that __svc_unregister should return an error so that
svc_unregister() can abort the loop on error. That would reduce the
timeouts to some extent, but not completely as we would still get a
timeout when trying to register. It might make sense that if the
svc_unregister() in svc_rpcb_setup() fails due to ECONNREFUSED, then it
should fail so we never even try to register. Plumbing that status
through nfsd so that it doesn't abort in the V4-only case when rpcbind
isn't available could be done, but seems like a lot of work for little gain.
Also, the error that comes through to __svc_unregister is "-EAGAIN"
rather then -ECONNREFUSED.
I think that the easiest solution is to make sure reconnection attempts
on an RPC_IS_SOFTCONN tasks do not trigger a delay in xs_connect().
call_connect_status() already adds a 3second delay for other tasks, so
simply forcing a close in xs_sock_marked_close() should be sufficient.
The revert I posted would do that, or we could just add
set_bit(XPRT_CLOSE_WAIT, &xprt->state);
so that the comment would be correct. That is sufficient.
Another approach which works is:
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 7f79fb7dc6a0..f80b135b4830 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1936,8 +1936,10 @@ call_connect_status(struct rpc_task *task)
case -EADDRINUSE:
case -ENOBUFS:
case -EPIPE:
- if (RPC_IS_SOFTCONN(task))
+ if (RPC_IS_SOFTCONN(task)) {
+ xprt_force_disconnect(task->tk_rqstp->rq_xprt);
break;
+ }
/* retry with existing socket, after a delay */
rpc_delay(task, 3*HZ);
case -EAGAIN:
so that we force a disconnect precisely when a connection attempt fails
on a SOFTCONN task.
Which of these solutions do you prefer?
Thanks,
NeilBrown
Ping.
This is still outstanding. Is there some way we can reach a resolution?
Thanks,
NeilBrown
On Wed, Aug 10 2016, NeilBrown wrote:
> [ Unknown signature status ]
> On Thu, Aug 04 2016, Trond Myklebust wrote:
>
>>> On Aug 3, 2016, at 19:33, NeilBrown <[email protected]> wrote:
>>>
>>>
>>> This reverts commit 4b0ab51db32eba0f48b7618254742f143364a28d.
>>>
>>> This change causes 'rpc.nfsd' to hang for long time if rpcbind is not
>>> available.
>>> If
>>> --no-nfs-version 2 --no-nfs-version 3
>>> is given, the delay is about 6.5 minutes. When trying to register
>>> all versions, the delay is over half an hour.
>>> Before this commit, and after reverting it, nfsd fails (when v3 is
>>> requested) or succeeds (when only v4 is requested) immediately.
>>>
>>> Signed-off-by: NeilBrown <[email protected]>
>>> ---
>>> net/sunrpc/xprtsock.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>> index 111767ab124a..2a938055e95b 100644
>>> --- a/net/sunrpc/xprtsock.c
>>> +++ b/net/sunrpc/xprtsock.c
>>> @@ -795,6 +795,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
>>> xs_sock_reset_connection_flags(xprt);
>>> /* Mark transport as closed and wake up all pending tasks */
>>> xprt_disconnect_done(xprt);
>>> + xprt_force_disconnect(xprt);
>>> }
>>>
>>
>> If there is an outstanding request, then _that_ is supposed to redrive the connection. Why is the xprt_disconnect_done() not functioning as per the comment above it?
>
> I made some time to look at this and have a somewhat clearer
> understanding.
>
> The rpcbind code makes about 24 requests, include NULL pings, UNSET
> requests to clear any old registrations, and SET requests.
>
> The first (a NULL) fails immediately due to ECONNREFUSED. Others aren't
> even attempted for a growing number of seconds because xs_connect()
> finds that transport->sock is not NULL, so it delays before trying.
> Once it tries, it failed immediately and we move on to the next request.
>
> transport->sock is set to NULL by xs_reset_transport() which xs_connect() calls
> as part of the the timeout, but ->reestablish_timeout is not reset.
> The next time xs_connect is called, ->sock will be non-NULL again as
> transport->reestablish_timeout will be longer.
>
> transport->sock is only set to NULL *before* the test in xs_connect() by a
> call to xs_close() or ->close(). But nothing in triggering this.
>
> xs_sock_mark_closed() currently doesn't actually mark the socket as
> closed (XPRT_CLOSE_WAITING), or close it directly.
>
> static void xs_sock_mark_closed(struct rpc_xprt *xprt)
> {
> xs_sock_reset_connection_flags(xprt);
> /* Mark transport as closed and wake up all pending tasks */
> xprt_disconnect_done(xprt);
> }
>
> xs_sock_reset_connection_flags() clears XPRT_CLOSE_WAITING,
> XPRT_CLOSING, and XPRT_SOCK_DATA_READY, and adds some memory barrier.
>
> xprt_disconnect_done() clears XPRT_CONNECTED and wakes up the task.
>
> So none of this will "Mark transport as closed" like the comment
> suggests. They just mark it as disconnected.
> xs_sock_reset_connection_flags() even *clears* XPRT_CLOSE_WAITING, which
> we really want to be set.
>
> The commit said:
>
> Under all conditions, it should be quite sufficient just to mark
> the socket as disconnected. It will then be closed by the
> transport shutdown or reconnect code.
>
> The "reconnect code" is presumably xprt_connect()? This will close the
> socket if XPRT_CLOSE_WAIT is set., but not if the socket is marked as
> disconnected. When marked as disconnected, it tries to reconnect -
> which incurs a timeout.
>
> The "transport shutdown" code is presumably xprt_destroy()? Yes, that
> would close the socket, but that doesn't straight away. The rpcbind
> client can try 10 UNSET calls before destroying the transport.
>
>
> It is possible that __svc_unregister should return an error so that
> svc_unregister() can abort the loop on error. That would reduce the
> timeouts to some extent, but not completely as we would still get a
> timeout when trying to register. It might make sense that if the
> svc_unregister() in svc_rpcb_setup() fails due to ECONNREFUSED, then it
> should fail so we never even try to register. Plumbing that status
> through nfsd so that it doesn't abort in the V4-only case when rpcbind
> isn't available could be done, but seems like a lot of work for little gain.
>
> Also, the error that comes through to __svc_unregister is "-EAGAIN"
> rather then -ECONNREFUSED.
>
> I think that the easiest solution is to make sure reconnection attempts
> on an RPC_IS_SOFTCONN tasks do not trigger a delay in xs_connect().
> call_connect_status() already adds a 3second delay for other tasks, so
> simply forcing a close in xs_sock_marked_close() should be sufficient.
> The revert I posted would do that, or we could just add
>
> set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>
> so that the comment would be correct. That is sufficient.
>
> Another approach which works is:
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 7f79fb7dc6a0..f80b135b4830 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -1936,8 +1936,10 @@ call_connect_status(struct rpc_task *task)
> case -EADDRINUSE:
> case -ENOBUFS:
> case -EPIPE:
> - if (RPC_IS_SOFTCONN(task))
> + if (RPC_IS_SOFTCONN(task)) {
> + xprt_force_disconnect(task->tk_rqstp->rq_xprt);
> break;
> + }
> /* retry with existing socket, after a delay */
> rpc_delay(task, 3*HZ);
> case -EAGAIN:
>
> so that we force a disconnect precisely when a connection attempt fails
> on a SOFTCONN task.
>
> Which of these solutions do you prefer?
>
> Thanks,
> NeilBrown
DQo+IE9uIEF1ZyA5LCAyMDE2LCBhdCAyMjowNSwgTmVpbEJyb3duIDxuZWlsYkBzdXNlLmNvbT4g
d3JvdGU6DQo+IA0KPiBPbiBUaHUsIEF1ZyAwNCAyMDE2LCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6
DQo+IA0KPj4+IE9uIEF1ZyAzLCAyMDE2LCBhdCAxOTozMywgTmVpbEJyb3duIDxuZWlsYkBzdXNl
LmNvbT4gd3JvdGU6DQo+Pj4gDQo+Pj4gDQo+Pj4gVGhpcyByZXZlcnRzIGNvbW1pdCA0YjBhYjUx
ZGIzMmViYTBmNDhiNzYxODI1NDc0MmYxNDMzNjRhMjhkLg0KPj4+IA0KPj4+IFRoaXMgY2hhbmdl
IGNhdXNlcyAncnBjLm5mc2QnIHRvIGhhbmcgZm9yIGxvbmcgdGltZSBpZiBycGNiaW5kIGlzIG5v
dA0KPj4+IGF2YWlsYWJsZS4NCj4+PiBJZg0KPj4+ICAtLW5vLW5mcy12ZXJzaW9uIDIgLS1uby1u
ZnMtdmVyc2lvbiAzDQo+Pj4gaXMgZ2l2ZW4sIHRoZSBkZWxheSBpcyBhYm91dCA2LjUgbWludXRl
cy4gIFdoZW4gdHJ5aW5nIHRvIHJlZ2lzdGVyDQo+Pj4gYWxsIHZlcnNpb25zLCB0aGUgZGVsYXkg
aXMgb3ZlciBoYWxmIGFuIGhvdXIuDQo+Pj4gQmVmb3JlIHRoaXMgY29tbWl0LCBhbmQgYWZ0ZXIg
cmV2ZXJ0aW5nIGl0LCBuZnNkIGZhaWxzICh3aGVuIHYzIGlzDQo+Pj4gcmVxdWVzdGVkKSBvciBz
dWNjZWVkcyAod2hlbiBvbmx5IHY0IGlzIHJlcXVlc3RlZCkgaW1tZWRpYXRlbHkuDQo+Pj4gDQo+
Pj4gU2lnbmVkLW9mZi1ieTogTmVpbEJyb3duIDxuZWlsYkBzdXNlLmNvbT4NCj4+PiAtLS0NCj4+
PiBuZXQvc3VucnBjL3hwcnRzb2NrLmMgfCAxICsNCj4+PiAxIGZpbGUgY2hhbmdlZCwgMSBpbnNl
cnRpb24oKykNCj4+PiANCj4+PiBkaWZmIC0tZ2l0IGEvbmV0L3N1bnJwYy94cHJ0c29jay5jIGIv
bmV0L3N1bnJwYy94cHJ0c29jay5jDQo+Pj4gaW5kZXggMTExNzY3YWIxMjRhLi4yYTkzODA1NWU5
NWIgMTAwNjQ0DQo+Pj4gLS0tIGEvbmV0L3N1bnJwYy94cHJ0c29jay5jDQo+Pj4gKysrIGIvbmV0
L3N1bnJwYy94cHJ0c29jay5jDQo+Pj4gQEAgLTc5NSw2ICs3OTUsNyBAQCBzdGF0aWMgdm9pZCB4
c19zb2NrX21hcmtfY2xvc2VkKHN0cnVjdCBycGNfeHBydCAqeHBydCkNCj4+PiAJeHNfc29ja19y
ZXNldF9jb25uZWN0aW9uX2ZsYWdzKHhwcnQpOw0KPj4+IAkvKiBNYXJrIHRyYW5zcG9ydCBhcyBj
bG9zZWQgYW5kIHdha2UgdXAgYWxsIHBlbmRpbmcgdGFza3MgKi8NCj4+PiAJeHBydF9kaXNjb25u
ZWN0X2RvbmUoeHBydCk7DQo+Pj4gKwl4cHJ0X2ZvcmNlX2Rpc2Nvbm5lY3QoeHBydCk7DQo+Pj4g
fQ0KPj4+IA0KPj4gDQo+PiBJZiB0aGVyZSBpcyBhbiBvdXRzdGFuZGluZyByZXF1ZXN0LCB0aGVu
IF90aGF0XyBpcyBzdXBwb3NlZCB0byByZWRyaXZlIHRoZSBjb25uZWN0aW9uLiBXaHkgaXMgdGhl
IHhwcnRfZGlzY29ubmVjdF9kb25lKCkgbm90IGZ1bmN0aW9uaW5nIGFzIHBlciB0aGUgY29tbWVu
dCBhYm92ZSBpdD8NCj4gDQo+IEkgbWFkZSBzb21lIHRpbWUgdG8gbG9vayBhdCB0aGlzIGFuZCBo
YXZlIGEgc29tZXdoYXQgY2xlYXJlcg0KPiB1bmRlcnN0YW5kaW5nLg0KPiANCj4gVGhlIHJwY2Jp
bmQgY29kZSBtYWtlcyBhYm91dCAyNCByZXF1ZXN0cywgaW5jbHVkZSBOVUxMIHBpbmdzLCBVTlNF
VA0KPiByZXF1ZXN0cyB0byBjbGVhciBhbnkgb2xkIHJlZ2lzdHJhdGlvbnMsIGFuZCBTRVQgcmVx
dWVzdHMuDQo+IA0KPiBUaGUgZmlyc3QgKGEgTlVMTCkgZmFpbHMgaW1tZWRpYXRlbHkgZHVlIHRv
IEVDT05OUkVGVVNFRC4gIE90aGVycyBhcmVuJ3QNCj4gZXZlbiBhdHRlbXB0ZWQgZm9yIGEgZ3Jv
d2luZyBudW1iZXIgb2Ygc2Vjb25kcyBiZWNhdXNlIHhzX2Nvbm5lY3QoKQ0KPiBmaW5kcyB0aGF0
IHRyYW5zcG9ydC0+c29jayBpcyBub3QgTlVMTCwgc28gaXQgZGVsYXlzIGJlZm9yZSB0cnlpbmcu
DQo+IE9uY2UgaXQgdHJpZXMsIGl0IGZhaWxlZCBpbW1lZGlhdGVseSBhbmQgd2UgbW92ZSBvbiB0
byB0aGUgbmV4dCByZXF1ZXN0Lg0KPiANCj4gdHJhbnNwb3J0LT5zb2NrIGlzIHNldCB0byBOVUxM
IGJ5IHhzX3Jlc2V0X3RyYW5zcG9ydCgpIHdoaWNoIHhzX2Nvbm5lY3QoKSBjYWxscw0KPiBhcyBw
YXJ0IG9mIHRoZSB0aGUgdGltZW91dCwgYnV0IC0+cmVlc3RhYmxpc2hfdGltZW91dCBpcyBub3Qg
cmVzZXQuDQo+IFRoZSBuZXh0IHRpbWUgeHNfY29ubmVjdCBpcyBjYWxsZWQsIC0+c29jayB3aWxs
IGJlIG5vbi1OVUxMIGFnYWluIGFzDQo+IHRyYW5zcG9ydC0+cmVlc3RhYmxpc2hfdGltZW91dCB3
aWxsIGJlIGxvbmdlci4NCj4gDQo+IHRyYW5zcG9ydC0+c29jayBpcyBvbmx5IHNldCB0byBOVUxM
ICpiZWZvcmUqIHRoZSB0ZXN0IGluIHhzX2Nvbm5lY3QoKSBieSBhDQo+IGNhbGwgdG8geHNfY2xv
c2UoKSBvciAtPmNsb3NlKCkuICBCdXQgbm90aGluZyBpbiB0cmlnZ2VyaW5nIHRoaXMuDQo+IA0K
PiB4c19zb2NrX21hcmtfY2xvc2VkKCkgY3VycmVudGx5IGRvZXNuJ3QgYWN0dWFsbHkgbWFyayB0
aGUgc29ja2V0IGFzDQo+IGNsb3NlZCAoWFBSVF9DTE9TRV9XQUlUSU5HKSwgb3IgY2xvc2UgaXQg
ZGlyZWN0bHkuDQo+IA0KPiBzdGF0aWMgdm9pZCB4c19zb2NrX21hcmtfY2xvc2VkKHN0cnVjdCBy
cGNfeHBydCAqeHBydCkNCj4gew0KPiAJeHNfc29ja19yZXNldF9jb25uZWN0aW9uX2ZsYWdzKHhw
cnQpOw0KPiAJLyogTWFyayB0cmFuc3BvcnQgYXMgY2xvc2VkIGFuZCB3YWtlIHVwIGFsbCBwZW5k
aW5nIHRhc2tzICovDQo+IAl4cHJ0X2Rpc2Nvbm5lY3RfZG9uZSh4cHJ0KTsNCj4gfQ0KPiANCj4g
eHNfc29ja19yZXNldF9jb25uZWN0aW9uX2ZsYWdzKCkgY2xlYXJzIFhQUlRfQ0xPU0VfV0FJVElO
RywNCj4gWFBSVF9DTE9TSU5HLCBhbmQgWFBSVF9TT0NLX0RBVEFfUkVBRFksIGFuZCBhZGRzIHNv
bWUgbWVtb3J5IGJhcnJpZXIuDQo+IA0KPiB4cHJ0X2Rpc2Nvbm5lY3RfZG9uZSgpIGNsZWFycyBY
UFJUX0NPTk5FQ1RFRCBhbmQgd2FrZXMgdXAgdGhlIHRhc2suDQo+IA0KPiBTbyBub25lIG9mIHRo
aXMgd2lsbCAiTWFyayB0cmFuc3BvcnQgYXMgY2xvc2VkIiBsaWtlIHRoZSBjb21tZW50DQo+IHN1
Z2dlc3RzLiAgVGhleSBqdXN0IG1hcmsgaXQgYXMgZGlzY29ubmVjdGVkLg0KPiB4c19zb2NrX3Jl
c2V0X2Nvbm5lY3Rpb25fZmxhZ3MoKSBldmVuICpjbGVhcnMqIFhQUlRfQ0xPU0VfV0FJVElORywg
d2hpY2gNCj4gd2UgcmVhbGx5IHdhbnQgdG8gYmUgc2V0Lg0KPiANCj4gVGhlIGNvbW1pdCBzYWlk
Og0KPiANCj4gICAgVW5kZXIgYWxsIGNvbmRpdGlvbnMsIGl0IHNob3VsZCBiZSBxdWl0ZSBzdWZm
aWNpZW50IGp1c3QgdG8gbWFyaw0KPiAgICB0aGUgc29ja2V0IGFzIGRpc2Nvbm5lY3RlZC4gSXQg
d2lsbCB0aGVuIGJlIGNsb3NlZCBieSB0aGUNCj4gICAgdHJhbnNwb3J0IHNodXRkb3duIG9yIHJl
Y29ubmVjdCBjb2RlLg0KPiANCj4gVGhlICJyZWNvbm5lY3QgY29kZSIgaXMgcHJlc3VtYWJseSB4
cHJ0X2Nvbm5lY3QoKT8gIFRoaXMgd2lsbCBjbG9zZSB0aGUNCj4gc29ja2V0IGlmIFhQUlRfQ0xP
U0VfV0FJVCBpcyBzZXQuLCBidXQgbm90IGlmIHRoZSBzb2NrZXQgaXMgbWFya2VkIGFzDQo+IGRp
c2Nvbm5lY3RlZC4gIFdoZW4gbWFya2VkIGFzIGRpc2Nvbm5lY3RlZCwgaXQgdHJpZXMgdG8gcmVj
b25uZWN0IC0NCj4gd2hpY2ggaW5jdXJzIGEgdGltZW91dC4NCj4gDQo+IFRoZSAidHJhbnNwb3J0
IHNodXRkb3duIiBjb2RlIGlzIHByZXN1bWFibHkgeHBydF9kZXN0cm95KCk/ICBZZXMsIHRoYXQN
Cj4gd291bGQgY2xvc2UgdGhlIHNvY2tldCwgYnV0IHRoYXQgZG9lc24ndCBzdHJhaWdodCBhd2F5
LiAgVGhlIHJwY2JpbmQNCj4gY2xpZW50IGNhbiB0cnkgMTAgVU5TRVQgY2FsbHMgYmVmb3JlIGRl
c3Ryb3lpbmcgdGhlIHRyYW5zcG9ydC4NCj4gDQo+IA0KPiBJdCBpcyBwb3NzaWJsZSB0aGF0IF9f
c3ZjX3VucmVnaXN0ZXIgc2hvdWxkIHJldHVybiBhbiBlcnJvciBzbyB0aGF0DQo+IHN2Y191bnJl
Z2lzdGVyKCkgY2FuIGFib3J0IHRoZSBsb29wIG9uIGVycm9yLiAgVGhhdCB3b3VsZCByZWR1Y2Ug
dGhlDQo+IHRpbWVvdXRzIHRvIHNvbWUgZXh0ZW50LCBidXQgbm90IGNvbXBsZXRlbHkgYXMgd2Ug
d291bGQgc3RpbGwgZ2V0IGENCj4gdGltZW91dCB3aGVuIHRyeWluZyB0byByZWdpc3Rlci4gIEl0
IG1pZ2h0IG1ha2Ugc2Vuc2UgdGhhdCBpZiB0aGUNCj4gc3ZjX3VucmVnaXN0ZXIoKSBpbiBzdmNf
cnBjYl9zZXR1cCgpIGZhaWxzIGR1ZSB0byBFQ09OTlJFRlVTRUQsIHRoZW4gaXQNCj4gc2hvdWxk
IGZhaWwgc28gd2UgbmV2ZXIgZXZlbiB0cnkgdG8gcmVnaXN0ZXIuICBQbHVtYmluZyB0aGF0IHN0
YXR1cw0KPiB0aHJvdWdoIG5mc2Qgc28gdGhhdCBpdCBkb2Vzbid0IGFib3J0IGluIHRoZSBWNC1v
bmx5IGNhc2Ugd2hlbiBycGNiaW5kDQo+IGlzbid0IGF2YWlsYWJsZSBjb3VsZCBiZSBkb25lLCBi
dXQgc2VlbXMgbGlrZSBhIGxvdCBvZiB3b3JrIGZvciBsaXR0bGUgZ2Fpbi4NCj4gDQo+IEFsc28s
IHRoZSBlcnJvciB0aGF0IGNvbWVzIHRocm91Z2ggdG8gX19zdmNfdW5yZWdpc3RlciBpcyAiLUVB
R0FJTiINCj4gcmF0aGVyIHRoZW4gLUVDT05OUkVGVVNFRC4NCj4gDQo+IEkgdGhpbmsgdGhhdCB0
aGUgZWFzaWVzdCBzb2x1dGlvbiBpcyB0byBtYWtlIHN1cmUgcmVjb25uZWN0aW9uIGF0dGVtcHRz
DQo+IG9uIGFuIFJQQ19JU19TT0ZUQ09OTiB0YXNrcyBkbyBub3QgdHJpZ2dlciBhIGRlbGF5IGlu
IHhzX2Nvbm5lY3QoKS4NCj4gY2FsbF9jb25uZWN0X3N0YXR1cygpIGFscmVhZHkgYWRkcyBhIDNz
ZWNvbmQgZGVsYXkgZm9yIG90aGVyIHRhc2tzLCBzbw0KPiBzaW1wbHkgZm9yY2luZyBhIGNsb3Nl
IGluIHhzX3NvY2tfbWFya2VkX2Nsb3NlKCkgc2hvdWxkIGJlIHN1ZmZpY2llbnQuDQo+IFRoZSBy
ZXZlcnQgSSBwb3N0ZWQgd291bGQgZG8gdGhhdCwgb3Igd2UgY291bGQganVzdCBhZGQNCj4gDQo+
IAlzZXRfYml0KFhQUlRfQ0xPU0VfV0FJVCwgJnhwcnQtPnN0YXRlKTsNCj4gDQo+IHNvIHRoYXQg
dGhlIGNvbW1lbnQgd291bGQgYmUgY29ycmVjdC4gIFRoYXQgaXMgc3VmZmljaWVudC4NCj4gDQo+
IEFub3RoZXIgYXBwcm9hY2ggd2hpY2ggd29ya3MgaXM6DQo+IGRpZmYgLS1naXQgYS9uZXQvc3Vu
cnBjL2NsbnQuYyBiL25ldC9zdW5ycGMvY2xudC5jDQo+IGluZGV4IDdmNzlmYjdkYzZhMC4uZjgw
YjEzNWI0ODMwIDEwMDY0NA0KPiAtLS0gYS9uZXQvc3VucnBjL2NsbnQuYw0KPiArKysgYi9uZXQv
c3VucnBjL2NsbnQuYw0KPiBAQCAtMTkzNiw4ICsxOTM2LDEwIEBAIGNhbGxfY29ubmVjdF9zdGF0
dXMoc3RydWN0IHJwY190YXNrICp0YXNrKQ0KPiAJY2FzZSAtRUFERFJJTlVTRToNCj4gCWNhc2Ug
LUVOT0JVRlM6DQo+IAljYXNlIC1FUElQRToNCj4gLQkJaWYgKFJQQ19JU19TT0ZUQ09OTih0YXNr
KSkNCj4gKwkJaWYgKFJQQ19JU19TT0ZUQ09OTih0YXNrKSkgew0KPiArCQkJeHBydF9mb3JjZV9k
aXNjb25uZWN0KHRhc2stPnRrX3Jxc3RwLT5ycV94cHJ0KTsNCj4gCQkJYnJlYWs7DQo+ICsJCX0N
Cj4gCQkvKiByZXRyeSB3aXRoIGV4aXN0aW5nIHNvY2tldCwgYWZ0ZXIgYSBkZWxheSAqLw0KPiAJ
CXJwY19kZWxheSh0YXNrLCAzKkhaKTsNCj4gCWNhc2UgLUVBR0FJTjoNCj4gDQo+IHNvIHRoYXQg
d2UgZm9yY2UgYSBkaXNjb25uZWN0IHByZWNpc2VseSB3aGVuIGEgY29ubmVjdGlvbiBhdHRlbXB0
IGZhaWxzDQo+IG9uIGEgU09GVENPTk4gdGFzay4NCj4gDQo+IFdoaWNoIG9mIHRoZXNlIHNvbHV0
aW9ucyBkbyB5b3UgcHJlZmVyPw0KDQoNCkluc3RlYWQgb2YgdGhlIGFib3ZlLCBjb3VsZCB3ZSBy
YXRoZXIgbG9vayBhdCBhZGRpbmcgYSBjYWxsIHRvIHhwcnRfY29uZGl0aW9uYWxfZGlzY29ubmVj
dCgpIHRvIGVuc3VyZSB0aGF0IHdlIGRvbuKAmXQga2VlcCBkaXNjb25uZWN0aW5nIG11bHRpcGxl
IHRpbWVzPyBJdCBtZWFucyB3ZeKAmWQgaGF2ZSB0byBzZXQgcmVxLT5ycV9jb25uZWN0X2Nvb2tp
ZSBpbiB4cHJ0X2Nvbm5lY3QoKSBiZWZvcmUgc2xlZXBpbmcgb24geHBydC0+cGVuZGluZywgYnV0
IEkgZG9u4oCZdCB0aGluayB0aGF0IGlzIGEgbWFqb3IgaW1wb3NpdGlvbi4NCg0KSSBhbHNvIHRo
aW5rIHdlIG1pZ2h0IG5vdCB3YW50IHRvIG1ha2UgdGhhdCBjYWxsIGNvbmRpdGlvbmFsIG9uIFJQ
Q19JU19TT0ZUQ09OTi4NCg0KDQo=
On Sat, Nov 19 2016, Trond Myklebust wrote:
>> On Aug 9, 2016, at 22:05, NeilBrown <[email protected]> wrote:
....
>>
>> Another approach which works is:
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index 7f79fb7dc6a0..f80b135b4830 100644
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -1936,8 +1936,10 @@ call_connect_status(struct rpc_task *task)
>> case -EADDRINUSE:
>> case -ENOBUFS:
>> case -EPIPE:
>> - if (RPC_IS_SOFTCONN(task))
>> + if (RPC_IS_SOFTCONN(task)) {
>> + xprt_force_disconnect(task->tk_rqstp->rq_xprt);
>> break;
>> + }
>> /* retry with existing socket, after a delay */
>> rpc_delay(task, 3*HZ);
>> case -EAGAIN:
>>
>> so that we force a disconnect precisely when a connection attempt fails
>> on a SOFTCONN task.
>>
>> Which of these solutions do you prefer?
>
>
> Instead of the above, could we rather look at adding a call to xprt_conditional_disconnect() to ensure that we don’t keep disconnecting multiple times? It means we’d have to set req->rq_connect_cookie in xprt_connect() before sleeping on xprt->pending, but I don’t think that is a major imposition.
>
> I also think we might not want to make that call conditional on RPC_IS_SOFTCONN.
Something like this maybe?
Just using xprt_conditional_disconnect() doesn't work as it does
nothing when !xprt_connected(xprt).
I commented out that test and then the change works. I don't know if
that is what you want though.
Thanks,
NeilBrown
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index b31ca97e754e..53acd4e3a317 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1926,6 +1926,8 @@ call_connect_status(struct rpc_task *task)
case -EADDRINUSE:
case -ENOBUFS:
case -EPIPE:
+ xprt_conditional_disconnect(task->tk_rqstp->rq_xprt,
+ task->tk_rqstp->rq_connect_cookie);
if (RPC_IS_SOFTCONN(task))
break;
/* retry with existing socket, after a delay */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 685e6d225414..52007d018135 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -669,7 +669,7 @@ void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie)
spin_lock_bh(&xprt->transport_lock);
if (cookie != xprt->connect_cookie)
goto out;
- if (test_bit(XPRT_CLOSING, &xprt->state) || !xprt_connected(xprt))
+ if (test_bit(XPRT_CLOSING, &xprt->state) /*|| !xprt_connected(xprt)*/)
goto out;
set_bit(XPRT_CLOSE_WAIT, &xprt->state);
/* Try to schedule an autoclose RPC call */
@@ -772,6 +772,7 @@ void xprt_connect(struct rpc_task *task)
if (!xprt_connected(xprt)) {
task->tk_rqstp->rq_bytes_sent = 0;
task->tk_timeout = task->tk_rqstp->rq_timeout;
+ task->tk_rqstp->rq_connect_cookie = xprt->connect_cookie;
rpc_sleep_on(&xprt->pending, task, xprt_connect_status);
if (test_bit(XPRT_CLOSING, &xprt->state))