2018-03-30 22:22:02

by Long Li

[permalink] [raw]
Subject: [PATCH 1/2] cifs: smbd: avoid reconnect lockup

From: Long Li <[email protected]>

During transport reconnect, other processes may have registered memory
and blocked on transport. This creates a deadlock situation because the
transport resources can't be freed, and reconnect is blocked.

Fix this by returning to upper layer on timeout. Before returning,
transport status is set to reconnecting so other processes will release
memory registration resources.

Upper layer will retry the reconnect. This is not in fast I/O path so
setting the timeout to 5 seconds.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 5aa0b54..3f7883e 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1498,8 +1498,8 @@ int smbd_reconnect(struct TCP_Server_Info *server)
log_rdma_event(INFO, "reconnecting rdma session\n");

if (!server->smbd_conn) {
- log_rdma_event(ERR, "rdma session already destroyed\n");
- return -EINVAL;
+ log_rdma_event(INFO, "rdma session already destroyed\n");
+ goto create_conn;
}

/*
@@ -1512,15 +1512,19 @@ int smbd_reconnect(struct TCP_Server_Info *server)
}

/* wait until the transport is destroyed */
- wait_event(server->smbd_conn->wait_destroy,
- server->smbd_conn->transport_status == SMBD_DESTROYED);
+ if (!wait_event_timeout(server->smbd_conn->wait_destroy,
+ server->smbd_conn->transport_status == SMBD_DESTROYED, 5*HZ))
+ return -EAGAIN;

destroy_workqueue(server->smbd_conn->workqueue);
kfree(server->smbd_conn);

+create_conn:
log_rdma_event(INFO, "creating rdma session\n");
server->smbd_conn = smbd_get_connection(
server, (struct sockaddr *) &server->dstaddr);
+ log_rdma_event(INFO, "created rdma session info=%p\n",
+ server->smbd_conn);

return server->smbd_conn ? 0 : -ENOENT;
}
--
2.7.4



2018-03-30 22:22:02

by Long Li

[permalink] [raw]
Subject: [PATCH 2/2] cifs: smbd: disconnect transport on RDMA errors

From: Long Li <[email protected]>

On RDMA errors, transport should disconnect the RDMA CM connection. This
will notify the upper layer, and it will attempt transport reconnect.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 3f7883e..5008af5 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -862,6 +862,8 @@ static int smbd_post_send_negotiate_req(struct smbd_connection *info)
ib_dma_unmap_single(info->id->device, request->sge[0].addr,
request->sge[0].length, DMA_TO_DEVICE);

+ smbd_disconnect_rdma_connection(info);
+
dma_mapping_failed:
mempool_free(request, info->request_mempool);
return rc;
@@ -1061,6 +1063,7 @@ static int smbd_post_send(struct smbd_connection *info,
if (atomic_dec_and_test(&info->send_pending))
wake_up(&info->wait_send_pending);
}
+ smbd_disconnect_rdma_connection(info);
} else
/* Reset timer for idle connection after packet is sent */
mod_delayed_work(info->workqueue, &info->idle_timer_work,
@@ -1202,7 +1205,7 @@ static int smbd_post_recv(
if (rc) {
ib_dma_unmap_single(info->id->device, response->sge.addr,
response->sge.length, DMA_FROM_DEVICE);
-
+ smbd_disconnect_rdma_connection(info);
log_rdma_recv(ERR, "ib_post_recv failed rc=%d\n", rc);
}

@@ -2546,6 +2549,8 @@ struct smbd_mr *smbd_register_mr(
if (atomic_dec_and_test(&info->mr_used_count))
wake_up(&info->wait_for_mr_cleanup);

+ smbd_disconnect_rdma_connection(info);
+
return NULL;
}

--
2.7.4


2018-03-30 22:25:40

by ronnie sahlberg

[permalink] [raw]
Subject: Re: [PATCH 1/2] cifs: smbd: avoid reconnect lockup

Looks good to me (both patches)

Reviewed-by: Ronnie Sahlberg <[email protected]>

On Sat, Mar 31, 2018 at 8:16 AM, Long Li <[email protected]> wrote:
> From: Long Li <[email protected]>
>
> During transport reconnect, other processes may have registered memory
> and blocked on transport. This creates a deadlock situation because the
> transport resources can't be freed, and reconnect is blocked.
>
> Fix this by returning to upper layer on timeout. Before returning,
> transport status is set to reconnecting so other processes will release
> memory registration resources.
>
> Upper layer will retry the reconnect. This is not in fast I/O path so
> setting the timeout to 5 seconds.
>
> Signed-off-by: Long Li <[email protected]>
> ---
> fs/cifs/smbdirect.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
> index 5aa0b54..3f7883e 100644
> --- a/fs/cifs/smbdirect.c
> +++ b/fs/cifs/smbdirect.c
> @@ -1498,8 +1498,8 @@ int smbd_reconnect(struct TCP_Server_Info *server)
> log_rdma_event(INFO, "reconnecting rdma session\n");
>
> if (!server->smbd_conn) {
> - log_rdma_event(ERR, "rdma session already destroyed\n");
> - return -EINVAL;
> + log_rdma_event(INFO, "rdma session already destroyed\n");
> + goto create_conn;
> }
>
> /*
> @@ -1512,15 +1512,19 @@ int smbd_reconnect(struct TCP_Server_Info *server)
> }
>
> /* wait until the transport is destroyed */
> - wait_event(server->smbd_conn->wait_destroy,
> - server->smbd_conn->transport_status == SMBD_DESTROYED);
> + if (!wait_event_timeout(server->smbd_conn->wait_destroy,
> + server->smbd_conn->transport_status == SMBD_DESTROYED, 5*HZ))
> + return -EAGAIN;
>
> destroy_workqueue(server->smbd_conn->workqueue);
> kfree(server->smbd_conn);
>
> +create_conn:
> log_rdma_event(INFO, "creating rdma session\n");
> server->smbd_conn = smbd_get_connection(
> server, (struct sockaddr *) &server->dstaddr);
> + log_rdma_event(INFO, "created rdma session info=%p\n",
> + server->smbd_conn);
>
> return server->smbd_conn ? 0 : -ENOENT;
> }
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2018-03-31 00:22:56

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 1/2] cifs: smbd: avoid reconnect lockup

merged into cifs-2.6.git for-next

added cc:stable

On Fri, Mar 30, 2018 at 5:23 PM, ronnie sahlberg
<[email protected]> wrote:
> Looks good to me (both patches)
>
> Reviewed-by: Ronnie Sahlberg <[email protected]>
>
> On Sat, Mar 31, 2018 at 8:16 AM, Long Li <[email protected]> wrote:
>> From: Long Li <[email protected]>
>>
>> During transport reconnect, other processes may have registered memory
>> and blocked on transport. This creates a deadlock situation because the
>> transport resources can't be freed, and reconnect is blocked.
>>
>> Fix this by returning to upper layer on timeout. Before returning,
>> transport status is set to reconnecting so other processes will release
>> memory registration resources.
>>
>> Upper layer will retry the reconnect. This is not in fast I/O path so
>> setting the timeout to 5 seconds.
>>
>> Signed-off-by: Long Li <[email protected]>
>> ---
>> fs/cifs/smbdirect.c | 12 ++++++++----
>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
>> index 5aa0b54..3f7883e 100644
>> --- a/fs/cifs/smbdirect.c
>> +++ b/fs/cifs/smbdirect.c
>> @@ -1498,8 +1498,8 @@ int smbd_reconnect(struct TCP_Server_Info *server)
>> log_rdma_event(INFO, "reconnecting rdma session\n");
>>
>> if (!server->smbd_conn) {
>> - log_rdma_event(ERR, "rdma session already destroyed\n");
>> - return -EINVAL;
>> + log_rdma_event(INFO, "rdma session already destroyed\n");
>> + goto create_conn;
>> }
>>
>> /*
>> @@ -1512,15 +1512,19 @@ int smbd_reconnect(struct TCP_Server_Info *server)
>> }
>>
>> /* wait until the transport is destroyed */
>> - wait_event(server->smbd_conn->wait_destroy,
>> - server->smbd_conn->transport_status == SMBD_DESTROYED);
>> + if (!wait_event_timeout(server->smbd_conn->wait_destroy,
>> + server->smbd_conn->transport_status == SMBD_DESTROYED, 5*HZ))
>> + return -EAGAIN;
>>
>> destroy_workqueue(server->smbd_conn->workqueue);
>> kfree(server->smbd_conn);
>>
>> +create_conn:
>> log_rdma_event(INFO, "creating rdma session\n");
>> server->smbd_conn = smbd_get_connection(
>> server, (struct sockaddr *) &server->dstaddr);
>> + log_rdma_event(INFO, "created rdma session info=%p\n",
>> + server->smbd_conn);
>>
>> return server->smbd_conn ? 0 : -ENOENT;
>> }
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Thanks,

Steve