2021-09-03 13:10:26

by Daniel Wagner

[permalink] [raw]
Subject: [RFC v1] nvme-tcp: enable linger socket option on shutdown

When the no linger is set, the networking stack sends FIN followed by
RST immediately when shutting down the socket. By enabling linger when
shutting down we have a proper shutdown sequence on the wire.

Signed-off-by: Daniel Wagner <[email protected]>
---
The current shutdown sequence on the wire is a bit harsh and
doesn't let the remote host to react. I suppose we should
introduce a short (how long?) linger pause when shutting down
the connection. Thoughs?

drivers/nvme/host/tcp.c | 1 +
include/net/sock.h | 1 +
net/core/sock.c | 8 ++++++++
3 files changed, 10 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e2ab12f3f51c..6c6dc465147a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1558,6 +1558,7 @@ static void nvme_tcp_restore_sock_calls(struct nvme_tcp_queue *queue)

static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
{
+ sock_reset_linger(queue->sock->sk);
kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_restore_sock_calls(queue);
cancel_work_sync(&queue->io_work);
diff --git a/include/net/sock.h b/include/net/sock.h
index 66a9a90f9558..313a6c8ba51c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2777,6 +2777,7 @@ int sock_set_timestamping(struct sock *sk, int optname,

void sock_enable_timestamps(struct sock *sk);
void sock_no_linger(struct sock *sk);
+void sock_reset_linger(struct sock *sk);
void sock_set_keepalive(struct sock *sk);
void sock_set_priority(struct sock *sk, u32 priority);
void sock_set_rcvbuf(struct sock *sk, int val);
diff --git a/net/core/sock.c b/net/core/sock.c
index 62627e868e03..23090a01e412 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -755,6 +755,14 @@ void sock_no_linger(struct sock *sk)
}
EXPORT_SYMBOL(sock_no_linger);

+void sock_reset_linger(struct sock *sk)
+{
+ lock_sock(sk);
+ sock_reset_flag(sk, SOCK_LINGER);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(sock_reset_linger);
+
void sock_set_priority(struct sock *sk, u32 priority)
{
lock_sock(sk);
--
2.29.2


2021-09-06 08:49:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC v1] nvme-tcp: enable linger socket option on shutdown

On Fri, Sep 03, 2021 at 02:17:57PM +0200, Daniel Wagner wrote:
> When the no linger is set, the networking stack sends FIN followed by
> RST immediately when shutting down the socket. By enabling linger when
> shutting down we have a proper shutdown sequence on the wire.
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
> The current shutdown sequence on the wire is a bit harsh and
> doesn't let the remote host to react. I suppose we should
> introduce a short (how long?) linger pause when shutting down
> the connection. Thoughs?

Why? I'm not really a TCP expert, but why is this different from
say iSCSI or NBD?

2021-09-14 08:49:43

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC v1] nvme-tcp: enable linger socket option on shutdown

On Mon, Sep 06, 2021 at 08:58:20AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 03, 2021 at 02:17:57PM +0200, Daniel Wagner wrote:
> > When the no linger is set, the networking stack sends FIN followed by
> > RST immediately when shutting down the socket. By enabling linger when
> > shutting down we have a proper shutdown sequence on the wire.
> >
> > Signed-off-by: Daniel Wagner <[email protected]>
> > ---
> > The current shutdown sequence on the wire is a bit harsh and
> > doesn't let the remote host to react. I suppose we should
> > introduce a short (how long?) linger pause when shutting down
> > the connection. Thoughs?
>
> Why? I'm not really a TCP expert, but why is this different from
> say iSCSI or NBD?

I am also no TCP expert. Adding netdev to Cc.

During testing the nvme-tcp subsystem by one of our partners we observed
this. Maybe this is perfectly fine. Just as I said it looks a bit weird
that a proper shutdown of the connection a RST is send out right after
the FIN.

No idea how iSCSI or NBD handles this. I'll check.

2021-09-14 14:22:53

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [RFC v1] nvme-tcp: enable linger socket option on shutdown


>>> When the no linger is set, the networking stack sends FIN followed by
>>> RST immediately when shutting down the socket. By enabling linger when
>>> shutting down we have a proper shutdown sequence on the wire.
>>>
>>> Signed-off-by: Daniel Wagner <[email protected]>
>>> ---
>>> The current shutdown sequence on the wire is a bit harsh and
>>> doesn't let the remote host to react. I suppose we should
>>> introduce a short (how long?) linger pause when shutting down
>>> the connection. Thoughs?
>>
>> Why? I'm not really a TCP expert, but why is this different from
>> say iSCSI or NBD?
>
> I am also no TCP expert. Adding netdev to Cc.
>
> During testing the nvme-tcp subsystem by one of our partners we observed
> this. Maybe this is perfectly fine. Just as I said it looks a bit weird
> that a proper shutdown of the connection a RST is send out right after
> the FIN.

The point here is that when we close the connection we may have inflight
requests that we already failed to upper layers and we don't want them
to get through as we proceed to error handling. This is why we want the
socket to go away asap.

> No idea how iSCSI or NBD handles this. I'll check.

iSCSI does the same thing in essence (with a minor variation because in
iscsi we have a logout message which we don't have in nvme).

2021-09-15 07:56:31

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC v1] nvme-tcp: enable linger socket option on shutdown

On Tue, Sep 14, 2021 at 05:20:46PM +0300, Sagi Grimberg wrote:
> > During testing the nvme-tcp subsystem by one of our partners we observed
> > this. Maybe this is perfectly fine. Just as I said it looks a bit weird
> > that a proper shutdown of the connection a RST is send out right after
> > the FIN.
>
> The point here is that when we close the connection we may have inflight
> requests that we already failed to upper layers and we don't want them
> to get through as we proceed to error handling. This is why we want the
> socket to go away asap.

Thanks for the explanation. The RST is on purpose, got it.