2022-11-21 14:24:25

by Benjamin Coddington

[permalink] [raw]
Subject: [PATCH v1 0/3] Stop corrupting socket's task_frag

The networking code uses flags in sk_allocation to determine if it can use
current->task_frag, however in-kernel users of sockets may stop setting
sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
on all rpciod/xprtiod jobs").

This will cause corruption in current->task_frag when recursing into the
network layer for those subsystems during page fault or reclaim. The
corruption is difficult to diagnose because stack traces may not contain the
offending subsystem at all. The corruption is unlikely to show up in
testing because it requires memory pressure, and so subsystems that
convert to memalloc_nofs_save/restore are likely to continue to run into
this issue.

Previous reports and proposed fixes:
https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/

Guilluame Nault has done all of the hard work tracking this problem down and
finding the best fix for this issue. I'm just taking a turn posting another
fix.

Benjamin Coddington (2):
Treewide: Stop corrupting socket's task_frag
net: simplify sk_page_frag

Guillaume Nault (1):
net: Introduce sk_use_task_frag in struct sock.

drivers/block/drbd/drbd_receiver.c | 3 +++
drivers/block/nbd.c | 1 +
drivers/nvme/host/tcp.c | 1 +
drivers/scsi/iscsi_tcp.c | 1 +
drivers/usb/usbip/usbip_common.c | 1 +
fs/afs/rxrpc.c | 1 +
fs/cifs/connect.c | 1 +
fs/dlm/lowcomms.c | 2 ++
fs/ocfs2/cluster/tcp.c | 1 +
include/net/sock.h | 10 ++++++----
net/9p/trans_fd.c | 1 +
net/ceph/messenger.c | 1 +
net/core/sock.c | 1 +
net/sunrpc/xprtsock.c | 3 +++
14 files changed, 24 insertions(+), 4 deletions(-)

--
2.31.1



2022-11-21 14:51:55

by Benjamin Coddington

[permalink] [raw]
Subject: [PATCH v1 3/3] net: simplify sk_page_frag

Now that in-kernel socket users that may recurse during reclaim have benn
converted to sk_use_task_frag = false, we can have sk_page_frag() simply
check that value.

Signed-off-by: Benjamin Coddington <[email protected]>
---
include/net/sock.h | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ffba9e95470d..fac24c6ee30d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
* Both direct reclaim and page faults can nest inside other
* socket operations and end up recursing into sk_page_frag()
* while it's already in use: explicitly avoid task page_frag
- * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags or
- * explicitly disable sk_use_task_frag.
+ * when users disable sk_use_task_frag.
*
* Return: a per task page_frag if context allows that,
* otherwise a per socket one.
*/
static inline struct page_frag *sk_page_frag(struct sock *sk)
{
- if (sk->sk_use_task_frag &&
- (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
- __GFP_FS)) ==
- (__GFP_DIRECT_RECLAIM | __GFP_FS))
+ if (sk->sk_use_task_frag)
return &current->task_frag;

return &sk->sk_frag;
--
2.31.1


2022-12-07 12:16:16

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] Stop corrupting socket's task_frag

Hi Dave, Eric, Jakub, Paolo,

I think it makes sense for all three of these to go together through netdev.
If you agree, would you like me to chase down individual ACKs for each
treewide touch?

What can I do from netdev's perspective to move this forward?

Ben

On 21 Nov 2022, at 8:35, Benjamin Coddington wrote:

> The networking code uses flags in sk_allocation to determine if it can use
> current->task_frag, however in-kernel users of sockets may stop setting
> sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
> as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
> on all rpciod/xprtiod jobs").
>
> This will cause corruption in current->task_frag when recursing into the
> network layer for those subsystems during page fault or reclaim. The
> corruption is difficult to diagnose because stack traces may not contain the
> offending subsystem at all. The corruption is unlikely to show up in
> testing because it requires memory pressure, and so subsystems that
> convert to memalloc_nofs_save/restore are likely to continue to run into
> this issue.
>
> Previous reports and proposed fixes:
> https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
> https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
> https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/
>
> Guilluame Nault has done all of the hard work tracking this problem down and
> finding the best fix for this issue. I'm just taking a turn posting another
> fix.
>
> Benjamin Coddington (2):
> Treewide: Stop corrupting socket's task_frag
> net: simplify sk_page_frag
>
> Guillaume Nault (1):
> net: Introduce sk_use_task_frag in struct sock.
>
> drivers/block/drbd/drbd_receiver.c | 3 +++
> drivers/block/nbd.c | 1 +
> drivers/nvme/host/tcp.c | 1 +
> drivers/scsi/iscsi_tcp.c | 1 +
> drivers/usb/usbip/usbip_common.c | 1 +
> fs/afs/rxrpc.c | 1 +
> fs/cifs/connect.c | 1 +
> fs/dlm/lowcomms.c | 2 ++
> fs/ocfs2/cluster/tcp.c | 1 +
> include/net/sock.h | 10 ++++++----
> net/9p/trans_fd.c | 1 +
> net/ceph/messenger.c | 1 +
> net/core/sock.c | 1 +
> net/sunrpc/xprtsock.c | 3 +++
> 14 files changed, 24 insertions(+), 4 deletions(-)
>
> --
> 2.31.1

2022-12-09 17:29:06

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] net: simplify sk_page_frag

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
> include/net/sock.h | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
> * Both direct reclaim and page faults can nest inside other
> * socket operations and end up recursing into sk_page_frag()
> * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
> *
> * Return: a per task page_frag if context allows that,
> * otherwise a per socket one.
> */
> static inline struct page_frag *sk_page_frag(struct sock *sk)
> {
> - if (sk->sk_use_task_frag &&
> - (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> - __GFP_FS)) ==
> - (__GFP_DIRECT_RECLAIM | __GFP_FS))
> + if (sk->sk_use_task_frag)
> return &current->task_frag;
>
> return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo

2022-12-09 17:31:47

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] net: simplify sk_page_frag

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
> include/net/sock.h | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
> * Both direct reclaim and page faults can nest inside other
> * socket operations and end up recursing into sk_page_frag()
> * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
> *
> * Return: a per task page_frag if context allows that,
> * otherwise a per socket one.
> */
> static inline struct page_frag *sk_page_frag(struct sock *sk)
> {
> - if (sk->sk_use_task_frag &&
> - (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> - __GFP_FS)) ==
> - (__GFP_DIRECT_RECLAIM | __GFP_FS))
> + if (sk->sk_use_task_frag)
> return &current->task_frag;
>
> return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo

2022-12-09 17:47:02

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] net: simplify sk_page_frag

On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Now that in-kernel socket users that may recurse during reclaim have benn
> converted to sk_use_task_frag = false, we can have sk_page_frag() simply
> check that value.
>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
> include/net/sock.h | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index ffba9e95470d..fac24c6ee30d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2539,19 +2539,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
> * Both direct reclaim and page faults can nest inside other
> * socket operations and end up recursing into sk_page_frag()
> * while it's already in use: explicitly avoid task page_frag
> - * usage if the caller is potentially doing any of them.
> - * This assumes that page fault handlers use the GFP_NOFS flags or
> - * explicitly disable sk_use_task_frag.
> + * when users disable sk_use_task_frag.
> *
> * Return: a per task page_frag if context allows that,
> * otherwise a per socket one.
> */
> static inline struct page_frag *sk_page_frag(struct sock *sk)
> {
> - if (sk->sk_use_task_frag &&
> - (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
> - __GFP_FS)) ==
> - (__GFP_DIRECT_RECLAIM | __GFP_FS))
> + if (sk->sk_use_task_frag)
> return &current->task_frag;
>
> return &sk->sk_frag;

To make the above as safe as possible I think we should double-check
the in-kernel users explicitly setting sk_allocation to GFP_ATOMIC, as
that has the side effect of disabling the task_frag usage, too.

Patch 2/3 already catches some of such users, and we can safely leave
alone few others, (specifically l2tp, fou and inet_ctl_sock_create()).

Even wireguard and tls looks safe IMHO.

So the only left-over should be espintcp, I suggest updating patch 2/3
clearing sk_use_task_frag even in espintcp_init_sk().

Other than that LGTM.

Cheers,

Paolo