2024-04-24 16:57:34

by Dragos Tatulea

[permalink] [raw]
Subject: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

The patch referenced in the fixes tag introduced the skb_frag_unref API.
This API still has a dark corner: when skb->pp_recycled is false and a
fragment being referenced is backed by a page_pool page, skb_frag_unref
can leak the page out of the page_pool, like in the following example:

BUG: Bad page state in process swapper/4 pfn:103423
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffffffff()
raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
page dumped because: page_pool leak
Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
swapper/4 Not tainted 6.9.0-rc4+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x53/0x70
bad_page+0x6f/0xf0
free_unref_page_prepare+0x271/0x420
free_unref_page+0x38/0x120
___pskb_trim+0x261/0x390
skb_segment+0xf60/0x1040
tcp_gso_segment+0xe8/0x4e0
inet_gso_segment+0x155/0x3d0
skb_mac_gso_segment+0x99/0x100
__skb_gso_segment+0xb4/0x160
? netif_skb_features+0x95/0x2f0
validate_xmit_skb+0x16c/0x330
validate_xmit_skb_list+0x4c/0x70
sch_direct_xmit+0x23e/0x350
__dev_queue_xmit+0x334/0xbc0
tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
tcf_mirred_act+0xd7/0x4b0 [act_mirred]
? tcf_pedit_act+0x6f/0x540 [act_pedit]
tcf_action_exec+0x82/0x170
fl_classify+0x1ee/0x200 [cls_flower]
? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
? mlx5e_completion_event+0x39/0x40 [mlx5_core]
? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
tcf_classify+0x26a/0x470
tc_run+0xa2/0x120
? handle_irq_event+0x53/0x80
? kvm_clock_get_cycles+0x11/0x20
__netif_receive_skb_core.constprop.0+0x932/0xee0
__netif_receive_skb_list_core+0xfe/0x1f0
netif_receive_skb_list_internal+0x1b5/0x2b0
napi_gro_complete.constprop.0+0xee/0x120
dev_gro_receive+0x3f4/0x710
napi_gro_receive+0x7d/0x220
mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
__napi_poll+0x25/0x1b0
net_rx_action+0x2c1/0x330
__do_softirq+0xcb/0x28c
irq_exit_rcu+0x69/0x90
common_interrupt+0x85/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x26/0x40
RIP: 0010:default_idle+0x17/0x20
Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
default_idle_call+0x3d/0xf0
do_idle+0x1ce/0x1e0
cpu_startup_entry+0x29/0x30
start_secondary+0x109/0x130
common_startup_64+0x129/0x138
</TASK>

How it happens:
-> skb_segment
-> clone skb into nskb
-> call __pskb_trim(nskb)
-> call pskb_expand_head(nskb) because nskb is cloned
-> call skb_release_data(nskb) because nskb is cloned
-> nskb->pp_recycle = 0
-> skb_unref(nskb->frag[i], nskb)
-> nskb->pp_recycle is false, frag is page_pool page
-> Fragment is released via put_page(frag page), hence leaking
from the page_pool.

Something tells me that this is not be the only issue of this kind...

The patch itself contains a suggested fix for this specific bug: it
forces the unref in ___pskb_trim to recycle to the page_pool. If the
page is not a page_pool page, it will be dereferenced as a regular page.

The alternative would be to save the skb->pp_recycled flag before
pskb_expand_head and use it later during the unref.

Signed-off-by: Dragos Tatulea <[email protected]>
Co-developed-by: Jianbo Liu <[email protected]>
Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
Cc: Mina Almasry <[email protected]>
---
net/core/skbuff.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37c858dc11a6..ab75b4f876ce 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
skb_shinfo(skb)->nr_frags = i;

for (; i < nfrags; i++)
- skb_frag_unref(skb, i);
+ __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);

if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);
--
2.44.0



2024-04-24 17:28:50

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> The patch referenced in the fixes tag introduced the skb_frag_unref API.
This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
confusion.

> This API still has a dark corner: when skb->pp_recycled is false and a
> fragment being referenced is backed by a page_pool page, skb_frag_unref
> can leak the page out of the page_pool, like in the following example:
>
> BUG: Bad page state in process swapper/4 pfn:103423
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
> flags: 0x200000000000000(node=0|zone=2)
> page_type: 0xffffffff()
> raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
> raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
> page dumped because: page_pool leak
> Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
> act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
> nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
> scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
> swapper/4 Not tainted 6.9.0-rc4+ #2
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <IRQ>
> dump_stack_lvl+0x53/0x70
> bad_page+0x6f/0xf0
> free_unref_page_prepare+0x271/0x420
> free_unref_page+0x38/0x120
> ___pskb_trim+0x261/0x390
> skb_segment+0xf60/0x1040
> tcp_gso_segment+0xe8/0x4e0
> inet_gso_segment+0x155/0x3d0
> skb_mac_gso_segment+0x99/0x100
> __skb_gso_segment+0xb4/0x160
> ? netif_skb_features+0x95/0x2f0
> validate_xmit_skb+0x16c/0x330
> validate_xmit_skb_list+0x4c/0x70
> sch_direct_xmit+0x23e/0x350
> __dev_queue_xmit+0x334/0xbc0
> tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> tcf_mirred_act+0xd7/0x4b0 [act_mirred]
> ? tcf_pedit_act+0x6f/0x540 [act_pedit]
> tcf_action_exec+0x82/0x170
> fl_classify+0x1ee/0x200 [cls_flower]
> ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
> ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
> tcf_classify+0x26a/0x470
> tc_run+0xa2/0x120
> ? handle_irq_event+0x53/0x80
> ? kvm_clock_get_cycles+0x11/0x20
> __netif_receive_skb_core.constprop.0+0x932/0xee0
> __netif_receive_skb_list_core+0xfe/0x1f0
> netif_receive_skb_list_internal+0x1b5/0x2b0
> napi_gro_complete.constprop.0+0xee/0x120
> dev_gro_receive+0x3f4/0x710
> napi_gro_receive+0x7d/0x220
> mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
> mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
> mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
> __napi_poll+0x25/0x1b0
> net_rx_action+0x2c1/0x330
> __do_softirq+0xcb/0x28c
> irq_exit_rcu+0x69/0x90
> common_interrupt+0x85/0xa0
> </IRQ>
> <TASK>
> asm_common_interrupt+0x26/0x40
> RIP: 0010:default_idle+0x17/0x20
> Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
> fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
> 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
> RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
> RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
> RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
> R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> default_idle_call+0x3d/0xf0
> do_idle+0x1ce/0x1e0
> cpu_startup_entry+0x29/0x30
> start_secondary+0x109/0x130
> common_startup_64+0x129/0x138
> </TASK>
>
> How it happens:
> -> skb_segment
> -> clone skb into nskb
> -> call __pskb_trim(nskb)
> -> call pskb_expand_head(nskb) because nskb is cloned
> -> call skb_release_data(nskb) because nskb is cloned
> -> nskb->pp_recycle = 0
> -> skb_unref(nskb->frag[i], nskb)
> -> nskb->pp_recycle is false, frag is page_pool page
> -> Fragment is released via put_page(frag page), hence leaking
> from the page_pool.
>
> Something tells me that this is not be the only issue of this kind...
>
> The patch itself contains a suggested fix for this specific bug: it
> forces the unref in ___pskb_trim to recycle to the page_pool. If the
> page is not a page_pool page, it will be dereferenced as a regular page.
>
> The alternative would be to save the skb->pp_recycled flag before
> pskb_expand_head and use it later during the unref.
>

One more interesting point is the comment in skb_release_data [1] and it's
commit message [2] from Ilias. Looking at the commit message

[1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803

> Signed-off-by: Dragos Tatulea <[email protected]>
> Co-developed-by: Jianbo Liu <[email protected]>
> Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> Cc: Mina Almasry <[email protected]>
> ---
> net/core/skbuff.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 37c858dc11a6..ab75b4f876ce 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
> skb_shinfo(skb)->nr_frags = i;
>
> for (; i < nfrags; i++)
> - skb_frag_unref(skb, i);
> + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
>
> if (skb_has_frag_list(skb))
> skb_drop_fraglist(skb);

2024-04-24 22:09:24

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <[email protected]> wrote:
>
> On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> > The patch referenced in the fixes tag introduced the skb_frag_unref API.
> This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
> confusion.
>
> > This API still has a dark corner: when skb->pp_recycled is false and a
> > fragment being referenced is backed by a page_pool page, skb_frag_unref
> > can leak the page out of the page_pool, like in the following example:
> >
> > BUG: Bad page state in process swapper/4 pfn:103423
> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
> > flags: 0x200000000000000(node=0|zone=2)
> > page_type: 0xffffffff()
> > raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
> > raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
> > page dumped because: page_pool leak
> > Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
> > act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
> > nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> > br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
> > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> > ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
> > swapper/4 Not tainted 6.9.0-rc4+ #2
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > Call Trace:
> > <IRQ>
> > dump_stack_lvl+0x53/0x70
> > bad_page+0x6f/0xf0
> > free_unref_page_prepare+0x271/0x420
> > free_unref_page+0x38/0x120
> > ___pskb_trim+0x261/0x390
> > skb_segment+0xf60/0x1040
> > tcp_gso_segment+0xe8/0x4e0
> > inet_gso_segment+0x155/0x3d0
> > skb_mac_gso_segment+0x99/0x100
> > __skb_gso_segment+0xb4/0x160
> > ? netif_skb_features+0x95/0x2f0
> > validate_xmit_skb+0x16c/0x330
> > validate_xmit_skb_list+0x4c/0x70
> > sch_direct_xmit+0x23e/0x350
> > __dev_queue_xmit+0x334/0xbc0
> > tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > tcf_mirred_act+0xd7/0x4b0 [act_mirred]
> > ? tcf_pedit_act+0x6f/0x540 [act_pedit]
> > tcf_action_exec+0x82/0x170
> > fl_classify+0x1ee/0x200 [cls_flower]
> > ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
> > ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
> > tcf_classify+0x26a/0x470
> > tc_run+0xa2/0x120
> > ? handle_irq_event+0x53/0x80
> > ? kvm_clock_get_cycles+0x11/0x20
> > __netif_receive_skb_core.constprop.0+0x932/0xee0
> > __netif_receive_skb_list_core+0xfe/0x1f0
> > netif_receive_skb_list_internal+0x1b5/0x2b0
> > napi_gro_complete.constprop.0+0xee/0x120
> > dev_gro_receive+0x3f4/0x710
> > napi_gro_receive+0x7d/0x220
> > mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
> > mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
> > mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
> > __napi_poll+0x25/0x1b0
> > net_rx_action+0x2c1/0x330
> > __do_softirq+0xcb/0x28c
> > irq_exit_rcu+0x69/0x90
> > common_interrupt+0x85/0xa0
> > </IRQ>
> > <TASK>
> > asm_common_interrupt+0x26/0x40
> > RIP: 0010:default_idle+0x17/0x20
> > Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
> > fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
> > 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
> > RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
> > RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
> > RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
> > R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
> > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > default_idle_call+0x3d/0xf0
> > do_idle+0x1ce/0x1e0
> > cpu_startup_entry+0x29/0x30
> > start_secondary+0x109/0x130
> > common_startup_64+0x129/0x138
> > </TASK>
> >
> > How it happens:
> > -> skb_segment
> > -> clone skb into nskb
> > -> call __pskb_trim(nskb)
> > -> call pskb_expand_head(nskb) because nskb is cloned
> > -> call skb_release_data(nskb) because nskb is cloned
> > -> nskb->pp_recycle = 0
> > -> skb_unref(nskb->frag[i], nskb)
> > -> nskb->pp_recycle is false, frag is page_pool page
> > -> Fragment is released via put_page(frag page), hence leaking
> > from the page_pool.
> >
> > Something tells me that this is not be the only issue of this kind...
> >
> > The patch itself contains a suggested fix for this specific bug: it
> > forces the unref in ___pskb_trim to recycle to the page_pool. If the
> > page is not a page_pool page, it will be dereferenced as a regular page.
> >
> > The alternative would be to save the skb->pp_recycled flag before
> > pskb_expand_head and use it later during the unref.
> >
>
> One more interesting point is the comment in skb_release_data [1] and it's
> commit message [2] from Ilias. Looking at the commit message
>

Sorry for the bug. I don't think this is the right fix to be honest
though. As you mention this is only 1 such instance of a general
underlying issue.

The underlying issue is that before the fixes commit, skb_frag_ref()
grabbed a regular page ref. After the fixes commit, skb_frag_ref()
grabs a page-pool ref. The unref path always dropped a regular page
ref, thanks to this commit as you point out:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803

AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
("skbuff: Fix a potential race while recycling page_pool packets").
The reason is that now that skb_frag_ref() can grab page-pool refs, we
don't need to make sure there is only 1 SKB that triggers the recycle
path anymore. All the skb and its clones can obtain page-pool refs,
and in the unref path we drop the page-pool refs. page_pool_put_page()
detects correctly that the last page-pool ref is put and recycles the
page only then.

I'm unable to repro this issue. Do you need to do anything special? Or
is 1 flow that hits skb_segment() good enough?

Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
recycling page_pool packets") shows no regressions for me. Is it
possible to try that out? If that doesn't work, I think I prefer
reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
rather than merging this fix to make sure we removed the underlying
cause of the issue.

> [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
> [2]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
>
> > Signed-off-by: Dragos Tatulea <[email protected]>
> > Co-developed-by: Jianbo Liu <[email protected]>
> > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > Cc: Mina Almasry <[email protected]>
> > ---
> > net/core/skbuff.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 37c858dc11a6..ab75b4f876ce 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
> > skb_shinfo(skb)->nr_frags = i;
> >
> > for (; i < nfrags; i++)
> > - skb_frag_unref(skb, i);
> > + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
> >
> > if (skb_has_frag_list(skb))
> > skb_drop_fraglist(skb);
>


--
Thanks,
Mina

2024-04-25 08:17:47

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <[email protected]> wrote:
> >
> > On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote:
> > > The patch referenced in the fixes tag introduced the skb_frag_unref API.
> > This is wrong actually. Only skb_frag_ref was introduced. Sorry for the
> > confusion.
> >
> > > This API still has a dark corner: when skb->pp_recycled is false and a
> > > fragment being referenced is backed by a page_pool page, skb_frag_unref
> > > can leak the page out of the page_pool, like in the following example:
> > >
> > > BUG: Bad page state in process swapper/4 pfn:103423
> > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423
> > > flags: 0x200000000000000(node=0|zone=2)
> > > page_type: 0xffffffff()
> > > raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000
> > > raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000
> > > page dumped because: page_pool leak
> > > Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower
> > > act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE
> > > nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> > > br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi
> > > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
> > > ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm:
> > > swapper/4 Not tainted 6.9.0-rc4+ #2
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > Call Trace:
> > > <IRQ>
> > > dump_stack_lvl+0x53/0x70
> > > bad_page+0x6f/0xf0
> > > free_unref_page_prepare+0x271/0x420
> > > free_unref_page+0x38/0x120
> > > ___pskb_trim+0x261/0x390
> > > skb_segment+0xf60/0x1040
> > > tcp_gso_segment+0xe8/0x4e0
> > > inet_gso_segment+0x155/0x3d0
> > > skb_mac_gso_segment+0x99/0x100
> > > __skb_gso_segment+0xb4/0x160
> > > ? netif_skb_features+0x95/0x2f0
> > > validate_xmit_skb+0x16c/0x330
> > > validate_xmit_skb_list+0x4c/0x70
> > > sch_direct_xmit+0x23e/0x350
> > > __dev_queue_xmit+0x334/0xbc0
> > > tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > > tcf_mirred_act+0xd7/0x4b0 [act_mirred]
> > > ? tcf_pedit_act+0x6f/0x540 [act_pedit]
> > > tcf_action_exec+0x82/0x170
> > > fl_classify+0x1ee/0x200 [cls_flower]
> > > ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred]
> > > ? mlx5e_completion_event+0x39/0x40 [mlx5_core]
> > > ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core]
> > > tcf_classify+0x26a/0x470
> > > tc_run+0xa2/0x120
> > > ? handle_irq_event+0x53/0x80
> > > ? kvm_clock_get_cycles+0x11/0x20
> > > __netif_receive_skb_core.constprop.0+0x932/0xee0
> > > __netif_receive_skb_list_core+0xfe/0x1f0
> > > netif_receive_skb_list_internal+0x1b5/0x2b0
> > > napi_gro_complete.constprop.0+0xee/0x120
> > > dev_gro_receive+0x3f4/0x710
> > > napi_gro_receive+0x7d/0x220
> > > mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core]
> > > mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core]
> > > mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core]
> > > __napi_poll+0x25/0x1b0
> > > net_rx_action+0x2c1/0x330
> > > __do_softirq+0xcb/0x28c
> > > irq_exit_rcu+0x69/0x90
> > > common_interrupt+0x85/0xa0
> > > </IRQ>
> > > <TASK>
> > > asm_common_interrupt+0x26/0x40
> > > RIP: 0010:default_idle+0x17/0x20
> > > Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e
> > > fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f
> > > 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0
> > > RSP: 0018:ffff88810087bed8 EFLAGS: 00000246
> > > RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb
> > > RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214
> > > RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb
> > > R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > default_idle_call+0x3d/0xf0
> > > do_idle+0x1ce/0x1e0
> > > cpu_startup_entry+0x29/0x30
> > > start_secondary+0x109/0x130
> > > common_startup_64+0x129/0x138
> > > </TASK>
> > >
> > > How it happens:
> > > -> skb_segment
> > > -> clone skb into nskb
> > > -> call __pskb_trim(nskb)
> > > -> call pskb_expand_head(nskb) because nskb is cloned
> > > -> call skb_release_data(nskb) because nskb is cloned
> > > -> nskb->pp_recycle = 0
> > > -> skb_unref(nskb->frag[i], nskb)
> > > -> nskb->pp_recycle is false, frag is page_pool page
> > > -> Fragment is released via put_page(frag page), hence leaking
> > > from the page_pool.
> > >
> > > Something tells me that this is not be the only issue of this kind...
> > >
> > > The patch itself contains a suggested fix for this specific bug: it
> > > forces the unref in ___pskb_trim to recycle to the page_pool. If the
> > > page is not a page_pool page, it will be dereferenced as a regular page.
> > >
> > > The alternative would be to save the skb->pp_recycled flag before
> > > pskb_expand_head and use it later during the unref.
> > >
> >
> > One more interesting point is the comment in skb_release_data [1] and it's
> > commit message [2] from Ilias. Looking at the commit message
> >
>
> Sorry for the bug. I don't think this is the right fix to be honest
> though. As you mention this is only 1 such instance of a general
> underlying issue.
>
Right. It was a conversation starter :).

> The underlying issue is that before the fixes commit, skb_frag_ref()
> grabbed a regular page ref. After the fixes commit, skb_frag_ref()
> grabs a page-pool ref.
>
It is still possible for skb_frag_ref() to grab a regular page ref for a
page_pool page: for example in skb_segment() when a nskb is created (not cloned,
so pp_recycle is off) and frags with page_pool pages are added to
it. I was seeing this in my test as well.

If the intention is to always use page_pool ref/unref for page_pool pages, then
the recycle flag check doesn't belong skb_frag_ref/unref().

> The unref path always dropped a regular page
> ref, thanks to this commit as you point out:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
>
> AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> ("skbuff: Fix a potential race while recycling page_pool packets").
> The reason is that now that skb_frag_ref() can grab page-pool refs, we
> don't need to make sure there is only 1 SKB that triggers the recycle
> path anymore. All the skb and its clones can obtain page-pool refs,
> and in the unref path we drop the page-pool refs. page_pool_put_page()
> detects correctly that the last page-pool ref is put and recycles the
> page only then.
>
I don't think this is a good way forward. For example, skb->pp_recycle is used
as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
flag states. This could interfere with that.

> I'm unable to repro this issue. Do you need to do anything special? Or
> is 1 flow that hits skb_segment() good enough?
>
I don't have a very easy repro. But the test is a forwarding one.

> Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while
> recycling page_pool packets") shows no regressions for me. Is it
> possible to try that out?
>
I can try it out, it might work. But again, I fear that simply reverting this
change can have trigger other unexpected behaviours. Also, this doesn't handle
the behaviour for skb_frag_ref mentioned above.

> If that doesn't work, I think I prefer
> reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> rather than merging this fix to make sure we removed the underlying
> cause of the issue.
This is the safest bet.

So, to recap, I see 2 possibilities:

1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
will probably have to come back in one way or another.
2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
always referencing/dereferencing pages based on their type (page_pool or
normal).

Thanks,
Dragos
>
> > [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137
> > [2]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> >
> > > Signed-off-by: Dragos Tatulea <[email protected]>
> > > Co-developed-by: Jianbo Liu <[email protected]>
> > > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > Cc: Mina Almasry <[email protected]>
> > > ---
> > > net/core/skbuff.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 37c858dc11a6..ab75b4f876ce 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
> > > skb_shinfo(skb)->nr_frags = i;
> > >
> > > for (; i < nfrags; i++)
> > > - skb_frag_unref(skb, i);
> > > + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true);
> > >
> > > if (skb_has_frag_list(skb))
> > > skb_drop_fraglist(skb);
> >
>
>

2024-04-25 19:21:23

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <[email protected]> wrote:
>
> On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> > If that doesn't work, I think I prefer
> > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > rather than merging this fix to make sure we removed the underlying
> > cause of the issue.
> This is the safest bet.
>
> So, to recap, I see 2 possibilities:
>
> 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> will probably have to come back in one way or another.
> 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> always referencing/dereferencing pages based on their type (page_pool or
> normal).
>

If this works, I would be very happy. I personally think ref/unref
should be done based on the page type. For me the net stack using the
regular {get|put}_page on a pp page isn't great. It requires special
handling to make sure the ref + unref are in sync. Also if the last pp
ref is dropped while there are pending regular refs,
__page_pool_page_can_be_recycled() check will fail and the page will
not be recycled.

On the other hand, since 0a149ab78ee2 ("page_pool: transition to
reference count management after page draining") I'm not sure there is
any reason to continue to use get/put_page on pp-pages, we can use the
new pp-ref instead.

I don't see any regressions with this diff (needs cleanup), but your
test setup seems much much better than mine (I think this is the
second reffing issue you manage to repro):

diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
index 4dcdbe9fbc5f..4c72227dce1b 100644
--- a/include/linux/skbuff_ref.h
+++ b/include/linux/skbuff_ref.h
@@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
static inline void skb_page_ref(struct page *page, bool recycle)
{
#ifdef CONFIG_PAGE_POOL
- if (recycle && napi_pp_get_page(page))
+ if (napi_pp_get_page(page))
return;
#endif
get_page(page);
@@ -69,7 +69,7 @@ static inline void
skb_page_unref(struct page *page, bool recycle)
{
#ifdef CONFIG_PAGE_POOL
- if (recycle && napi_pp_put_page(page))
+ if (napi_pp_put_page(page))
return;
#endif
put_page(page);


--
Thanks,
Mina

2024-04-25 19:48:20

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <[email protected]> wrote:
> >
> > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote:
> > > If that doesn't work, I think I prefer
> > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > rather than merging this fix to make sure we removed the underlying
> > > cause of the issue.
> > This is the safest bet.
> >
> > So, to recap, I see 2 possibilities:
> >
> > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it
> > will probably have to come back in one way or another.
> > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of
> > always referencing/dereferencing pages based on their type (page_pool or
> > normal).
> >
>
> If this works, I would be very happy. I personally think ref/unref
> should be done based on the page type. For me the net stack using the
> regular {get|put}_page on a pp page isn't great. It requires special
> handling to make sure the ref + unref are in sync. Also if the last pp
> ref is dropped while there are pending regular refs,
> __page_pool_page_can_be_recycled() check will fail and the page will
> not be recycled.
>
> On the other hand, since 0a149ab78ee2 ("page_pool: transition to
> reference count management after page draining") I'm not sure there is
> any reason to continue to use get/put_page on pp-pages, we can use the
> new pp-ref instead.
>
> I don't see any regressions with this diff (needs cleanup), but your
> test setup seems much much better than mine (I think this is the
> second reffing issue you manage to repro):
>
> diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> index 4dcdbe9fbc5f..4c72227dce1b 100644
> --- a/include/linux/skbuff_ref.h
> +++ b/include/linux/skbuff_ref.h
> @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
> static inline void skb_page_ref(struct page *page, bool recycle)
> {
> #ifdef CONFIG_PAGE_POOL
> - if (recycle && napi_pp_get_page(page))
> + if (napi_pp_get_page(page))
> return;
> #endif
> get_page(page);
> @@ -69,7 +69,7 @@ static inline void
> skb_page_unref(struct page *page, bool recycle)
> {
> #ifdef CONFIG_PAGE_POOL
> - if (recycle && napi_pp_put_page(page))
> + if (napi_pp_put_page(page))
> return;
> #endif
> put_page(page);
>
>
This is option 2. I thought this would fix everything. But I just tested and
it's not the case: we are now reaching a negative pp_ref_count. So probably
somewhere a regular page reference is still being taken...

Thanks,
Dragos



2024-04-26 19:04:10

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Fri, Apr 26, 2024 at 7:59 AM Dragos Tatulea <[email protected]> wrote:
>
> On Thu, 2024-04-25 at 13:42 -0700, Mina Almasry wrote:
> > On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <[email protected]> wrote:
> > >
> > > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote:
> > > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h
> > > > index 4dcdbe9fbc5f..4c72227dce1b 100644
> > > > --- a/include/linux/skbuff_ref.h
> > > > +++ b/include/linux/skbuff_ref.h
> > > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page)
> > > > static inline void skb_page_ref(struct page *page, bool recycle)
> > > > {
> > > > #ifdef CONFIG_PAGE_POOL
> > > > - if (recycle && napi_pp_get_page(page))
> > > > + if (napi_pp_get_page(page))
> > > > return;
> > > > #endif
> > > > get_page(page);
> > > > @@ -69,7 +69,7 @@ static inline void
> > > > skb_page_unref(struct page *page, bool recycle)
> > > > {
> > > > #ifdef CONFIG_PAGE_POOL
> > > > - if (recycle && napi_pp_put_page(page))
> > > > + if (napi_pp_put_page(page))
> > > > return;
> > > > #endif
> > > > put_page(page);
> > > >
> > > >
> > > This is option 2. I thought this would fix everything. But I just tested and
> > > it's not the case: we are now reaching a negative pp_ref_count.
> > >
> I was tired and botched the revert of the code in this RFC when testing.
> Dropping the recycle flag works as expected. Do we need an RFC v2 or is this non
> RFC material?
>
> Thanks,
> Dragos
>

IMO, it needs to be cleaned up to remove the bool recycle flag dead
code, but other than that I think it's good as a non RFC.

To save you some time I did the said clean up and here is a patch
passing make allmodconfig build + checkpatch/kdoc checks + tests on my
end:

https://pastebin.com/bX5KcHTb

I could submit it to the list if you prefer.

FWIW, if this approach shows no regressions, I think we may be able to
deprecate skb->pp_recycle flag entirely, but I can look into that as a
separate change.

--
Thanks,
Mina

2024-04-26 23:06:07

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > The unref path always dropped a regular page
> > ref, thanks to this commit as you point out:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> >
> > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets").
> > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > don't need to make sure there is only 1 SKB that triggers the recycle
> > path anymore. All the skb and its clones can obtain page-pool refs,
> > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > detects correctly that the last page-pool ref is put and recycles the
> > page only then.
> >
> I don't think this is a good way forward. For example, skb->pp_recycle is used
> as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> flag states. This could interfere with that.

That's a bit speculative, right? The simple invariant we are trying to
hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
reference skb is holding on that frag is a pp reference, not page
reference.

skb_gro_receive() needs to maintain that invariant, if it doesn't
we need to fix it..

2024-04-26 23:07:24

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 24 Apr 2024 15:08:42 -0700 Mina Almasry wrote:
> I'm unable to repro this issue. Do you need to do anything special? Or
> is 1 flow that hits skb_segment() good enough?

At some point we may want to start writing unit tests to make sure
we don't regress the 5 corner cases we found previously.. ;)
Should be easy to operate on skbs under kunit.

2024-04-26 23:09:11

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> - if (recycle && napi_pp_get_page(page))
> + if (napi_pp_get_page(page))

Pretty sure you can't do that. The "recycle" here is a concurrency
guarantee. A guarantee someone is holding a pp ref on that page,
a ref which will not go away while napi_pp_get_page() is executing.

2024-04-27 04:24:36

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <[email protected]> wrote:
>
> On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> > - if (recycle && napi_pp_get_page(page))
> > + if (napi_pp_get_page(page))
>
> Pretty sure you can't do that. The "recycle" here is a concurrency
> guarantee. A guarantee someone is holding a pp ref on that page,
> a ref which will not go away while napi_pp_get_page() is executing.

I don't mean to argue, but I think the get_page()/put_page() pair we
do in the page ref path is susceptible to the same issue. AFAIU it's
not safe to get_page() if another CPU can be dropping the last ref,
get_page_unless_zero() should be used instead.

Since get_page() is good in the page ref path without some guarantee,
it's not obvious to me why we need this guarantee in the pp ref path,
but I could be missing some subtlety. At any rate, if you prefer us
going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a
potential race while recycling page_pool packets"), I think that could
also fix the issue.

--
Thanks,
Mina

2024-04-29 07:39:28

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > The unref path always dropped a regular page
> > > ref, thanks to this commit as you point out:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > >
> > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > detects correctly that the last page-pool ref is put and recycles the
> > > page only then.
> > >
> > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > flag states. This could interfere with that.
>
> That's a bit speculative, right? The simple invariant we are trying to
> hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> reference skb is holding on that frag is a pp reference, not page
> reference.
>
Yes, it was a speculative statement. After re-reading it and the code of
skb_gro_receive() it makes less sense now.

Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
while recycling page_pool packets") seems less scary now. I just hope we don't
bump into too many scenarios similar to the ipsec one...

> skb_gro_receive() needs to maintain that invariant, if it doesn't
> we need to fix it..
>

Thanks,
Dragos

2024-04-29 15:00:35

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote:
> On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <[email protected]> wrote:
> >
> > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> > > - if (recycle && napi_pp_get_page(page))
> > > + if (napi_pp_get_page(page))
> >
> > Pretty sure you can't do that. The "recycle" here is a concurrency
> > guarantee. A guarantee someone is holding a pp ref on that page,
> > a ref which will not go away while napi_pp_get_page() is executing.
>
> I don't mean to argue, but I think the get_page()/put_page() pair we
> do in the page ref path is susceptible to the same issue. AFAIU it's
> not safe to get_page() if another CPU can be dropping the last ref,
> get_page_unless_zero() should be used instead.

Whoever gave us the pointer to operate on has a reference, so the page
can't disappear. get_page() is safe. The problem with pp is that we
don't know whether the caller has a pp ref or a page ref. IOW the pp
ref may not be owned by whoever called us.

I guess the situation may actually be worse and we can only pp-ref a
page if both "source" and "destination" skb has pp_recycle = 1 :S

> Since get_page() is good in the page ref path without some guarantee,
> it's not obvious to me why we need this guarantee in the pp ref path,
> but I could be missing some subtlety. At any rate, if you prefer us
> going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a
> potential race while recycling page_pool packets"), I think that could
> also fix the issue.

2024-05-01 06:20:59

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > > The unref path always dropped a regular page
> > > > ref, thanks to this commit as you point out:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > >
> > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > detects correctly that the last page-pool ref is put and recycles the
> > > > page only then.
> > > >
> > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > flag states. This could interfere with that.
> >
> > That's a bit speculative, right? The simple invariant we are trying to
> > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > reference skb is holding on that frag is a pp reference, not page
> > reference.
> >
> Yes, it was a speculative statement. After re-reading it and the code of
> skb_gro_receive() it makes less sense now.
>
> Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> while recycling page_pool packets") seems less scary now. I just hope we don't
> bump into too many scenarios similar to the ipsec one...
>
> > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > we need to fix it..
> >
>
Gentle ping. Not sure how to proceed with this:

1) Revert commit 2cc3aeb5eccc
("skbuff: Fix a potential race while recycling page_pool packets"). I tested
this btw and it works (for this specific scenario).

2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
for now.

Thanks,
Dragos

2024-05-01 07:49:24

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <[email protected]> wrote:
>
> On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > > > The unref path always dropped a regular page
> > > > > ref, thanks to this commit as you point out:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > > >
> > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > > detects correctly that the last page-pool ref is put and recycles the
> > > > > page only then.
> > > > >
> > > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > > flag states. This could interfere with that.
> > >
> > > That's a bit speculative, right? The simple invariant we are trying to
> > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > > reference skb is holding on that frag is a pp reference, not page
> > > reference.
> > >
> > Yes, it was a speculative statement. After re-reading it and the code of
> > skb_gro_receive() it makes less sense now.
> >
> > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> > while recycling page_pool packets") seems less scary now. I just hope we don't
> > bump into too many scenarios similar to the ipsec one...
> >
> > > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > > we need to fix it..
> > >
> >
> Gentle ping. Not sure how to proceed with this:
>
> 1) Revert commit 2cc3aeb5eccc
> ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> this btw and it works (for this specific scenario).
>
> 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> for now.
>

I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
well. If we continue to run into edge cases after the revert of #1, I
think we may want to do #2 and I can look to reland it with the kunit
tests that Jakub suggested that reproduce these edge cases.

I can upload #1 in the morning if there are no objections. I don't see
any regressions with #1 but I was never able to repo this issue.

--
Thanks,
Mina

2024-05-01 07:58:20

by Dragos Tatulea

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 2024-05-01 at 00:48 -0700, Mina Almasry wrote:
> On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <[email protected]> wrote:
> >
> > On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote:
> > > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote:
> > > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote:
> > > > > > The unref path always dropped a regular page
> > > > > > ref, thanks to this commit as you point out:
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> > > > > >
> > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc
> > > > > > ("skbuff: Fix a potential race while recycling page_pool packets").
> > > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we
> > > > > > don't need to make sure there is only 1 SKB that triggers the recycle
> > > > > > path anymore. All the skb and its clones can obtain page-pool refs,
> > > > > > and in the unref path we drop the page-pool refs. page_pool_put_page()
> > > > > > detects correctly that the last page-pool ref is put and recycles the
> > > > > > page only then.
> > > > > >
> > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used
> > > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle
> > > > > flag states. This could interfere with that.
> > > >
> > > > That's a bit speculative, right? The simple invariant we are trying to
> > > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the
> > > > reference skb is holding on that frag is a pp reference, not page
> > > > reference.
> > > >
> > > Yes, it was a speculative statement. After re-reading it and the code of
> > > skb_gro_receive() it makes less sense now.
> > >
> > > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race
> > > while recycling page_pool packets") seems less scary now. I just hope we don't
> > > bump into too many scenarios similar to the ipsec one...
> > >
> > > > skb_gro_receive() needs to maintain that invariant, if it doesn't
> > > > we need to fix it..
> > > >
> > >
> > Gentle ping. Not sure how to proceed with this:
> >
> > 1) Revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > this btw and it works (for this specific scenario).
> >
> > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > for now.
> >
>
> I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> well. If we continue to run into edge cases after the revert of #1, I
> think we may want to do #2 and I can look to reland it with the kunit
> tests that Jakub suggested that reproduce these edge cases.
>
> I can upload #1 in the morning if there are no objections. I don't see
> any regressions with #1 but I was never able to repo this issue.
>
Yes please. And once you post it we can also take it internally to check all the
other tests that were failing in the same place.

Thanks,
Dragos

2024-05-01 14:24:46

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote:
> > 1) Revert commit 2cc3aeb5eccc
> > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > this btw and it works (for this specific scenario).
> >
> > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > for now.
>
> I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> well.

I vote #2, actually :( Or #3 make page pool ref safe to acquire
concurrently, but that plus fixing all the places where we do crazy
things may be tricky.

Even taking the ref is not as simple as using atomic_long_inc_not_zero()
sadly, partly because we try to keep the refcount at one, in an apparent
attempt to avoid dirtying the cache line twice.

So maybe partial revert to stop be bleeding and retry after more testing
is the way to go?

I had a quick look at the code and there is also a bunch of functions
which "shift" frags from one skb to another, without checking whether
the pp_recycle state matches.

2024-05-01 14:28:51

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, 1 May 2024 07:24:34 -0700 Jakub Kicinski wrote:
> I vote #2, actually :( Or #3 make page pool ref safe to acquire
> concurrently, but that plus fixing all the places where we do crazy
> things may be tricky.
>
> Even taking the ref is not as simple as using atomic_long_inc_not_zero()
> sadly, partly because we try to keep the refcount at one, in an apparent
> attempt to avoid dirtying the cache line twice.
>
> So maybe partial revert to stop be bleeding and retry after more testing
> is the way to go?
>
> I had a quick look at the code and there is also a bunch of functions
> which "shift" frags from one skb to another, without checking whether
> the pp_recycle state matches.

BTW these two refs seem to look at the wrong skb:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0c8b82750000..afd3336928d0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2148,7 +2148,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
}
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
- skb_frag_ref(skb, i);
+ skb_frag_ref(n, i);
}
skb_shinfo(n)->nr_frags = i;
}
@@ -5934,7 +5934,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
* since we set nr_frags to 0.
*/
for (i = 0; i < from_shinfo->nr_frags; i++)
- __skb_frag_ref(&from_shinfo->frags[i], from->pp_recycle);
+ __skb_frag_ref(&from_shinfo->frags[i], to->pp_recycle);

to->truesize += delta;
to->len += len;

2024-05-01 16:24:10

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Wed, May 1, 2024 at 7:24 AM Jakub Kicinski <[email protected]> wrote:
>
> On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote:
> > > 1) Revert commit 2cc3aeb5eccc
> > > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested
> > > this btw and it works (for this specific scenario).
> > >
> > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers")
> > > for now.
> >
> > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as
> > well.
>
> I vote #2, actually :( Or #3 make page pool ref safe to acquire
> concurrently, but that plus fixing all the places where we do crazy
> things may be tricky.
>
> Even taking the ref is not as simple as using atomic_long_inc_not_zero()
> sadly, partly because we try to keep the refcount at one, in an apparent
> attempt to avoid dirtying the cache line twice.
>
> So maybe partial revert to stop be bleeding and retry after more testing
> is the way to go?
>

OK, I will upload a revert sometime today.

> I had a quick look at the code and there is also a bunch of functions
> which "shift" frags from one skb to another, without checking whether
> the pp_recycle state matches.

You posted a diff, I will pick it up in a separate patch.

--
Thanks,
Mina

2024-05-02 20:09:54

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Mon, Apr 29, 2024 at 8:00 AM Jakub Kicinski <[email protected]> wrote:
>
> On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote:
> > On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <[email protected]> wrote:
> > >
> > > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote:
> > > > - if (recycle && napi_pp_get_page(page))
> > > > + if (napi_pp_get_page(page))
> > >
> > > Pretty sure you can't do that. The "recycle" here is a concurrency
> > > guarantee. A guarantee someone is holding a pp ref on that page,
> > > a ref which will not go away while napi_pp_get_page() is executing.
> >
> > I don't mean to argue, but I think the get_page()/put_page() pair we
> > do in the page ref path is susceptible to the same issue. AFAIU it's
> > not safe to get_page() if another CPU can be dropping the last ref,
> > get_page_unless_zero() should be used instead.
>

I uploaded a revert for review, but to reland I perhaps need to
understand a bit more the concern here. AFAICT that diff you're
responding to is safe and it works very well with devmem so it would
be my preferred approach to reland (but there are other options if you
are convinced it's bad). FWIW my thoughts:

> Whoever gave us the pointer to operate on has a reference, so the page
> can't disappear. get_page() is safe.

Agreed.

> The problem with pp is that we
> don't know whether the caller has a pp ref or a page ref. IOW the pp
> ref may not be owned by whoever called us.
>

OK, this is where I'm not sure anymore. The diff you're replying to
attempts to enforce the invariant: "if anyone wants a reference on an
skb_frag, skb_frag_ref will be a pp ref on pp frags
(is_pp_page==true), and page refs on non-pp frags
(is_pp_page==false)".

Additionally the page doesn't transition from pp to non-pp and vice
versa while anyone is holding a pp ref, because
page_pool_set_pp_info() is called right after the page is obtained
from the buddy allocator (before released from the page pool) and
page_pool_clear_pp_info() is called only after all the pp refs are
dropped.

So:

1. We know the caller has a ref (otherwise get_page() wouldn't be safe
in the non-pp case).
2. We know that the page has not transitioned from pp to non-pp or
vice versa since the caller obtained the ref (from code inspection, pp
info is not changed until all the refs are dropped for pp pages).
3. AFAICT, it follows that if the page is pp, then the caller has a pp
ref, and if the page is non-pp, then the caller has a page ref.
4. So, if is_pp_page==true, then the caller has a pp ref, then
napi_pp_get_page() should be concurrently safe.

AFAICT the only way my mental model is broken is if there is code
doing a raw get_page() rather than a skb_frag_ref() in core net stack.
I would like to get rid of these call sites if they exist. They would
not interact well with devmem I think (but could be made to work with
some effort).

--
Thanks,
Mina

2024-05-03 00:48:04

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH] net: Fix one page_pool page leak from skb_frag_unref

On Thu, 2 May 2024 13:08:23 -0700 Mina Almasry wrote:
> OK, this is where I'm not sure anymore. The diff you're replying to
> attempts to enforce the invariant: "if anyone wants a reference on an
> skb_frag, skb_frag_ref will be a pp ref on pp frags
> (is_pp_page==true), and page refs on non-pp frags
> (is_pp_page==false)".
>
> Additionally the page doesn't transition from pp to non-pp and vice
> versa while anyone is holding a pp ref, because
> page_pool_set_pp_info() is called right after the page is obtained
> from the buddy allocator (before released from the page pool) and
> page_pool_clear_pp_info() is called only after all the pp refs are
> dropped.
>
> So:
>
> 1. We know the caller has a ref (otherwise get_page() wouldn't be safe
> in the non-pp case).
> 2. We know that the page has not transitioned from pp to non-pp or
> vice versa since the caller obtained the ref (from code inspection, pp

How do we know that?

> info is not changed until all the refs are dropped for pp pages).
> 3. AFAICT, it follows that if the page is pp, then the caller has a pp
> ref, and if the page is non-pp, then the caller has a page ref.

If that's true we have nothing to worry about.

> 4. So, if is_pp_page==true, then the caller has a pp ref, then
> napi_pp_get_page() should be concurrently safe.
>
> AFAICT the only way my mental model is broken is if there is code
> doing a raw get_page() rather than a skb_frag_ref() in core net stack.

Not just. We also get pages fed from the outside, which may be PP pages.
Right now it'd be okay because "from outside" pages would end up in Tx.
Tx always allocates skbs with ->pp_recycle = 0, so we'll hold full refs.

> I would like to get rid of these call sites if they exist. They would
> not interact well with devmem I think (but could be made to work with
> some effort).

Maybe if we convert more code to operate on netmem_ref it will become
clearer where raw pages get fed in, and therefore were we have to be
very careful?