2023-05-23 01:00:59

by Peilin Ye

[permalink] [raw]
Subject: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other. In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state. Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions). Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.

Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.

Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9

RTM_NEWTFILTER (X, RTNL-unlocked)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU sync 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers. The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: [email protected]
Closes: https://lore.kernel.org/r/[email protected]/
Cc: Hillf Danton <[email protected]>
Cc: Vlad Buslov <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
---
changes in v2:
- replay the request if the current Qdisc has any ongoing RTNL-unlocked
filter requests (Vlad)
- minor changes in code comments and commit log

include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
refcount_inc(&qdisc->refcnt);
}

+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return true;
+ return refcount_dec_if_one(&qdisc->refcnt);
+}
+
/* Intended to be used by unlocked users, when concurrent qdisc release is
* possible.
*/
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
struct Qdisc *qdisc);
void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
void qdisc_put(struct Qdisc *qdisc);
void qdisc_put_unlocked(struct Qdisc *qdisc);
void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..b3bafa6c1b44 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
ingress = 1;
- if (!dev_ingress_queue(dev)) {
+ dev_queue = dev_ingress_queue(dev);
+ if (!dev_queue) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
return -ENOENT;
}
+
+ /* Replay if the current ingress (or clsact) Qdisc has ongoing
+ * RTNL-unlocked filter request(s). This is the counterpart of that
+ * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+ */
+ if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+ return -EAGAIN;
}

if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
qdisc_put(old);
}
} else {
- dev_queue = dev_ingress_queue(dev);
- old = dev_graft_qdisc(dev_queue, new);
+ old = dev_graft_qdisc(dev_queue, NULL);
+
+ /* {ingress,clsact}_destroy() @old before grafting @new to avoid
+ * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+ * pointer(s) in mini_qdisc_pair_swap().
+ */
+ qdisc_notify(net, skb, n, classid, old, new, extack);
+ qdisc_destroy(old);
+
+ dev_graft_qdisc(dev_queue, new);
}

skip:
@@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,

if (new && new->ops->attach)
new->ops->attach(new);
- } else {
- notify_and_destroy(net, skb, n, classid, old, new, extack);
}

if (dev->flags & IFF_UP)
@@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
struct Qdisc *p = NULL;
int err;

+replay:
err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
rtm_tca_policy, extack);
if (err < 0)
@@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
return -ENOENT;
}
err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
- if (err != 0)
+ if (err != 0) {
+ if (err == -EAGAIN)
+ goto replay;
return err;
+ }
} else {
qdisc_notify(net, skb, n, clid, NULL, q, NULL);
}
@@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
if (err) {
if (q)
qdisc_put(q);
+ if (err == -EAGAIN)
+ goto replay;
return err;
}

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
qdisc_free(q);
}

-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
call_rcu(&qdisc->rcu, qdisc_free_cb);
}

+void qdisc_destroy(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return;
+
+ __qdisc_destroy(qdisc);
+}
+
void qdisc_put(struct Qdisc *qdisc)
{
if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
!refcount_dec_and_test(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
}
EXPORT_SYMBOL(qdisc_put);

@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
!refcount_dec_and_rtnl_lock(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
rtnl_unlock();
}
EXPORT_SYMBOL(qdisc_put_unlocked);
--
2.20.1



2023-05-23 04:03:38

by Pedro Tammela

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On 22/05/2023 20:55, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc [...]


Hi Peilin!

With V2 patches 5 and 6 applied I was still able to trigger an oops.

Branch is 'net' + patches 5 & 6:
145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
clsact Qdiscs before grafting
1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
allocation

Kernel config is the same as in the syzbot report.
Note that this was on a _single core_ VM.
I will double check if v1 is triggering this issue (basically run the
repro for a long time). For multi-core my VM is running OOM even on a
32Gb system. I will check if we have a spare server to run the repro.

[ 695.782780][T12033]
==================================================================
[ 695.783617][T12033] BUG: KASAN: slab-use-after-free in
mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.784323][T12033] Write of size 8 at addr ffff888060cafb08 by task
repro/12033
[ 695.784996][T12033]
[ 695.785210][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 695.785981][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 695.786883][T12033] Call Trace:
[ 695.787178][T12033] <TASK>
[ 695.787444][T12033] dump_stack_lvl+0xd9/0x1b0
[ 695.787871][T12033] print_report+0xc4/0x5f0
[ 695.788283][T12033] ? __virt_addr_valid+0x5e/0x2d0
[ 695.788736][T12033] ? __phys_addr+0xc6/0x140
[ 695.789138][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] kasan_report+0xc0/0xf0
[ 695.789604][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 695.789604][T12033] ? ingress_init+0x1c0/0x1c0
[ 695.789604][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 695.789604][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 695.789604][T12033] ? kasan_quarantine_put+0x102/0x230
[ 695.789604][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 695.789604][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? bpf_lsm_capable+0x9/0x10
[ 695.789604][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] ? rtnl_getlink+0xb10/0xb10
[ 695.789604][T12033] ? netlink_ack+0x1370/0x1370
[ 695.789604][T12033] ? kasan_set_track+0x25/0x30
[ 695.789604][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] ? netlink_attachskb+0x880/0x880
[ 695.789604][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 695.789604][T12033] ? __phys_addr_symbol+0x30/0x70
[ 695.789604][T12033] ? __check_object_size+0x323/0x740
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 695.789604][T12033] ? netlink_unicast+0x800/0x800
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ? kernel_sendmsg+0x50/0x50
[ 695.789604][T12033] ? __copy_msghdr+0x460/0x460
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] ? do_recvmmsg+0x700/0x700
[ 695.789604][T12033] ? find_held_lock+0x2d/0x110
[ 695.789604][T12033] ? __might_fault+0xe5/0x190
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 695.789604][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 695.789604][T12033] ? rcu_is_watching+0x12/0xb0
[ 695.789604][T12033] ? xfd_validate_state+0x5d/0x180
[ 695.789604][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 695.789604][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 695.789604][T12033] ? do_futex+0x350/0x350
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033] RIP: 0033:0x7f4aca44a89d
[ 695.789604][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 695.789604][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 695.789604][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 695.789604][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 695.789604][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 695.789604][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 695.789604][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 695.789604][T12033] </TASK>
[ 695.789604][T12033]
[ 695.789604][T12033] Allocated by task 12031:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] __kasan_kmalloc+0xa2/0xb0
[ 695.789604][T12033] __kmalloc_node+0x60/0x100
[ 695.789604][T12033] qdisc_alloc+0xb3/0xa90
[ 695.789604][T12033] qdisc_create+0xcf/0x1020
[ 695.789604][T12033] tc_modify_qdisc+0x495/0x1ab0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x439/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Freed by task 15:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] kasan_set_track+0x25/0x30
[ 695.789604][T12033] kasan_save_free_info+0x2e/0x40
[ 695.789604][T12033] ____kasan_slab_free+0x15e/0x1b0
[ 695.789604][T12033] slab_free_freelist_hook+0x10b/0x1e0
[ 695.789604][T12033] __kmem_cache_free+0xaf/0x2e0
[ 695.789604][T12033] rcu_core+0x7f7/0x1ac0
[ 695.789604][T12033] __do_softirq+0x1d8/0x8fd
[ 695.789604][T12033]
[ 695.789604][T12033] Last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] qdisc_put_unlocked+0x74/0x90
[ 695.789604][T12033] tcf_block_release+0x90/0xa0
[ 695.789604][T12033] tc_new_tfilter+0xa5e/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 695.789604][T12033] __sys_sendmmsg+0x18e/0x430
[ 695.789604][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 695.789604][T12033] do_syscall_64+0x38/0xb0
[ 695.789604][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 695.789604][T12033]
[ 695.789604][T12033] Second to last potentially related work creation:
[ 695.789604][T12033] kasan_save_stack+0x20/0x40
[ 695.789604][T12033] __kasan_record_aux_stack+0xbf/0xd0
[ 695.789604][T12033] __call_rcu_common.constprop.0+0x9a/0x790
[ 695.789604][T12033] rht_deferred_worker+0x10fd/0x2010
[ 695.789604][T12033] process_one_work+0x9f9/0x15f0
[ 695.789604][T12033] worker_thread+0x687/0x1110
[ 695.789604][T12033] kthread+0x334/0x430
[ 695.789604][T12033] ret_from_fork+0x1f/0x30
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the object at
ffff888060caf800
[ 695.789604][T12033] which belongs to the cache kmalloc-1k of size 1024
[ 695.789604][T12033] The buggy address is located 776 bytes inside of
[ 695.789604][T12033] freed 1024-byte region [ffff888060caf800,
ffff888060cafc00)
[ 695.789604][T12033]
[ 695.789604][T12033] The buggy address belongs to the physical page:
[ 695.789604][T12033] page:ffffea0001832b00 refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x60cac
[ 695.789604][T12033] head:ffffea0001832b00 order:2 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[ 695.789604][T12033] flags:
0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[ 695.789604][T12033] page_type: 0xffffffff()
[ 695.789604][T12033] raw: 00fff00000010200 ffff888012441dc0
ffffea0000bac600 dead000000000002
[ 695.789604][T12033] raw: 0000000000000000 0000000000080008
00000001ffffffff 0000000000000000
[ 695.789604][T12033] page dumped because: kasan: bad access detected
[ 695.789604][T12033] page_owner tracks the page as allocated
[ 695.789604][T12033] page last allocated via order 2, migratetype
Unmovable, gfp_mask
0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC),
pid 13324, tgid 13318 (repro), ts 170262603079, free_ts 0
[ 695.789604][T12033] get_page_from_freelist+0xe71/0x2e80
[ 695.789604][T12033] __alloc_pages+0x1c8/0x4a0
[ 695.789604][T12033] alloc_pages+0x1a9/0x270
[ 695.789604][T12033] allocate_slab+0x24e/0x380
[ 695.789604][T12033] ___slab_alloc+0x89a/0x1400
[ 695.789604][T12033] __slab_alloc.constprop.0+0x56/0xa0
[ 695.789604][T12033] __kmem_cache_alloc_node+0x126/0x330
[ 695.789604][T12033] kmalloc_trace+0x25/0xe0
[ 695.789604][T12033] fl_change+0x1b3/0x51e0
[ 695.789604][T12033] tc_new_tfilter+0x992/0x22b0
[ 695.789604][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 695.789604][T12033] netlink_rcv_skb+0x166/0x440
[ 695.789604][T12033] netlink_unicast+0x530/0x800
[ 695.789604][T12033] netlink_sendmsg+0x90b/0xe10
[ 695.789604][T12033] sock_sendmsg+0xd9/0x180
[ 695.789604][T12033] ____sys_sendmsg+0x264/0x910
[ 695.789604][T12033] page_owner free stack trace missing
[ 695.789604][T12033]
[ 695.789604][T12033] Memory state around the buggy address:
[ 695.789604][T12033] ffff888060cafa00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafa80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] >ffff888060cafb00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ^
[ 695.789604][T12033] ffff888060cafb80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 695.789604][T12033] ffff888060cafc00: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 695.789604][T12033]
==================================================================
[ 695.996261][T12042] __nla_validate_parse: 32 callbacks suppressed
[ 695.996271][T12042] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.473670][T12046] netlink: 24 bytes leftover after parsing
attributes in process `repro'.
[ 696.660496][T12033] Kernel panic - not syncing: KASAN: panic_on_warn
set ...
[ 696.661250][T12033] CPU: 0 PID: 12033 Comm: repro Not tainted
6.4.0-rc2-00187-g145f639b9403 #1
[ 696.661947][T12033] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 696.662768][T12033] Call Trace:
[ 696.663031][T12033] <TASK>
[ 696.663268][T12033] dump_stack_lvl+0xd9/0x1b0
[ 696.663659][T12033] panic+0x689/0x730
[ 696.663977][T12033] ? panic_smp_self_stop+0xa0/0xa0
[ 696.664396][T12033] ? preempt_schedule_thunk+0x1a/0x20
[ 696.664829][T12033] ? preempt_schedule_common+0x45/0xc0
[ 696.665263][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.665698][T12033] check_panic_on_warn+0xab/0xb0
[ 696.666087][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.666519][T12033] end_report+0xe9/0x120
[ 696.666861][T12033] kasan_report+0xcd/0xf0
[ 696.667209][T12033] ? mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.667639][T12033] mini_qdisc_pair_swap+0x1c2/0x1f0
[ 696.668064][T12033] ? ingress_init+0x1c0/0x1c0
[ 696.668451][T12033] tcf_chain0_head_change.isra.0+0xb9/0x120
[ 696.668964][T12033] tc_new_tfilter+0x1ebb/0x22b0
[ 696.669391][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 696.669608][T12033] ? kasan_quarantine_put+0x102/0x230
[ 696.669608][T12033] ? lockdep_hardirqs_on+0x7d/0x100
[ 696.669608][T12033] ? rtnetlink_rcv_msg+0x94a/0xd30
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? bpf_lsm_capable+0x9/0x10
[ 696.669608][T12033] ? tc_del_tfilter+0x1570/0x1570
[ 696.669608][T12033] rtnetlink_rcv_msg+0x98a/0xd30
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? netdev_core_pick_tx+0x390/0x390
[ 696.669608][T12033] netlink_rcv_skb+0x166/0x440
[ 696.669608][T12033] ? rtnl_getlink+0xb10/0xb10
[ 696.669608][T12033] ? netlink_ack+0x1370/0x1370
[ 696.669608][T12033] ? kasan_set_track+0x25/0x30
[ 696.669608][T12033] ? netlink_deliver_tap+0x1b1/0xd00
[ 696.669608][T12033] netlink_unicast+0x530/0x800
[ 696.669608][T12033] ? netlink_attachskb+0x880/0x880
[ 696.669608][T12033] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 696.669608][T12033] ? __phys_addr_symbol+0x30/0x70
[ 696.669608][T12033] ? __check_object_size+0x323/0x740
[ 696.669608][T12033] netlink_sendmsg+0x90b/0xe10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 696.669608][T12033] ? netlink_unicast+0x800/0x800
[ 696.669608][T12033] sock_sendmsg+0xd9/0x180
[ 696.669608][T12033] ____sys_sendmsg+0x264/0x910
[ 696.669608][T12033] ? kernel_sendmsg+0x50/0x50
[ 696.669608][T12033] ? __copy_msghdr+0x460/0x460
[ 696.669608][T12033] ___sys_sendmsg+0x11d/0x1b0
[ 696.669608][T12033] ? do_recvmmsg+0x700/0x700
[ 696.669608][T12033] ? find_held_lock+0x2d/0x110
[ 696.669608][T12033] ? __might_fault+0xe5/0x190
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] __sys_sendmmsg+0x18e/0x430
[ 696.669608][T12033] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 696.669608][T12033] ? reacquire_held_locks+0x4b0/0x4b0
[ 696.669608][T12033] ? rcu_is_watching+0x12/0xb0
[ 696.669608][T12033] ? xfd_validate_state+0x5d/0x180
[ 696.669608][T12033] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 696.669608][T12033] ? unlock_page_memcg+0x2d0/0x2d0
[ 696.669608][T12033] ? do_futex+0x350/0x350
[ 696.669608][T12033] __x64_sys_sendmmsg+0x9c/0x100
[ 696.669608][T12033] ? syscall_enter_from_user_mode+0x26/0x80
[ 696.669608][T12033] do_syscall_64+0x38/0xb0
[ 696.669608][T12033] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 696.669608][T12033] RIP: 0033:0x7f4aca44a89d
[ 696.669608][T12033] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 696.669608][T12033] RSP: 002b:00007f4aca2eec68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 696.669608][T12033] RAX: ffffffffffffffda RBX: 00007f4aca2efcdc RCX:
00007f4aca44a89d
[ 696.669608][T12033] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 696.669608][T12033] RBP: 00007f4aca2eede0 R08: 0000000000000000 R09:
0000000000000000
[ 696.669608][T12033] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 696.669608][T12033] R13: 000000000000006e R14: 00007ffd1a53f720 R15:
00007f4aca2cf000
[ 696.669608][T12033] </TASK>
[ 696.669608][T12033] Kernel Offset: disabled
[ 696.669608][T12033] Rebooting in 86400 seconds..





2023-05-23 05:01:02

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Hi Pedro,

On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote:
> With V2 patches 5 and 6 applied I was still able to trigger an oops.
>
> Branch is 'net' + patches 5 & 6:
> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
> clsact Qdiscs before grafting
> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
> allocation
>
> Kernel config is the same as in the syzbot report.
> Note that this was on a _single core_ VM.
> I will double check if v1 is triggering this issue (basically run the repro
> for a long time). For multi-core my VM is running OOM even on a 32Gb system.
> I will check if we have a spare server to run the repro.

Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
"!ingress" path in qdisc_graft(). I think that's why you are still seeing
the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
after [1,2/6], and I think we'll need a different reproducer for [5,6/6].

However I just noticed that for some reason my git-send-email in my new
setup didn't auto-generate From: tags with my work email, so Author: will
be my personal email (I have to send patches from personal email to avoid
"[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in
advance for spamming.

Thanks,
Peilin Ye


2023-05-23 11:45:20

by Pedro Tammela

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On 23/05/2023 01:40, Peilin Ye wrote:
> Hi Pedro,
>
> On Tue, May 23, 2023 at 12:51:44AM -0300, Pedro Tammela wrote:
>> With V2 patches 5 and 6 applied I was still able to trigger an oops.
>>
>> Branch is 'net' + patches 5 & 6:
>> 145f639b9403 (HEAD -> main) net/sched: qdisc_destroy() old ingress and
>> clsact Qdiscs before grafting
>> 1aac74ef9673 net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
>> 18c40a1cc1d9 (origin/main, origin/HEAD) net/handshake: Fix sock->file
>> allocation
>>
>> Kernel config is the same as in the syzbot report.
>> Note that this was on a _single core_ VM.
>> I will double check if v1 is triggering this issue (basically run the repro
>> for a long time). For multi-core my VM is running OOM even on a 32Gb system.
>> I will check if we have a spare server to run the repro.
>
> Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
> under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
> "!ingress" path in qdisc_graft(). I think that's why you are still seeing
> the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
> after [1,2/6], and I think we'll need a different reproducer for [5,6/6].
>
> However I just noticed that for some reason my git-send-email in my new
> setup didn't auto-generate From: tags with my work email, so Author: will
> be my personal email (I have to send patches from personal email to avoid
> "[External] " subject prefixes) ... I will fix it in v3 soon. Sorry in
> advance for spamming.
>
> Thanks,
> Peilin Ye
>

I see,
We need to make sure then, when the time comes, that all the required
patches are back ported in the same bundle so we don't have a partial
fix; given that they target different commit tags.

I was still able to trigger an oops with the full patchset:

[ 104.944353][ T6588] ------------[ cut here ]------------
[ 104.944896][ T6588] jump label: negative count!
[ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at
kernel/jump_label.c:263 static_key_slow_try_dec+0xf2/0x110
[ 104.946795][ T6588] Modules linked in:
[ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted
6.4.0-rc2-00191-g4a3f9100193d #3
[ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
[ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c
41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8
4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66
2e 0f 1f
[ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
[ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX:
0000000000000000
[ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI:
0000000000000001
[ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09:
0000000000000000
[ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12:
00000000ffffffff
[ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15:
ffffffff8e7b0680
[ 104.954746][ T6588] FS: 00007f76c65d56c0(0000)
GS:ffff8881f5a00000(0000) knlGS:0000000000000000
[ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4:
0000000000350ef0
[ 104.956559][ T6588] Call Trace:
[ 104.956829][ T6588] <TASK>
[ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40
[ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0
[ 104.957906][ T6588] qdisc_create+0xa45/0x1090
[ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70
[ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70
[ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090
[ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10
[ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090
[ 104.960175][ T6588] rtnetlink_rcv_msg+0x439/0xd30
[ 104.960625][ T6588] ? rtnl_getlink+0xb10/0xb10
[ 104.960995][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100
[ 104.961397][ T6588] ? do_syscall_64+0x38/0xb0
[ 104.961767][ T6588] ? netdev_core_pick_tx+0x390/0x390
[ 104.962195][ T6588] netlink_rcv_skb+0x166/0x440
[ 104.962584][ T6588] ? rtnl_getlink+0xb10/0xb10
[ 104.962954][ T6588] ? netlink_ack+0x1370/0x1370
[ 104.963336][ T6588] ? kasan_set_track+0x25/0x30
[ 104.963731][ T6588] ? netlink_deliver_tap+0x1b1/0xd00
[ 104.964156][ T6588] netlink_unicast+0x530/0x800
[ 104.964537][ T6588] ? netlink_attachskb+0x880/0x880
[ 104.964951][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 104.965405][ T6588] ? __phys_addr_symbol+0x30/0x70
[ 104.965793][ T6588] ? __check_object_size+0x323/0x740
[ 104.966205][ T6588] netlink_sendmsg+0x90b/0xe10
[ 104.966577][ T6588] ? netlink_unicast+0x800/0x800
[ 104.966965][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 104.967374][ T6588] ? netlink_unicast+0x800/0x800
[ 104.967761][ T6588] sock_sendmsg+0xd9/0x180
[ 104.968109][ T6588] ____sys_sendmsg+0x264/0x910
[ 104.968490][ T6588] ? kernel_sendmsg+0x50/0x50
[ 104.968871][ T6588] ? __copy_msghdr+0x460/0x460
[ 104.969244][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 104.969716][ T6588] ? find_held_lock+0x2d/0x110
[ 104.970088][ T6588] ___sys_sendmsg+0x11d/0x1b0
[ 104.970491][ T6588] ? do_recvmmsg+0x700/0x700
[ 104.970853][ T6588] ? __fget_files+0x260/0x420
[ 104.971221][ T6588] ? reacquire_held_locks+0x4b0/0x4b0
[ 104.971649][ T6588] ? __fget_files+0x282/0x420
[ 104.972018][ T6588] ? __fget_light+0xe6/0x270
[ 104.972384][ T6588] __sys_sendmmsg+0x18e/0x430
[ 104.972766][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 104.973167][ T6588] ? reacquire_held_locks+0x4b0/0x4b0
[ 104.973588][ T6588] ? rcu_is_watching+0x12/0xb0
[ 104.973958][ T6588] ? xfd_validate_state+0x5d/0x180
[ 104.974355][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 104.974796][ T6588] ? unlock_page_memcg+0x2d0/0x2d0
[ 104.975178][ T6588] ? do_futex+0x350/0x350
[ 104.975501][ T6588] __x64_sys_sendmmsg+0x9c/0x100
[ 104.975867][ T6588] ? syscall_enter_from_user_mode+0x26/0x80
[ 104.976307][ T6588] do_syscall_64+0x38/0xb0
[ 104.976649][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 104.977088][ T6588] RIP: 0033:0x7f76c66ee89d
[ 104.977417][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 104.978835][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 104.979468][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX:
00007f76c66ee89d
[ 104.980073][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 104.980739][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09:
0000000000000000
[ 104.981352][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 104.981968][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15:
00007f76c65b5000
[ 104.982587][ T6588] </TASK>
[ 104.982833][ T6588] Kernel panic - not syncing: kernel: panic_on_warn
set ...
[ 104.983391][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted
6.4.0-rc2-00191-g4a3f9100193d #3
[ 104.984054][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 104.984854][ T6588] Call Trace:
[ 104.985121][ T6588] <TASK>
[ 104.985359][ T6588] dump_stack_lvl+0xd9/0x1b0
[ 104.985733][ T6588] panic+0x689/0x730
[ 104.986053][ T6588] ? panic_smp_self_stop+0xa0/0xa0
[ 104.986467][ T6588] ? show_trace_log_lvl+0x28e/0x400
[ 104.986888][ T6588] ? static_key_slow_try_dec+0xf2/0x110
[ 104.987329][ T6588] check_panic_on_warn+0xab/0xb0
[ 104.987729][ T6588] __warn+0xf2/0x380
[ 104.988046][ T6588] ? static_key_slow_try_dec+0xf2/0x110
[ 104.988476][ T6588] report_bug+0x3bc/0x580
[ 104.988838][ T6588] handle_bug+0x3c/0x70
[ 104.989169][ T6588] exc_invalid_op+0x17/0x40
[ 104.989529][ T6588] asm_exc_invalid_op+0x1a/0x20
[ 104.989914][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
[ 104.990402][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c
41 5d c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8
4e ce 9c ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66
2e 0f 1f
[ 104.990702][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
[ 104.990702][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX:
0000000000000000
[ 104.990702][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI:
0000000000000001
[ 104.990702][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09:
0000000000000000
[ 104.990702][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12:
00000000ffffffff
[ 104.990702][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15:
ffffffff8e7b0680
[ 104.990702][ T6588] ? __warn_printk+0x187/0x310
[ 104.990702][ T6588] ? clsact_egress_block_get+0x40/0x40
[ 104.990702][ T6588] static_key_slow_dec+0x60/0xc0
[ 104.990702][ T6588] qdisc_create+0xa45/0x1090
[ 104.990702][ T6588] ? tc_get_qdisc+0xb70/0xb70
[ 104.990702][ T6588] tc_modify_qdisc+0x491/0x1b70
[ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090
[ 104.990702][ T6588] ? bpf_lsm_capable+0x9/0x10
[ 104.990702][ T6588] ? qdisc_create+0x1090/0x1090
[ 104.990702][ T6588] rtnetlink_rcv_msg+0x439/0xd30
[ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10
[ 104.990702][ T6588] ? __x64_sys_sendmmsg+0x9c/0x100
[ 104.990702][ T6588] ? do_syscall_64+0x38/0xb0
[ 104.990702][ T6588] ? netdev_core_pick_tx+0x390/0x390
[ 104.990702][ T6588] netlink_rcv_skb+0x166/0x440
[ 104.990702][ T6588] ? rtnl_getlink+0xb10/0xb10
[ 104.990702][ T6588] ? netlink_ack+0x1370/0x1370
[ 104.990702][ T6588] ? kasan_set_track+0x25/0x30
[ 104.990702][ T6588] ? netlink_deliver_tap+0x1b1/0xd00
[ 104.990702][ T6588] netlink_unicast+0x530/0x800
[ 104.990702][ T6588] ? netlink_attachskb+0x880/0x880
[ 104.990702][ T6588] ? __sanitizer_cov_trace_switch+0x54/0x90
[ 104.990702][ T6588] ? __phys_addr_symbol+0x30/0x70
[ 104.990702][ T6588] ? __check_object_size+0x323/0x740
[ 104.990702][ T6588] netlink_sendmsg+0x90b/0xe10
[ 104.990702][ T6588] ? netlink_unicast+0x800/0x800
[ 104.990702][ T6588] ? bpf_lsm_socket_sendmsg+0x9/0x10
[ 104.990702][ T6588] ? netlink_unicast+0x800/0x800
[ 104.990702][ T6588] sock_sendmsg+0xd9/0x180
[ 104.990702][ T6588] ____sys_sendmsg+0x264/0x910
[ 104.990702][ T6588] ? kernel_sendmsg+0x50/0x50
[ 104.990702][ T6588] ? __copy_msghdr+0x460/0x460
[ 104.990702][ T6588] ? lockdep_hardirqs_on_prepare+0x410/0x410
[ 104.990702][ T6588] ? find_held_lock+0x2d/0x110
[ 104.990702][ T6588] ___sys_sendmsg+0x11d/0x1b0
[ 104.990702][ T6588] ? do_recvmmsg+0x700/0x700
[ 104.990702][ T6588] ? __fget_files+0x260/0x420
[ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0
[ 104.990702][ T6588] ? __fget_files+0x282/0x420
[ 104.990702][ T6588] ? __fget_light+0xe6/0x270
[ 104.990702][ T6588] __sys_sendmmsg+0x18e/0x430
[ 104.990702][ T6588] ? __ia32_sys_sendmsg+0xb0/0xb0
[ 104.990702][ T6588] ? reacquire_held_locks+0x4b0/0x4b0
[ 104.990702][ T6588] ? rcu_is_watching+0x12/0xb0
[ 104.990702][ T6588] ? xfd_validate_state+0x5d/0x180
[ 104.990702][ T6588] ? restore_fpregs_from_fpstate+0xc1/0x1d0
[ 104.990702][ T6588] ? unlock_page_memcg+0x2d0/0x2d0
[ 104.990702][ T6588] ? do_futex+0x350/0x350
[ 104.990702][ T6588] __x64_sys_sendmmsg+0x9c/0x100
[ 104.990702][ T6588] ? syscall_enter_from_user_mode+0x26/0x80
[ 104.990702][ T6588] do_syscall_64+0x38/0xb0
[ 104.990702][ T6588] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 104.990702][ T6588] RIP: 0033:0x7f76c66ee89d
[ 104.990702][ T6588] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3
0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c
24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64
89 01 48
[ 104.990702][ T6588] RSP: 002b:00007f76c65d4c68 EFLAGS: 00000203
ORIG_RAX: 0000000000000133
[ 104.990702][ T6588] RAX: ffffffffffffffda RBX: 00007f76c65d5cdc RCX:
00007f76c66ee89d
[ 104.990702][ T6588] RDX: 040000000000009f RSI: 00000000200002c0 RDI:
0000000000000007
[ 104.990702][ T6588] RBP: 00007f76c65d4de0 R08: 0000000100000001 R09:
0000000000000000
[ 104.990702][ T6588] R10: 0000000000000000 R11: 0000000000000203 R12:
fffffffffffffeb8
[ 104.990702][ T6588] R13: 0000000000000002 R14: 00007ffce5a5fd90 R15:
00007f76c65b5000
[ 104.990702][ T6588] </TASK>
[ 104.990702][ T6588] Kernel Offset: disabled
[ 104.990702][ T6588] Rebooting in 86400 seconds..



2023-05-23 14:20:23

by Vlad Buslov

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Hi Peilin,

Thanks again for the great analysis and investing effort into fixing
this!

On Mon 22 May 2023 at 16:55, Peilin Ye <[email protected]> wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> @old and @new should not affect each other. In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state. Fixing without changing sch_api.c turned out to be
> difficult (please refer to Closes: for discussions). Instead, make sure
> @new's first call always happen after @old's last call, in
> qdisc_destroy(), has finished:
>
> In qdisc_graft(), return -EAGAIN and tell the caller to replay
> (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> requests, and call qdisc_destroy() for @old before grafting @new.
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> Qdiscs".
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-unlocked)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU sync 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is only one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers. The point is clear though: again, A should
> never modify those per-net_device pointers after B, and B should not
> update A's RCU state.
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: [email protected]
> Closes: https://lore.kernel.org/r/[email protected]/
> Cc: Hillf Danton <[email protected]>
> Cc: Vlad Buslov <[email protected]>
> Signed-off-by: Peilin Ye <[email protected]>
> ---
> changes in v2:
> - replay the request if the current Qdisc has any ongoing RTNL-unlocked
> filter requests (Vlad)
> - minor changes in code comments and commit log
>
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 45 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..b3bafa6c1b44 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* Replay if the current ingress (or clsact) Qdisc has ongoing
> + * RTNL-unlocked filter request(s). This is the counterpart of that
> + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> + return -EAGAIN;
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() @old before grafting @new to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> struct Qdisc *p = NULL;
> int err;
>
> +replay:

Perhaps also set q and p to NULL here? Even though on cursory look you
should get the same lookup result since the function is called under
rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case
something touches this.") and tc_new_tfilter() got some non-obvious bugs
after I introduced replay there without re-setting some of the required
variables.

> err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
> rtm_tca_policy, extack);
> if (err < 0)
> @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> return -ENOENT;
> }
> err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> - if (err != 0)
> + if (err != 0) {
> + if (err == -EAGAIN)
> + goto replay;
> return err;
> + }
> } else {
> qdisc_notify(net, skb, n, clid, NULL, q, NULL);
> }
> @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> if (err) {
> if (q)
> qdisc_put(q);
> + if (err == -EAGAIN)
> + goto replay;
> return err;
> }
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);


2023-05-23 19:00:19

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Hi Vlad,

On Tue, May 23, 2023 at 05:04:40PM +0300, Vlad Buslov wrote:
> > @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> > struct Qdisc *p = NULL;
> > int err;
> >
> > +replay:
>
> Perhaps also set q and p to NULL here? Even though on cursory look you
> should get the same lookup result since the function is called under
> rtnl_lock, tc_modify_qdisc() does this on replay ("Reinit, just in case
> something touches this.") and tc_new_tfilter() got some non-obvious bugs
> after I introduced replay there without re-setting some of the required
> variables.

Sure, I'll reinitialize tcm, q and p after "replay:" in next version, just
like tc_modify_qdisc(). Thanks for the suggestion!

Peilin Ye


2023-05-23 20:20:01

by Peilin Ye

[permalink] [raw]
Subject: Re: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Tue, May 23, 2023 at 08:36:35AM -0300, Pedro Tammela wrote:
> > Thanks for testing this, but the syzbot reproducer creates ingress Qdiscs
> > under TC_H_ROOT, which isn't covered by [6/6] i.e. it exercises the
> > "!ingress" path in qdisc_graft(). I think that's why you are still seeing
> > the oops. Adding sch_{ingress,clsact} to TC_H_ROOT is no longer possible
> > after [1,2/6], and I think we'll need a different reproducer for [5,6/6].
>
> I was still able to trigger an oops with the full patchset:
>
> [ 104.944353][ T6588] ------------[ cut here ]------------
> [ 104.944896][ T6588] jump label: negative count!
> [ 104.945780][ T6588] WARNING: CPU: 0 PID: 6588 at kernel/jump_label.c:263
> static_key_slow_try_dec+0xf2/0x110
> [ 104.946795][ T6588] Modules linked in:
> [ 104.947111][ T6588] CPU: 0 PID: 6588 Comm: repro Not tainted
> 6.4.0-rc2-00191-g4a3f9100193d #3
> [ 104.947765][ T6588] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 104.948557][ T6588] RIP: 0010:static_key_slow_try_dec+0xf2/0x110
> [ 104.949064][ T6588] Code: d5 ff e8 c1 33 d5 ff 44 89 e8 5b 5d 41 5c 41 5d
> c3 44 89 e5 e9 66 ff ff ff e8 aa 33 d5 ff 48 c7 c7 00 9c 56 8a e8 4e ce 9c
> ff <0f> 0b eb ae 48 89 df e8 02 4b 28 00 e9 42 ff ff ff 66 66 2e 0f 1f
> [ 104.951134][ T6588] RSP: 0018:ffffc900079cf2c0 EFLAGS: 00010286
> [ 104.951646][ T6588] RAX: 0000000000000000 RBX: ffffffff9213a160 RCX:
> 0000000000000000
> [ 104.952269][ T6588] RDX: ffff888112f83b80 RSI: ffffffff814c7747 RDI:
> 0000000000000001
> [ 104.952901][ T6588] RBP: 00000000ffffffff R08: 0000000000000001 R09:
> 0000000000000000
> [ 104.953523][ T6588] R10: 0000000000000001 R11: 0000000000000001 R12:
> 00000000ffffffff
> [ 104.954133][ T6588] R13: ffff88816a514001 R14: 0000000000000001 R15:
> ffffffff8e7b0680
> [ 104.954746][ T6588] FS: 00007f76c65d56c0(0000) GS:ffff8881f5a00000(0000)
> knlGS:0000000000000000
> [ 104.955430][ T6588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 104.955941][ T6588] CR2: 00007f9a40357a50 CR3: 000000011461e000 CR4:
> 0000000000350ef0
> [ 104.956559][ T6588] Call Trace:
> [ 104.956829][ T6588] <TASK>
> [ 104.957062][ T6588] ? clsact_egress_block_get+0x40/0x40
> [ 104.957507][ T6588] static_key_slow_dec+0x60/0xc0
> [ 104.957906][ T6588] qdisc_create+0xa45/0x1090
> [ 104.958274][ T6588] ? tc_get_qdisc+0xb70/0xb70
> [ 104.958646][ T6588] tc_modify_qdisc+0x491/0x1b70
> [ 104.959031][ T6588] ? qdisc_create+0x1090/0x1090
> [ 104.959420][ T6588] ? bpf_lsm_capable+0x9/0x10
> [ 104.959797][ T6588] ? qdisc_create+0x1090/0x1090

Ah, qdisc_create() calls ->destroy() even "if ops->init() failed". We
should check sch->parent in {ingress,clsact}_destroy() too. I'll update
[1,2/6] in v5.

Thanks for reporting this! Seems like I should've run the reproducer
nevertheless. I'll run it before posting v5.

Thanks,
Peilin Ye