2023-05-24 01:28:59

by Peilin Ye

[permalink] [raw]
Subject: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

From: Peilin Ye <[email protected]>

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other. In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state. Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions). Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.

Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.

Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9

RTM_NEWTFILTER (X, RTNL-unlocked)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU sync 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers. The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: [email protected]
Closes: https://lore.kernel.org/r/[email protected]/
Cc: Hillf Danton <[email protected]>
Cc: Vlad Buslov <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
---
change in v5:
- reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
just like @flags in tc_new_tfilter()

change in v3, v4:
- add in-body From: tag

changes in v2:
- replay the request if the current Qdisc has any ongoing RTNL-unlocked
filter requests (Vlad)
- minor changes in code comments and commit log

include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++---------
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
refcount_inc(&qdisc->refcnt);
}

+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return true;
+ return refcount_dec_if_one(&qdisc->refcnt);
+}
+
/* Intended to be used by unlocked users, when concurrent qdisc release is
* possible.
*/
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
struct Qdisc *qdisc);
void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
void qdisc_put(struct Qdisc *qdisc);
void qdisc_put_unlocked(struct Qdisc *qdisc);
void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..286b7c58f5b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
ingress = 1;
- if (!dev_ingress_queue(dev)) {
+ dev_queue = dev_ingress_queue(dev);
+ if (!dev_queue) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
return -ENOENT;
}
+
+ /* Replay if the current ingress (or clsact) Qdisc has ongoing
+ * RTNL-unlocked filter request(s). This is the counterpart of that
+ * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+ */
+ if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+ return -EAGAIN;
}

if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
qdisc_put(old);
}
} else {
- dev_queue = dev_ingress_queue(dev);
- old = dev_graft_qdisc(dev_queue, new);
+ old = dev_graft_qdisc(dev_queue, NULL);
+
+ /* {ingress,clsact}_destroy() @old before grafting @new to avoid
+ * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+ * pointer(s) in mini_qdisc_pair_swap().
+ */
+ qdisc_notify(net, skb, n, classid, old, new, extack);
+ qdisc_destroy(old);
+
+ dev_graft_qdisc(dev_queue, new);
}

skip:
@@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,

if (new && new->ops->attach)
new->ops->attach(new);
- } else {
- notify_and_destroy(net, skb, n, classid, old, new, extack);
}

if (dev->flags & IFF_UP)
@@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
struct netlink_ext_ack *extack)
{
struct net *net = sock_net(skb->sk);
- struct tcmsg *tcm = nlmsg_data(n);
struct nlattr *tca[TCA_MAX + 1];
struct net_device *dev;
+ struct Qdisc *q, *p;
+ struct tcmsg *tcm;
u32 clid;
- struct Qdisc *q = NULL;
- struct Qdisc *p = NULL;
int err;

+replay:
err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
rtm_tca_policy, extack);
if (err < 0)
return err;

+ tcm = nlmsg_data(n);
+ q = p = NULL;
+
dev = __dev_get_by_index(net, tcm->tcm_ifindex);
if (!dev)
return -ENODEV;
@@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
return -ENOENT;
}
err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
- if (err != 0)
+ if (err != 0) {
+ if (err == -EAGAIN)
+ goto replay;
return err;
+ }
} else {
qdisc_notify(net, skb, n, clid, NULL, q, NULL);
}
@@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
if (err) {
if (q)
qdisc_put(q);
+ if (err == -EAGAIN)
+ goto replay;
return err;
}

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
qdisc_free(q);
}

-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
call_rcu(&qdisc->rcu, qdisc_free_cb);
}

+void qdisc_destroy(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return;
+
+ __qdisc_destroy(qdisc);
+}
+
void qdisc_put(struct Qdisc *qdisc)
{
if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
!refcount_dec_and_test(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
}
EXPORT_SYMBOL(qdisc_put);

@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
!refcount_dec_and_rtnl_lock(&qdisc->refcnt))
return;

- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
rtnl_unlock();
}
EXPORT_SYMBOL(qdisc_put_unlocked);
--
2.20.1



2023-05-24 16:25:50

by Pedro Tammela

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On 23/05/2023 22:20, Peilin Ye wrote:
> From: Peilin Ye <[email protected]>
>
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> @old and @new should not affect each other. In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state. Fixing without changing sch_api.c turned out to be
> difficult (please refer to Closes: for discussions). Instead, make sure
> @new's first call always happen after @old's last call, in
> qdisc_destroy(), has finished:
>
> In qdisc_graft(), return -EAGAIN and tell the caller to replay
> (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> requests, and call qdisc_destroy() for @old before grafting @new.
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> Qdiscs".
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-unlocked)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU sync 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is only one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers. The point is clear though: again, A should
> never modify those per-net_device pointers after B, and B should not
> update A's RCU state.
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: [email protected]
> Closes: https://lore.kernel.org/r/[email protected]/
> Cc: Hillf Danton <[email protected]>
> Cc: Vlad Buslov <[email protected]>
> Signed-off-by: Peilin Ye <[email protected]>

Tested-by: Pedro Tammela <[email protected]>

> ---
> change in v5:
> - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
> just like @flags in tc_new_tfilter()
>
> change in v3, v4:
> - add in-body From: tag
>
> changes in v2:
> - replay the request if the current Qdisc has any ongoing RTNL-unlocked
> filter requests (Vlad)
> - minor changes in code comments and commit log
>
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++---------
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 50 insertions(+), 12 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..286b7c58f5b9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* Replay if the current ingress (or clsact) Qdisc has ongoing
> + * RTNL-unlocked filter request(s). This is the counterpart of that
> + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> + return -EAGAIN;
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() @old before grafting @new to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> struct netlink_ext_ack *extack)
> {
> struct net *net = sock_net(skb->sk);
> - struct tcmsg *tcm = nlmsg_data(n);
> struct nlattr *tca[TCA_MAX + 1];
> struct net_device *dev;
> + struct Qdisc *q, *p;
> + struct tcmsg *tcm;
> u32 clid;
> - struct Qdisc *q = NULL;
> - struct Qdisc *p = NULL;
> int err;
>
> +replay:
> err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
> rtm_tca_policy, extack);
> if (err < 0)
> return err;
>
> + tcm = nlmsg_data(n);
> + q = p = NULL;
> +
> dev = __dev_get_by_index(net, tcm->tcm_ifindex);
> if (!dev)
> return -ENODEV;
> @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> return -ENOENT;
> }
> err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> - if (err != 0)
> + if (err != 0) {
> + if (err == -EAGAIN)
> + goto replay;
> return err;
> + }
> } else {
> qdisc_notify(net, skb, n, clid, NULL, q, NULL);
> }
> @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> if (err) {
> if (q)
> qdisc_put(q);
> + if (err == -EAGAIN)
> + goto replay;
> return err;
> }
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);


2023-05-24 16:51:39

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

Pedro,
When you have a moment - could you run tc monitor in parallel to the
reproducer and double check it generates the correct events...

cheers,
jamal

On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <[email protected]> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <[email protected]>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> > Call Trace:
> > <TASK>
> > __dump_stack lib/dump_stack.c:88 [inline]
> > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> > print_report mm/kasan/report.c:430 [inline]
> > kasan_report+0x11c/0x130 mm/kasan/report.c:536
> > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other. In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state. Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions). Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> > adds a flower filter X to A.
> >
> > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> > b2) to replace A, then adds a flower filter Y to B.
> >
> > Thread 1 A's refcnt Thread 2
> > RTM_NEWQDISC (A, RTNL-locked)
> > qdisc_create(A) 1
> > qdisc_graft(A) 9
> >
> > RTM_NEWTFILTER (X, RTNL-unlocked)
> > __tcf_qdisc_find(A) 10
> > tcf_chain0_head_change(A)
> > mini_qdisc_pair_swap(A) (1st)
> > |
> > | RTM_NEWQDISC (B, RTNL-locked)
> > RCU sync 2 qdisc_graft(B)
> > | 1 notify_and_destroy(A)
> > |
> > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
> > qdisc_destroy(A) tcf_chain0_head_change(B)
> > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > mini_qdisc_pair_swap(A) (3rd) |
> > ... ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers. The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: [email protected]
> > Closes: https://lore.kernel.org/r/[email protected]/
> > Cc: Hillf Danton <[email protected]>
> > Cc: Vlad Buslov <[email protected]>
> > Signed-off-by: Peilin Ye <[email protected]>
>
> Tested-by: Pedro Tammela <[email protected]>
>
> > ---
> > change in v5:
> > - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
> > just like @flags in tc_new_tfilter()
> >
> > change in v3, v4:
> > - add in-body From: tag
> >
> > changes in v2:
> > - replay the request if the current Qdisc has any ongoing RTNL-unlocked
> > filter requests (Vlad)
> > - minor changes in code comments and commit log
> >
> > include/net/sch_generic.h | 8 ++++++++
> > net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++---------
> > net/sched/sch_generic.c | 14 +++++++++++---
> > 3 files changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index fab5ba3e61b7..3e9cc43cbc90 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> > refcount_inc(&qdisc->refcnt);
> > }
> >
> > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> > +{
> > + if (qdisc->flags & TCQ_F_BUILTIN)
> > + return true;
> > + return refcount_dec_if_one(&qdisc->refcnt);
> > +}
> > +
> > /* Intended to be used by unlocked users, when concurrent qdisc release is
> > * possible.
> > */
> > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> > struct Qdisc *qdisc);
> > void qdisc_reset(struct Qdisc *qdisc);
> > +void qdisc_destroy(struct Qdisc *qdisc);
> > void qdisc_put(struct Qdisc *qdisc);
> > void qdisc_put_unlocked(struct Qdisc *qdisc);
> > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > index f72a581666a2..286b7c58f5b9 100644
> > --- a/net/sched/sch_api.c
> > +++ b/net/sched/sch_api.c
> > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> > if ((q && q->flags & TCQ_F_INGRESS) ||
> > (new && new->flags & TCQ_F_INGRESS)) {
> > ingress = 1;
> > - if (!dev_ingress_queue(dev)) {
> > + dev_queue = dev_ingress_queue(dev);
> > + if (!dev_queue) {
> > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> > return -ENOENT;
> > }
> > +
> > + /* Replay if the current ingress (or clsact) Qdisc has ongoing
> > + * RTNL-unlocked filter request(s). This is the counterpart of that
> > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> > + */
> > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> > + return -EAGAIN;
> > }
> >
> > if (dev->flags & IFF_UP)
> > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> > qdisc_put(old);
> > }
> > } else {
> > - dev_queue = dev_ingress_queue(dev);
> > - old = dev_graft_qdisc(dev_queue, new);
> > + old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid
> > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > + * pointer(s) in mini_qdisc_pair_swap().
> > + */
> > + qdisc_notify(net, skb, n, classid, old, new, extack);
> > + qdisc_destroy(old);
> > +
> > + dev_graft_qdisc(dev_queue, new);
> > }
> >
> > skip:
> > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >
> > if (new && new->ops->attach)
> > new->ops->attach(new);
> > - } else {
> > - notify_and_destroy(net, skb, n, classid, old, new, extack);
> > }
> >
> > if (dev->flags & IFF_UP)
> > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> > struct netlink_ext_ack *extack)
> > {
> > struct net *net = sock_net(skb->sk);
> > - struct tcmsg *tcm = nlmsg_data(n);
> > struct nlattr *tca[TCA_MAX + 1];
> > struct net_device *dev;
> > + struct Qdisc *q, *p;
> > + struct tcmsg *tcm;
> > u32 clid;
> > - struct Qdisc *q = NULL;
> > - struct Qdisc *p = NULL;
> > int err;
> >
> > +replay:
> > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
> > rtm_tca_policy, extack);
> > if (err < 0)
> > return err;
> >
> > + tcm = nlmsg_data(n);
> > + q = p = NULL;
> > +
> > dev = __dev_get_by_index(net, tcm->tcm_ifindex);
> > if (!dev)
> > return -ENODEV;
> > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> > return -ENOENT;
> > }
> > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> > - if (err != 0)
> > + if (err != 0) {
> > + if (err == -EAGAIN)
> > + goto replay;
> > return err;
> > + }
> > } else {
> > qdisc_notify(net, skb, n, clid, NULL, q, NULL);
> > }
> > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> > if (err) {
> > if (q)
> > qdisc_put(q);
> > + if (err == -EAGAIN)
> > + goto replay;
> > return err;
> > }
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 37e41f972f69..e14ed47f961c 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> > qdisc_free(q);
> > }
> >
> > -static void qdisc_destroy(struct Qdisc *qdisc)
> > +static void __qdisc_destroy(struct Qdisc *qdisc)
> > {
> > const struct Qdisc_ops *ops = qdisc->ops;
> >
> > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> > call_rcu(&qdisc->rcu, qdisc_free_cb);
> > }
> >
> > +void qdisc_destroy(struct Qdisc *qdisc)
> > +{
> > + if (qdisc->flags & TCQ_F_BUILTIN)
> > + return;
> > +
> > + __qdisc_destroy(qdisc);
> > +}
> > +
> > void qdisc_put(struct Qdisc *qdisc)
> > {
> > if (!qdisc)
> > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> > !refcount_dec_and_test(&qdisc->refcnt))
> > return;
> >
> > - qdisc_destroy(qdisc);
> > + __qdisc_destroy(qdisc);
> > }
> > EXPORT_SYMBOL(qdisc_put);
> >
> > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> > !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> > return;
> >
> > - qdisc_destroy(qdisc);
> > + __qdisc_destroy(qdisc);
> > rtnl_unlock();
> > }
> > EXPORT_SYMBOL(qdisc_put_unlocked);
>

2023-05-25 10:01:57

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> When you have a moment - could you run tc monitor in parallel to the
> reproducer and double check it generates the correct events...

FTR, I'll wait a bit to apply this series, to allow for the above
tests. Unless someone will scream very loudly very soon, it's not going
to enter today's PR. Since the addressed issue is an ancient one, it
should not a problem, I hope.

Cheers,

Paolo


2023-05-26 12:30:18

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Thu, May 25, 2023 at 5:25 AM Paolo Abeni <[email protected]> wrote:
>
> On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> > When you have a moment - could you run tc monitor in parallel to the
> > reproducer and double check it generates the correct events...
>
> FTR, I'll wait a bit to apply this series, to allow for the above
> tests. Unless someone will scream very loudly very soon, it's not going
> to enter today's PR. Since the addressed issue is an ancient one, it
> should not a problem, I hope.

Sorry I do not mean to hold this back - I think we should have applied
V1 then worried about making things better. We can worry about events
after.
So Acking this.

cheers,
jamal

> Cheers,
>
> Paolo
>

2023-05-26 12:31:32

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <[email protected]> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <[email protected]>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> > Call Trace:
> > <TASK>
> > __dump_stack lib/dump_stack.c:88 [inline]
> > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> > print_report mm/kasan/report.c:430 [inline]
> > kasan_report+0x11c/0x130 mm/kasan/report.c:536
> > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other. In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state. Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions). Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> > adds a flower filter X to A.
> >
> > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> > b2) to replace A, then adds a flower filter Y to B.
> >
> > Thread 1 A's refcnt Thread 2
> > RTM_NEWQDISC (A, RTNL-locked)
> > qdisc_create(A) 1
> > qdisc_graft(A) 9
> >
> > RTM_NEWTFILTER (X, RTNL-unlocked)
> > __tcf_qdisc_find(A) 10
> > tcf_chain0_head_change(A)
> > mini_qdisc_pair_swap(A) (1st)
> > |
> > | RTM_NEWQDISC (B, RTNL-locked)
> > RCU sync 2 qdisc_graft(B)
> > | 1 notify_and_destroy(A)
> > |
> > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked)
> > qdisc_destroy(A) tcf_chain0_head_change(B)
> > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > mini_qdisc_pair_swap(A) (3rd) |
> > ... ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers. The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: [email protected]
> > Closes: https://lore.kernel.org/r/[email protected]/
> > Cc: Hillf Danton <[email protected]>
> > Cc: Vlad Buslov <[email protected]>
> > Signed-off-by: Peilin Ye <[email protected]>
>
> Tested-by: Pedro Tammela <[email protected]>


Acked-by: Jamal Hadi Salim <[email protected]>


cheers,
jamal

2023-05-26 20:16:56

by Jamal Hadi Salim

[permalink] [raw]
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

On Fri, May 26, 2023 at 8:20 AM Jamal Hadi Salim <[email protected]> wrote:
>
> On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <[email protected]> wrote:
> >
> > On 23/05/2023 22:20, Peilin Ye wrote:
> > > From: Peilin Ye <[email protected]>

> Acked-by: Jamal Hadi Salim <[email protected]>

I apologize i am going to take this back, lets please hold the series for now.

In pursuit for the effect on events, Pedro and I spent a few _hours_
chasing this - and regardless of the events, there are still
challenges with the concurrency issue. The current reproducer
unfortunately cant cause damage after patch 2, so really patch 6 was
not being tested. We hacked the repro to hit codepath patch 6 fixes.
We are not sure what the root cause is - but it certainly due to the
series. Peilin, Pedro will post the new repro.

cheers,
jamal