Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp5997728rwd; Wed, 24 May 2023 09:25:50 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6o+2eqacWt/6rNJBGWRYB4RnU5Vis6o3f0I6RwS+6tigUjlHLDiRVN70JOpLjIeGs1OE6E X-Received: by 2002:a17:90a:d98f:b0:253:62c2:4e1b with SMTP id d15-20020a17090ad98f00b0025362c24e1bmr15530637pjv.48.1684945549865; Wed, 24 May 2023 09:25:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684945549; cv=none; d=google.com; s=arc-20160816; b=YiwTsEKCJQxsD50i++5XfBk4XvxLcOnprfaxOQ4i+eb5kmMPPadiZ3hgkXoRf8csqC ptQyQ988uyjLTq8N4Uj3VgIFnxzAMxwQyM4ydTsC0mVJuxtDUArxevFQ0YyyY6SX8MeL ECFpVo6bekrt5tvgfr4im1cG/bQFhagYJaJ2wU+xjQTeYX72CAd49qohzbvx95T/t0LK IWpgnTjVlAy0bL5OmvQyO0JSo3iUjTNPjvz5PX2nSIpIxjPtgZFJvfxGWfzeXFhH3gnT aRgd+fqxwIpBBTBdA/up4dyJRAv2ge6Jg3axekhPsUgAvfiVqOorez4zNLokPosJEcGp Ma/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=ie1gW99X/HfQLymQyv3hMisxA5On2/0Wf20XM7jeBo0=; b=mwuJOpWqeqrqMVPqCdViXSPYD0lmR2M8/e6tVPOUrR7NwX7TDJw+t/uZxTNLW/N/iE La6/t/VmTaPcaFz4jyxOzx/zw1BiN5hZz3/GH2BAisiqvAzq+k5sp0Eje97R/IBaedpk ZnchXUpK/CL++uc3JDYeb/2sb4KHQ42dxoJku/ad83tuXfHC35Nh0b9oQvjVbAJyFflp Lt/V83LI97L659R4FXQvG9JIj5HC3aQ2xpGH9gMYa2dWwSZtYZRF99pDxwe4kBAXlHYV iLnMkiOvUavLN5NdB9nlpng9wQwRGqYn1iTeVRzk4svUpnpXA6SlKIoXnqAENPd4U4fF Tf6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@mojatatu-com.20221208.gappssmtp.com header.s=20221208 header.b=TQvlg1nb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t4-20020a17090ad50400b002477dac5834si1665217pju.95.2023.05.24.09.25.37; Wed, 24 May 2023 09:25:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@mojatatu-com.20221208.gappssmtp.com header.s=20221208 header.b=TQvlg1nb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237086AbjEXPmE (ORCPT + 99 others); Wed, 24 May 2023 11:42:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236995AbjEXPlt (ORCPT ); Wed, 24 May 2023 11:41:49 -0400 Received: from mail-oi1-x234.google.com (mail-oi1-x234.google.com [IPv6:2607:f8b0:4864:20::234]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A54213A for ; Wed, 24 May 2023 08:41:03 -0700 (PDT) Received: by mail-oi1-x234.google.com with SMTP id 5614622812f47-394c7ba4cb5so181441b6e.1 for ; Wed, 24 May 2023 08:41:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mojatatu-com.20221208.gappssmtp.com; s=20221208; t=1684942798; x=1687534798; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=ie1gW99X/HfQLymQyv3hMisxA5On2/0Wf20XM7jeBo0=; b=TQvlg1nbPjHwCfXOUvtX/xH9GVXY+9oUCsCUn93EQEXXeCBxkIn8tQG4lsaSVObx83 xVimlSg5xEbRSpcm5V6favLBJ4Iicllq89cXAKTzuBj1cfi5+GCSahleqcjIeRBdDSb0 HryMQoqjppuFylE/E72KNB0QNT6MrMOFFEHBfiu996uqnaA9OafuSzky1qnyZvcU7/9K t/w+0NEs8mRvl37ncSYzx94LXJhvaUw6ULnutlE3uiU3e2XzwaFWSo/PHNHSV2ds0XWd 4rwnGTnmcbX3ETUmz1zT0K16kLrbW/qtlw/kxKswzJ4TlUPMnKz22pgFbddw93r1pqAE gS3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684942798; x=1687534798; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ie1gW99X/HfQLymQyv3hMisxA5On2/0Wf20XM7jeBo0=; b=jZ/a3zUp0IcBU4TisQ9pyzXJUcc5BEDrW+LM5fZJKqbXVlEF4GKB2EakFfqk/dvZam 2XYFaxkYsWQUjPeH9sBpA0N8Lx/SJOWR7UdfwUMQcRBZ4QC7gBP8rU+ttGPkOGiu5Q4f f5NIeWJB5dQx/FdT+qGjhugzRcfgCWyVx57kGKXD594qR7CBDNcKx10Ha3JTg72Esz7y gRo8dvTaBOMFtKcNWq6t+fri3zt1eJIv7a3hZGhKtjtL0v0mK7tMQi5kgTJkFPGO3Syo hoEUT5Sq7rvMSdaJEg2na6Luxx/s+S9MItQhJFT9xud9Qy3M7j7tzOvMXok4uC9EFEB3 aC2g== X-Gm-Message-State: AC+VfDwpLe6S6Gksk5HjNlBvNIOdTi8I90nZq9/d/G8C7tsAv507KM4V zgf5xL/2JsAA0KudDXexkdsPvA== X-Received: by 2002:a05:6808:2811:b0:397:fb0a:b665 with SMTP id et17-20020a056808281100b00397fb0ab665mr5741499oib.6.1684942798405; Wed, 24 May 2023 08:39:58 -0700 (PDT) Received: from ?IPV6:2804:14d:5c5e:44fb:522c:f73f:493b:2b5? ([2804:14d:5c5e:44fb:522c:f73f:493b:2b5]) by smtp.gmail.com with ESMTPSA id az14-20020a056808164e00b0038e086c764dsm4972343oib.43.2023.05.24.08.39.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 May 2023 08:39:58 -0700 (PDT) Message-ID: Date: Wed, 24 May 2023 12:39:52 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Content-Language: en-US To: Peilin Ye , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jamal Hadi Salim , Cong Wang , Jiri Pirko Cc: Peilin Ye , Daniel Borkmann , John Fastabend , Vlad Buslov , Hillf Danton , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang References: <429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com> From: Pedro Tammela In-Reply-To: <429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 23/05/2023 22:20, Peilin Ye wrote: > From: Peilin Ye > > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in > ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs > access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for > clsact Qdiscs and miniq_egress. > > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER > requests (thanks Hillf Danton for the hint), when replacing ingress or > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), > causing race conditions [1] including a use-after-free bug in > mini_qdisc_pair_swap() reported by syzbot: > > BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 > ... > Call Trace: > > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 > print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 > print_report mm/kasan/report.c:430 [inline] > kasan_report+0x11c/0x130 mm/kasan/report.c:536 > mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 > tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] > tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 > tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] > tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] > tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 > ... > > @old and @new should not affect each other. In other words, @old should > never modify miniq_{in,e}gress after @new, and @new should not update > @old's RCU state. Fixing without changing sch_api.c turned out to be > difficult (please refer to Closes: for discussions). Instead, make sure > @new's first call always happen after @old's last call, in > qdisc_destroy(), has finished: > > In qdisc_graft(), return -EAGAIN and tell the caller to replay > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter > requests, and call qdisc_destroy() for @old before grafting @new. > > Introduce qdisc_refcount_dec_if_one() as the counterpart of > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, > just like qdisc_put() etc. > > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact > Qdiscs". > > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only > create under TC_H_INGRESS") on eth0 that has 8 transmission queues: > > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then > adds a flower filter X to A. > > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and > b2) to replace A, then adds a flower filter Y to B. > > Thread 1 A's refcnt Thread 2 > RTM_NEWQDISC (A, RTNL-locked) > qdisc_create(A) 1 > qdisc_graft(A) 9 > > RTM_NEWTFILTER (X, RTNL-unlocked) > __tcf_qdisc_find(A) 10 > tcf_chain0_head_change(A) > mini_qdisc_pair_swap(A) (1st) > | > | RTM_NEWQDISC (B, RTNL-locked) > RCU sync 2 qdisc_graft(B) > | 1 notify_and_destroy(A) > | > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) > qdisc_destroy(A) tcf_chain0_head_change(B) > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) > mini_qdisc_pair_swap(A) (3rd) | > ... ... > > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its > mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets > on eth0 will not find filter Y in sch_handle_ingress(). > > This is only one of the possible consequences of concurrently accessing > miniq_{in,e}gress pointers. The point is clear though: again, A should > never modify those per-net_device pointers after B, and B should not > update A's RCU state. > > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ > Cc: Hillf Danton > Cc: Vlad Buslov > Signed-off-by: Peilin Ye Tested-by: Pedro Tammela > --- > change in v5: > - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying, > just like @flags in tc_new_tfilter() > > change in v3, v4: > - add in-body From: tag > > changes in v2: > - replay the request if the current Qdisc has any ongoing RTNL-unlocked > filter requests (Vlad) > - minor changes in code comments and commit log > > include/net/sch_generic.h | 8 ++++++++ > net/sched/sch_api.c | 40 ++++++++++++++++++++++++++++++--------- > net/sched/sch_generic.c | 14 +++++++++++--- > 3 files changed, 50 insertions(+), 12 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index fab5ba3e61b7..3e9cc43cbc90 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) > refcount_inc(&qdisc->refcnt); > } > > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return true; > + return refcount_dec_if_one(&qdisc->refcnt); > +} > + > /* Intended to be used by unlocked users, when concurrent qdisc release is > * possible. > */ > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); > struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, > struct Qdisc *qdisc); > void qdisc_reset(struct Qdisc *qdisc); > +void qdisc_destroy(struct Qdisc *qdisc); > void qdisc_put(struct Qdisc *qdisc); > void qdisc_put_unlocked(struct Qdisc *qdisc); > void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index f72a581666a2..286b7c58f5b9 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > if ((q && q->flags & TCQ_F_INGRESS) || > (new && new->flags & TCQ_F_INGRESS)) { > ingress = 1; > - if (!dev_ingress_queue(dev)) { > + dev_queue = dev_ingress_queue(dev); > + if (!dev_queue) { > NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); > return -ENOENT; > } > + > + /* Replay if the current ingress (or clsact) Qdisc has ongoing > + * RTNL-unlocked filter request(s). This is the counterpart of that > + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > + */ > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + return -EAGAIN; > } > > if (dev->flags & IFF_UP) > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > qdisc_put(old); > } > } else { > - dev_queue = dev_ingress_queue(dev); > - old = dev_graft_qdisc(dev_queue, new); > + old = dev_graft_qdisc(dev_queue, NULL); > + > + /* {ingress,clsact}_destroy() @old before grafting @new to avoid > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress > + * pointer(s) in mini_qdisc_pair_swap(). > + */ > + qdisc_notify(net, skb, n, classid, old, new, extack); > + qdisc_destroy(old); > + > + dev_graft_qdisc(dev_queue, new); > } > > skip: > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > > if (new && new->ops->attach) > new->ops->attach(new); > - } else { > - notify_and_destroy(net, skb, n, classid, old, new, extack); > } > > if (dev->flags & IFF_UP) > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > struct netlink_ext_ack *extack) > { > struct net *net = sock_net(skb->sk); > - struct tcmsg *tcm = nlmsg_data(n); > struct nlattr *tca[TCA_MAX + 1]; > struct net_device *dev; > + struct Qdisc *q, *p; > + struct tcmsg *tcm; > u32 clid; > - struct Qdisc *q = NULL; > - struct Qdisc *p = NULL; > int err; > > +replay: > err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, > rtm_tca_policy, extack); > if (err < 0) > return err; > > + tcm = nlmsg_data(n); > + q = p = NULL; > + > dev = __dev_get_by_index(net, tcm->tcm_ifindex); > if (!dev) > return -ENODEV; > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > return -ENOENT; > } > err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); > - if (err != 0) > + if (err != 0) { > + if (err == -EAGAIN) > + goto replay; > return err; > + } > } else { > qdisc_notify(net, skb, n, clid, NULL, q, NULL); > } > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, > if (err) { > if (q) > qdisc_put(q); > + if (err == -EAGAIN) > + goto replay; > return err; > } > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 37e41f972f69..e14ed47f961c 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) > qdisc_free(q); > } > > -static void qdisc_destroy(struct Qdisc *qdisc) > +static void __qdisc_destroy(struct Qdisc *qdisc) > { > const struct Qdisc_ops *ops = qdisc->ops; > > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) > call_rcu(&qdisc->rcu, qdisc_free_cb); > } > > +void qdisc_destroy(struct Qdisc *qdisc) > +{ > + if (qdisc->flags & TCQ_F_BUILTIN) > + return; > + > + __qdisc_destroy(qdisc); > +} > + > void qdisc_put(struct Qdisc *qdisc) > { > if (!qdisc) > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) > !refcount_dec_and_test(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > } > EXPORT_SYMBOL(qdisc_put); > > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) > !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) > return; > > - qdisc_destroy(qdisc); > + __qdisc_destroy(qdisc); > rtnl_unlock(); > } > EXPORT_SYMBOL(qdisc_put_unlocked);