Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp3647062rwd; Mon, 22 May 2023 18:00:59 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5IXXcxVG0eYxghlPT9Y35tK1FPV64ht3n9HEJ4e/V2AC2/dNjv4vjHxBTLBDsR4fD5742m X-Received: by 2002:a17:903:41cf:b0:1ae:2e0e:dfcf with SMTP id u15-20020a17090341cf00b001ae2e0edfcfmr13308815ple.2.1684803659071; Mon, 22 May 2023 18:00:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684803659; cv=none; d=google.com; s=arc-20160816; b=hNqyVuMvAqoxOMzRltg0Xd08885MmRon5hMIRVblpzqI+SxSBNhNx0n58KQBjyeHMM EtiFLIooggn/k7P6WeVNB5z5fkOC74EXeifPpAJ6EFQFM40ssrvYNMS86AX4Rd0grTcr TX95AmRDAhXKRaVEw/YScSHTzGxVfgQxaIjaQr8aAqjw6goSLxA+ZE3mHhP5UOJ2QykW 4XVPYzHzX37N7njckhnrI4Y8rjzTw1sP7gdEZRCH73paAcgLi5GKDBYg2PHJE8ek8ofW BXuuBHwCaNQKM/MUkLO5tLiMTWGjKGVAQJHYYUlNNuvcMs2+WyM3hXdFAc++sTQ40iU/ 9qBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=HixgL2NVluKfw+nsJ/Td6OzkTCmDepFKzfKULlSheVpbXvxPYGvEE0vE6Ib3p9ys7M MQqyWCsy4waHMm483kOjl6Zi5ezYafdYui3G8TjZpn8jmPzBDiZCUaeEVyoSecakOm1M xMPFhqcL5fyb87mYMZP/UNJZ//4vR1ryvvidd+/NPgujOZZfSQI+fu35jfcKggoyczE3 oouhPz/vprIBnrqo1vVvHrUFpyA2lUO5S5epZg8gcvFz4SlI3F+izcYWNANOvjAOZQpa h5XEZWCu8xVel4CUBqWeX/ySsbiKbRxUbIXEIrjN4AhhSODAGw3rK6F6hextzsWFfvlH fgcQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Oxk2wkmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u11-20020a170903124b00b001ab23d42f37si3090255plh.258.2023.05.22.18.00.46; Mon, 22 May 2023 18:00:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=Oxk2wkmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234993AbjEWAly (ORCPT + 99 others); Mon, 22 May 2023 20:41:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234723AbjEWABl (ORCPT ); Mon, 22 May 2023 20:01:41 -0400 Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54EF2213F; Mon, 22 May 2023 16:55:51 -0700 (PDT) Received: by mail-ot1-x332.google.com with SMTP id 46e09a7af769-6af7daff37eso1141127a34.0; Mon, 22 May 2023 16:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=Oxk2wkmGKzN+KZViHTjAnIHcCWo+gN4WIpgSyeC38zjz0/hy481mG3wiKfWHFCi5xO jT8cRnHQ3D3ueSB0lDmMHwzLh3KvDfuwKsW0XIpyFf/xyfvVx5huI+WnPlyt+LJJOjis eUedKUWiJFf+2iKdFk/elq1GmfJPnZmXM80SwCwD7wQwpnwg3a3uDlP9iMVegvFYx+Y2 lgXAKuOW8opnf6DLXDlNtrrHKgM3A1Zkf3eo2BjeTdZZbaezCMRCeLGj5C1zZXFMMLYL wKAqOZC330NVZoYCNaL7hzH/dBQWeJVXelx9UNWtp+r9MKl8lZcY9W8eQgtVsKw0JImk eCqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684799750; x=1687391750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tItS/JirFunMX5RfI2+aJ2PuNtEICNcjEQi3pmV0waQ=; b=j5H1XNGO8ygVIUlhkdVWHf1ny78B5VZmQ6ghEKZDjM1XxVgnnE6bAjn+10CojJB3IU ygqDgAI0m93xvGLkUwdH4KVsFGD+mnSQkrK4STbsrsZYwoPADiTjG31rXy6VASHX9LPc wq9KoaejKLgFzus2uHgGWw8XQhTw0ZCcf51bQCbkzSw3oFmxb6TEH4hXdwzuIK4PG432 //xNkiyECy5MmmcCYgg4lQgtHCipB83bYWVuSNSJj4H2EIVRStTkOvodrPNB/43GUfWM ZCmQMO4vzLbaF7eoqOPb5xm0HrN6WjX8m9kGe4JxrF/tp9Zi/bPJO/PZMaEf6UOB0FAA Eeww== X-Gm-Message-State: AC+VfDw35I5GZtVckiQIKR+aexKrQpQkZE9PUnsioabG+RP+cBEohS8l 7YQO1wksPqj1hq/YIfOOdA== X-Received: by 2002:a9d:7ad7:0:b0:6ac:8844:3605 with SMTP id m23-20020a9d7ad7000000b006ac88443605mr6297410otn.30.1684799750582; Mon, 22 May 2023 16:55:50 -0700 (PDT) Received: from C02FL77VMD6R.bytedance.net ([208.184.112.130]) by smtp.gmail.com with ESMTPSA id d5-20020a05683018e500b006a65be836acsm2889645otf.16.2023.05.22.16.55.49 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 22 May 2023 16:55:50 -0700 (PDT) From: Peilin Ye X-Google-Original-From: Peilin Ye To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jamal Hadi Salim , Cong Wang , Jiri Pirko Cc: Peilin Ye , Daniel Borkmann , John Fastabend , Vlad Buslov , Pedro Tammela , Hillf Danton , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang , Peilin Ye Subject: [PATCH v2 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Date: Mon, 22 May 2023 16:55:36 -0700 Message-Id: <8e3383d0bacd084f0e33d9158d24bd411f1bf6ba.1684796705.git.peilin.ye@bytedance.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for clsact Qdiscs and miniq_egress. Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER requests (thanks Hillf Danton for the hint), when replacing ingress or clsact Qdiscs, for example, the old Qdisc ("@old") could access the same miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"), causing race conditions [1] including a use-after-free bug in mini_qdisc_pair_swap() reported by syzbot: BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901 ... Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319 print_report mm/kasan/report.c:430 [inline] kasan_report+0x11c/0x130 mm/kasan/report.c:536 mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573 tcf_chain_head_change_item net/sched/cls_api.c:495 [inline] tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509 tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline] tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline] tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266 ... @old and @new should not affect each other. In other words, @old should never modify miniq_{in,e}gress after @new, and @new should not update @old's RCU state. Fixing without changing sch_api.c turned out to be difficult (please refer to Closes: for discussions). Instead, make sure @new's first call always happen after @old's last call, in qdisc_destroy(), has finished: In qdisc_graft(), return -EAGAIN and tell the caller to replay (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter requests, and call qdisc_destroy() for @old before grafting @new. Introduce qdisc_refcount_dec_if_one() as the counterpart of qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests. Introduce a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check, just like qdisc_put() etc. Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs". [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only create under TC_H_INGRESS") on eth0 that has 8 transmission queues: Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then adds a flower filter X to A. Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and b2) to replace A, then adds a flower filter Y to B. Thread 1 A's refcnt Thread 2 RTM_NEWQDISC (A, RTNL-locked) qdisc_create(A) 1 qdisc_graft(A) 9 RTM_NEWTFILTER (X, RTNL-unlocked) __tcf_qdisc_find(A) 10 tcf_chain0_head_change(A) mini_qdisc_pair_swap(A) (1st) | | RTM_NEWQDISC (B, RTNL-locked) RCU sync 2 qdisc_graft(B) | 1 notify_and_destroy(A) | tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-unlocked) qdisc_destroy(A) tcf_chain0_head_change(B) tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd) mini_qdisc_pair_swap(A) (3rd) | ... ... Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets on eth0 will not find filter Y in sch_handle_ingress(). This is only one of the possible consequences of concurrently accessing miniq_{in,e}gress pointers. The point is clear though: again, A should never modify those per-net_device pointers after B, and B should not update A's RCU state. Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops") Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops") Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/ Cc: Hillf Danton Cc: Vlad Buslov Signed-off-by: Peilin Ye --- changes in v2: - replay the request if the current Qdisc has any ongoing RTNL-unlocked filter requests (Vlad) - minor changes in code comments and commit log include/net/sch_generic.h | 8 ++++++++ net/sched/sch_api.c | 32 ++++++++++++++++++++++++++------ net/sched/sch_generic.c | 14 +++++++++++--- 3 files changed, 45 insertions(+), 9 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index fab5ba3e61b7..3e9cc43cbc90 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc) refcount_inc(&qdisc->refcnt); } +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return true; + return refcount_dec_if_one(&qdisc->refcnt); +} + /* Intended to be used by unlocked users, when concurrent qdisc release is * possible. */ @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head); struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue, struct Qdisc *qdisc); void qdisc_reset(struct Qdisc *qdisc); +void qdisc_destroy(struct Qdisc *qdisc); void qdisc_put(struct Qdisc *qdisc); void qdisc_put_unlocked(struct Qdisc *qdisc); void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len); diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index f72a581666a2..b3bafa6c1b44 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if ((q && q->flags & TCQ_F_INGRESS) || (new && new->flags & TCQ_F_INGRESS)) { ingress = 1; - if (!dev_ingress_queue(dev)) { + dev_queue = dev_ingress_queue(dev); + if (!dev_queue) { NL_SET_ERR_MSG(extack, "Device does not have an ingress queue"); return -ENOENT; } + + /* Replay if the current ingress (or clsact) Qdisc has ongoing + * RTNL-unlocked filter request(s). This is the counterpart of that + * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). + */ + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) + return -EAGAIN; } if (dev->flags & IFF_UP) @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, qdisc_put(old); } } else { - dev_queue = dev_ingress_queue(dev); - old = dev_graft_qdisc(dev_queue, new); + old = dev_graft_qdisc(dev_queue, NULL); + + /* {ingress,clsact}_destroy() @old before grafting @new to avoid + * unprotected concurrent accesses to net_device::miniq_{in,e}gress + * pointer(s) in mini_qdisc_pair_swap(). + */ + qdisc_notify(net, skb, n, classid, old, new, extack); + qdisc_destroy(old); + + dev_graft_qdisc(dev_queue, new); } skip: @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, if (new && new->ops->attach) new->ops->attach(new); - } else { - notify_and_destroy(net, skb, n, classid, old, new, extack); } if (dev->flags & IFF_UP) @@ -1458,6 +1472,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, struct Qdisc *p = NULL; int err; +replay: err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy, extack); if (err < 0) @@ -1515,8 +1530,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, return -ENOENT; } err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack); - if (err != 0) + if (err != 0) { + if (err == -EAGAIN) + goto replay; return err; + } } else { qdisc_notify(net, skb, n, clid, NULL, q, NULL); } @@ -1704,6 +1722,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, if (err) { if (q) qdisc_put(q); + if (err == -EAGAIN) + goto replay; return err; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 37e41f972f69..e14ed47f961c 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head) qdisc_free(q); } -static void qdisc_destroy(struct Qdisc *qdisc) +static void __qdisc_destroy(struct Qdisc *qdisc) { const struct Qdisc_ops *ops = qdisc->ops; @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc) call_rcu(&qdisc->rcu, qdisc_free_cb); } +void qdisc_destroy(struct Qdisc *qdisc) +{ + if (qdisc->flags & TCQ_F_BUILTIN) + return; + + __qdisc_destroy(qdisc); +} + void qdisc_put(struct Qdisc *qdisc) { if (!qdisc) @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc) !refcount_dec_and_test(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); } EXPORT_SYMBOL(qdisc_put); @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc) !refcount_dec_and_rtnl_lock(&qdisc->refcnt)) return; - qdisc_destroy(qdisc); + __qdisc_destroy(qdisc); rtnl_unlock(); } EXPORT_SYMBOL(qdisc_put_unlocked); -- 2.20.1