Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp221027rwb; Wed, 5 Oct 2022 17:45:36 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4ov9M2oD4TZw0MXbmK0ykv3yMeVtkA9Zt1t1rv2HG4Xjjd2EbeSRhmakiTMeIiuPu2RP/q X-Received: by 2002:a17:90a:6003:b0:20a:6fa6:b5b with SMTP id y3-20020a17090a600300b0020a6fa60b5bmr2354197pji.21.1665017136087; Wed, 05 Oct 2022 17:45:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665017136; cv=none; d=google.com; s=arc-20160816; b=CLOW0uhldFXpz2uH6+a0U9T72SByMUHTtCpEgc8ScQwGz1dwZaVnblzS5hGwhhmlsv 6RvCqUDwNAzkrIu5T4ozDCOZq/5sMfuC8p55q2Dp8hXZNFswoCb4kIdtby6BAi/u9lXu qie5ynS3/KKkJOsO0ZQ6E6GGaPdsCWUW/HpIqNTLaWKPPNqBbsggpLcG27W1yFiJ90rr 3tJao34Rz3k5DKaHWfvy65k3mkLwTTRnVM9ii/Uf7NCi5vXNIxrJyyARzBzDi9DBaMVU Qoz2JjrhHOw074lerEwRHUc3n+BFMalg6IdNtGHhEstc9OAZn1ghtNki04qx/VitO3qY mORw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:dkim-signature; bh=DYgJg1UrV8fLwGkChdBQRh23D44nCT/jzOvBy3lWbPE=; b=d29oAPXXjODprYlcppF1NSLoYL81Bz0aZg4Qa9yghdkkbUiSgjxJiBibrElu/A9n0v nAOLR0U/bOD19aedGStyxCcQNU6dSBv5y5kpPc3CmDV37UbN0iCc4OXO/aWQAQktyv83 k91YprGDd7KXtSnLzoYRgkA7DLsNKrqZw0z2jav8r1PVN8xE5RO1hdAkIezamjRZvYHX J/C/Vs9biQIufGYmLQbDCl0r4UkazPGVu8rK7h/wTbFGOIH1MfD02Ly4LEL9OePp4Azd myiu9czz1MmsiMUVvaY0hrBHBvrCDH4xgni6Y6oYqwkK1/SxoePbHggW9RJ5FHfp6qLS 51vg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UplKDUe5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y21-20020a63fa15000000b00434e31807f6si18013462pgh.676.2022.10.05.17.45.19; Wed, 05 Oct 2022 17:45:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UplKDUe5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229695AbiJFALe (ORCPT + 99 others); Wed, 5 Oct 2022 20:11:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60492 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229540AbiJFALb (ORCPT ); Wed, 5 Oct 2022 20:11:31 -0400 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E612232072; Wed, 5 Oct 2022 17:11:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1665015090; x=1696551090; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=QfbB1xck2NZlRA+BREh90ksDTDh4H6hsxrQLHLKAXwM=; b=UplKDUe51F5lIhaVN0c86cqnee1ohEc7CJ+QGa6ejBpKjDiyEmUzhUbR Cv9lkEpyEaO7L8SRpBgwWKDB6N+dtpas5t0LATxP8Zy1yLwvVMW1Ym9r3 MAQWcJ9FQXZUlZJ5Mh/9d2CI2QSan0g4b5xEPX+U7RxCOllaKdY6SIwFa bfAs0OgH7MLrwGoeO7RpZ7cV17g8WXuPqe2vEIwofAA3xRnbwUKAUizte 73rYxuyzQv1WEA0hsreNB0ao4x7A3mnaZVfIm74nGXxRELtNf+QiT774u PObXYZglkcNOQMNxznRj2GtBzD4EWhNFaKHknVR4KlwKyKXFNZZeBzP4W A==; X-IronPort-AV: E=McAfee;i="6500,9779,10491"; a="303280938" X-IronPort-AV: E=Sophos;i="5.95,162,1661842800"; d="scan'208";a="303280938" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Oct 2022 17:11:30 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10491"; a="766941964" X-IronPort-AV: E=Sophos;i="5.95,162,1661842800"; d="scan'208";a="766941964" Received: from vcostago-desk1.jf.intel.com (HELO vcostago-desk1) ([10.54.70.10]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Oct 2022 17:11:30 -0700 From: Vinicius Costa Gomes To: Vladimir Oltean , netdev@vger.kernel.org Cc: Jamal Hadi Salim , Cong Wang , Jiri Pirko , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Vladimir Oltean , Kurt Kanzenbach , linux-kernel@vger.kernel.org, Muhammad Husaini Zulkifli Subject: Re: [PATCH net] Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs" In-Reply-To: <20221004220100.1650558-1-vladimir.oltean@nxp.com> References: <20221004220100.1650558-1-vladimir.oltean@nxp.com> Date: Wed, 05 Oct 2022 17:11:30 -0700 Message-ID: <87zge9olot.fsf@intel.com> MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Vladimir Oltean writes: > taprio_attach() has this logic at the end, which should have been > removed with the blamed patch (which is now being reverted): > > /* access to the child qdiscs is not needed in offload mode */ > if (FULL_OFFLOAD_IS_ENABLED(q->flags)) { > kfree(q->qdiscs); > q->qdiscs = NULL; > } > > because otherwise, we make use of q->qdiscs[] even after this array was > deallocated, namely in taprio_leaf(). Therefore, whenever one would try > to attach a valid child qdisc to a fully offloaded taprio root, one > would immediately dereference a NULL pointer. > > $ tc qdisc replace dev eno0 handle 8001: parent root taprio \ > num_tc 8 \ > map 0 1 2 3 4 5 6 7 \ > queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ > max-sdu 0 0 0 0 0 200 0 0 \ > base-time 200 \ > sched-entry S 80 20000 \ > sched-entry S a0 20000 \ > sched-entry S 5f 60000 \ > flags 2 > $ max_frame_size=1500 > $ data_rate_kbps=20000 > $ port_transmit_rate_kbps=1000000 > $ idleslope=$data_rate_kbps > $ sendslope=$(($idleslope - $port_transmit_rate_kbps)) > $ locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps)) > $ hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps)) > $ tc qdisc replace dev eno0 parent 8001:7 cbs \ > idleslope $idleslope \ > sendslope $sendslope \ > hicredit $hicredit \ > locredit $locredit \ > offload 0 > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030 > pc : taprio_leaf+0x28/0x40 > lr : qdisc_leaf+0x3c/0x60 > Call trace: > taprio_leaf+0x28/0x40 > tc_modify_qdisc+0xf0/0x72c > rtnetlink_rcv_msg+0x12c/0x390 > netlink_rcv_skb+0x5c/0x130 > rtnetlink_rcv+0x1c/0x2c > > The solution is not as obvious as the problem. The code which deallocates > q->qdiscs[] is in fact copied and pasted from mqprio, which also > deallocates the array in mqprio_attach() and never uses it afterwards. > > Therefore, the identical cleanup logic of priv->qdiscs[] that > mqprio_destroy() has is deceptive because it will never take place at > qdisc_destroy() time, but just at raw ops->destroy() time (otherwise > said, priv->qdiscs[] do not last for the entire lifetime of the mqprio > root), but rather, this is just the twisted way in which the Qdisc API > understands error path cleanup should be done (Qdisc_ops :: destroy() is > called even when Qdisc_ops :: init() never succeeded). > > Side note, in fact this is also what the comment in mqprio_init() says: > > /* pre-allocate qdisc, attachment can't fail */ > > Or reworded, mqprio's priv->qdiscs[] scheme is only meant to serve as > data passing between Qdisc_ops :: init() and Qdisc_ops :: attach(). > > [ this comment was also copied and pasted into the initial taprio > commit, even though taprio_attach() came way later ] > > The problem is that taprio also makes extensive use of the q->qdiscs[] > array in the software fast path (taprio_enqueue() and taprio_dequeue()), > but it does not keep a reference of its own on q->qdiscs[i] (you'd think > that since it creates these Qdiscs, it holds the reference, but nope, > this is not completely true). > > To understand the difference between taprio_destroy() and mqprio_destroy() > one must look before commit 13511704f8d7 ("net: taprio offload: enforce > qdisc to netdev queue mapping"), because that just muddied the waters. > > In the "original" taprio design, taprio always attached itself (the root > Qdisc) to all netdev TX queues, so that dev_qdisc_enqueue() would go > through taprio_enqueue(). > > It also called qdisc_refcount_inc() on itself for as many times as there > were netdev TX queues, in order to counter-balance what tc_get_qdisc() > does when destroying a Qdisc (simplified for brevity below): > > if (n->nlmsg_type == RTM_DELQDISC) > err = qdisc_graft(dev, parent=NULL, new=NULL, q, extack); > > qdisc_graft(where "new" is NULL so this deletes the Qdisc): > > for (i = 0; i < num_q; i++) { > struct netdev_queue *dev_queue; > > dev_queue = netdev_get_tx_queue(dev, i); > > old = dev_graft_qdisc(dev_queue, new); > if (new && i > 0) > qdisc_refcount_inc(new); > > qdisc_put(old); > ~~~~~~~~~~~~~~ > this decrements taprio's refcount once for each TX queue > } > > notify_and_destroy(net, skb, n, classid, > rtnl_dereference(dev->qdisc), new); > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > and this finally decrements it to zero, > making qdisc_put() call qdisc_destroy() > > The q->qdiscs[] created using qdisc_create_dflt() (or their > replacements, if taprio_graft() was ever to get called) were then > privately freed by taprio_destroy(). > > This is still what is happening after commit 13511704f8d7 ("net: taprio > offload: enforce qdisc to netdev queue mapping"), but only for software > mode. > > In full offload mode, the per-txq "qdisc_put(old)" calls from > qdisc_graft() now deallocate the child Qdiscs rather than decrement > taprio's refcount. So when notify_and_destroy(taprio) finally calls > taprio_destroy(), the difference is that the child Qdiscs were already > deallocated. > > And this is exactly why the taprio_attach() comment "access to the child > qdiscs is not needed in offload mode" is deceptive too. Not only the > q->qdiscs[] array is not needed, but it is also necessary to get rid of > it as soon as possible, because otherwise, we will also call qdisc_put() > on the child Qdiscs in qdisc_destroy() -> taprio_destroy(), and this > will cause a nasty use-after-free/refcount-saturate/whatever. > > In short, the problem is that since the blamed commit, taprio_leaf() > needs q->qdiscs[] to not be freed by taprio_attach(), while qdisc_destroy() > -> taprio_destroy() does need q->qdiscs[] to be freed by taprio_attach() > for full offload. Fixing one problem triggers the other. > > All of this can be solved by making taprio keep its q->qdiscs[i] with a > refcount elevated at 2 (in offloaded mode where they are attached to the > netdev TX queues), both in taprio_attach() and in taprio_graft(). The > generic qdisc_graft() would just decrement the child qdiscs' refcounts > to 1, and taprio_destroy() would give them the final coup de grace. > > However the rabbit hole of changes is getting quite deep, and the > complexity increases. The blamed commit was supposed to be a bug fix in > the first place, and the bug it addressed is not so significant so as to > justify further rework in stable trees. So I'd rather just revert it. > I don't know enough about multi-queue Qdisc design to make a proper > judgement right now regarding what is/isn't idiomatic use of Qdisc > concepts in taprio. I will try to study the problem more and come with a > different solution in net-next. > > Fixes: 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs") > Reported-by: Muhammad Husaini Zulkifli > Reported-by: Vinicius Costa Gomes > Signed-off-by: Vladimir Oltean > --- Reviewed-by: Vinicius Costa Gomes Cheers, -- Vinicius