Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp451066iog; Mon, 13 Jun 2022 06:13:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw7I1ZVyppxpxfDIl30loKMEk5UEOBatiwtO3Txn8L7qEkyDncxQ6gSIgfJwibjM7+0CtPD X-Received: by 2002:a05:6402:60d:b0:42f:ac38:af75 with SMTP id n13-20020a056402060d00b0042fac38af75mr54050243edv.203.1655125279954; Mon, 13 Jun 2022 06:01:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655125279; cv=none; d=google.com; s=arc-20160816; b=cns57dcUjmKzQsZfnwzz2FePXZwzaFBbigdPeD1QtyCbFoOvIs0d0Z8wOZCavBvTrI /c+34x4GqyjChMVBn8UjmmWRJ5eh2C6dKWayhmvX4mVOxFwbFti5/S/WV/61L0eriy03 o6Aiz5nOOOhP/74i8pfjeWNxLsOEmyjjFXOYhPjYiyQbAU2zIKPy6ZIHt62ZWhPEVBhw fuCvAHmLrsIGq9IUExEVwM3NvT001TgdqeK+IsWVY+0Ki3+ywFCaUII4adED0WQDYrVw qCEGBfPUStIW0KaHi9QTcQ33cSvbw6NKdhc+JkCHaALz27Zb9qvaLjK9Ysyna5g0QbPj bmjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=QAQ+12YoeGoJ/mjZm7n+6f2oB3yzvWDP5ouVtusINE0=; b=Q3elpWjmrVbK1ppSaTMPLo9/EgNdzv7BcxyO+ZlWLaLMoajlR14vueGujuhsvF2OBM qmFj4iTtz0XQpBrPUqRp8yLSkqrXYPJCfljlf5KC+l3Gg+XT/asj2vU6doaMUN3b+KIT JiTpEpeXbEVg41VATbWN6DYWjSxXEKZ2jF5wvSXiNP3lVutMiz1rO++7XskFjrGv+xT1 7xtZL3pWEJNR8OUpvtr9582Zu08COBNvVU0UMQLnlmobERHExWN+Rv0E7mw/tTVih6d5 UxHa+tojthCJEKrunqZTwWy3sYXhahTCjw3zamfUCaeM9yLgUMBeuPWEiKRcGqeJSrCt qUaQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=hIYcmWjI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l4-20020a170906414400b006e88579e94fsi7000359ejk.798.2022.06.13.06.00.53; Mon, 13 Jun 2022 06:01:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=hIYcmWjI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354793AbiFMLj5 (ORCPT + 99 others); Mon, 13 Jun 2022 07:39:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1355187AbiFMLaw (ORCPT ); Mon, 13 Jun 2022 07:30:52 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 727A62B25A; Mon, 13 Jun 2022 03:46:47 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 1DFE9B80D3F; Mon, 13 Jun 2022 10:46:46 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 723A2C34114; Mon, 13 Jun 2022 10:46:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1655117204; bh=s4bQTMOXccCEJYiOCIWRA58ppT+gbk9Jz+EYEub9ptY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hIYcmWjICe3Ui6yDUjggY9nOHBfpdPt8SlmFJc1xwdUklBP5ZuUwsld4C5n2o+s8+ eSVjnVBRya4g8HyDlAMB9hpCJKsK0kGdQWZC+2QUq9SCUAW1u6sDa3me4CMTaMHc2d ja0SCAniFboPWQmHezC4qfSMXJ97yrOKjUBkMXVA= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Vincent Ray , Eric Dumazet , Jakub Kicinski , Sasha Levin Subject: [PATCH 5.4 323/411] net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog Date: Mon, 13 Jun 2022 12:09:56 +0200 Message-Id: <20220613094938.418003432@linuxfoundation.org> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220613094928.482772422@linuxfoundation.org> References: <20220613094928.482772422@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Vincent Ray [ Upstream commit a54ce3703613e41fe1d98060b62ec09a3984dc28 ] In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit() does not provide any ordering guarantee as test_bit() is not an atomic operation. This, added to the fact that the spin_trylock() call at the beginning of qdisc_run_begin() does not guarantee acquire semantics if it does not grab the lock, makes it possible for the following statement : if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) to be executed before an enqueue operation called before qdisc_run_begin(). As a result the following race can happen : CPU 1 CPU 2 qdisc_run_begin() qdisc_run_begin() /* true */ set(MISSED) . /* returns false */ . . /* sees MISSED = 1 */ . /* so qdisc not empty */ . __qdisc_run() . . . pfifo_fast_dequeue() ----> /* may be done here */ . | . clear(MISSED) | . . | . smp_mb __after_atomic(); | . . | . /* recheck the queue */ | . /* nothing => exit */ | enqueue(skb1) | . | qdisc_run_begin() | . | spin_trylock() /* fail */ | . | smp_mb__before_atomic() /* not enough */ | . ---- if (test_bit(MISSED)) return false; /* exit */ In the above scenario, CPU 1 and CPU 2 both try to grab the qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the bypass code path, where it emits its skb then calls __qdisc_run(). CPU1 fails, sets MISSED and goes down the traditionnal enqueue() + dequeue() code path. But when executing qdisc_run_begin() for the second time, after enqueuing its skbuff, it sees the MISSED bit still set (by itself) and consequently chooses to exit early without setting it again nor trying to grab the spinlock again. Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue and found it empty, so it returned. At the end of the sequence, we end up with skb1 enqueued in the backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set, and no __netif_schedule() called made. skb1 will now linger in the qdisc until somebody later performs a full __qdisc_run(). Associated to the bypass capacity of the qdisc, and the ability of the TCP layer to avoid resending packets which it knows are still in the qdisc, this can lead to serious traffic "holes" in a TCP connection. We fix this by replacing the smp_mb__before_atomic() / test_bit() / set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin() by a single test_and_set_bit() call, which is more concise and enforces the needed memory barriers. Fixes: 89837eb4b246 ("net: sched: add barrier to ensure correct ordering for lockless qdisc") Signed-off-by: Vincent Ray Signed-off-by: Eric Dumazet Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski Signed-off-by: Sasha Levin --- include/net/sch_generic.h | 36 ++++++++---------------------------- 1 file changed, 8 insertions(+), 28 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index ae69059ba76d..90fb413d9fd7 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -160,37 +160,17 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) if (spin_trylock(&qdisc->seqlock)) goto nolock_empty; - /* Paired with smp_mb__after_atomic() to make sure - * STATE_MISSED checking is synchronized with clearing - * in pfifo_fast_dequeue(). + /* No need to insist if the MISSED flag was already set. + * Note that test_and_set_bit() also gives us memory ordering + * guarantees wrt potential earlier enqueue() and below + * spin_trylock(), both of which are necessary to prevent races */ - smp_mb__before_atomic(); - - /* If the MISSED flag is set, it means other thread has - * set the MISSED flag before second spin_trylock(), so - * we can return false here to avoid multi cpus doing - * the set_bit() and second spin_trylock() concurrently. - */ - if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) + if (test_and_set_bit(__QDISC_STATE_MISSED, &qdisc->state)) return false; - /* Set the MISSED flag before the second spin_trylock(), - * if the second spin_trylock() return false, it means - * other cpu holding the lock will do dequeuing for us - * or it will see the MISSED flag set after releasing - * lock and reschedule the net_tx_action() to do the - * dequeuing. - */ - set_bit(__QDISC_STATE_MISSED, &qdisc->state); - - /* spin_trylock() only has load-acquire semantic, so use - * smp_mb__after_atomic() to ensure STATE_MISSED is set - * before doing the second spin_trylock(). - */ - smp_mb__after_atomic(); - - /* Retry again in case other CPU may not see the new flag - * after it releases the lock at the end of qdisc_run_end(). + /* Try to take the lock again to make sure that we will either + * grab it or the CPU that still has it will see MISSED set + * when testing it in qdisc_run_end() */ if (!spin_trylock(&qdisc->seqlock)) return false; -- 2.35.1