Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp2749665pxj; Mon, 31 May 2021 09:47:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwpzg9hyc3QH8eAc7SCZSXm58znrpZFvFwBwtMqTGp/P0HUlzMDLw9FQSTNI8zAfVEE3BIr X-Received: by 2002:a50:9e2e:: with SMTP id z43mr26037084ede.70.1622479621511; Mon, 31 May 2021 09:47:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1622479621; cv=none; d=google.com; s=arc-20160816; b=gQ6lL7/6VKFS+IUogT8MW/DHpSNkENQFqXr1N8F+RDdEXC1F0eCvtCVVCxdulBGD2f Fq3m86HnSOdgqvOu30+HmZVBheHfV9hTYrhH1BwUVZNXBL6ZAgjakJ9JCVCNIr90ir0N qfwWVpPs6rBjq/+GvZxS39FfVf0xvmDsjMdEz/7MP/rVirfe60QErbGSPIUbqCkH1KMo 3NspYNZO4gIqmY4c1B5NhMYIT5yDfheO7sHLs59rtnzFZn5A7AV6j2hQCKshGHMGj5xi 36uEX7IUdE0Sgc8vIhwrQX7xbxWaWfUwceeAIxf7CTKVqaXl+WcPpbMtXWs9czACeR0c x5Rw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=oGQgIDGC52gB+G8ssHLfqZ/o5+Wfeq5Di3fwieJAQV4=; b=WsopOtea6siwitCGFqpcIG85id5JlkN8XxpEahADnxtCXXPH/SFIl+tCwBZ2M588Kh 8eHmV/lEUB+Nd8IVAerI+13imzmr5OVUZTevI+FyGRZMowayYKNruCjV8i1b95TViu2E bU8nsOB3wlJkFnpBvpSTI/L12VpzjJs5KMWV5dRHMm50dgdN6g6ZDvjde1dub37+0sb1 OWtA80aBjZ4g2tEAOWQPcOR2vCYMg87aBiT0A4r1HrZ6ngJbCj9DY4dnyvjXWm6R1rpm +d1Am80OVwd9uO04icEFaj+0stztvlHSbW5d0QAmgrKqCXLKdh2oJgoLPPS2KKGw/eC3 a+3Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ube6xFWB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id fl21si13434070ejc.601.2021.05.31.09.46.39; Mon, 31 May 2021 09:47:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=ube6xFWB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233301AbhEaQpE (ORCPT + 99 others); Mon, 31 May 2021 12:45:04 -0400 Received: from mail.kernel.org ([198.145.29.99]:47164 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234251AbhEaOym (ORCPT ); Mon, 31 May 2021 10:54:42 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 50DDC61446; Mon, 31 May 2021 13:59:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1622469557; bh=4MGRhDtRwd9u1UJeWbFHlGZW6Auqd4re8ONVRW2+rLY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ube6xFWBGw+zEVO5fwpzgOgw2UjTI79o83pnfYwRI52XTf70IsXvYYqRWOJbpk+Kj sl/XPXpEB4Cuw0MZnBlgKRyD2/lProzt9NMy2iKqiLYNrdkq+KJIzTTIf+KX/SqDR7 wVKMcX5bvkjekq4JDlMiLu0SHM+drMo6DNu+NaGE= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Jakub Kicinski , Juergen Gross , Yunsheng Lin , "David S. Miller" , Sasha Levin Subject: [PATCH 5.12 242/296] net: sched: fix packet stuck problem for lockless qdisc Date: Mon, 31 May 2021 15:14:57 +0200 Message-Id: <20210531130711.926935292@linuxfoundation.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210531130703.762129381@linuxfoundation.org> References: <20210531130703.762129381@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Yunsheng Lin [ Upstream commit a90c57f2cedd52a511f739fb55e6244e22e1a2fb ] Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() . cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end(). Lockless qdisc has below another concurrent problem when tx_action is involved: cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_end() . This patch fixes the above data race by: 1. If the first spin_trylock() return false and STATE_MISSED is not set, set STATE_MISSED and retry another spin_trylock() in case other CPU may not see STATE_MISSED after it releases the lock. 2. reschedule if STATE_MISSED is set after the lock is released at the end of qdisc_run_end(). For tx_action case, STATE_MISSED is also set when cpu1 is at the end if qdisc_run_end(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2. Clear STATE_MISSED before retrying a dequeuing when dequeuing returns NULL in order to reduce the overhead of the second spin_trylock() and __netif_schedule() calling. Also clear the STATE_MISSED before calling __netif_schedule() at the end of qdisc_run_end() to avoid doing another round of dequeuing in the pfifo_fast_dequeue(). The performance impact of this patch, tested using pktgen and dummy netdev with pfifo_fast qdisc attached: threads without+this_patch with+this_patch delta 1 2.61Mpps 2.60Mpps -0.3% 2 3.97Mpps 3.82Mpps -3.7% 4 5.62Mpps 5.59Mpps -0.5% 8 2.78Mpps 2.77Mpps -0.3% 16 2.22Mpps 2.22Mpps -0.0% Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Acked-by: Jakub Kicinski Tested-by: Juergen Gross Signed-off-by: Yunsheng Lin Signed-off-by: David S. Miller Signed-off-by: Sasha Levin --- include/net/sch_generic.h | 35 ++++++++++++++++++++++++++++++++++- net/sched/sch_generic.c | 19 +++++++++++++++++++ 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 2d6eb60c58c8..2c4f3527cc09 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_MISSED, }; struct qdisc_size_table { @@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) static inline bool qdisc_run_begin(struct Qdisc *qdisc) { if (qdisc->flags & TCQ_F_NOLOCK) { + if (spin_trylock(&qdisc->seqlock)) + goto nolock_empty; + + /* If the MISSED flag is set, it means other thread has + * set the MISSED flag before second spin_trylock(), so + * we can return false here to avoid multi cpus doing + * the set_bit() and second spin_trylock() concurrently. + */ + if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) + return false; + + /* Set the MISSED flag before the second spin_trylock(), + * if the second spin_trylock() return false, it means + * other cpu holding the lock will do dequeuing for us + * or it will see the MISSED flag set after releasing + * lock and reschedule the net_tx_action() to do the + * dequeuing. + */ + set_bit(__QDISC_STATE_MISSED, &qdisc->state); + + /* Retry again in case other CPU may not see the new flag + * after it releases the lock at the end of qdisc_run_end(). + */ if (!spin_trylock(&qdisc->seqlock)) return false; + +nolock_empty: WRITE_ONCE(qdisc->empty, false); } else if (qdisc_is_running(qdisc)) { return false; @@ -176,8 +202,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); - if (qdisc->flags & TCQ_F_NOLOCK) + if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); + + if (unlikely(test_bit(__QDISC_STATE_MISSED, + &qdisc->state))) { + clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + __netif_schedule(qdisc); + } + } } static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 49eae93d1489..8c6b97cc5e41 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) { struct pfifo_fast_priv *priv = qdisc_priv(qdisc); struct sk_buff *skb = NULL; + bool need_retry = true; int band; +retry: for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { struct skb_array *q = band2list(priv, band); @@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) } if (likely(skb)) { qdisc_update_stats_at_dequeue(qdisc, skb); + } else if (need_retry && + test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { + /* Delay clearing the STATE_MISSED here to reduce + * the overhead of the second spin_trylock() in + * qdisc_run_begin() and __netif_schedule() calling + * in qdisc_run_end(). + */ + clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + + /* Make sure dequeuing happens after clearing + * STATE_MISSED. + */ + smp_mb__after_atomic(); + + need_retry = false; + + goto retry; } else { WRITE_ONCE(qdisc->empty, true); } -- 2.30.2