Received: by 2002:aa6:c3ca:0:b029:c8:4414:5686 with SMTP id b10csp1821233lkq; Fri, 19 Mar 2021 12:43:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwIlugn6/xoNfFTRM6Pfrllubr8OzVQAMG2+1+bdxlX8aooakHeCGyWStST8aJSW0zd2nwq X-Received: by 2002:aa7:cb0a:: with SMTP id s10mr11439518edt.36.1616183008181; Fri, 19 Mar 2021 12:43:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616183008; cv=none; d=google.com; s=arc-20160816; b=ejZuaregbvUOdd77dg7ExdUn3SBfnQ0LU6DBbDSlKlJBSecmL0d6n6aBPDXNJdSwe/ PdX5Au8VOoeqoro2YODfL8keKUgbJHErxx+JzElhQD3RDsuF8hxBCQFmBaPGDE8HDI1o SMUexpRLCbkPgSl5WW6lT69UksF9xKGLYRkFyUO3H6yKGCnSHXpPElISMLY4Z1Owwnco 4ndUMXo7oPSDv6xEVB/yXVmSOivSuY0l0srkuTC1BGp54vKa1r98iIvNWfF3d2CIDq+W VE1Z6RSPSA6hMXvEqXbaOjZkuHHdV+z1qj8LEgix4IfuF5R6ciE2YZD5tKmAFtyJ7Ofh v/QQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Dh1nzjBIr2bRJaiR4CEWujMGTbheJgRq70Upx+kZz1Q=; b=sTOYhQnMq+Cr0+4z6k6UbjwYYwwTxWC4IVtR/l6qZQJUPhOnt72KeUP21s8qMqvWNZ a3766Dlg/p19TzEw86jDwHlFj2h2ONaZCSpWSzIw6iWEbV7gDAVgt7vIIrzJzFup+Swx j6GlcZ1c1XOogbua6HVdEONG/HCCxgmwE/DSKgBtyShYKvIygfkgSfIBl5F8OLrPJ4Q5 fJqPocv+FYC+nJ9uDrzqbUDGSoXdbOVuDHaR5D3cRc1EnKsOBeqIgmcvkwVI6HuFybGU BlHMnhvBpDHo4BTGXHdNXN/WdnL4oij/t2jUNZs66/j6Ey4v5W6U4iK/XpwPnYNIpWIC +trQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=s0z09CPq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h1si5273575ejf.242.2021.03.19.12.43.05; Fri, 19 Mar 2021 12:43:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=s0z09CPq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230229AbhCSTln (ORCPT + 99 others); Fri, 19 Mar 2021 15:41:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229956AbhCSTlQ (ORCPT ); Fri, 19 Mar 2021 15:41:16 -0400 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 48DBAC06174A; Fri, 19 Mar 2021 12:41:05 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id k4so3434796plk.5; Fri, 19 Mar 2021 12:41:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Dh1nzjBIr2bRJaiR4CEWujMGTbheJgRq70Upx+kZz1Q=; b=s0z09CPqCn5AccsFXJ/30J2EhdbneqRm6xh3bHb8I4QXeMucmVRwqK6cI575XRF8nM QFO6mf8BrfSSJdP2f/+iPTLo3/EyUudriU5o/j14s6WByxZOMwCG+1ojDWVNACnssw5O f5a5+jMX3tXVQ7X15KOWoOcdLw03KgbkpLQtj6p6xk91s0OkD/lpKBe6uqiGxO/Ka5p5 i4Nz6keOuRbiLS6DV6tNkIqt0JKu11yZlWFNIYw8F298n67dvq8MfZK24uiYOeqrFxFj 602WX4eAU49U1lsswYkoDqmKMLwwBGtg8OXDrSRqfd8XxpANiiEqw0bF6otxXJPhFtgf fsQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Dh1nzjBIr2bRJaiR4CEWujMGTbheJgRq70Upx+kZz1Q=; b=f6oG4wRV58NJkSlNatLp69fuujHHzsKKcaiJHMrfJqgd9fAQj9lwbnN/VDNrCdbLJa AxJlDhRtWsu2PL6lXvYilwMe9RofVUeoj3tsM0/Gb5eciTrmNd5AH3iyllLkr7JPIcBR l1sgjPP2EhptYg6cnzpOGpYzdr6akSdzYlvHB9/QowhqP/uSiAt6FqQ0/j4cV6MumSJW D5SNpX88FDDYhvUwHfFXWBwkcIKrB899+4IL+aWPA4MioNoZkjTXkTX2U9y/s0kpD7rg 2smXgeS4jESyNPc72g/i2HjfRcsktDAWGohpwpf1v4DKLRws5xKyAnAAjd4g5Uvxxg+h PVXA== X-Gm-Message-State: AOAM530Rw/80pITd7ctZVid3p5+KiVxRhukvCfJ4khtcSEOF+TfqjkuF +nSIgpaReXMsjChIpwQkmrC+fYrY4oVumgWYAS4= X-Received: by 2002:a17:90a:8b16:: with SMTP id y22mr76528pjn.191.1616182864704; Fri, 19 Mar 2021 12:41:04 -0700 (PDT) MIME-Version: 1.0 References: <1616050402-37023-1-git-send-email-linyunsheng@huawei.com> In-Reply-To: <1616050402-37023-1-git-send-email-linyunsheng@huawei.com> From: Cong Wang Date: Fri, 19 Mar 2021 12:40:53 -0700 Message-ID: Subject: Re: [PATCH net] net: sched: fix packet stuck problem for lockless qdisc To: Yunsheng Lin Cc: David Miller , Jakub Kicinski , Vladimir Oltean , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Eric Dumazet , Wei Wang , "Cong Wang ." , Taehee Yoo , Linux Kernel Network Developers , LKML , linuxarm@openeuler.org, Marc Kleine-Budde , linux-can@vger.kernel.org, Jamal Hadi Salim , Jiri Pirko , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , kpsingh@kernel.org, bpf , Jonas Bonn , Paolo Abeni , Michael Zhivich , Josh Hunt , Jike Song , Kehuan Feng Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin wrote: > > Lockless qdisc has below concurrent problem: > cpu0 cpu1 > . . > q->enqueue . > . . > qdisc_run_begin() . > . . > dequeue_skb() . > . . > sch_direct_xmit() . > . . > . q->enqueue > . qdisc_run_begin() > . return and do nothing > . . > qdisc_run_end() . > > cpu1 enqueue a skb without calling __qdisc_run() because cpu0 > has not released the lock yet and spin_trylock() return false > for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb > enqueued by cpu1 when calling dequeue_skb() because cpu1 may > enqueue the skb after cpu0 calling dequeue_skb() and before > cpu0 calling qdisc_run_end(). > > Lockless qdisc has another concurrent problem when tx_action > is involved: > > cpu0(serving tx_action) cpu1 cpu2 > . . . > . q->enqueue . > . qdisc_run_begin() . > . dequeue_skb() . > . . q->enqueue > . . . > . sch_direct_xmit() . > . . qdisc_run_begin() > . . return and do nothing > . . . > clear __QDISC_STATE_SCHED . . > qdisc_run_begin() . . > return and do nothing . . > . . . > . qdisc_run_begin() . > > This patch fixes the above data race by: > 1. Set a flag after spin_trylock() return false. > 2. Retry a spin_trylock() in case other CPU may not see the > new flag after it releases the lock. > 3. reschedule if the flag is set after the lock is released > at the end of qdisc_run_end(). > > For tx_action case, the flags is also set when cpu1 is at the > end if qdisc_run_begin(), so tx_action will be rescheduled > again to dequeue the skb enqueued by cpu2. > > Also clear the flag before dequeuing in order to reduce the > overhead of the above process, and aviod doing the heavy > test_and_clear_bit() at the end of qdisc_run_end(). > > Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") > Signed-off-by: Yunsheng Lin > --- > For those who has not been following the qdsic scheduling > discussion, there is packet stuck problem for lockless qdisc, > see [1], and I has done some cleanup and added some enhanced > features too, see [2] [3]. > While I was doing the optimization for lockless qdisc, it > accurred to me that these optimization is useless if there is > still basic bug in lockless qdisc, even the bug is not easily > reproducible. So look through [1] again, I found that the data > race for tx action mentioned by Michael, and thought deep about > it and came up with this patch trying to fix it. > > So I am really appreciated some who still has the reproducer > can try this patch and report back. > > 1. https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd8a2@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32 > 2. https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsheng@huawei.com/ > 3. https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/ > --- > include/net/sch_generic.h | 23 ++++++++++++++++++++--- > net/sched/sch_generic.c | 1 + > 2 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > index f7a6e14..4220eab 100644 > --- a/include/net/sch_generic.h > +++ b/include/net/sch_generic.h > @@ -36,6 +36,7 @@ struct qdisc_rate_table { > enum qdisc_state_t { > __QDISC_STATE_SCHED, > __QDISC_STATE_DEACTIVATED, > + __QDISC_STATE_NEED_RESCHEDULE, > }; > > struct qdisc_size_table { > @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) > static inline bool qdisc_run_begin(struct Qdisc *qdisc) > { > if (qdisc->flags & TCQ_F_NOLOCK) { > - if (!spin_trylock(&qdisc->seqlock)) > - return false; > + if (!spin_trylock(&qdisc->seqlock)) { > + set_bit(__QDISC_STATE_NEED_RESCHEDULE, > + &qdisc->state); Why do we need another bit? I mean why not just call __netif_schedule()? > + > + /* Retry again in case other CPU may not see the > + * new flags after it releases the lock at the > + * end of qdisc_run_end(). > + */ > + if (!spin_trylock(&qdisc->seqlock)) > + return false; > + } > WRITE_ONCE(qdisc->empty, false); > } else if (qdisc_is_running(qdisc)) { > return false; > @@ -176,8 +186,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) > static inline void qdisc_run_end(struct Qdisc *qdisc) > { > write_seqcount_end(&qdisc->running); > - if (qdisc->flags & TCQ_F_NOLOCK) > + if (qdisc->flags & TCQ_F_NOLOCK) { > spin_unlock(&qdisc->seqlock); > + > + if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE, > + &qdisc->state) && > + !test_bit(__QDISC_STATE_DEACTIVATED, > + &qdisc->state))) Testing two bits one by one is not atomic... > + __netif_schedule(qdisc); > + } > } > > static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 44991ea..25d75d8 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -205,6 +205,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, > const struct netdev_queue *txq = q->dev_queue; > struct sk_buff *skb = NULL; > > + clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state); > *packets = 1; > if (unlikely(!skb_queue_empty(&q->gso_skb))) { > spinlock_t *lock = NULL; > -- > 2.7.4 >