Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp394033rdb; Fri, 6 Oct 2023 06:49:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE4h/S4umWlV1lghDx/WZjvtSMN8oo6mjPmDrHefGC/myZHklfxoN61v3+NHNtoKX0EG28x X-Received: by 2002:a05:6870:f720:b0:1d6:2476:be2e with SMTP id ej32-20020a056870f72000b001d62476be2emr8183114oab.35.1696600161061; Fri, 06 Oct 2023 06:49:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696600161; cv=none; d=google.com; s=arc-20160816; b=SNaOVAIIcn9axYkBmxAUn24oBRmCc32+WwC8f/1RnSTpz1IPKiOYCNa8HQAqmh392F AX9g5ay0/ZeQosVFXMypWVks7tTu1whx4CX4bJAbjvGRjU7V+ptKg7DeHobd0dZYuT1p IlD3P9DF9iCzkVhRhC5nCYbiUHfqcw6diOb3gkvxTJIRDyHFwf3rPZ0CrcfrTeTFf4iB uSgMnszMP472hR3fEjDM/bD/AikRanVKRrRkiyeG7EhFf1+7PDHqHkMYeuYN1b2fZ/OU z+6yZtEYCELXpSWdd9kQwds5r4WR+Gv/qDzyqEJJJhRPW6uG+GWLbUOD0zUqFOQnw5Jk ZpYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=xfUMpCb972x02fuc9xITgOLc26MysozXmGisOPR9cQQ=; fh=+JMZjVu4vq1Xm26saJS1dKVBRgyX7rVd+ilNvBhMmLk=; b=d1GeqBiUPCAn/zWelnQ8Qt8/wjQF+losFMblRthS8hL6rvvo6uBcXrNP9Hj5EKTdPO rFS0eDVdJDMT71ZVGwxKI3sdF1vpaScPNVh1sDFyk23U9spTXhmxv7Q9VvrEr+RhLndu 36dDbzyQDpMT9/BCrPl1t0xXsv4cv+W1UB2GK3wndG2PBGKbQ/GBOxZZg4x7i4KWtQs6 aT5z7vhZpSaSQokcydqqd9r9CGWm+uFf/ZuGeX8Im4HgNCXUvZHREP2Wa3J+jt5OU2WE q2Pt+Amb9RfeJZ6PXNyWpCEYo+ma/I1j6hgqI6GpV4VoEhgH4KLs4X6JRYt70eODCWGs s4zA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=GTDwGYfY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id f3-20020a056a00228300b00690fec2f3bbsi1566931pfe.85.2023.10.06.06.49.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 Oct 2023 06:49:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=GTDwGYfY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id DA83D82A9FA3; Fri, 6 Oct 2023 06:49:11 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232705AbjJFNsZ (ORCPT + 99 others); Fri, 6 Oct 2023 09:48:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232482AbjJFNsH (ORCPT ); Fri, 6 Oct 2023 09:48:07 -0400 Received: from mail-oo1-xc36.google.com (mail-oo1-xc36.google.com [IPv6:2607:f8b0:4864:20::c36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFE1119B2 for ; Fri, 6 Oct 2023 06:46:56 -0700 (PDT) Received: by mail-oo1-xc36.google.com with SMTP id 006d021491bc7-57b5f0d658dso1203271eaf.0 for ; Fri, 06 Oct 2023 06:46:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1696600016; x=1697204816; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=xfUMpCb972x02fuc9xITgOLc26MysozXmGisOPR9cQQ=; b=GTDwGYfY6sE+7hKfpqP1cc85CFr2DXc3t2yoO4buRe36IHA/iRC8aS8Lv1LhiW6vXN 1Ek2w7s31svRS0cqMj4+UkdF2abKc5y9lXYks6cqIwjVjVIc0GPTxdw9afSzFn0pwTqt UnyXBb2zb6hwzOLeUTfW8157YAY3C4fxYdSS/sbU1cNBN1Sf514rxgw1m0UuZLxdVUZ9 Z0DfPQKDcjOFRvvtK73irf+9idgi1DCIgT0PO59KLutm6rhuwUtglj1B1uTEQBLJMEkh Sr4AuVsny9OjFhQK+Tt9Qnhc5fS8WcB0chGkvaZnH359tekVeRCunx82OZZk4AfY8Fj9 2VKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696600016; x=1697204816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=xfUMpCb972x02fuc9xITgOLc26MysozXmGisOPR9cQQ=; b=sx5DiZYNVdTvyMZ4v7xQXU4Npy3X+BtUeU279BGtiruanqta5QEQ8lnHGyA2xUZruA CmDakTDHghpD/NUHdQxxXQqdrgK3ooBMmEj+GIZf7SEu6MOqvFSOcnHffcypHmibzuGY UYyiJSOZSAZ7tlO6qc6JM/tz1KToPW4ha1aJpKyJy772M+Ssyyt17zP+Tdz1c2OSuG7B KflbmMfCoUCbEZPF/JPtEXG3Ed8Q4wEEdbYLHBu/G7fPmDQyd3Opm/ASR9QdsIGfwPBl v7xvErGd9hEzIN6tQSIXR3c2GKeMuuuRP7vYYbwUHNyqZIjIE9e80L6m7Qr7c4KtFP7m l1eQ== X-Gm-Message-State: AOJu0YwnjdLqIl4XrjegPZD0HltfvlCybce6EmDNjf4THGsvtosT3sdj l4QOLBW0bj2IbKzX2BEQmhnZcr4CBGXVaMBrnctZDw== X-Received: by 2002:a05:6358:3407:b0:134:e3d2:1e50 with SMTP id h7-20020a056358340700b00134e3d21e50mr7278311rwd.18.1696600015719; Fri, 06 Oct 2023 06:46:55 -0700 (PDT) MIME-Version: 1.0 References: <20231005161727.1855004-1-joel@joelfernandes.org> In-Reply-To: <20231005161727.1855004-1-joel@joelfernandes.org> From: Vincent Guittot Date: Fri, 6 Oct 2023 15:46:44 +0200 Message-ID: Subject: Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB To: "Joel Fernandes (Google)" Cc: linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Vineeth Pillai , Suleiman Souhlal , Hsin Yi , Frederic Weisbecker , "Paul E . McKenney" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=2.7 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 06 Oct 2023 06:49:12 -0700 (PDT) X-Spam-Level: ** On Thu, 5 Oct 2023 at 18:17, Joel Fernandes (Google) wrote: > > From: Vineeth Pillai > > Whenever a CPU stops its tick, it now requires another idle CPU to handle the > balancing for it because it can't perform its own periodic load balancing. > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if > the upcoming nohz-idle load balancing is too distant in the future. This update > process is done by triggering an ILB, as the general ILB handler > (_nohz_idle_balance) that manages regular nohz balancing also refreshes > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs > and selecting the smallest value. > > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This > primarily results in the ILB handler updating 'nohz.next_balance' while > possibly not doing any load balancing at all. However, sending an IPI merely to > refresh 'nohz.next_balance' seems excessive, and there ought to be a more > efficient method to update 'nohz.next_balance' from the local CPU. > > Fortunately, there already exists a mechanism to directly invoke the ILB > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle" > balancing and solely exists to update a CPU's blocked load if it couldn't pull > more tasks during regular "newly idle balancing" - and it does so without > having to send any IPIs. Once the flag is set, the ILB handler is called > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update > the blocked load without an IPI, in our situation, we aim to refresh > 'nohz.next_balance' without an IPI but we can piggy back on this. > > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to > indicate nohz.next_balance needs an update via this direct call shortcut. Note > that we set this flag without knowledge that the tick is about to be stopped, > because at the point we do it, we have no way of knowing that. However we do > know that the CPU is about to enter idle. In our testing, the reduction in IPIs > is well worth updating nohz.next_balance a few more times. > > Also just to note, without this patch we observe the following pattern: > > 1. A CPU is about to stop its tick. > 2. It sets nohz.needs_update to 1. > 3. It then stops its tick and goes idle. > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed. > 5. The ILB CPU ends up being the one that just stopped its tick! > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up > and disturbing it! > > Testing shows a considerable reduction in IPIs when doing this: > > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM > the IPI call count profiled over 10s period is as follows: > without fix: ~10500 > with fix: ~1000 > > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle") > > [ Joel: wrote commit messages, collaborated on fix, helped reproduce issue etc. ] > > Cc: Suleiman Souhlal > Cc: Steven Rostedt > Cc: Hsin Yi > Cc: Frederic Weisbecker > Cc: Paul E. McKenney > Signed-off-by: Vineeth Pillai > Co-developed-by: Joel Fernandes (Google) > Signed-off-by: Joel Fernandes (Google) > --- > kernel/sched/fair.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cb225921bbca..2ece55f32782 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11786,13 +11786,12 @@ void nohz_balance_enter_idle(int cpu) > /* > * Ensures that if nohz_idle_balance() fails to observe our > * @idle_cpus_mask store, it must observe the @has_blocked > - * and @needs_update stores. > + * stores. > */ > smp_mb__after_atomic(); > > set_cpu_sd_state_idle(cpu); > > - WRITE_ONCE(nohz.needs_update, 1); the set of needs_updat here is the main way to set nohz.needs_update and trigger an update of next_balance so it would be good to explain why we still need to keep it instead r removing it entirely > out: > /* > * Each time a cpu enter idle, we assume that it has blocked load and > @@ -11945,21 +11944,25 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > } > > /* > - * Check if we need to run the ILB for updating blocked load before entering > - * idle state. > + * Check if we need to run the ILB for updating blocked load and/or updating > + * nohz.next_balance before entering idle state. > */ > void nohz_run_idle_balance(int cpu) > { > unsigned int flags; > > - flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu)); > + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK | NOHZ_NEXT_KICK, nohz_flags(cpu)); > + > + if (!flags) > + return; > > /* > * Update the blocked load only if no SCHED_SOFTIRQ is about to happen > * (ie NOHZ_STATS_KICK set) and will do the same. > */ > - if ((flags == NOHZ_NEWILB_KICK) && !need_resched()) > - _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK); > + if ((flags == (flags & (NOHZ_NEXT_KICK | NOHZ_NEWILB_KICK))) && > + !need_resched()) > + _nohz_idle_balance(cpu_rq(cpu), flags); This breaks the update of the blocked load. nohz_newidle_balance() only sets NOHZ_NEWILB_KICK (and not NOHZ_STATS_KICK) when it wants to update blocked load before going idle but it then sets NOHZ_STATS_KICK when calling_nohz_idle_balance() . We only clear NOHZ_NEWILB_KICK when fetching flags but we test if other bits have been set by a pending softirq which will do the same. That's why we use NOHZ_NEWILB_KICK and not NOHZ_STATS_KICK directly. Similarly you can't directly use NOHZ_NEXT_KICK. Also note that _nohz_idle_balance() is not robust against parallel run > } > > static void nohz_newidle_balance(struct rq *this_rq) > @@ -11977,6 +11980,10 @@ static void nohz_newidle_balance(struct rq *this_rq) > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return; > > + /* If rq->next_balance before nohz.next_balance, trigger ILB */ > + if (time_before(this_rq->next_balance, READ_ONCE(nohz.next_balance))) > + atomic_or(NOHZ_NEXT_KICK, nohz_flags(this_cpu)); > + > /* Don't need to update blocked load of idle CPUs*/ > if (!READ_ONCE(nohz.has_blocked) || > time_before(jiffies, READ_ONCE(nohz.next_blocked))) > -- > 2.42.0.609.gbb76f46606-goog >