Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp3961183rdb; Thu, 14 Sep 2023 07:54:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG40AhoDzw6AbIC26q6ublZEwIRvEBFAZZD7YsiDBJpeJLV524OyBXT3OuqBA1A3EK7al8t X-Received: by 2002:a05:6a21:1aa:b0:14b:8b82:867f with SMTP id le42-20020a056a2101aa00b0014b8b82867fmr6806616pzb.50.1694703295783; Thu, 14 Sep 2023 07:54:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694703295; cv=none; d=google.com; s=arc-20160816; b=akYt1QCEnt2Vd5id6s/21KG1z/jbSiy3F/+j/xn+0rkX8Y394I17jxRjHswB6Rv5GG WPnJK8L2Q+aya8HxeDRLHs7PyjUA5tP1fBawYSQGdzhc11FGTT93CxdYCySXbJoiwDuH q67iOayTuOpmcSYBRPQosnpMxVTKD32IcfClGFN/sSvzVHqK6SRRjZBBEy0DKW2wkFmg J+CYKzCsjVRgf6uwGuiXhKMOuJuhyYxC20Lz2BOuGQCjSs+21b26iC1ODJvkpGBB9Nlu C/vrWbiqqYLMwlxnWp2FmcbjbcOWTiLPTEuGn1nHYk0+1lbV1e5Yw0IOxC/TFnFahdXn OdiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id; bh=zuSFtBT2UWj3Cm7joev36vbdqzPxNnZ6+jq9Z9DH/i0=; fh=xGDFKNsZzYbU5lRsBW/BzN6VAIeevTf31nqXHBpTezs=; b=k2JMxzDFICcdBnCqtCaeUnJZ5GAYa1m1BFQzE0Xr1m8KTNK/7sXQ3jNFBmI+R1T4GL C3JlCCKtycOnlbQMHlHGp6wWmBAeyxndyaxzDE/k88+NU+zKGwGjwCxBl48BqdochAFL BNrA96lEb6u5pKhaqN3n12mZcEtDdCPAJsC5I0QCymV7K75Lq8Qi6eGoo49x/LofR4Fk X+j47DaOo6nnG703TsKamg6SucTCUQ2OZIRODxb60woqVyitiKnNTlFpVnUMt3uTqQ/8 Pe5XZCj4kkfsTA2dHMvqZMxnpTbsukZmBaZZEvlQMoLQU/lKLoVLDmjA21wlq8+SGz2/ Ijew== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id l6-20020a63ba46000000b0057753421536si1692544pgu.351.2023.09.14.07.54.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 07:54:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id E7D028168217; Thu, 14 Sep 2023 07:53:26 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240408AbjINOxY (ORCPT + 99 others); Thu, 14 Sep 2023 10:53:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240356AbjINOxX (ORCPT ); Thu, 14 Sep 2023 10:53:23 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F029AF9 for ; Thu, 14 Sep 2023 07:53:18 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EFC651FB; Thu, 14 Sep 2023 07:53:55 -0700 (PDT) Received: from [10.34.100.121] (e126645.nice.arm.com [10.34.100.121]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3BECC3F738; Thu, 14 Sep 2023 07:53:17 -0700 (PDT) Message-ID: Date: Thu, 14 Sep 2023 16:53:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance To: "Zhang, Rui" , "Lu, Aaron" Cc: "peterz@infradead.org" , "mingo@redhat.com" , "Pandruvada, Srinivas" , "vincent.guittot@linaro.org" , "linux-kernel@vger.kernel.org" , "dietmar.eggemann@arm.com" , "tj@kernel.org" References: <20230804090858.7605-1-rui.zhang@intel.com> <20230814031432.GA574314@ziqianlu-dell> <20230911114218.GA334545@ziqianlu-dell> <0d1c440becca151db3f3b05b3fb2c63fe69c2904.camel@intel.com> Content-Language: en-US From: Pierre Gondois In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 14 Sep 2023 07:53:27 -0700 (PDT) X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email On 9/14/23 11:23, Zhang, Rui wrote: > Hi, Pierre, > >> >> Yes right indeed, >> This happens when putting a CPU offline (as you mentioned earlier, >> putting a CPU offline clears the CPU in the idle_cpus_mask). >> >> The load balancing related variables > > including? I meant the nohz idle variables in the load balancing, so I was referring to: (struct sched_domain_shared).nr_busy_cpus (struct sched_domain).nohz_idle nohz.idle_cpus_mask nohz.nr_cpus (struct rq).nohz_tick_stopped > >> are unused if a CPU has a NULL >> rq as it cannot pull any task. Ideally we should clear them once, >> when attaching a NULL sd to the CPU. > > This sounds good to me. But TBH, I don't have enough confidence to do > so because I'm not crystal clear about how these variables are used. > > Some questions about the code below. >> >> The following snipped should do that and solve the issue you >> mentioned: >> --- snip --- >> --- a/include/linux/sched/nohz.h >> +++ b/include/linux/sched/nohz.h >> @@ -9,8 +9,10 @@ >>   #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) >>   extern void nohz_balance_enter_idle(int cpu); >>   extern int get_nohz_timer_target(void); >> +extern void nohz_clean_sd_state(int cpu); >>   #else >>   static inline void nohz_balance_enter_idle(int cpu) { } >> +static inline void nohz_clean_sd_state(int cpu) { } >>   #endif >> >>   #ifdef CONFIG_NO_HZ_COMMON >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b3e25be58e2b..6fcabe5d08f5 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq) >>   { >>          SCHED_WARN_ON(rq != this_rq()); >> >> +       if (on_null_domain(rq)) >> +               return; >> + >>          if (likely(!rq->nohz_tick_stopped)) >>                  return; >> > if we force clearing rq->nohz_tick_stopped when detaching domain, why > bother adding the first check? Yes you're right. I added this check for safety, but this is not mandatory. > >> >> @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu) >>          rcu_read_unlock(); >>   } >> >> +void nohz_clean_sd_state(int cpu) { >> +       struct rq *rq = cpu_rq(cpu); >> + >> +       rq->nohz_tick_stopped = 0; >> +       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { >> +               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); >> +               atomic_dec(&nohz.nr_cpus); >> +       } >> +       set_cpu_sd_state_idle(cpu); >> +} >> + > > detach_destroy_domains > cpu_attach_domain > update_top_cache_domain > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here, > no? Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls have to be inverted to avoid what you just described. It also seems that the current kernel doesn't decrease nr_busy_cpus when putting CPUs in an isolated partition. Indeed if a CPU is counted in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger any call to set_cpu_sd_state_idle(). So it might an additional argument. Thanks for reading the patch, Regards, Pierre > > thanks, > rui >>   /* >>    * This routine will record that the CPU is going idle with tick >> stopped. >>    * This info will be used in performing idle load balancing in the >> future. >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index d3a3b2646ec4..d31137b5f0ce 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const >> struct cpumask *cpu_map) >> >> static_branch_dec_cpuslocked(&sched_asym_cpucapacity); >> >>          rcu_read_lock(); >> -       for_each_cpu(i, cpu_map) >> +       for_each_cpu(i, cpu_map) { >>                  cpu_attach_domain(NULL, &def_root_domain, i); >> +               nohz_clean_sd_state(i); >> +       } >>          rcu_read_unlock(); >>   } >> >> --- snip --- >> >> Regards, >> Pierre >> >>> >>>> >>>>> +       } >>>>> + >>>>>           /* >>>>>            * The tick is still stopped but load could have been >>>>> added in the >>>>>            * meantime. We set the nohz.has_blocked flag to trig >>>>> a >>>>> check of the >>>>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) >>>>>           if (rq->nohz_tick_stopped) >>>>>                   goto out; >>>>> -       /* If we're a completely isolated CPU, we don't play: >>>>> */ >>>>> -       if (on_null_domain(rq)) >>>>> -               return; >>>>> - >>>>>           rq->nohz_tick_stopped = 1; >>>>>           cpumask_set_cpu(cpu, nohz.idle_cpus_mask); >>>>> >>>>> Otherwise I could reproduce the issue and the patch was solving >>>>> it, >>>>> so: >>>>> Tested-by: Pierre Gondois >>> >>> Thanks for testing, really appreciated! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Also, your patch doesn't aim to solve that, but I think there >>>>> is an >>>>> issue >>>>> when updating cpuset.cpus when an isolated partition was >>>>> already >>>>> created: >>>>> >>>>> // Create an isolated partition containing CPU0 >>>>> # mkdir cgroup >>>>> # mount -t cgroup2 none cgroup/ >>>>> # mkdir cgroup/Testing >>>>> # echo "+cpuset" > cgroup/cgroup.subtree_control >>>>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control >>>>> # echo 0 > cgroup/Testing/cpuset.cpus >>>>> # echo isolated > cgroup/Testing/cpuset.cpus.partition >>>>> >>>>> // CPU0's sched domain is detached: >>>>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>>>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>>>> domain0  domain1 >>>>> >>>>> // Change the isolated partition to be CPU1 >>>>> # echo 1 > cgroup/Testing/cpuset.cpus >>>>> >>>>> // CPU[0-1] sched domains are not updated: >>>>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>>>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>>>> domain0  domain1 >>>>> >>> Interesting. Let me check and get back to you later on this. :) >>> >>> thanks, >>> rui >