Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp2638262rdb; Tue, 12 Sep 2023 07:54:32 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGKCztFSGrQjR1ZHIrsUmKbjZ3dmFAFx/fkb0QeW57nDSbq0YLNeONxCFOV1JUSbj9Q9u74 X-Received: by 2002:a17:90a:9ce:b0:26d:4312:17ec with SMTP id 72-20020a17090a09ce00b0026d431217ecmr11521814pjo.1.1694530471787; Tue, 12 Sep 2023 07:54:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694530471; cv=none; d=google.com; s=arc-20160816; b=dDRYUPG7Xfdqrg8c3K9r7IMXlSTUdDJ85CldArFUOIsBBoCJooKkaJEQVhltpcFqHW 9a0Imz6/EoWc3CyzqQhuLE+xNu67Ro27Gv6w9lxEUKstRHOjOyDU3Dx8roULao8sIxkB H/fQ1TsuMVcyGUgSo5SpChZAcCGR+UuxUyt3u8wt9g800B+oXlFi9RiTCMnBjnhZeFmL YvBZrgbZ4658pySv7G8EuepinSIFyGWyTQ9b/1pZAgMdC8p0lCjPAfUAt5RJ9ZJi3y9u wD884y4bdvC0rZWzT8AgUfQ5pLFSqCQ9ntdtXwyEGlrPBANLJKiJRuO/1ZnA353sL2m6 Jnlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=Nd1w4CPqE1dJkcHSleC6jxQp/zISmJYwd+Vvi5h9nqA=; fh=z1OVlPvryzFTWJXbKenwSfLxMfXS2N/zD7GgoShIezg=; b=iHaAJxGwECejCduNuYtO8ImCHdyrTIkXaxHNx7qEvnhkIvEUty6huokC4iFOY1ZVlY MoxZLpivCslZdAkrcYobh1ZIFqk1tY8izMxMf2+6gXgU0jKVBksVTIlOIWoyMGzT9BkJ pc9Df9hhXFsGBfCiPXJocy+zU8oW/ZtB6ezFzKZxAle/9v7jlLpIDiMlKCk9T8ySGo/0 bYpnRF3lqPD6zCgGaJ39KXhtsDrPn1DR+PRCg1+vz1I978aXK/6VzQf6k9daMo0ogKu5 0fQnyUXcyer6Bq5FwCJXb2UHdS/wDuS5ywlguQ70m5X2ZR4qj4w++3DkMd8BKywDcLis ZZRQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id n11-20020a17090a9f0b00b002740f52e526si3866946pjp.139.2023.09.12.07.54.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Sep 2023 07:54:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 5E6FE81BB3DF; Tue, 12 Sep 2023 07:33:41 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235969AbjILOdM (ORCPT + 99 others); Tue, 12 Sep 2023 10:33:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235929AbjILOdJ (ORCPT ); Tue, 12 Sep 2023 10:33:09 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 1875110D for ; Tue, 12 Sep 2023 07:33:05 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 09747C15; Tue, 12 Sep 2023 07:33:42 -0700 (PDT) Received: from [10.34.100.121] (e126645.nice.arm.com [10.34.100.121]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 78E2A3F5A1; Tue, 12 Sep 2023 07:33:03 -0700 (PDT) Message-ID: Date: Tue, 12 Sep 2023 16:32:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Subject: Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance Content-Language: en-US To: "Zhang, Rui" , "Lu, Aaron" Cc: "peterz@infradead.org" , "mingo@redhat.com" , "vincent.guittot@linaro.org" , "linux-kernel@vger.kernel.org" , "tj@kernel.org" , "dietmar.eggemann@arm.com" , "Pandruvada, Srinivas" References: <20230804090858.7605-1-rui.zhang@intel.com> <20230814031432.GA574314@ziqianlu-dell> <20230911114218.GA334545@ziqianlu-dell> <0d1c440becca151db3f3b05b3fb2c63fe69c2904.camel@intel.com> From: Pierre Gondois In-Reply-To: <0d1c440becca151db3f3b05b3fb2c63fe69c2904.camel@intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Tue, 12 Sep 2023 07:33:41 -0700 (PDT) X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Hello Rui and Aaron, On 9/11/23 18:23, Zhang, Rui wrote: > On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote: >> Hi Pierre, >> >> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote: >>> Hello Rui, >>> >>> On 8/14/23 10:30, Zhang, Rui wrote: >>>> On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote: >>>>> Hi Rui, >>>>> >>>>> On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote: >>>>>> Problem statement >>>>>> ----------------- >>>>>> When using cgroup isolated partition to isolate cpus >>>>>> including >>>>>> cpu0, it >>>>>> is observed that cpu0 is woken up frequenctly but doing >>>>>> nothing. >>>>>> This is >>>>>> not good for power efficiency. >>>>>> >>>>>> -0     [000]   616.491602: hrtimer_cancel: >>>>>> hrtimer=0xffff8e8fdf623c10 >>>>>> -0     [000]   616.491608: hrtimer_start: >>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>>>>> expires=615996000000 softexpires=615996000000 >>>>>> -0     [000]   616.491616: rcu_utilization:      Start >>>>>> context switch >>>>>> -0     [000]   616.491618: rcu_utilization:      End >>>>>> context >>>>>> switch >>>>>> -0     [000]   616.491637: tick_stop: >>>>>> success=1 >>>>>> dependency=NONE >>>>>> -0     [000]   616.491637: hrtimer_cancel: >>>>>> hrtimer=0xffff8e8fdf623c10 >>>>>> -0     [000]   616.491638: hrtimer_start: >>>>>> hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0 >>>>>> expires=616420000000 softexpires=616420000000 >>>>>> >>>>>> The above pattern repeats every one or multiple ticks, >>>>>> results in >>>>>> total >>>>>> 2000+ wakeups on cpu0 in 60 seconds, when running workload on >>>>>> the >>>>>> cpus that are not in the isolated partition. >>>>>> >>>>>> Rootcause >>>>>> --------- >>>>>> In NOHZ mode, an active cpu either sends an IPI or touches >>>>>> the idle >>>>>> cpu's polling flag to wake it up, so that the idle cpu can >>>>>> pull >>>>>> tasks >>>>>> from the busy cpu. The logic for selecting the target cpu is >>>>>> to use >>>>>> the >>>>>> first idle cpu that presents in both nohz.idle_cpus_mask and >>>>>> housekeeping_cpumask. >>>>>> >>>>>> In the above scenario, when cpu0 is in the cgroup isolated >>>>>> partition, >>>>>> its sched domain is deteched, but it is still available in >>>>>> both of >>>>>> the >>>>>> above cpumasks. As a result, cpu0 >>>>> >>>>> I saw in nohz_balance_enter_idle(), if a cpu is isolated, it >>>>> will not >>>>> set itself in nohz.idle_cpus_mask and thus should not be chosen >>>>> as >>>>> ilb_cpu. I wonder what's stopping this from working? >>>> >>>> One thing I forgot to mention is that the problem is gone if we >>>> offline >>>> and re-online those cpus. In that case, the isolated cpus >>>> are removed >>>> from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are >>>> never >>>> added back. >>>> >>>> At runtime, the cpus can be removed from the nohz.idle_cpus_mask >>>> in >>>> below case only >>>>         trigger_load_balance() >>>>                 if (unlikely(on_null_domain(rq) || >>>> !cpu_active(cpu_of(rq)))) >>>>                         return; >>>>                 nohz_balancer_kick(rq); >>>>                         nohz_balance_exit_idle() >>>> >>>> My understanding is that if a cpu is in nohz.idle_cpus_mask when >>>> it is >>>> isolated, there is no chance to remove it from that mask later, >>>> so the >>>> check in nohz_balance_enter_idle() does not help. >>> >>> >>> As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped >>> is >>> never updated. Instead of avoiding isolated CPUs at each >>> find_new_ilb() call >>> as this patch does, maybe it would be better to permanently remove >>> these CPUs >>> from nohz.idle_cpus_mask. This would lower the number of checks >>> done. >> >> I agree. > > Sounds reasonable to me. > >> >>> This could be done with something like: >>> @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu) >>>           */ >>>          rq->has_blocked_load = 1; >>> +       /* If we're a completely isolated CPU, we don't play: */ >>> +       if (on_null_domain(rq)) { >>> +               if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) >>> { >>> +                       cpumask_clear_cpu(rq->cpu, >>> nohz.idle_cpus_mask); >>> +                       atomic_dec(&nohz.nr_cpus); >>> +               } >>> + >> >> >>> +               /* >>> +                * Set nohz_tick_stopped on isolated CPUs to avoid >>> keeping the >>> +                * value that was stored when >>> rebuild_sched_domains() was called. >>> +                */ >>> +               rq->nohz_tick_stopped = 1; >> >> It's not immediately clear to me what the above comment and code >> mean, >> maybe that's because I know very little about sched domain related >> code. >> Other than this, your change looks good to me. >> > > If we set rq->nohz_tick_stopped for the isolated cpu, next time when we > invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and > decrease nohz.nr_cpus (for the second time). This seems like a problem. > > In the current code, when rq->nohz_tick_stopped is set, it means the > cpu is in the nohz.idle_cpus_mask, and the code above breaks this > assumption. Yes right indeed, This happens when putting a CPU offline (as you mentioned earlier, putting a CPU offline clears the CPU in the idle_cpus_mask). The load balancing related variables are unused if a CPU has a NULL rq as it cannot pull any task. Ideally we should clear them once, when attaching a NULL sd to the CPU. The following snipped should do that and solve the issue you mentioned: --- snip --- --- a/include/linux/sched/nohz.h +++ b/include/linux/sched/nohz.h @@ -9,8 +9,10 @@ #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) extern void nohz_balance_enter_idle(int cpu); extern int get_nohz_timer_target(void); +extern void nohz_clean_sd_state(int cpu); #else static inline void nohz_balance_enter_idle(int cpu) { } +static inline void nohz_clean_sd_state(int cpu) { } #endif #ifdef CONFIG_NO_HZ_COMMON diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3e25be58e2b..6fcabe5d08f5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq) { SCHED_WARN_ON(rq != this_rq()); + if (on_null_domain(rq)) + return; + if (likely(!rq->nohz_tick_stopped)) return; @@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu) rcu_read_unlock(); } +void nohz_clean_sd_state(int cpu) { + struct rq *rq = cpu_rq(cpu); + + rq->nohz_tick_stopped = 0; + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) { + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask); + atomic_dec(&nohz.nr_cpus); + } + set_cpu_sd_state_idle(cpu); +} + /* * This routine will record that the CPU is going idle with tick stopped. * This info will be used in performing idle load balancing in the future. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d3a3b2646ec4..d31137b5f0ce 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const struct cpumask *cpu_map) static_branch_dec_cpuslocked(&sched_asym_cpucapacity); rcu_read_lock(); - for_each_cpu(i, cpu_map) + for_each_cpu(i, cpu_map) { cpu_attach_domain(NULL, &def_root_domain, i); + nohz_clean_sd_state(i); + } rcu_read_unlock(); } --- snip --- Regards, Pierre > >> >>> +       } >>> + >>>          /* >>>           * The tick is still stopped but load could have been >>> added in the >>>           * meantime. We set the nohz.has_blocked flag to trig a >>> check of the >>> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu) >>>          if (rq->nohz_tick_stopped) >>>                  goto out; >>> -       /* If we're a completely isolated CPU, we don't play: */ >>> -       if (on_null_domain(rq)) >>> -               return; >>> - >>>          rq->nohz_tick_stopped = 1; >>>          cpumask_set_cpu(cpu, nohz.idle_cpus_mask); >>> >>> Otherwise I could reproduce the issue and the patch was solving it, >>> so: >>> Tested-by: Pierre Gondois > > Thanks for testing, really appreciated! >>> >>> >>> >>> >>> >>> Also, your patch doesn't aim to solve that, but I think there is an >>> issue >>> when updating cpuset.cpus when an isolated partition was already >>> created: >>> >>> // Create an isolated partition containing CPU0 >>> # mkdir cgroup >>> # mount -t cgroup2 none cgroup/ >>> # mkdir cgroup/Testing >>> # echo "+cpuset" > cgroup/cgroup.subtree_control >>> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control >>> # echo 0 > cgroup/Testing/cpuset.cpus >>> # echo isolated > cgroup/Testing/cpuset.cpus.partition >>> >>> // CPU0's sched domain is detached: >>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>> domain0  domain1 >>> >>> // Change the isolated partition to be CPU1 >>> # echo 1 > cgroup/Testing/cpuset.cpus >>> >>> // CPU[0-1] sched domains are not updated: >>> # ls /sys/kernel/debug/sched/domains/cpu0/ >>> # ls /sys/kernel/debug/sched/domains/cpu1/ >>> domain0  domain1 >>> > Interesting. Let me check and get back to you later on this. :) > > thanks, > rui