Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp244193pxk; Wed, 23 Sep 2020 01:54:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJymk/GMyDg/QoqKqt7t+JOVYL6FtREc541ZTDz3Bge7nzor1eYvVpQq7pCb7mXhDFVMhPnk X-Received: by 2002:a17:907:374:: with SMTP id rs20mr9511695ejb.135.1600851255273; Wed, 23 Sep 2020 01:54:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600851254; cv=none; d=google.com; s=arc-20160816; b=iDNY0nY/kkoDlGwS99G1GmV9g89Crc78LTVxk53MVvf4cJwy5bSyqQKDuoGAzbUeZ0 63waem+FfVNCQ0XKbG13xGUfX5rhofNc87i8onRVeuz/yEcRjHQMGi4fxjEWi0Nfonez +O/AFTjk1d5+/qfPciYaJtoJDFT/bUdro7N9ihdVHIEfSHiVPFepa4tz9yQ2dkkhSez1 E+bCkxwarjPKYgxdhVrLgGarhkCaUk95D4BPtQ8baVnEpfC1auLcznjtmCITSzptbAMQ 1swn/9UEokg1kD7pEAd3N4ILW2vGCGk0oEJT8AsuLrf5hUrMYhLYGPJWsFPTbXiCcYxO utHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=O0ikWKdryEJhGlmOshHBsFY0vtDPqvTyz6vgzUBF91A=; b=R33uZpTmiXRdokpikv+FkzBtDHLYTPacpjZRiTf2y5WgOvpJD6x69G8lXOSfoivtRV AdEPnqb+OLgbB6M9hYtuX5wp/XXXusWZ/tvePBdzhuCAUFQXSfPtmo86b70KTr//77jo ++zXqe2Kk2QhjOT2QXRGs11qA+qxNAXba5cftf8/KLY1vToOl9qh1SoFNdgIEDX9uH1L sNU9zoYs2tHuFmXf1V9d/1ijVT3PoBUIr1t6IzxVYugIVWN0/C28OI3wPVsyBwQNgsZl u/fnqOUS6l1MGJ/dExvhs8YiF27qJY1VpctKAgI6K6XfRxeK6FAahQW/0SLv13OFtNLB fwLw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=BojpOEEN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h19si11839711ejb.619.2020.09.23.01.53.50; Wed, 23 Sep 2020 01:54:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=BojpOEEN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726332AbgIWIuq (ORCPT + 99 others); Wed, 23 Sep 2020 04:50:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726184AbgIWIuq (ORCPT ); Wed, 23 Sep 2020 04:50:46 -0400 Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6A49C061755 for ; Wed, 23 Sep 2020 01:50:45 -0700 (PDT) Received: by mail-lj1-x244.google.com with SMTP id b19so16523846lji.11 for ; Wed, 23 Sep 2020 01:50:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O0ikWKdryEJhGlmOshHBsFY0vtDPqvTyz6vgzUBF91A=; b=BojpOEEN+6q4LCcAeDnBnoKBvlRVq/mz2M9F7MhvZk8pbhZbvyIo7rAMnDDHEHG4Hr 82ZKFO3VzUq7Mn4pOvh1QKAbIb2Bx1wXxvhUaPzF3E5yPOHidQuRZCuXpmB8crZGz5nh VxTRXOlkK0kRZf2pf1Q+VsD00vx+DqX2cY+dpZwAz4JLViA/yJebhV2JHgWQxzZ6D9N1 G283oHug9AoonpF2e5KCr+MI7R6voxuVfCIk4c+MKaf7Kb39ksgIhqwq49GGTshM/Jb3 +LfZONqWEgGjaZP7vD5W+uWjyZxOXKz59HieYOGt7g4NcAB60JIcGHB04lmqKuiiXuA6 AQKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O0ikWKdryEJhGlmOshHBsFY0vtDPqvTyz6vgzUBF91A=; b=Q9ex0UKwrPQYtSNel/xTFtxP8NoU81PEpPEMdDjEcqXYR1+wNMI0KVww6meAb+6uny obb5yJgboQJ3QHV4GhLrLwvGzQdBfFXS5/HzpIvZnzhfaLl3WmhLiYRnYbeH1wAlN94L pu9+ll9eP4mJtB8cDfCd9nUjqul9G0Qst3ddxO+Eof/TzH5TNB8EW0W7Pye2+AwOVnN+ JVS/SD7qj4wOFAfUPUPz8YHa0BniEdFBFT7hTfkhfgMxHASkXr6QQ+cFqa1G/KDafalc Z04tF+ZRP7Y5ftdOhW0+orz/8+zK6Je2+LOScjSnb5ZBUM3tIZvBo28amSeOzAyigV1+ FFXQ== X-Gm-Message-State: AOAM532vrWD78kcddbolz6WTMtwk4egmxigZ48QzD7fhZhk7CB3xpJD8 hDnDCqxN/qLFdIjxA10odDPX3ytI0jNmdrjhaZoAoA== X-Received: by 2002:a2e:81d7:: with SMTP id s23mr3084640ljg.69.1600851043993; Wed, 23 Sep 2020 01:50:43 -0700 (PDT) MIME-Version: 1.0 References: <20200916043103.606132-1-aubrey.li@linux.intel.com> <20200916110039.GG3117@suse.de> <78d608f2-b974-e940-da32-b37777bc405a@linux.intel.com> <8a86b085-b445-b1c2-9b46-6346d923abf0@linux.intel.com> In-Reply-To: <8a86b085-b445-b1c2-9b46-6346d923abf0@linux.intel.com> From: Vincent Guittot Date: Wed, 23 Sep 2020 10:50:32 +0200 Message-ID: Subject: Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain To: "Li, Aubrey" Cc: Mel Gorman , Ingo Molnar , Peter Zijlstra , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Valentin Schneider , Tim Chen , linux-kernel , Qais Yousef , Jiang Biao Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 23 Sep 2020 at 04:59, Li, Aubrey wrote: > > Hi Vincent, > > On 2020/9/22 15:14, Vincent Guittot wrote: > > On Tue, 22 Sep 2020 at 05:33, Li, Aubrey wrote: > >> > >> On 2020/9/21 23:21, Vincent Guittot wrote: > >>> On Mon, 21 Sep 2020 at 17:14, Vincent Guittot > >>> wrote: > >>>> > >>>> On Thu, 17 Sep 2020 at 11:21, Li, Aubrey wrote: > >>>>> > >>>>> On 2020/9/16 19:00, Mel Gorman wrote: > >>>>>> On Wed, Sep 16, 2020 at 12:31:03PM +0800, Aubrey Li wrote: > >>>>>>> Added idle cpumask to track idle cpus in sched domain. When a CPU > >>>>>>> enters idle, its corresponding bit in the idle cpumask will be set, > >>>>>>> and when the CPU exits idle, its bit will be cleared. > >>>>>>> > >>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask > >>>>>>> has low cost than scanning all the cpus in last level cache domain, > >>>>>>> especially when the system is heavily loaded. > >>>>>>> > >>>>>>> The following benchmarks were tested on a x86 4 socket system with > >>>>>>> 24 cores per socket and 2 hyperthreads per core, total 192 CPUs: > >>>>>>> > >>>>>> > >>>>>> This still appears to be tied to turning the tick off. An idle CPU > >>>>>> available for computation does not necessarily have the tick turned off > >>>>>> if it's for short periods of time. When nohz is disabled or a machine is > >>>>>> active enough that CPUs are not disabling the tick, select_idle_cpu may > >>>>>> fail to select an idle CPU and instead stack tasks on the old CPU. > >>>>>> > >>>>>> The other subtlety is that select_idle_sibling() currently allows a > >>>>>> SCHED_IDLE cpu to be used as a wakeup target. The CPU is not really > >>>>>> idle as such, it's simply running a low priority task that is suitable > >>>>>> for preemption. I suspect this patch breaks that. > >>>>>> > >>>>> Thanks! > >>>>> > >>>>> I shall post a v3 with performance data, I made a quick uperf testing and > >>>>> found the benefit is still there. So I posted the patch here and looking > >>>>> forward to your comments before I start the benchmarks. > >>>>> > >>>>> Thanks, > >>>>> -Aubrey > >>>>> > >>>>> ----------------------------------------------------------------------- > >>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > >>>>> index fb11091129b3..43a641d26154 100644 > >>>>> --- a/include/linux/sched/topology.h > >>>>> +++ b/include/linux/sched/topology.h > >>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared { > >>>>> atomic_t ref; > >>>>> atomic_t nr_busy_cpus; > >>>>> int has_idle_cores; > >>>>> + /* > >>>>> + * Span of all idle CPUs in this domain. > >>>>> + * > >>>>> + * NOTE: this field is variable length. (Allocated dynamically > >>>>> + * by attaching extra space to the end of the structure, > >>>>> + * depending on how many CPUs the kernel has booted up with) > >>>>> + */ > >>>>> + unsigned long idle_cpus_span[]; > >>>>> }; > >>>>> > >>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds) > >>>>> +{ > >>>>> + return to_cpumask(sds->idle_cpus_span); > >>>>> +} > >>>>> + > >>>>> struct sched_domain { > >>>>> /* These fields must be setup */ > >>>>> struct sched_domain __rcu *parent; /* top domain must be null terminated */ > >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>>>> index 6b3b59cc51d6..9a3c82645472 100644 > >>>>> --- a/kernel/sched/fair.c > >>>>> +++ b/kernel/sched/fair.c > >>>>> @@ -6023,6 +6023,26 @@ void __update_idle_core(struct rq *rq) > >>>>> rcu_read_unlock(); > >>>>> } > >>>>> > >>>>> +/* > >>>>> + * Update cpu idle state and record this information > >>>>> + * in sd_llc_shared->idle_cpus_span. > >>>>> + */ > >>>>> +void update_idle_cpumask(struct rq *rq) > >>>>> +{ > >>>>> + struct sched_domain *sd; > >>>>> + int cpu = cpu_of(rq); > >>>>> + > >>>>> + rcu_read_lock(); > >>>>> + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > >>>>> + if (!sd || !sd->shared) > >>>>> + goto unlock; > >>>>> + if (!available_idle_cpu(cpu) || !sched_idle_cpu(cpu)) > >>>>> + goto unlock; > >>>>> + cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared)); > >>>>> +unlock: > >>>>> + rcu_read_unlock(); > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Scan the entire LLC domain for idle cores; this dynamically switches off if > >>>>> * there are no idle cores left in the system; tracked through > >>>>> @@ -6136,7 +6156,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > >>>>> > >>>>> time = cpu_clock(this); > >>>>> > >>>>> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > >>>>> + /* > >>>>> + * sched_domain_shared is set only at shared cache level, > >>>>> + * this works only because select_idle_cpu is called with > >>>>> + * sd_llc. > >>>>> + */ > >>>>> + cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr); > >>>>> > >>>>> for_each_cpu_wrap(cpu, cpus, target) { > >>>>> if (!--nr) > >>>>> @@ -6712,6 +6737,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > >>>>> > >>>>> if (want_affine) > >>>>> current->recent_used_cpu = cpu; > >>>>> + > >>>>> + sd = rcu_dereference(per_cpu(sd_llc, new_cpu)); > >>>>> + if (sd && sd->shared) > >>>>> + cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared)); > >>>> > >>>> Why are you clearing the bit only for the fast path ? the slow path > >>>> can also select an idle CPU > >> > >> Right, I saw idle core searching is turned off in the fast path only too, > >> because next wakeup we'll check if the CPU is idle, this only affects the > >> idle cpu searching span. > >> > >>>> > >>>> Then, I'm afraid that updating a cpumask at each and every task wakeup > >>>> will be far too expensive. That's why we are ot updating > >>> > >>> That's why we are not updating > >> > >> AFAIK, uperf/netperf is the workload with bunches of short idles, do you > >> have any other workloads in your mind? I can measure to verify this. > >>> > >>>> nohz.idle_cpus_mask at each and every enter/exit idle but only once > >>>> per tick. > >> Yes, agreed, need more think about this, especially if the data is really > >> bad. > >> > >>>> > >>>> And a quick test with hackbench on my octo cores arm64 gives for 12 > >>>> iterations of: hackbench -l 2560 -g 1 > >>>> tip/sched/core : 1.324(+/- 1.26%) > >>>> with this patch : 2.419(+/- 12.31%) -82% regression > >> > >> Can you please clarify this, is this running 2560 loops and 1 group? > > > > yes it's 2560 loops and 1 group and I run 12 times the bench: > > $ hackbench -l 2560 -g 1 > > Running in process mode with 1 groups using 40 file descriptors each > > (== 40 tasks) > > Each sender will pass 2560 messages of 100 bytes > > Time: 2.953 > > > > you can also have a look at perf sched pipe > > tip/sched/core > > $ perf bench sched pipe -T -l 50000 > > # Running 'sched/pipe' benchmark: > > # Executed 50000 pipe operations between two threads > > > > Total time: 0.980 [sec] > > > > 19.609160 usecs/op > > 50996 ops/sec > > > > With your patch : > > $ perf bench sched pipe -T -l 50000 > > # Running 'sched/pipe' benchmark: > > # Executed 50000 pipe operations between two threads > > > > Total time: 1.283 [sec] > > > > 25.674200 usecs/op > > 38949 ops/sec > > > > which is a 23% regression > > This is interesting, 10 iterations data on my side: > > $ cat upstream.log | grep Total > Total time: 0.799 [sec] > Total time: 0.513 [sec] > Total time: 0.658 [sec] > Total time: 0.499 [sec] > Total time: 0.644 [sec] > Total time: 0.758 [sec] > Total time: 0.576 [sec] > Total time: 0.864 [sec] > Total time: 0.566 [sec] > Total time: 0.694 [sec] > Total time: 0.609 [sec] > > $ cat idle_cpumask.log | grep Total > Total time: 0.595 [sec] > Total time: 0.519 [sec] > Total time: 0.592 [sec] > Total time: 0.504 [sec] > Total time: 0.440 [sec] > Total time: 0.676 [sec] > Total time: 0.683 [sec] > Total time: 0.437 [sec] > Total time: 0.460 [sec] > Total time: 0.473 [sec] > Total time: 0.478 [sec] > > That is, there is a 18.53% improvement. > ======================================== > cases avg std > 5.8.10 0.653 17.958 > idle_cpumask 0.532 16.916 > > Is this caused by the architecture specific atomic operations? > I also noticed I have shorter total time of both hackbench and perf sched > pipe. TBH, I'm surprised that perf sched pipe is shorter with your patch. It adds more code to run but don't take advantage of a shorter cpumask in select_idle_cpu() because it either skips select_idle_cpu() if previous cpu is idle which should be the case in this light loaded benchmark or the cpumask should be quite the same as the sched_domain span The sequence is : select_task_rq_fair() select prev which should be idle +cpumask_clear_cpu(new_cpu, sds_idle_cpus(sd->shared)); pick_next_task_fair() set_next_task_fair() +unlikely update_idle_cpumask(rq); wake up other side go back to sleep put_prev_task_fair pick_next_task_idle set_next_task_idle +update_idle_cpumask(rq) > > Would you mind share uperf(netperf load) result on your side? That's the > workload I have seen the most benefit this patch contributed under heavy > load level. with uperf, i've got the same kind of result as sched pipe tip/sched/core: Throughput 24.83Mb/s (+/- 0.09%) with this patch: Throughput 19.02Mb/s (+/- 0.71%) which is a 23% regression as for sched pipe > > Maybe I should look for the architecture specific implementation? Any clue? > > > > >> 10 iterations "./hackbench 1 process 2560" on my side are: > >> > >> 5.8.10: 0.14(+/- 12.01%) > >> ========================= > >> [0.089, 0.148, 0.147, 0.141, 0.143, 0.143, 0.143, 0.146, 0.143, 0.142] > >> Score: avg - 0.1385, std - 12.01% > >> > >> With this patch > >> ================ > >> [0.095, 0.142, 0.143, 0.142, 0.15, 0.146, 0.144, 0.145, 0.143, 0.145] > >> Score: avg - 0.1395, std - 10.88% > >> > >> I didn't see such big regression. > >> > >>>> > >>>>> } > >>>>> rcu_read_unlock(); > >>>>> > >>>>> @@ -10871,6 +10900,9 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) > >>>>> /* ensure bandwidth has been allocated on our new cfs_rq */ > >>>>> account_cfs_rq_runtime(cfs_rq, 0); > >>>>> } > >>>>> + /* Update idle cpumask if task has idle policy */ > >>>>> + if (unlikely(task_has_idle_policy(p))) > >>>>> + update_idle_cpumask(rq); > >>>> > >>>> it's wrong because a sched_idle task will run for time to time even > >>>> when some cfs tasks are runnable > >>>> > >> Sorry I didn't get your point. The intention here is to add a SCHED_IDLE cpu to the idle cpumask, > >> so that this cpu can be used as a target for wakeup preemption. > > > > a cpu with sched_idle tasks can be considered idle iff there is only > > sched_idle tasks runnable. Look at sched_idle_cpu() > > > You are right, thanks to point this out. > > Thanks, > -Aubrey > >> > >>>>> } > >>>>> > >>>>> void init_cfs_rq(struct cfs_rq *cfs_rq) > >>>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > >>>>> index 1ae95b9150d3..876dfdfe35bb 100644 > >>>>> --- a/kernel/sched/idle.c > >>>>> +++ b/kernel/sched/idle.c > >>>>> @@ -405,6 +405,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) > >>>>> static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first) > >>>>> { > >>>>> update_idle_core(rq); > >>>>> + update_idle_cpumask(rq); > >>>>> schedstat_inc(rq->sched_goidle); > >>>>> } > >>>>> > >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >>>>> index c82857e2e288..7a3355f61bcf 100644 > >>>>> --- a/kernel/sched/sched.h > >>>>> +++ b/kernel/sched/sched.h > >>>>> @@ -1069,6 +1069,7 @@ static inline void update_idle_core(struct rq *rq) > >>>>> #else > >>>>> static inline void update_idle_core(struct rq *rq) { } > >>>>> #endif > >>>>> +void update_idle_cpumask(struct rq *rq); > >>>>> > >>>>> DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); > >>>>> > >>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > >>>>> index 9079d865a935..f14a6ef4de57 100644 > >>>>> --- a/kernel/sched/topology.c > >>>>> +++ b/kernel/sched/topology.c > >>>>> @@ -1407,6 +1407,7 @@ sd_init(struct sched_domain_topology_level *tl, > >>>>> sd->shared = *per_cpu_ptr(sdd->sds, sd_id); > >>>>> atomic_inc(&sd->shared->ref); > >>>>> atomic_set(&sd->shared->nr_busy_cpus, sd_weight); > >>>>> + cpumask_copy(sds_idle_cpus(sd->shared), sched_domain_span(sd)); > >>>>> } > >>>>> > >>>>> sd->private = sdd; > >>>>> @@ -1769,7 +1770,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map) > >>>>> > >>>>> *per_cpu_ptr(sdd->sd, j) = sd; > >>>>> > >>>>> - sds = kzalloc_node(sizeof(struct sched_domain_shared), > >>>>> + sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(), > >>>>> GFP_KERNEL, cpu_to_node(j)); > >>>>> if (!sds) > >>>>> return -ENOMEM; > >>>>> > >> >