Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp1685921pxf; Fri, 26 Mar 2021 12:22:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJztPTxKi2rOpjwNeyDcg34sx5ery3SVo6B0StJRNH+c03FlX4a6J1AOFyr0V21wNR18cUuO X-Received: by 2002:a17:906:4a19:: with SMTP id w25mr16816150eju.180.1616786520218; Fri, 26 Mar 2021 12:22:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616786520; cv=none; d=google.com; s=arc-20160816; b=g/Dwhhpe/qV6WkO9Xn4m4oR1Ab/l/ujiJpYcy/EwZcm1/mXt6HgQ65EMcSUTbZUOvp NFpcZB0UeU9TqYIDX+1eIRyDq4MrDbLEaM8bz/xwCkEhmMCb4q3sbIm7GQnKn37yFBz/ Fw1/ZmVqzpaTlGhQuVN2pyo3VKwad5GgHqbAQV5IVl9d1OSs5MYI1sMLz410QzKfz/IP SYWBNfPPJxShqn0+rH/YR+GQzBhPJbUaaTnPZWFdFOKB1mx2qVlMw/AOfAYi0jv/ENo2 2LZG41LsJmrDrVQISL9Vh76fB8UjQJUJ/2pcMC8YQhRJ4g1Yft3zR11eIS15U9fhDLgB omQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=dGj4i7WrDNTIQoSEpDwK3ywCqK2hV+LW418jLdRLwfo=; b=yYZQK8cVt4Wn7/mg2RPUf2tsPqyD5fqJK7ctbcXTck8XHx2XfcVB59qOvigsiNmGuO RIjnjcHgQLBqKtgPD66z/9Syqpoys7SYMR1wg5Dus0PFPF1VCijaEgRHMttNCe9SRcjG rNBJ6lNSbXatN/VGPmfh7Hx2GMBwMdj31ybLfYN9FWNppDTBiePDJSXnK6pd1iZ9GzxF SGqNP0nYXiElhr03gPZoQdHmyKGTlGALDg9/lkFN1JcsdRFNbonvT1mnIGJNV7HxXlNY 8ySdmHp8Q4A2U2FdGUyE0lcLH+Hy28g4r5stGJAlIbuRDWhrNEbHKPpMKo57vt2LSM6E 3J2Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l19si7321603ejz.43.2021.03.26.12.21.37; Fri, 26 Mar 2021 12:22:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230434AbhCZTUC (ORCPT + 99 others); Fri, 26 Mar 2021 15:20:02 -0400 Received: from shelob.surriel.com ([96.67.55.147]:60040 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230238AbhCZTTp (ORCPT ); Fri, 26 Mar 2021 15:19:45 -0400 Received: from [2603:3005:d05:2b00:6e0b:84ff:fee2:98bb] (helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94) (envelope-from ) id 1lPrzZ-0004nJ-5I; Fri, 26 Mar 2021 15:19:33 -0400 Date: Fri, 26 Mar 2021 15:19:32 -0400 From: Rik van Riel To: Mel Gorman Cc: linux-kernel@vger.kernel.org, kernel-team@fb.com, "Peter Zijlstra (Intel)" , Ingo Molnar , Vincent Guittot , Valentin Schneider Subject: [PATCH v3] sched/fair: bring back select_idle_smt, but differently Message-ID: <20210326151932.2c187840@imladris.surriel.com> In-Reply-To: <20210322110306.GE3697@techsingularity.net> References: <20210321150358.71ef52b1@imladris.surriel.com> <20210322110306.GE3697@techsingularity.net> X-Mailer: Claws Mail 3.17.6 (GTK+ 2.24.32; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: riel@shelob.surriel.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 22 Mar 2021 11:03:06 +0000 Mel Gorman wrote: > Second, select_idle_smt() does not use the cpus mask so consider moving > the cpus initialisation after select_idle_smt() has been called. > Specifically this initialisation > > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); > > Alternatively, clear the bits in the SMT sibling scan to avoid checking > the siblings twice. It's a tradeoff because initialising and clearing > bits is not free and the cost is wasted if a sibling is free. I tried a number of different variations on moving the CPU mask initialization, and clearing CPUs from the mask, and failed to get any clear results from those in testing, even in workloads with lots of context switches. Below is a simple version that seems to perform identically to more complicated versions :) ---8<--- sched,fair: bring back select_idle_smt, but differently Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel being more efficient at finding an idle CPU, and in tasks spending less time waiting to be run, both according to the schedstats run_delay numbers, and according to measured application latencies. Yay. The flip side of this is that we see more task migrations (about 30% more), higher cache misses, higher memory bandwidth utilization, and higher CPU use, for the same number of requests/second. This is most pronounced on a memcache type workload, which saw a consistent 1-3% increase in total CPU use on the system, due to those increased task migrations leading to higher L2 cache miss numbers, and higher memory utilization. The exclusive L3 cache on Skylake does us no favors there. On our web serving workload, that effect is usually negligible. It appears that the increased number of CPU migrations is generally a good thing, since it leads to lower cpu_delay numbers, reflecting the fact that tasks get to run faster. However, the reduced locality and the corresponding increase in L2 cache misses hurts a little. The patch below appears to fix the regression, while keeping the benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt with a twist: when a socket has no idle cores, check to see if the sibling of "prev" is idle, before searching all the other CPUs. This fixes both the occasional 9% regression on the web serving workload, and the continuous 2% CPU use regression on the memcache type workload. With Mel's patches and this patch together, task migrations are still high, but L2 cache misses, memory bandwidth, and CPU time used are back down to what they were before. The p95 and p99 response times for the memcache type application improve by about 10% over what they were before Mel's patches got merged. Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 52 insertions(+), 16 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 794c2cb945f8..69680158963f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu return -1; } +/* + * Scan the local SMT mask for idle CPUs. + */ +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int +target) +{ + int cpu; + + if (!static_branch_likely(&sched_smt_present)) + return -1; + + for_each_cpu(cpu, cpu_smt_mask(target)) { + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || + !cpumask_test_cpu(cpu, sched_domain_span(sd))) + continue; + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + return cpu; + } + + return -1; +} + #else /* CONFIG_SCHED_SMT */ static inline void set_idle_cores(int cpu, int val) @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma return __select_idle_cpu(core); } +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) +{ + return -1; +} + #endif /* CONFIG_SCHED_SMT */ /* @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma * comparing the average scan cost (tracked in sd->avg_scan_cost) against the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); int i, cpu, idle_cpu = -1, nr = INT_MAX; @@ -6136,23 +6163,32 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); - if (sched_feat(SIS_PROP) && !smt) { - u64 avg_cost, avg_idle, span_avg; + if (!smt) { + if (cpus_share_cache(prev, target)) { + /* No idle core. Check if prev has an idle sibling. */ + i = select_idle_smt(p, sd, prev); + if ((unsigned int)i < nr_cpumask_bits) + return i; + } - /* - * Due to large variance we need a large fuzz factor; - * hackbench in particularly is sensitive here. - */ - avg_idle = this_rq()->avg_idle / 512; - avg_cost = this_sd->avg_scan_cost + 1; + if (sched_feat(SIS_PROP)) { + u64 avg_cost, avg_idle, span_avg; - span_avg = sd->span_weight * avg_idle; - if (span_avg > 4*avg_cost) - nr = div_u64(span_avg, avg_cost); - else - nr = 4; + /* + * Due to large variance we need a large fuzz factor; + * hackbench in particularly is sensitive here. + */ + avg_idle = this_rq()->avg_idle / 512; + avg_cost = this_sd->avg_scan_cost + 1; - time = cpu_clock(this); + span_avg = sd->span_weight * avg_idle; + if (span_avg > 4*avg_cost) + nr = div_u64(span_avg, avg_cost); + else + nr = 4; + + time = cpu_clock(this); + } } for_each_cpu_wrap(cpu, cpus, target) { @@ -6307,7 +6343,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (!sd) return target; - i = select_idle_cpu(p, sd, target); + i = select_idle_cpu(p, sd, prev, target); if ((unsigned)i < nr_cpumask_bits) return i; -- 2.25.4