Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp797995ybb; Fri, 20 Mar 2020 08:14:53 -0700 (PDT) X-Google-Smtp-Source: ADFU+vs6EfPuXu3q4CMRvdNt/HhnViMCumfWV8CmTqxFJYDLH1wZ9P/7lYMQ6wAIgtHiI1Mw5r2L X-Received: by 2002:a9d:2028:: with SMTP id n37mr7508214ota.127.1584717292726; Fri, 20 Mar 2020 08:14:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584717292; cv=none; d=google.com; s=arc-20160816; b=AraPF2A0MrcVaCiu4MrHZlrR7TkhQCKdvwpjP0OVUvB0jOCzQfi59YV9p28cyyuYFq 7uESCzyEh/80Y9+SLDAiOQpVgDfMjMQU70LArUtcp/wTxOq6BiMVYBU58rfVM674FO8O Lt8A/SVBDvuX5RPORslZ/v87XJQ98CV3sqXjzafDCRPhrxi4n4Tzqf0/wDn0TmgC1EwL UdaNCkSkUmxBPR3iTXYtT7m8HnWzujzbmnXojtJdMRn3C13aYfVilv1vef7sChBvECUk 0ngFsieQvIh2dxkxdblsxBrGmaeFoJo/Y+Kn3YQU2UH4/vOoMrdwHC2sbKKYB5uq9U26 i9dw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from; bh=fCmGWVnvL2wSTErYCmdBejQERS2Zj7juN4OWUMpAPn8=; b=c55eCBZbhnntxxjp4dvUwINWkjl0KijEspHMGrs5slL7CyRLSJgsYsUhP2d4OjsVyo D/FWAfFijeK9FbodVYBMDzxwLlQLVOlD9NmKU9Ze/iMD8F04hlyZdfI+m9sOm1ap8bvw qfFobd6fAKk6DybywJFlbYm65C6TMyyMhEcvHqfehKdyvQPBa5OZKtURlI+gXvBo0b7W G97te6MNLif84Pr7EZD5ejELzMLdRFl2ZTxeDLNtGm1vjPR2pZL7cR3aczQIMgjOgX1D rlfaDTZcjYMy8/vQZy+7S0uQB2raQJH5708Q0OaOFxya0LPlDsLS2JLxbbEcKE2cZvEv 2l9A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d10si3075340oif.30.2020.03.20.08.14.20; Fri, 20 Mar 2020 08:14:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727517AbgCTPNm (ORCPT + 99 others); Fri, 20 Mar 2020 11:13:42 -0400 Received: from outbound-smtp46.blacknight.com ([46.22.136.58]:35955 "EHLO outbound-smtp46.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727142AbgCTPNm (ORCPT ); Fri, 20 Mar 2020 11:13:42 -0400 Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp46.blacknight.com (Postfix) with ESMTPS id CA254FA780 for ; Fri, 20 Mar 2020 15:13:38 +0000 (GMT) Received: (qmail 31678 invoked from network); 20 Mar 2020 15:13:38 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPA; 20 Mar 2020 15:13:38 -0000 From: Mel Gorman To: Ingo Molnar Cc: Peter Zijlstra , Vincent Guittot , Valentin Schneider , Phil Auld , LKML , Mel Gorman Subject: [PATCH 4/4] sched/fair: Track possibly overloaded domains and abort a scan if necessary Date: Fri, 20 Mar 2020 15:12:45 +0000 Message-Id: <20200320151245.21152-5-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20200320151245.21152-1-mgorman@techsingularity.net> References: <20200320151245.21152-1-mgorman@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Once a domain is overloaded, it is very unlikely that a free CPU will be found in the short term but there is still potentially a lot of scanning. This patch tracks if a domain may be overloaded due to an excessive number of running tasks relative to available CPUs. In the event a domain is overloaded, a search is aborted. This has a variable impact on performance for hackbench which often is overloaded on the test machines used. There was a mix of performance gains and losses but there is a substantial impact on search efficiency. On a 2-socket broadwell machine with 80 cores in total, tbench showed small gains and some losses Hmean 1 431.51 ( 0.00%) 426.53 * -1.15%* Hmean 2 842.69 ( 0.00%) 839.00 * -0.44%* Hmean 4 1631.09 ( 0.00%) 1634.81 * 0.23%* Hmean 8 3001.08 ( 0.00%) 3020.85 * 0.66%* Hmean 16 5631.75 ( 0.00%) 5655.04 * 0.41%* Hmean 32 9736.22 ( 0.00%) 9645.68 * -0.93%* Hmean 64 13978.54 ( 0.00%) 15215.65 * 8.85%* Hmean 128 20093.06 ( 0.00%) 19389.45 * -3.50%* Hmean 256 17491.34 ( 0.00%) 18616.32 * 6.43%* Hmean 320 17423.67 ( 0.00%) 17793.38 * 2.12%* However, the "SIS Domain Search Efficiency" went from 6.03% to 19.61% indicating that far fewer CPUs were scanned. The impact of the patch is more noticable when sockets have multiple L3 caches. While true for EPYC 2nd generation, it's particularly noticable on EPYC 1st generation Hmean 1 325.30 ( 0.00%) 324.92 * -0.12%* Hmean 2 630.77 ( 0.00%) 621.35 * -1.49%* Hmean 4 1211.41 ( 0.00%) 1148.51 * -5.19%* Hmean 8 2017.29 ( 0.00%) 1953.57 * -3.16%* Hmean 16 4068.81 ( 0.00%) 3514.06 * -13.63%* Hmean 32 5588.20 ( 0.00%) 6583.58 * 17.81%* Hmean 64 8470.14 ( 0.00%) 10117.26 * 19.45%* Hmean 128 11462.06 ( 0.00%) 17207.68 * 50.13%* Hmean 256 11433.74 ( 0.00%) 13446.93 * 17.61%* Hmean 512 12576.88 ( 0.00%) 13630.08 * 8.37%* On this machine, search efficiency goes from 21.04% to 32.66%. There is a noticable problem at 16 when there are enough clients for a LLC domain to spill over. With hackbench, the overload problem is a bit more obvious. On the 2-socket broadwell machine using processes and pipes we see Amean 1 0.3023 ( 0.00%) 0.2893 ( 4.30%) Amean 4 0.6823 ( 0.00%) 0.6930 ( -1.56%) Amean 7 1.0293 ( 0.00%) 1.0380 ( -0.84%) Amean 12 1.6913 ( 0.00%) 1.7027 ( -0.67%) Amean 21 2.9307 ( 0.00%) 2.9297 ( 0.03%) Amean 30 4.0040 ( 0.00%) 4.0270 ( -0.57%) Amean 48 6.0703 ( 0.00%) 6.1067 ( -0.60%) Amean 79 9.0630 ( 0.00%) 9.1223 * -0.65%* Amean 110 12.1917 ( 0.00%) 12.1693 ( 0.18%) Amean 141 15.7150 ( 0.00%) 15.4187 ( 1.89%) Amean 172 19.5327 ( 0.00%) 18.9937 ( 2.76%) Amean 203 23.3093 ( 0.00%) 22.2497 * 4.55%* Amean 234 27.8657 ( 0.00%) 25.9627 * 6.83%* Amean 265 32.9783 ( 0.00%) 29.5240 * 10.47%* Amean 296 35.6727 ( 0.00%) 32.8260 * 7.98%* More of the SIS stats are worth looking at in this case Ops SIS Domain Search 10390526707.00 9822163508.00 Ops SIS Scanned 223173467577.00 48330226094.00 Ops SIS Domain Scanned 222820381314.00 47964114165.00 Ops SIS Failures 10183794873.00 9639912418.00 Ops SIS Recent Used Hit 22194515.00 22517194.00 Ops SIS Recent Used Miss 5733847634.00 5500415074.00 Ops SIS Recent Attempts 5756042149.00 5522932268.00 Ops SIS Search Efficiency 4.81 21.08 Search efficiency goes from 4.66% to 20.48% but the SIS Domain Scanned shows the sheer volume of searching SIS does when prev, target and recent CPUs are unavailable. This could be much more aggressive by also cutting off a search for idle cores. However, to make that work properly requires a much more intrusive series that is likely to be controversial. This seemed like a reasonable tradeoff to tackle the most obvious problem with select_idle_cpu. Signed-off-by: Mel Gorman --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++--- kernel/sched/features.h | 3 ++ 3 files changed, 65 insertions(+), 4 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index af9319e4cfb9..76ec7a54f57b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -66,6 +66,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + int is_overloaded; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41913fac68de..31e011e627db 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5924,6 +5924,38 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p return new_cpu; } +static inline void +set_sd_overloaded(struct sched_domain_shared *sds, int val) +{ + if (!sds) + return; + + WRITE_ONCE(sds->is_overloaded, val); +} + +static inline bool test_sd_overloaded(struct sched_domain_shared *sds) +{ + return READ_ONCE(sds->is_overloaded); +} + +/* Returns true if a previously overloaded domain is likely still overloaded. */ +static inline bool +abort_sd_overloaded(struct sched_domain_shared *sds, int prev, int target) +{ + if (!sds || !test_sd_overloaded(sds)) + return false; + + /* Are either target or a suitable prev 1 or 0 tasks? */ + if (cpu_rq(target)->nr_running <= 1 || + (prev != target && cpus_share_cache(prev, target) && + cpu_rq(prev)->nr_running <= 1)) { + set_sd_overloaded(sds, 0); + return false; + } + + return true; +} + #ifdef CONFIG_SCHED_SMT DEFINE_STATIC_KEY_FALSE(sched_smt_present); EXPORT_SYMBOL_GPL(sched_smt_present); @@ -6060,15 +6092,18 @@ static inline int select_idle_smt(struct task_struct *p, int target) * comparing the average scan cost (tracked in sd->avg_scan_cost) against the * average idle time for this rq (as found in rq->avg_idle). */ -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, + int prev, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; + struct sched_domain_shared *sds; u64 avg_cost, avg_idle; u64 time, cost; s64 delta; int this = smp_processor_id(); int cpu, nr = INT_MAX; + int nr_scanned = 0, nr_running = 0; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6092,18 +6127,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t nr = 4; } + sds = rcu_dereference(per_cpu(sd_llc_shared, target)); + if (sched_feat(SIS_OVERLOAD)) { + if (abort_sd_overloaded(sds, prev, target)) + return -1; + } + time = cpu_clock(this); cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); for_each_cpu_wrap(cpu, cpus, target) { schedstat_inc(this_rq()->sis_scanned); - if (!--nr) - return -1; + if (!--nr) { + cpu = -1; + break; + } if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) break; + if (sched_feat(SIS_OVERLOAD)) { + nr_scanned++; + nr_running += cpu_rq(cpu)->nr_running; + } } + /* Check if domain should be marked overloaded if no cpu was found. */ + if (sched_feat(SIS_OVERLOAD) && (signed)cpu >= nr_cpumask_bits && + nr_scanned && nr_running > (nr_scanned << 1)) { + set_sd_overloaded(sds, 1); + } + + /* Scan cost not accounted for if scan is throttled */ + if (!nr) + return -1; + time = cpu_clock(this) - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; @@ -6236,7 +6293,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; - i = select_idle_cpu(p, sd, target); + i = select_idle_cpu(p, sd, prev, target); if ((unsigned)i < nr_cpumask_bits) return i; diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 7481cd96f391..c36ae01910e2 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,9 @@ SCHED_FEAT(TTWU_QUEUE, true) SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +/* Limit scans if the domain is likely overloaded */ +SCHED_FEAT(SIS_OVERLOAD, true) + /* * Issue a WARN when we do multiple update_rq_clock() calls * in a single rq->lock section. Default disabled because the -- 2.16.4