Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752266AbdI0Q2A (ORCPT ); Wed, 27 Sep 2017 12:28:00 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:50274 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752057AbdI0Q15 (ORCPT ); Wed, 27 Sep 2017 12:27:57 -0400 Subject: Re: sysbench throughput degradation in 4.13+ To: Peter Zijlstra Cc: Rik van Riel , =?UTF-8?B?546L6YeR5rWm?= , LKML , Ingo Molnar , Christian Borntraeger , "KVM-ML (kvm@vger.kernel.org)" , vcaputo@pengaru.com, Matthew Rosato References: <95edafb1-5e9d-8461-db73-bcb002b7ebef@linux.vnet.ibm.com> <50a279d3-84eb-3403-f2f0-854934778037@linux.vnet.ibm.com> <20170922155348.zujigkn3o5eylctn@hirez.programming.kicks-ass.net> <754f5a9f-5332-148d-2631-918fc7a7cfe9@linux.vnet.ibm.com> <20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net> From: Eric Farman Date: Wed, 27 Sep 2017 12:27:50 -0400 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17092716-8235-0000-0000-00000C54BB34 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007800; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000232; SDB=6.00923175; UDB=6.00464077; IPR=6.00703318; BA=6.00005609; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017293; XFM=3.00000015; UTC=2017-09-27 16:27:54 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17092716-8236-0000-0000-00003DD1523C Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-09-27_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1709270228 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9379 Lines: 273 On 09/27/2017 05:35 AM, Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: >> >> MySQL. We've tried a few different configs with both test=oltp and >> test=threads, but both show the same behavior. What I have settled on for >> my repro is the following: >> > > Right, didn't even need to run it in a guest to observe a regression. > > So the below cures native sysbench and NAS bench for me, does it also > work for you virt thingy? > Ran a quick test this morning with 4.13.0 + 90001d67be2f + a731ebe6f17b and then with/without this patch. An oltp sysbench run shows that guest cpu migrations decreased significantly, from ~27K to ~2K over 5 seconds. So, we applied this patch to linux-next (next-20170926) and ran it against a couple sysbench tests: --test=oltp Baseline: 5655.26 transactions/second Patched: 9618.13 transactions/second --test=threads Baseline: 25482.9 events/sec Patched: 29577.9 events/sec That's good! With that... Tested-by: Eric Farman Thanks! - Eric > > PRE (current tip/master): > > ivb-ex sysbench: > > 2: [30 secs] transactions: 64110 (2136.94 per sec.) > 5: [30 secs] transactions: 143644 (4787.99 per sec.) > 10: [30 secs] transactions: 274298 (9142.93 per sec.) > 20: [30 secs] transactions: 418683 (13955.45 per sec.) > 40: [30 secs] transactions: 320731 (10690.15 per sec.) > 80: [30 secs] transactions: 355096 (11834.28 per sec.) > > hsw-ex NAS: > > OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 > OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 > OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 > lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 > lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 > lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 > > > POST (+patch): > > ivb-ex sysbench: > > 2: [30 secs] transactions: 64494 (2149.75 per sec.) > 5: [30 secs] transactions: 145114 (4836.99 per sec.) > 10: [30 secs] transactions: 278311 (9276.69 per sec.) > 20: [30 secs] transactions: 437169 (14571.60 per sec.) > 40: [30 secs] transactions: 669837 (22326.73 per sec.) > 80: [30 secs] transactions: 631739 (21055.88 per sec.) > > hsw-ex NAS: > > lu.C.x_threads_144_run_1.log: Time in seconds = 23.36 > lu.C.x_threads_144_run_2.log: Time in seconds = 22.96 > lu.C.x_threads_144_run_3.log: Time in seconds = 22.52 > > > This patch takes out all the shiny wake_affine stuff and goes back to > utter basics. Rik was there another NUMA benchmark that wanted your > fancy stuff? Because NAS isn't it. > > (the previous, slightly fancier wake_affine was basically a !idle > extension of this, in that it would pick the 'shortest' of the 2 queues > and thereby run quickest, in approximation) > > I'll try and run a number of other benchmarks I have around to see if > there's anything that shows a difference between the below trivial > wake_affine and the old 2-cpu-load one. > > --- > include/linux/sched/topology.h | 8 --- > kernel/sched/fair.c | 125 ++--------------------------------------- > 2 files changed, 6 insertions(+), 127 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index d7b6dab956ec..7d065abc7a47 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -71,14 +71,6 @@ struct sched_domain_shared { > atomic_t ref; > atomic_t nr_busy_cpus; > int has_idle_cores; > - > - /* > - * Some variables from the most recent sd_lb_stats for this domain, > - * used by wake_affine(). > - */ > - unsigned long nr_running; > - unsigned long load; > - unsigned long capacity; > }; > > struct sched_domain { > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 70ba32e08a23..66930ce338af 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p) > return 1; > } > > -struct llc_stats { > - unsigned long nr_running; > - unsigned long load; > - unsigned long capacity; > - int has_capacity; > -}; > - > -static bool get_llc_stats(struct llc_stats *stats, int cpu) > -{ > - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); > - > - if (!sds) > - return false; > - > - stats->nr_running = READ_ONCE(sds->nr_running); > - stats->load = READ_ONCE(sds->load); > - stats->capacity = READ_ONCE(sds->capacity); > - stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu); > - > - return true; > -} > - > -/* > - * Can a task be moved from prev_cpu to this_cpu without causing a load > - * imbalance that would trigger the load balancer? > - * > - * Since we're running on 'stale' values, we might in fact create an imbalance > - * but recomputing these values is expensive, as that'd mean iteration 2 cache > - * domains worth of CPUs. > - */ > -static bool > -wake_affine_llc(struct sched_domain *sd, struct task_struct *p, > - int this_cpu, int prev_cpu, int sync) > -{ > - struct llc_stats prev_stats, this_stats; > - s64 this_eff_load, prev_eff_load; > - unsigned long task_load; > - > - if (!get_llc_stats(&prev_stats, prev_cpu) || > - !get_llc_stats(&this_stats, this_cpu)) > - return false; > - > - /* > - * If sync wakeup then subtract the (maximum possible) > - * effect of the currently running task from the load > - * of the current LLC. > - */ > - if (sync) { > - unsigned long current_load = task_h_load(current); > - > - /* in this case load hits 0 and this LLC is considered 'idle' */ > - if (current_load > this_stats.load) > - return true; > - > - this_stats.load -= current_load; > - } > - > - /* > - * The has_capacity stuff is not SMT aware, but by trying to balance > - * the nr_running on both ends we try and fill the domain at equal > - * rates, thereby first consuming cores before siblings. > - */ > - > - /* if the old cache has capacity, stay there */ > - if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1) > - return false; > - > - /* if this cache has capacity, come here */ > - if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running) > - return true; > - > - /* > - * Check to see if we can move the load without causing too much > - * imbalance. > - */ > - task_load = task_h_load(p); > - > - this_eff_load = 100; > - this_eff_load *= prev_stats.capacity; > - > - prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; > - prev_eff_load *= this_stats.capacity; > - > - this_eff_load *= this_stats.load + task_load; > - prev_eff_load *= prev_stats.load - task_load; > - > - return this_eff_load <= prev_eff_load; > -} > - > static int wake_affine(struct sched_domain *sd, struct task_struct *p, > int prev_cpu, int sync) > { > int this_cpu = smp_processor_id(); > - bool affine; > - > - /* > - * Default to no affine wakeups; wake_affine() should not effect a task > - * placement the load-balancer feels inclined to undo. The conservative > - * option is therefore to not move tasks when they wake up. > - */ > - affine = false; > + bool affine = false; > > /* > - * If the wakeup is across cache domains, try to evaluate if movement > - * makes sense, otherwise rely on select_idle_siblings() to do > - * placement inside the cache domain. > + * If we can run _now_ on the waking CPU, go there, otherwise meh. > */ > - if (!cpus_share_cache(prev_cpu, this_cpu)) > - affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync); > + if (idle_cpu(this_cpu)) > + affine = true; > + else if (sync && cpu_rq(this_cpu)->nr_running == 1) > + affine = true; > > schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); > if (affine) { > @@ -7600,7 +7504,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) > */ > static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) > { > - struct sched_domain_shared *shared = env->sd->shared; > struct sched_domain *child = env->sd->child; > struct sched_group *sg = env->sd->groups; > struct sg_lb_stats *local = &sds->local_stat; > @@ -7672,22 +7575,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > if (env->dst_rq->rd->overload != overload) > env->dst_rq->rd->overload = overload; > } > - > - if (!shared) > - return; > - > - /* > - * Since these are sums over groups they can contain some CPUs > - * multiple times for the NUMA domains. > - * > - * Currently only wake_affine_llc() and find_busiest_group() > - * uses these numbers, only the last is affected by this problem. > - * > - * XXX fix that. > - */ > - WRITE_ONCE(shared->nr_running, sds->total_running); > - WRITE_ONCE(shared->load, sds->total_load); > - WRITE_ONCE(shared->capacity, sds->total_capacity); > } > > /** >