Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161484Ab3DGDJY (ORCPT ); Sat, 6 Apr 2013 23:09:24 -0400 Received: from e28smtp03.in.ibm.com ([122.248.162.3]:35203 "EHLO e28smtp03.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161342Ab3DGDJX (ORCPT ); Sat, 6 Apr 2013 23:09:23 -0400 Message-ID: <5160E355.6000701@linux.vnet.ibm.com> Date: Sun, 07 Apr 2013 11:09:09 +0800 From: Michael Wang User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: Alex Shi CC: mingo@redhat.com, peterz@infradead.org, tglx@linutronix.de, akpm@linux-foundation.org, arjan@linux.intel.com, bp@alien8.de, pjt@google.com, namhyung@kernel.org, efault@gmx.de, morten.rasmussen@arm.com, vincent.guittot@linaro.org, gregkh@linuxfoundation.org, preeti@linux.vnet.ibm.com, viresh.kumar@linaro.org, linux-kernel@vger.kernel.org, len.brown@intel.com, rafael.j.wysocki@intel.com, jkosina@suse.cz, clark.williams@gmail.com, tony.luck@intel.com, keescook@chromium.org, mgorman@suse.de, riel@redhat.com Subject: Re: [patch v3 0/8] sched: use runnable avg in load balance References: <1364873008-3169-1-git-send-email-alex.shi@intel.com> <515A877B.3020908@linux.vnet.ibm.com> <515BEC5F.60001@intel.com> In-Reply-To: <515BEC5F.60001@intel.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13040703-3864-0000-0000-0000079A4FA7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4679 Lines: 143 On 04/03/2013 04:46 PM, Alex Shi wrote: > On 04/02/2013 03:23 PM, Michael Wang wrote: >> | 15 GB | 12 | 45393 | | 43986 | >> | 15 GB | 16 | 45110 | | 45719 | >> | 15 GB | 24 | 41415 | | 36813 | -11.11% >> | 15 GB | 32 | 35988 | | 34025 | >> >> The reason may caused by wake_affine()'s higher overhead, and pgbench is >> really sensitive to this stuff... > > Michael: > I changed the threshold to 0.1ms it has same effect on aim7. > So could you try the following on pgbench? Here the results of different threshold, too many data so I just list 22 MB 32 clients item: threshold(us) tps base 43420 500 40694 250 40591 120 41468 -4.50% 60 47785 +10.05% 30 51389 15 52844 6 54539 3 52674 1.5 52885 Since 120~60us seems to be the inflection, I made more test in this range: threshold(us) tps base 43420 110 41772 100 42246 90 43471 0% 80 44920 70 46341 According to these data, 90us == 90000 is the inflection point on my box for 22 MB 32 clients item, other test items show different float, so 80~90us is the conclusion. Now the concern is how to deal with this issue, the results may changed on different deployment, static value is not acceptable, so we need another new knob here? I'm not sure whether you have take a look at the wake-affine throttle patch I sent weeks ago, it's purpose is throttle the wake-affine to not work too frequently. And since the aim problem is caused by the imbalance which is the side-effect of frequently succeed wake-affine, may be the throttle patch could help to address that issue too, if it is, then we only need to add one new knob. BTW, the benefit your patch set bring is not conflict with wake-affine throttle patch, which means with your patch set, the 1ms throttle will also show 25% improvement now (used to be <5%), and it also increase the maximum benefit from 40% to 45% on my box ;-) Regards, Michael Wang > > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h > index bf8086b..a3c3d43 100644 > --- a/include/linux/sched/sysctl.h > +++ b/include/linux/sched/sysctl.h > @@ -53,6 +53,7 @@ extern unsigned int sysctl_numa_balancing_settle_count; > > #ifdef CONFIG_SCHED_DEBUG > extern unsigned int sysctl_sched_migration_cost; > +extern unsigned int sysctl_sched_burst_threshold; > extern unsigned int sysctl_sched_nr_migrate; > extern unsigned int sysctl_sched_time_avg; > extern unsigned int sysctl_timer_migration; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index dbaa8ca..dd5a324 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -91,6 +91,7 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL; > unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; > > const_debug unsigned int sysctl_sched_migration_cost = 500000UL; > +const_debug unsigned int sysctl_sched_burst_threshold = 100000UL; > > /* > * The exponential sliding window over which load is averaged for shares > @@ -3103,12 +3104,24 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) > unsigned long weight; > int balanced; > int runnable_avg; > + int burst = 0; > > idx = sd->wake_idx; > this_cpu = smp_processor_id(); > prev_cpu = task_cpu(p); > - load = source_load(prev_cpu, idx); > - this_load = target_load(this_cpu, idx); > + > + if (cpu_rq(this_cpu)->avg_idle < sysctl_sched_burst_threshold || > + cpu_rq(prev_cpu)->avg_idle < sysctl_sched_burst_threshold) > + burst= 1; > + > + /* use instant load for bursty waking up */ > + if (!burst) { > + load = source_load(prev_cpu, idx); > + this_load = target_load(this_cpu, idx); > + } else { > + load = cpu_rq(prev_cpu)->load.weight; > + this_load = cpu_rq(this_cpu)->load.weight; > + } > > /* > * If sync wakeup then subtract the (maximum possible) > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index afc1dc6..1f23457 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -327,6 +327,13 @@ static struct ctl_table kern_table[] = { > .proc_handler = proc_dointvec, > }, > { > + .procname = "sched_burst_threshold_ns", > + .data = &sysctl_sched_burst_threshold, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > + { > .procname = "sched_nr_migrate", > .data = &sysctl_sched_nr_migrate, > .maxlen = sizeof(unsigned int), > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/