Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753443Ab3EMC2C (ORCPT ); Sun, 12 May 2013 22:28:02 -0400 Received: from e28smtp05.in.ibm.com ([122.248.162.5]:36721 "EHLO e28smtp05.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753078Ab3EMC2A (ORCPT ); Sun, 12 May 2013 22:28:00 -0400 Message-ID: <51904FA6.50108@linux.vnet.ibm.com> Date: Mon, 13 May 2013 10:27:50 +0800 From: Michael Wang User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121011 Thunderbird/16.0.1 MIME-Version: 1.0 To: LKML , Ingo Molnar , Peter Zijlstra , Peter Zijlstra CC: Mike Galbraith , Alex Shi , Namhyung Kim , Paul Turner , Andrew Morton , "Nikunj A. Dadhania" , Ram Pai Subject: Re: [PATCH] sched: wake-affine throttle References: <5164DCE7.8080906@linux.vnet.ibm.com> <51833302.6090208@linux.vnet.ibm.com> <51886AFF.4090500@linux.vnet.ibm.com> In-Reply-To: <51886AFF.4090500@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13051302-8256-0000-0000-0000077853B2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8668 Lines: 224 On 05/07/2013 10:46 AM, Michael Wang wrote: > On 05/03/2013 11:46 AM, Michael Wang wrote: >> On 04/10/2013 11:30 AM, Michael Wang wrote: >> >> Now long time has been passed since the first version, I'd like to >> do some summary about current states: >> >> On a 12 cpu box with tip 3.9.0-rc7, test show that: >> >> 1. remove wake-affine stuff cause regression on hackbench (could be 15%). >> 2. reserve wake-affine stuff cause regression on pgbench (could be 45%). >> >> And since neither remove or reserve is acceptable, this patch introduced a >> knob to throttle wake-affine stuff. >> >> By turning the knob, we could benefit both hackbench and pgbench, or sacrifice >> one to significantly benefit another. >> >> All similar workload which prefer or hate wake-affine behaviour could >> been cared. >> >> If this approach caused any concerns, please let me know ;-) > > Is there any more concerns? or should we take some action now? Seems like folks has no more concerns :) Peter, would you like to pick the patch? or please give some comments may be? Regards, Michael Wang > > Regards, > Michael Wang > >> >> Regards, >> Michael Wang >> >>> >>> Recently testing show that wake-affine stuff cause regression on pgbench, the >>> hiding rat was finally catched out. >>> >>> wake-affine stuff is always trying to pull wakee close to waker, by theory, >>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme >>> ping-pong case. >>> >>> However, the whole stuff is somewhat blindly, load balance is the only factor >>> to be guaranteed, and since the stuff itself is time-consuming, some workload >>> suffered, pgbench is just the one who has been found. >>> >>> Thus, throttle the wake-affine stuff for such workload is necessary. >>> >>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the >>> default value 1ms (default minimum balance interval), which means wake-affine >>> will keep silent for 1ms after it returned false. >>> >>> By turning the new knob, those workload who suffered will have the chance to >>> stop the regression. >>> >>> Test: >>> Test with 12 cpu X86 server and tip 3.9.0-rc2. >>> >>> default >>> base 1ms interval 10ms interval 100ms interval >>> | db_size | clients | tps | | tps | | tps | | tps | >>> +---------+---------+-------+- +-------+ +-------+ +-------+ >>> | 21 MB | 1 | 10572 | | 10804 | | 10802 | | 10801 | >>> | 21 MB | 2 | 21275 | | 21533 | | 21400 | | 21498 | >>> | 21 MB | 4 | 41866 | | 42158 | | 42410 | | 42306 | >>> | 21 MB | 8 | 53931 | | 55796 | | 58608 | +8.67% | 59916 | +11.10% >>> | 21 MB | 12 | 50956 | | 52266 | | 54586 | +7.12% | 55982 | +9.86% >>> | 21 MB | 16 | 49911 | | 52862 | +5.91% | 55668 | +11.53% | 57255 | +14.71% >>> | 21 MB | 24 | 46046 | | 48820 | +6.02% | 54269 | +17.86% | 58113 | +26.21% >>> | 21 MB | 32 | 43405 | | 46635 | +7.44% | 53690 | +23.70% | 57729 | +33.00% >>> | 7483 MB | 1 | 7734 | | 8013 | | 8046 | | 7879 | >>> | 7483 MB | 2 | 19375 | | 19459 | | 19448 | | 19421 | >>> | 7483 MB | 4 | 37408 | | 37780 | | 37937 | | 37819 | >>> | 7483 MB | 8 | 49033 | | 50389 | | 51636 | +5.31% | 52294 | +6.65% >>> | 7483 MB | 12 | 45525 | | 47794 | +4.98% | 49828 | +9.45% | 50571 | +11.08% >>> | 7483 MB | 16 | 45731 | | 47921 | +4.79% | 50203 | +9.78% | 52033 | +13.78% >>> | 7483 MB | 24 | 41533 | | 44301 | +6.67% | 49697 | +19.66% | 53833 | +29.62% >>> | 7483 MB | 32 | 36370 | | 38301 | +5.31% | 48146 | +32.38% | 52795 | +45.16% >>> | 15 GB | 1 | 7576 | | 7926 | | 7722 | | 7969 | >>> | 15 GB | 2 | 19157 | | 19284 | | 19294 | | 19304 | >>> | 15 GB | 4 | 37285 | | 37539 | | 37281 | | 37508 | >>> | 15 GB | 8 | 48718 | | 49176 | | 50836 | +4.35% | 51239 | +5.17% >>> | 15 GB | 12 | 45167 | | 47180 | +4.45% | 49206 | +8.94% | 50126 | +10.98% >>> | 15 GB | 16 | 45270 | | 47293 | +4.47% | 49638 | +9.65% | 51748 | +14.31% >>> | 15 GB | 24 | 40984 | | 43366 | +5.81% | 49356 | +20.43% | 53157 | +29.70% >>> | 15 GB | 32 | 35918 | | 37632 | +4.77% | 47923 | +33.42% | 52241 | +45.45% >>> >>> Suggested-by: Peter Zijlstra >>> Signed-off-by: Michael Wang >>> --- >>> include/linux/sched.h | 5 +++++ >>> kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++ >>> kernel/sysctl.c | 10 ++++++++++ >>> 3 files changed, 46 insertions(+), 0 deletions(-) >>> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>> index d35d2b6..e9efd3a 100644 >>> --- a/include/linux/sched.h >>> +++ b/include/linux/sched.h >>> @@ -1197,6 +1197,10 @@ enum perf_event_task_context { >>> perf_nr_task_contexts, >>> }; >>> >>> +#ifdef CONFIG_SMP >>> +extern unsigned int sysctl_sched_wake_affine_interval; >>> +#endif >>> + >>> struct task_struct { >>> volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ >>> void *stack; >>> @@ -1207,6 +1211,7 @@ struct task_struct { >>> #ifdef CONFIG_SMP >>> struct llist_node wake_entry; >>> int on_cpu; >>> + unsigned long next_wake_affine; >>> #endif >>> int on_rq; >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 7a33e59..68eedd7 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3087,6 +3087,22 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu, >>> >>> #endif >>> >>> +/* >>> + * Default is 1ms, to prevent the wake_affine() stuff working too frequently. >>> + */ >>> +unsigned int sysctl_sched_wake_affine_interval = 1U; >>> + >>> +static inline int wake_affine_throttled(struct task_struct *p) >>> +{ >>> + return time_before(jiffies, p->next_wake_affine); >>> +} >>> + >>> +static inline void wake_affine_throttle(struct task_struct *p) >>> +{ >>> + p->next_wake_affine = jiffies + >>> + msecs_to_jiffies(sysctl_sched_wake_affine_interval); >>> +} >>> + >>> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) >>> { >>> s64 this_load, load; >>> @@ -3096,6 +3112,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) >>> unsigned long weight; >>> int balanced; >>> >>> + if (wake_affine_throttled(p)) >>> + return 0; >>> + >>> idx = sd->wake_idx; >>> this_cpu = smp_processor_id(); >>> prev_cpu = task_cpu(p); >>> @@ -3167,6 +3186,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) >>> >>> return 1; >>> } >>> + >>> + /* >>> + * wake_affine() stuff try to pull wakee to the cpu >>> + * around waker, this will benefit us if the data >>> + * cached on waker cpu is hot for wakee, or the extreme >>> + * ping-pong case. >>> + * >>> + * However, do such blindly work too frequently will >>> + * cause regression to some workload, thus, each time >>> + * when wake_affine() failed, throttle it for a while. >>> + */ >>> + wake_affine_throttle(p); >>> return 0; >>> } >>> >>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >>> index afc1dc6..6ebfc18 100644 >>> --- a/kernel/sysctl.c >>> +++ b/kernel/sysctl.c >>> @@ -437,6 +437,16 @@ static struct ctl_table kern_table[] = { >>> .extra1 = &one, >>> }, >>> #endif >>> +#ifdef CONFIG_SMP >>> + { >>> + .procname = "sched_wake_affine_interval", >>> + .data = &sysctl_sched_wake_affine_interval, >>> + .maxlen = sizeof(unsigned int), >>> + .mode = 0644, >>> + .proc_handler = proc_dointvec_minmax, >>> + .extra1 = &zero, >>> + }, >>> +#endif >>> #ifdef CONFIG_PROVE_LOCKING >>> { >>> .procname = "prove_locking", >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/