Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752891AbbK3KWt (ORCPT ); Mon, 30 Nov 2015 05:22:49 -0500 Received: from casper.infradead.org ([85.118.1.10]:46371 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbbK3KWr (ORCPT ); Mon, 30 Nov 2015 05:22:47 -0500 Date: Mon, 30 Nov 2015 11:22:40 +0100 From: Peter Zijlstra To: Waiman Long Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Paul Turner , Ben Segall , Morten Rasmussen , Yuyang Du Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Message-ID: <20151130102240.GH17308@twins.programming.kicks-ass.net> References: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com> <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7543 Lines: 187 Please always Cc the people who wrote the code. +CC pjt, ben, morten, yuyang On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote: > The load_avg statistical counter is only changed if the load on a CPU > deviates significantly from the previous tick. So there are usually > more readers than writers of load_avg. Still, on a large system, > the cacheline contention can cause significant slowdown and impact > performance. > > This patch attempts to separate those load_avg readers > (update_cfs_shares) and writers (task_tick_fair) to use different > cachelines instead. Writers of load_avg will now accumulates the > load delta into load_avg_delta which sits in a different cacheline. > If load_avg_delta is sufficiently large (> load_avg/64), it will then > be added back to load_avg. > > Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores, > 480 threads), the perf profile before the patch was: > > 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt > 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt > 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer > 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times > 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick > 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair > 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > After the patch, it became: > > 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt > 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt > 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer > 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times > 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick > 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair > 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares This begs the question tough; why are you running a global load in a cgroup; and do we really need to update this for the root cgroup? It seems to me we don't need calc_tg_weight() for the root cgroup, it doesn't need to normalize its weight numbers. That is; isn't this simply a problem we should avoid? > The benchmark results before and after the patch were as follows: > > Before patch - Max-jOPs: 916011 Critical-jOps: 142366 > AFter patch - Max-jOPs: 939130 Critical-jOps: 211937 > > There was significant improvement in Critical-jOps which was latency > sensitive. > > This patch does introduce additional delay in getting the real load > average reflected in load_avg. It may also incur additional overhead > if the number of CPUs in a task group is small. As a result, this > change is only activated when running on a 4-socket or larger systems > which can get the most benefit from it. So I'm not particularly charmed by this; it rather makes a mess of things. Also this really wants a run of the cgroup fairness test thingy pjt/ben have somewhere. > Signed-off-by: Waiman Long > --- > kernel/sched/core.c | 9 +++++++++ > kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- > kernel/sched/sched.h | 8 ++++++++ > 3 files changed, 45 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d568ac..f3075da 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7356,6 +7356,12 @@ void __init sched_init(void) > root_task_group.cfs_rq = (struct cfs_rq **)ptr; > ptr += nr_cpu_ids * sizeof(void **); > > +#ifdef CONFIG_SMP > + /* > + * Use load_avg_delta if not 2P or less > + */ > + root_task_group.use_la_delta = (num_possible_nodes() > 2); > +#endif /* CONFIG_SMP */ > #endif /* CONFIG_FAIR_GROUP_SCHED */ > #ifdef CONFIG_RT_GROUP_SCHED > root_task_group.rt_se = (struct sched_rt_entity **)ptr; > @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent) > if (!alloc_rt_sched_group(tg, parent)) > goto err; > > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) > + tg->use_la_delta = root_task_group.use_la_delta; > +#endif > return tg; > > err: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8f1eccc..44732cc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > #ifdef CONFIG_FAIR_GROUP_SCHED > /* > - * Updating tg's load_avg is necessary before update_cfs_share (which is done) > + * Updating tg's load_avg is necessary before update_cfs_shares (which is done) > * and effective_load (which is not done because it is too costly). > + * > + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be > + * accumulated into the load_avg_delta variable instead to reduce cacheline > + * contention on load_avg at the expense of more delay in reflecting the real > + * load_avg. The tg's load_avg and load_avg_delta variables are in separate > + * cachelines. With that flag set, load_avg will be read mostly whereas > + * load_avg_delta will be write mostly. > */ > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) > { > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > - atomic_long_add(delta, &cfs_rq->tg->load_avg); > + struct task_group *tg = cfs_rq->tg; > + long load_avg, tot_delta; > + > + if (!tg->use_la_delta) { > + /* > + * If the use_la_delta isn't set, just add the > + * delta directly into load_avg. > + */ > + atomic_long_add(delta, &tg->load_avg); > + goto set_contrib; > + } > + > + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta); > + load_avg = atomic_long_read(&tg->load_avg); > + if (abs(tot_delta) > load_avg / 64) { > + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0); > + if (tot_delta) > + atomic_long_add(tot_delta, &tg->load_avg); > + } > +set_contrib: > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > } > } I'm thinking that its now far too big to retain the inline qualifier. > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index e679895..aef4e4e 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -252,8 +252,16 @@ struct task_group { > * load_avg can be heavily contended at clock tick time, so put > * it in its own cacheline separated from the fields above which > * will also be accessed at each tick. > + * > + * The use_la_delta flag, if set, will enable the use of load_avg_delta > + * to accumulate the delta and only change load_avg when the delta > + * is big enough. This reduces the cacheline contention on load_avg. > + * This flag will be set at allocation time depending on the system > + * configuration. > */ > + int use_la_delta; > atomic_long_t load_avg ____cacheline_aligned; > + atomic_long_t load_avg_delta ____cacheline_aligned; This would only work if the structure itself is allocated with cacheline alignment, and looking at sched_create_group(), we use a plain kzalloc() for this, which doesn't guarantee any sort of alignment beyond machine word size IIRC. Also, you unconditionally grow the structure by a whole cacheline. > #endif > #endif > > -- > 1.7.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/