Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754351AbbK3TNh (ORCPT ); Mon, 30 Nov 2015 14:13:37 -0500 Received: from g2t4622.austin.hp.com ([15.73.212.79]:40638 "EHLO g2t4622.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752205AbbK3TNf (ORCPT ); Mon, 30 Nov 2015 14:13:35 -0500 Message-ID: <565C9FDC.9020603@hpe.com> Date: Mon, 30 Nov 2015 14:13:32 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Peter Zijlstra CC: Ingo Molnar , linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Paul Turner , Ben Segall , Morten Rasmussen , Yuyang Du Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg References: <1448478580-26467-1-git-send-email-Waiman.Long@hpe.com> <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com> <20151130102240.GH17308@twins.programming.kicks-ass.net> In-Reply-To: <20151130102240.GH17308@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8776 Lines: 203 On 11/30/2015 05:22 AM, Peter Zijlstra wrote: > Please always Cc the people who wrote the code. > > +CC pjt, ben, morten, yuyang Sorry for that. Their names didn't show up when I did get_maintainer.pl. > On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote: >> The load_avg statistical counter is only changed if the load on a CPU >> deviates significantly from the previous tick. So there are usually >> more readers than writers of load_avg. Still, on a large system, >> the cacheline contention can cause significant slowdown and impact >> performance. >> >> This patch attempts to separate those load_avg readers >> (update_cfs_shares) and writers (task_tick_fair) to use different >> cachelines instead. Writers of load_avg will now accumulates the >> load delta into load_avg_delta which sits in a different cacheline. >> If load_avg_delta is sufficiently large (> load_avg/64), it will then >> be added back to load_avg. >> >> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores, >> 480 threads), the perf profile before the patch was: >> >> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares >> >> After the patch, it became: >> >> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer >> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times >> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair >> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares > This begs the question tough; why are you running a global load in a > cgroup; and do we really need to update this for the root cgroup? It > seems to me we don't need calc_tg_weight() for the root cgroup, it > doesn't need to normalize its weight numbers. > > That is; isn't this simply a problem we should avoid? I didn't use any cgroup in my test setup. Autogroup was enabled, though. Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 times. >> The benchmark results before and after the patch were as follows: >> >> Before patch - Max-jOPs: 916011 Critical-jOps: 142366 >> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937 >> >> There was significant improvement in Critical-jOps which was latency >> sensitive. >> >> This patch does introduce additional delay in getting the real load >> average reflected in load_avg. It may also incur additional overhead >> if the number of CPUs in a task group is small. As a result, this >> change is only activated when running on a 4-socket or larger systems >> which can get the most benefit from it. > So I'm not particularly charmed by this; it rather makes a mess of > things. Also this really wants a run of the cgroup fairness test thingy > pjt/ben have somewhere. I will be glad to run any additional tests, if necessary. I do need pointers to those test, though. >> Signed-off-by: Waiman Long >> --- >> kernel/sched/core.c | 9 +++++++++ >> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- >> kernel/sched/sched.h | 8 ++++++++ >> 3 files changed, 45 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 4d568ac..f3075da 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7356,6 +7356,12 @@ void __init sched_init(void) >> root_task_group.cfs_rq = (struct cfs_rq **)ptr; >> ptr += nr_cpu_ids * sizeof(void **); >> >> +#ifdef CONFIG_SMP >> + /* >> + * Use load_avg_delta if not 2P or less >> + */ >> + root_task_group.use_la_delta = (num_possible_nodes()> 2); >> +#endif /* CONFIG_SMP */ >> #endif /* CONFIG_FAIR_GROUP_SCHED */ >> #ifdef CONFIG_RT_GROUP_SCHED >> root_task_group.rt_se = (struct sched_rt_entity **)ptr; >> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent) >> if (!alloc_rt_sched_group(tg, parent)) >> goto err; >> >> +#if defined(CONFIG_FAIR_GROUP_SCHED)&& defined(CONFIG_SMP) >> + tg->use_la_delta = root_task_group.use_la_delta; >> +#endif >> return tg; >> >> err: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8f1eccc..44732cc 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> >> #ifdef CONFIG_FAIR_GROUP_SCHED >> /* >> - * Updating tg's load_avg is necessary before update_cfs_share (which is done) >> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done) >> * and effective_load (which is not done because it is too costly). >> + * >> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be >> + * accumulated into the load_avg_delta variable instead to reduce cacheline >> + * contention on load_avg at the expense of more delay in reflecting the real >> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate >> + * cachelines. With that flag set, load_avg will be read mostly whereas >> + * load_avg_delta will be write mostly. >> */ >> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) >> { >> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >> >> if (force || abs(delta)> cfs_rq->tg_load_avg_contrib / 64) { >> - atomic_long_add(delta,&cfs_rq->tg->load_avg); >> + struct task_group *tg = cfs_rq->tg; >> + long load_avg, tot_delta; >> + >> + if (!tg->use_la_delta) { >> + /* >> + * If the use_la_delta isn't set, just add the >> + * delta directly into load_avg. >> + */ >> + atomic_long_add(delta,&tg->load_avg); >> + goto set_contrib; >> + } >> + >> + tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta); >> + load_avg = atomic_long_read(&tg->load_avg); >> + if (abs(tot_delta)> load_avg / 64) { >> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0); >> + if (tot_delta) >> + atomic_long_add(tot_delta,&tg->load_avg); >> + } >> +set_contrib: >> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >> } >> } > I'm thinking that its now far too big to retain the inline qualifier. I can take the inline keyword out. > >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index e679895..aef4e4e 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -252,8 +252,16 @@ struct task_group { >> * load_avg can be heavily contended at clock tick time, so put >> * it in its own cacheline separated from the fields above which >> * will also be accessed at each tick. >> + * >> + * The use_la_delta flag, if set, will enable the use of load_avg_delta >> + * to accumulate the delta and only change load_avg when the delta >> + * is big enough. This reduces the cacheline contention on load_avg. >> + * This flag will be set at allocation time depending on the system >> + * configuration. >> */ >> + int use_la_delta; >> atomic_long_t load_avg ____cacheline_aligned; >> + atomic_long_t load_avg_delta ____cacheline_aligned; > This would only work if the structure itself is allocated with cacheline > alignment, and looking at sched_create_group(), we use a plain kzalloc() > for this, which doesn't guarantee any sort of alignment beyond machine > word size IIRC. With a RHEL 6 derived .config file, the size of the task_group structure was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag increase the size to 512 bytes. So it did make the structure a multiple of the cacheline size. With both slub and slab, the allocated task group pointers from kzalloc() in sched_create_group() were all multiples of 0x200. So they were properly aligned for the ____cacheline_aligned tag to work. > Also, you unconditionally grow the structure by a whole cacheline. I know it is a drawback of using ____cacheline_aligned tag. However, we probably won't create too many task groups in normal use. So the increase in memory consumption shouldn't be noticeable. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/