Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753259AbbLCT0L (ORCPT ); Thu, 3 Dec 2015 14:26:11 -0500 Received: from g2t4622.austin.hp.com ([15.73.212.79]:49806 "EHLO g2t4622.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752895AbbLCT0J (ORCPT ); Thu, 3 Dec 2015 14:26:09 -0500 Message-ID: <5660974D.7050006@hpe.com> Date: Thu, 03 Dec 2015 14:26:05 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: bsegall@google.com CC: Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, Yuyang Du , Paul Turner , Morten Rasmussen , Scott J Norton , Douglas Hatch Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline References: <1449081710-20185-1-git-send-email-Waiman.Long@hpe.com> <1449081710-20185-3-git-send-email-Waiman.Long@hpe.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7062 Lines: 175 On 12/02/2015 03:02 PM, bsegall@google.com wrote: > Waiman Long writes: > >> If a system with large number of sockets was driven to full >> utilization, it was found that the clock tick handling occupied a >> rather significant proportion of CPU time when fair group scheduling >> and autogroup were enabled. >> >> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf >> profile looked like: >> >> 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt >> 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >> 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times >> 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair >> 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares >> >> In particular, the high CPU time consumed by update_cfs_shares() >> was mostly due to contention on the cacheline that contained the >> task_group's load_avg statistical counter. This cacheline may also >> contains variables like shares, cfs_rq& se which are accessed rather >> frequently during clock tick processing. >> >> This patch moves the load_avg variable into another cacheline >> separated from the other frequently accessed variables. It also >> creates a cacheline aligned kmemcache for task_group to make sure >> that all the allocated task_group's are cacheline aligned. >> >> By doing so, the perf profile became: >> >> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares >> >> The %cpu time is still pretty high, but it is better than before. The >> benchmark results before and after the patch was as follows: >> >> Before patch - Max-jOPs: 907533 Critical-jOps: 134877 >> After patch - Max-jOPs: 916011 Critical-jOps: 142366 >> >> Signed-off-by: Waiman Long >> --- >> kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++-- >> kernel/sched/sched.h | 7 ++++++- >> 2 files changed, 40 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 4d568ac..e39204f 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr) >> */ >> struct task_group root_task_group; >> LIST_HEAD(task_groups); >> + >> +#ifdef CONFIG_FAIR_GROUP_SCHED >> +/* Cacheline aligned slab cache for task_group */ >> +static struct kmem_cache *task_group_cache __read_mostly; >> +#endif >> #endif >> >> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); >> @@ -7356,6 +7361,7 @@ void __init sched_init(void) >> root_task_group.cfs_rq = (struct cfs_rq **)ptr; >> ptr += nr_cpu_ids * sizeof(void **); >> >> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN); > The KMEM_CACHE macro suggests instead adding > ____cacheline_aligned_in_smp to the struct definition instead. The main goal is to have the load_avg placed in a new cacheline separated from the read-only fields above. That is why I placed ____cacheline_aligned after load_avg. I omitted the in_smp part because it is in the SMP block already. Putting ____cacheline_aligned_in_smp won't guarantee alignment of any field within the structure. I have done some test and having ____cacheline_aligned inside the structure has the same effect of forcing the whole structure in the cacheline aligned boundary. >> #endif /* CONFIG_FAIR_GROUP_SCHED */ >> #ifdef CONFIG_RT_GROUP_SCHED >> root_task_group.rt_se = (struct sched_rt_entity **)ptr; >> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p) >> /* task_group_lock serializes the addition/removal of task groups */ >> static DEFINE_SPINLOCK(task_group_lock); >> >> +/* >> + * Make sure that the task_group structure is cacheline aligned when >> + * fair group scheduling is enabled. >> + */ >> +#ifdef CONFIG_FAIR_GROUP_SCHED >> +static inline struct task_group *alloc_task_group(void) >> +{ >> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); >> +} >> + >> +static inline void free_task_group(struct task_group *tg) >> +{ >> + kmem_cache_free(task_group_cache, tg); >> +} >> +#else /* CONFIG_FAIR_GROUP_SCHED */ >> +static inline struct task_group *alloc_task_group(void) >> +{ >> + return kzalloc(sizeof(struct task_group), GFP_KERNEL); >> +} >> + >> +static inline void free_task_group(struct task_group *tg) >> +{ >> + kfree(tg); >> +} >> +#endif /* CONFIG_FAIR_GROUP_SCHED */ >> + >> static void free_sched_group(struct task_group *tg) >> { >> free_fair_sched_group(tg); >> free_rt_sched_group(tg); >> autogroup_free(tg); >> - kfree(tg); >> + free_task_group(tg); >> } >> >> /* allocate runqueue etc for a new task group */ >> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent) >> { >> struct task_group *tg; >> >> - tg = kzalloc(sizeof(*tg), GFP_KERNEL); >> + tg = alloc_task_group(); >> if (!tg) >> return ERR_PTR(-ENOMEM); >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index efd3bfc..e679895 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -248,7 +248,12 @@ struct task_group { >> unsigned long shares; >> >> #ifdef CONFIG_SMP >> - atomic_long_t load_avg; >> + /* >> + * load_avg can be heavily contended at clock tick time, so put >> + * it in its own cacheline separated from the fields above which >> + * will also be accessed at each tick. >> + */ >> + atomic_long_t load_avg ____cacheline_aligned; >> #endif >> #endif > I suppose the question is if it would be better to just move this to > wind up on a separate cacheline without the extra empty space, though it > would likely be more fragile and unclear. I have been thinking about that too. The problem is anything that will be in the same cacheline as load_avg and have to be accessed at clock click time will cause the same contention problem. In the current layout, the fields after load_avg are the rt stuff as well some list head structure and pointers. The rt stuff should be kind of mutually exclusive of the CFS load_avg in term of usage. The list head structure and pointers don't seem to be that frequently accessed. So it is the right place to start a new cacheline boundary. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/