Message-ID: <56609E75.4080407@hpe.com>
Date: Thu, 03 Dec 2015 14:56:37 -0500
From: Waiman Long <waiman.long@hpe.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        Yuyang Du <yuyang.du@intel.com>, Paul Turner <pjt@google.com>,
        Ben Segall <bsegall@google.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Scott J Norton <scott.norton@hpe.com>,
        Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
References: <1449081710-20185-1-git-send-email-Waiman.Long@hpe.com> <1449081710-20185-3-git-send-email-Waiman.Long@hpe.com> <20151203111209.GX3816@twins.programming.kicks-ass.net>
In-Reply-To: <20151203111209.GX3816@twins.programming.kicks-ass.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4066
Lines: 98

On 12/03/2015 06:12 AM, Peter Zijlstra wrote:
>
> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long<Waiman.Long@hpe.com>
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
>    10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>     9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>     8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>     8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>     8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>     6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>     5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq&  se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
>     9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>     8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>     7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>     7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>     7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>     5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>     4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
>    Before patch - Max-jOPs: 907533    Critical-jOps: 134877
>    After patch  - Max-jOPs: 916011    Critical-jOps: 142366
>
> Cc: Scott J Norton<scott.norton@hpe.com>
> Cc: Douglas Hatch<doug.hatch@hpe.com>
> Cc: Ingo Molnar<mingo@redhat.com>
> Cc: Yuyang Du<yuyang.du@intel.com>
> Cc: Paul Turner<pjt@google.com>
> Cc: Ben Segall<bsegall@google.com>
> Cc: Morten Rasmussen<morten.rasmussen@arm.com>
> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
> ---
>   kernel/sched/core.c  |   10 +++++++---
>   kernel/sched/sched.h |    7 ++++++-
>   2 files changed, 13 insertions(+), 4 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
>    */
>   struct task_group root_task_group;
>   LIST_HEAD(task_groups);
> +
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
>   #endif
>
>   DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>   #endif /* CONFIG_RT_GROUP_SCHED */
>
>   #ifdef CONFIG_CGROUP_SCHED
> +	task_group_cache = KMEM_CACHE(task_group, 0);
> +
Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper 
flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is 
defined. Other than that, I am fine with the change.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/