2009-07-16 08:11:56

by Bharata B Rao

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Tue, May 12, 2009 at 07:44:37PM +0900, KOSAKI Motohiro wrote:
> ------------------------------------
> Subject: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
>
> percpu counters used to accumulate statistics in cpuacct controller use
> the default batch value [max(2*nr_cpus, 32)] which can be too small for
> archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
> cputime updates in the range of thousands. As a result, cpuacct_update_stats()
> would end up acquiring the percpu counter spinlock on every tick which
> is not good for performance.
>
> Let those architectures to have a bigger batch threshold so that percpu counter
> spinlock isn't taken on every tick. This change doesn't affect the archs which
> don't define VIRT_CPU_ACCOUNTING and they continue to have the default
> percpu counter batch value.
>
> v7:
> - fix typo and changelog
>
> v6:
> - fix build error when UP
>
> v5:
> - move cpuacct_batch initialization into sched_init()
>
> v4:
> - rewrite patch description (thanks Bharata!)
> - append read_mostly to cpuacct_batch
> - cpuacct_batch is initialized by sched_init_debug()
>
> v3:
> - revert using percpu_counter_sum()
>
> v2:
> - use percpu_counter_sum() instead percpu_counter_read()
>
> Cc: Balaji Rao <[email protected]>
> Cc: Dhaval Giani <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> kernel/sched.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c 2009-05-12 13:12:59.000000000 +0900
> +++ b/kernel/sched.c 2009-05-12 19:04:49.000000000 +0900
> @@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
> */
> int sysctl_sched_rt_runtime = 950000;
>
> +static __read_mostly s32 cpuacct_batch;
> +
> static inline u64 global_rt_period(void)
> {
> return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> @@ -9284,6 +9286,10 @@ void __init sched_init(void)
>
> perf_counter_init();
>
> +#ifdef CONFIG_SMP
> + cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> +#endif

On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
later in time_init(). Because of this I see that cpuacct_batch will always
be zero effectively negating what this patch is trying to do.

As explained by you earlier, we too are finding the default batch value to
be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
if this patch is taken in (ofcourse with the above issue fixed), it will
benefit ppc64 also.

Regards,
Bharata.


2009-07-16 08:39:57

by Anton Blanchard

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters


Hi,

> On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
> jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
> later in time_init(). Because of this I see that cpuacct_batch will always
> be zero effectively negating what this patch is trying to do.
>
> As explained by you earlier, we too are finding the default batch value to
> be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
> if this patch is taken in (ofcourse with the above issue fixed), it will
> benefit ppc64 also.

I created this patch earlier today when I hit the problem. Thoughts?

Anton
--

When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
with values much larger than percpu_counter_batch. This means the
call to percpu_counter_add will always add to the global count which is
protected by a spinlock.

Since reading of the CPU accounting cgroup counters is not performance
critical, we can use a maximum size batch of INT_MAX and use
percpu_counter_sum on the read side which will add all the percpu
counters.

With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
397k/sec to 3.9M/sec, a 10x improvement.

Signed-off-by: Anton Blanchard <[email protected]>
---

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2009-07-16 10:11:02.000000000 +1000
+++ linux.trees.git/kernel/sched.c 2009-07-16 10:16:41.000000000 +1000
@@ -10551,7 +10551,7 @@
int i;

for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
- s64 val = percpu_counter_read(&ca->cpustat[i]);
+ s64 val = percpu_counter_sum(&ca->cpustat[i]);
val = cputime64_to_clock_t(val);
cb->fill(cb, cpuacct_stat_desc[i], val);
}
@@ -10621,7 +10621,7 @@
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
ca = ca->parent;
} while (ca);
rcu_read_unlock();