2010-01-18 04:44:06

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters


Hi,

Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

--

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <[email protected]>
---

Note: ccing ia64 and s390 who have not yet added code to statically
initialise cputime_one_jiffy at boot.
See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
jiffies_to_cputime(1) for details). Adding this would help optimise not only
this patch but many other areas of the scheduler when
CONFIG_VIRT_CPU_ACCOUNTING is enabled.

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
+++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
@@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;

if (unlikely(!cpuacct_subsys.active))
return;
@@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
rcu_read_lock();
ca = task_ca(tsk);

+ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();


2010-01-18 04:55:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

>
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
>
> --
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
>
> Tested with:
>
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>

Looks fine to me. Thanks Anton!
Reviewed-by: KOSAKI Motohiro <[email protected]>


> ---
>
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();


2010-01-18 07:51:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Mon, 2010-01-18 at 15:41 +1100, Anton Blanchard wrote:
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

Seems like a good idea, but isn't that batch number rather static? If
so, computing it in some init path would save that multiply on the
actual accounting path.

>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();

2010-01-18 08:26:16

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters


Hi Peter,

> Seems like a good idea, but isn't that batch number rather static? If
> so, computing it in some init path would save that multiply on the
> actual accounting path.

It is mostly static, but percpu_counter_batch does change with hotplug
operations. Adding a hotplug notifier felt like a lot of work but I was
worried we have issues if we didn't scale with hotplug add operations.

Anton

2010-01-18 08:35:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Mon, 2010-01-18 at 19:21 +1100, Anton Blanchard wrote:
> Hi Peter,
>
> > Seems like a good idea, but isn't that batch number rather static? If
> > so, computing it in some init path would save that multiply on the
> > actual accounting path.
>
> It is mostly static, but percpu_counter_batch does change with hotplug
> operations. Adding a hotplug notifier felt like a lot of work but I was
> worried we have issues if we didn't scale with hotplug add operations.

OK, trading a notifier for a mult might be worth it, we'll see if
someone complains ;-)

Lets got with this for now.

2010-01-18 08:36:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Monday 18 January 2010 10:11 AM, Anton Blanchard wrote:
>
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
>
> --
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
>
> Tested with:
>
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>
> ---
>
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();

Looks good to me, but I'll test it as well and revert back. I think we
might need to look at the call side where we do the percpu_counter_read().

Acked-by: Balbir Singh <[email protected]>

Balbir

2010-01-18 09:42:23

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

Hi Anton,

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <[email protected]> wrote:

> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.

For s390 the jiffies_to_cputime is a compile time constant. No need to
initialize it at runtime, no?

> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();

The patch itself trades some accuracy (larger cpu accounting value that
are stored per-cpu) against runtime overhead (spinlock to transfer the
value to the global variable in __percpu_counter_add). Did you
calculate how big the loss in accuracy is?

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2010-01-18 09:57:25

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters


Hi Martin,

> For s390 the jiffies_to_cputime is a compile time constant. No need to
> initialize it at runtime, no?

Indeed it is, I didn't look closely enough. Same with ia64 so no work to
do on either arch :)

> The patch itself trades some accuracy (larger cpu accounting value that
> are stored per-cpu) against runtime overhead (spinlock to transfer the
> value to the global variable in __percpu_counter_add). Did you
> calculate how big the loss in accuracy is?

The idea is we are already batching percpu_counter_batch jiffies, so
with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
cputime.

Anton

2010-01-18 11:01:05

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>
> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>

This patch worked well for me.

Acked-by: Balbir Singh <[email protected]>
Tested-by: Balbir Singh <[email protected]>

Balbir Singh.

2010-01-21 15:25:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>
> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>

Hi, Peter, Ingo

Could we please pick up the patch for -tip?

--
Three Cheers,
Balbir Singh

2010-01-21 15:36:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Thu, 2010-01-21 at 20:55 +0530, Balbir Singh wrote:

> Could we please pick up the patch for -tip?

OK, will queue it up if Ingo doesn't beat me to it.

2010-01-25 23:14:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <[email protected]> wrote:

> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.

When one looks at the end result:

: static void cpuacct_update_stats(struct task_struct *tsk,
: enum cpuacct_stat_index idx, cputime_t val)
: {
: struct cpuacct *ca;
: int batch;
:
: if (unlikely(!cpuacct_subsys.active))
: return;
:
: rcu_read_lock();
: ca = task_ca(tsk);
:
: batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
: do {
: __percpu_counter_add(&ca->cpustat[idx], val, batch);
: ca = ca->parent;
: } while (ca);
: rcu_read_unlock();
: }

the code (which used to be quite obvious) becomes pretty unobvious. In
fact it looks quite wrong.

Shouldn't there be a comment there explaining wtf is going on?

2010-01-25 23:44:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <[email protected]> wrote:
>
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> > call cpuacct_update_stats with values much larger than percpu_counter_batch.
> > This means the call to percpu_counter_add will always add to the global count
> > which is protected by a spinlock and we end up with a global spinlock in
> > the scheduler.
>
> When one looks at the end result:

We have about 32 jiffies batch both with or without CONFIG_VIRT_CPU_ACCOUNTING.
Then, The enduser can looks some jiffies after.



2010-01-26 06:17:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <[email protected]> wrote:
>
>> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
>> call cpuacct_update_stats with values much larger than percpu_counter_batch.
>> This means the call to percpu_counter_add will always add to the global count
>> which is protected by a spinlock and we end up with a global spinlock in
>> the scheduler.
>
> When one looks at the end result:
>
> : static void cpuacct_update_stats(struct task_struct *tsk,
> : enum cpuacct_stat_index idx, cputime_t val)
> : {
> : struct cpuacct *ca;
> : int batch;
> :
> : if (unlikely(!cpuacct_subsys.active))
> : return;
> :
> : rcu_read_lock();
> : ca = task_ca(tsk);
> :
> : batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> : do {
> : __percpu_counter_add(&ca->cpustat[idx], val, batch);
> : ca = ca->parent;
> : } while (ca);
> : rcu_read_unlock();
> : }
>
> the code (which used to be quite obvious) becomes pretty unobvious. In
> fact it looks quite wrong.
>
> Shouldn't there be a comment there explaining wtf is going on?

Andrew,

I guess a lot of the changelog and comments are in the email history,
but your point on the comment is valid. Why does it look quite wrong to you?

cputime_one_jiffy tells us how many cputime_t's we've gotten in one
jiffy. If virtual accounting is enabled, this number is quite large, and
1 if virtual accounting is not enabled. Overall the value is set to 32
for non-virtual accounting enabled systems. On systems that support
virtual accounting, the value is set to 32*cputime_per_jifffy, so the
per cpu counter syncs up roughly once in 32 jiffies assuming
cpuacct_update_stats is called once per jiffy for non-virtual machines.

If the above comment, pleases you I'll polish it and send it across.
Anton, could you please confirm what I've said above is indeed correct.



--
Three Cheers,
Balbir Singh

2010-01-26 06:36:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Tue, 26 Jan 2010 11:47:15 +0530 Balbir Singh <[email protected]> wrote:

> On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> > On Mon, 18 Jan 2010 15:41:42 +1100
> > Anton Blanchard <[email protected]> wrote:
> >
> >> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> >> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> >> This means the call to percpu_counter_add will always add to the global count
> >> which is protected by a spinlock and we end up with a global spinlock in
> >> the scheduler.
> >
> > When one looks at the end result:
> >
> > : static void cpuacct_update_stats(struct task_struct *tsk,
> > : enum cpuacct_stat_index idx, cputime_t val)
> > : {
> > : struct cpuacct *ca;
> > : int batch;
> > :
> > : if (unlikely(!cpuacct_subsys.active))
> > : return;
> > :
> > : rcu_read_lock();
> > : ca = task_ca(tsk);
> > :
> > : batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > : do {
> > : __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > : ca = ca->parent;
> > : } while (ca);
> > : rcu_read_unlock();
> > : }
> >
> > the code (which used to be quite obvious) becomes pretty unobvious. In
> > fact it looks quite wrong.
> >
> > Shouldn't there be a comment there explaining wtf is going on?
>
> Andrew,
>
> I guess a lot of the changelog and comments are in the email history,

Not a very useful location for it!

> Why does it look quite wrong to you?

Because it computes the correct value and then if it's larger than
INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!


Does that code actually work, btw? percpu_counter_batch has type `int'
and cputime_one_jiffy has type `int' so their product has type `int'.
So by the time min_t performs its comparison, the upper 32 bits of the
product are already lost.

2010-01-26 08:11:01

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters


Hi Andrew,

> > I guess a lot of the changelog and comments are in the email history,
>
> Not a very useful location for it!

Good point, I'll work on a useful comment.

> > Why does it look quite wrong to you?
>
> Because it computes the correct value and then if it's larger than
> INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!
>
> Does that code actually work, btw? percpu_counter_batch has type `int'
> and cputime_one_jiffy has type `int' so their product has type `int'.
> So by the time min_t performs its comparison, the upper 32 bits of the
> product are already lost.

On ppc64, s390 and ia64 cputime_one_jiffy is 64bit and I want to prevent us
creating too large a batch value:

void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)

If we pass in something bigger than INT_MAX we could end up with 0 after
truncation which will turn the percpu counter into a single spinlock global
counter.

Anton

2010-01-27 13:16:30

by Anton Blanchard

[permalink] [raw]
Subject: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters

Commit-ID: 43f85eab1411905afe5db510fbf9841b516e7e6a
Gitweb: http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
Author: Anton Blanchard <[email protected]>
AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 27 Jan 2010 08:34:38 +0100

sched: cpuacct: Use bigger percpu counter batch values for stats counters

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
can call cpuacct_update_stats with values much larger than
percpu_counter_batch. This means the call to percpu_counter_add will
always add to the global count which is protected by a spinlock and we
end up with a global spinlock in the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would if
CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu
add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <20100118044142.GS12666@kryten>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..8f94138 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;

if (unlikely(!cpuacct_subsys.active))
return;
@@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
rcu_read_lock();
ca = task_ca(tsk);

+ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();

2010-01-27 13:34:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
<[email protected]> wrote:
> Commit-ID: ?43f85eb1411905afe5db510fbf9841b516e7e6a
> Gitweb: ? ? http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> Author: ? ? Anton Blanchard <[email protected]>
> AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
>
> sched: cpuacct: Use bigger percpu counter batch values for stats counters
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> can call cpuacct_update_stats with values much larger than
> percpu_counter_batch. This means the call to percpu_counter_add will
> always add to the global count which is protected by a spinlock and we
> end up with a global spinlock in the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would if
> CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu
> add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: ? ? ? ? 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: ? ? ? ? ? ? 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: ? ? ? ? ?16663217 ctx switches/sec
>
> Tested with:
>
> ?wget http://ozlabs.org/~anton/junkcode/context_switch.c
> ?make context_switch
> ?for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> ?vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <20100118044142.GS12666@kryten>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> ?kernel/sched.c | ? ?4 +++-
> ?1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..8f94138 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> ? ? ? ? ? ? ? ?enum cpuacct_stat_index idx, cputime_t val)
> ?{
> ? ? ? ?struct cpuacct *ca;
> + ? ? ? int batch;
>
> ? ? ? ?if (unlikely(!cpuacct_subsys.active))
> ? ? ? ? ? ? ? ?return;
> @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> ? ? ? ?rcu_read_lock();
> ? ? ? ?ca = task_ca(tsk);
>
> + ? ? ? batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> ? ? ? ?do {
> - ? ? ? ? ? ? ? percpu_counter_add(&ca->cpustat[idx], val);
> + ? ? ? ? ? ? ? __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ? ? ? ? ? ? ? ?ca = ca->parent;
> ? ? ? ?} while (ca);
> ? ? ? ?rcu_read_unlock();

IIRC, Andrew picked up this patch as well and applied some checkpatch
fixes too..

Balbir

2010-01-27 21:23:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Wed, 27 Jan 2010 19:04:55 +0530
Balbir Singh <[email protected]> wrote:

> On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
> <[email protected]> wrote:
> > Commit-ID: __43f85eb1411905afe5db510fbf9841b516e7e6a
> > Gitweb: __ __ http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> > Author: __ __ Anton Blanchard <[email protected]>
> > AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> > Committer: __Ingo Molnar <[email protected]>
> > CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
> >
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> >
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> > can call cpuacct_update_stats with values much larger than
> > percpu_counter_batch. This means the call to percpu_counter_add will
> > always add to the global count which is protected by a spinlock and we
> > end up with a global spinlock in the scheduler.
> >
> > Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> > cputime_one_jiffy such that we have the same batch limit as we would if
> > CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> > but that initialisation happened too early on PowerPC (before time_init)
> > and it was never updated at runtime as a result of a hotplug cpu
> > add/remove.
> >
> > This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> > runtime, which keeps the batch correct even after cpu hotplug operations.
> > We cap it at INT_MAX in case of overflow.
> >
> > For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> > cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> > min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> > least on x86 and PowerPC. So there is no need to add an #ifdef.
> >
> > On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> > faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> >
> > CONFIG_CGROUP_CPUACCT disabled: __ __ __ __ 16906698 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT enabled: __ __ __ __ __ __ 61720 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT + patch: __ __ __ __ __16663217 ctx switches/sec
> >
> > Tested with:
> >
> > __wget http://ozlabs.org/~anton/junkcode/context_switch.c
> > __make context_switch
> > __for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> > __vmstat 1
> >
> > Signed-off-by: Anton Blanchard <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > LKML-Reference: <20100118044142.GS12666@kryten>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > __kernel/sched.c | __ __4 +++-
> > __1 files changed, 3 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..8f94138 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __ __ __ __ __enum cpuacct_stat_index idx, cputime_t val)
> > __{
> > __ __ __ __struct cpuacct *ca;
> > + __ __ __ int batch;
> >
> > __ __ __ __if (unlikely(!cpuacct_subsys.active))
> > __ __ __ __ __ __ __ __return;
> > @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __rcu_read_lock();
> > __ __ __ __ca = task_ca(tsk);
> >
> > + __ __ __ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > __ __ __ __do {
> > - __ __ __ __ __ __ __ percpu_counter_add(&ca->cpustat[idx], val);
> > + __ __ __ __ __ __ __ __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > __ __ __ __ __ __ __ __ca = ca->parent;
> > __ __ __ __} while (ca);
> > __ __ __ __rcu_read_unlock();

^^ your email client inexplicably fills emails with 0xa0

> IIRC, Andrew picked up this patch as well and applied some checkpatch
> fixes too..

No, I have no changes.

Last I heard, Anton was "working on a useful comment" and will be
redoing the patch.

2010-01-27 21:43:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters

On Wed, 2010-01-27 at 13:22 -0800, Andrew Morton wrote:

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

If Anton is working on a new version, would Anton then also care to post
one that compiles on UP? :-)

2010-01-28 05:21:23

by Balbir Singh

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters

>> > __ __ __ __ __ __ __ __ca = ca->parent;
>> > __ __ __ __} while (ca);
>> > __ __ __ __rcu_read_unlock();
>
> ^^ your email client inexplicably fills emails with 0xa0
>

I am going to be changing my email client soon, back to the good old
text client.

>> IIRC, Andrew picked up this patch as well and applied some checkpatch
>> fixes too..
>
> No, I have no changes.
>

Sorry, I was confused, the changes you had were for getdelays.c

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

Yep, lets wait on him.

Three Cheers,
Balbir

2010-01-28 12:50:52

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH 1/2] percpu_counter: Make __percpu_counter_add an inline function on UP


Even though batch isn't used on UP, we may want to pass one in to keep the
SMP and UP code paths similar. Convert __percpu_counter_add to an inline
function so we wont get variable unused warnings if we do.

Signed-off-by: Anton Blanchard <[email protected]>
---

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index a7684a5..794662b 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -98,9 +98,6 @@ static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
fbc->count = amount;
}

-#define __percpu_counter_add(fbc, amount, batch) \
- percpu_counter_add(fbc, amount)
-
static inline void
percpu_counter_add(struct percpu_counter *fbc, s64 amount)
{
@@ -109,6 +106,12 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
preempt_enable();
}

+static inline void
+__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+ percpu_counter_add(fbc, amount);
+}
+
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
return fbc->count;

2010-01-28 12:56:06

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH 2/2] sched: cpuacct: Use bigger percpu counter batch values for stats counters


When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <[email protected]>
---

v2: Added a comment and fixed the UP build.

Index: linux-cpumask/kernel/sched.c
===================================================================
--- linux-cpumask.orig/kernel/sched.c 2010-01-22 18:23:43.377212514 +1100
+++ linux-cpumask/kernel/sched.c 2010-01-28 23:24:02.677233753 +1100
@@ -10885,12 +10885,30 @@ static void cpuacct_charge(struct task_s
}

/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH \
+ min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH 0
+#endif
+
+/*
* Charge the system/user time to the task's accounting group.
*/
static void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch = CPUACCT_BATCH;

if (unlikely(!cpuacct_subsys.active))
return;
@@ -10899,7 +10917,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);

do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();