Hi,
Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
--
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.
Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.
This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.
For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.
On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
Tested with:
wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1
Signed-off-by: Anton Blanchard <[email protected]>
---
Note: ccing ia64 and s390 who have not yet added code to statically
initialise cputime_one_jiffy at boot.
See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
jiffies_to_cputime(1) for details). Adding this would help optimise not only
this patch but many other areas of the scheduler when
CONFIG_VIRT_CPU_ACCOUNTING is enabled.
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
+++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
@@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;
if (unlikely(!cpuacct_subsys.active))
return;
@@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
rcu_read_lock();
ca = task_ca(tsk);
+ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();
>
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
>
> --
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
>
> Tested with:
>
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>
Looks fine to me. Thanks Anton!
Reviewed-by: KOSAKI Motohiro <[email protected]>
> ---
>
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();
On Mon, 2010-01-18 at 15:41 +1100, Anton Blanchard wrote:
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
Seems like a good idea, but isn't that batch number rather static? If
so, computing it in some init path would save that multiply on the
actual accounting path.
>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();
Hi Peter,
> Seems like a good idea, but isn't that batch number rather static? If
> so, computing it in some init path would save that multiply on the
> actual accounting path.
It is mostly static, but percpu_counter_batch does change with hotplug
operations. Adding a hotplug notifier felt like a lot of work but I was
worried we have issues if we didn't scale with hotplug add operations.
Anton
On Mon, 2010-01-18 at 19:21 +1100, Anton Blanchard wrote:
> Hi Peter,
>
> > Seems like a good idea, but isn't that batch number rather static? If
> > so, computing it in some init path would save that multiply on the
> > actual accounting path.
>
> It is mostly static, but percpu_counter_batch does change with hotplug
> operations. Adding a hotplug notifier felt like a lot of work but I was
> worried we have issues if we didn't scale with hotplug add operations.
OK, trading a notifier for a mult might be worth it, we'll see if
someone complains ;-)
Lets got with this for now.
On Monday 18 January 2010 10:11 AM, Anton Blanchard wrote:
>
> Hi,
>
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
>
> --
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
>
> Tested with:
>
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>
> ---
>
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
>
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();
Looks good to me, but I'll test it as well and revert back. I think we
might need to look at the call side where we do the percpu_counter_read().
Acked-by: Balbir Singh <[email protected]>
Balbir
Hi Anton,
On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <[email protected]> wrote:
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot.
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
For s390 the jiffies_to_cputime is a compile time constant. No need to
initialize it at runtime, no?
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c 2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c 2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct
> enum cpuacct_stat_index idx, cputime_t val)
> {
> struct cpuacct *ca;
> + int batch;
>
> if (unlikely(!cpuacct_subsys.active))
> return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct
> rcu_read_lock();
> ca = task_ca(tsk);
>
> + batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> do {
> - percpu_counter_add(&ca->cpustat[idx], val);
> + __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ca = ca->parent;
> } while (ca);
> rcu_read_unlock();
The patch itself trades some accuracy (larger cpu accounting value that
are stored per-cpu) against runtime overhead (spinlock to transfer the
value to the global variable in __percpu_counter_add). Did you
calculate how big the loss in accuracy is?
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
Hi Martin,
> For s390 the jiffies_to_cputime is a compile time constant. No need to
> initialize it at runtime, no?
Indeed it is, I didn't look closely enough. Same with ia64 so no work to
do on either arch :)
> The patch itself trades some accuracy (larger cpu accounting value that
> are stored per-cpu) against runtime overhead (spinlock to transfer the
> value to the global variable in __percpu_counter_add). Did you
> calculate how big the loss in accuracy is?
The idea is we are already batching percpu_counter_batch jiffies, so
with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
cputime.
Anton
On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>
> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>
This patch worked well for me.
Acked-by: Balbir Singh <[email protected]>
Tested-by: Balbir Singh <[email protected]>
Balbir Singh.
On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
>
> Hi Martin,
>
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no?
>
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
>
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
>
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
>
Hi, Peter, Ingo
Could we please pick up the patch for -tip?
--
Three Cheers,
Balbir Singh
On Thu, 2010-01-21 at 20:55 +0530, Balbir Singh wrote:
> Could we please pick up the patch for -tip?
OK, will queue it up if Ingo doesn't beat me to it.
On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <[email protected]> wrote:
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
When one looks at the end result:
: static void cpuacct_update_stats(struct task_struct *tsk,
: enum cpuacct_stat_index idx, cputime_t val)
: {
: struct cpuacct *ca;
: int batch;
:
: if (unlikely(!cpuacct_subsys.active))
: return;
:
: rcu_read_lock();
: ca = task_ca(tsk);
:
: batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
: do {
: __percpu_counter_add(&ca->cpustat[idx], val, batch);
: ca = ca->parent;
: } while (ca);
: rcu_read_unlock();
: }
the code (which used to be quite obvious) becomes pretty unobvious. In
fact it looks quite wrong.
Shouldn't there be a comment there explaining wtf is going on?
> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <[email protected]> wrote:
>
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> > call cpuacct_update_stats with values much larger than percpu_counter_batch.
> > This means the call to percpu_counter_add will always add to the global count
> > which is protected by a spinlock and we end up with a global spinlock in
> > the scheduler.
>
> When one looks at the end result:
We have about 32 jiffies batch both with or without CONFIG_VIRT_CPU_ACCOUNTING.
Then, The enduser can looks some jiffies after.
On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <[email protected]> wrote:
>
>> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
>> call cpuacct_update_stats with values much larger than percpu_counter_batch.
>> This means the call to percpu_counter_add will always add to the global count
>> which is protected by a spinlock and we end up with a global spinlock in
>> the scheduler.
>
> When one looks at the end result:
>
> : static void cpuacct_update_stats(struct task_struct *tsk,
> : enum cpuacct_stat_index idx, cputime_t val)
> : {
> : struct cpuacct *ca;
> : int batch;
> :
> : if (unlikely(!cpuacct_subsys.active))
> : return;
> :
> : rcu_read_lock();
> : ca = task_ca(tsk);
> :
> : batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> : do {
> : __percpu_counter_add(&ca->cpustat[idx], val, batch);
> : ca = ca->parent;
> : } while (ca);
> : rcu_read_unlock();
> : }
>
> the code (which used to be quite obvious) becomes pretty unobvious. In
> fact it looks quite wrong.
>
> Shouldn't there be a comment there explaining wtf is going on?
Andrew,
I guess a lot of the changelog and comments are in the email history,
but your point on the comment is valid. Why does it look quite wrong to you?
cputime_one_jiffy tells us how many cputime_t's we've gotten in one
jiffy. If virtual accounting is enabled, this number is quite large, and
1 if virtual accounting is not enabled. Overall the value is set to 32
for non-virtual accounting enabled systems. On systems that support
virtual accounting, the value is set to 32*cputime_per_jifffy, so the
per cpu counter syncs up roughly once in 32 jiffies assuming
cpuacct_update_stats is called once per jiffy for non-virtual machines.
If the above comment, pleases you I'll polish it and send it across.
Anton, could you please confirm what I've said above is indeed correct.
--
Three Cheers,
Balbir Singh
On Tue, 26 Jan 2010 11:47:15 +0530 Balbir Singh <[email protected]> wrote:
> On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> > On Mon, 18 Jan 2010 15:41:42 +1100
> > Anton Blanchard <[email protected]> wrote:
> >
> >> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> >> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> >> This means the call to percpu_counter_add will always add to the global count
> >> which is protected by a spinlock and we end up with a global spinlock in
> >> the scheduler.
> >
> > When one looks at the end result:
> >
> > : static void cpuacct_update_stats(struct task_struct *tsk,
> > : enum cpuacct_stat_index idx, cputime_t val)
> > : {
> > : struct cpuacct *ca;
> > : int batch;
> > :
> > : if (unlikely(!cpuacct_subsys.active))
> > : return;
> > :
> > : rcu_read_lock();
> > : ca = task_ca(tsk);
> > :
> > : batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > : do {
> > : __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > : ca = ca->parent;
> > : } while (ca);
> > : rcu_read_unlock();
> > : }
> >
> > the code (which used to be quite obvious) becomes pretty unobvious. In
> > fact it looks quite wrong.
> >
> > Shouldn't there be a comment there explaining wtf is going on?
>
> Andrew,
>
> I guess a lot of the changelog and comments are in the email history,
Not a very useful location for it!
> Why does it look quite wrong to you?
Because it computes the correct value and then if it's larger than
INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!
Does that code actually work, btw? percpu_counter_batch has type `int'
and cputime_one_jiffy has type `int' so their product has type `int'.
So by the time min_t performs its comparison, the upper 32 bits of the
product are already lost.
Hi Andrew,
> > I guess a lot of the changelog and comments are in the email history,
>
> Not a very useful location for it!
Good point, I'll work on a useful comment.
> > Why does it look quite wrong to you?
>
> Because it computes the correct value and then if it's larger than
> INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!
>
> Does that code actually work, btw? percpu_counter_batch has type `int'
> and cputime_one_jiffy has type `int' so their product has type `int'.
> So by the time min_t performs its comparison, the upper 32 bits of the
> product are already lost.
On ppc64, s390 and ia64 cputime_one_jiffy is 64bit and I want to prevent us
creating too large a batch value:
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
If we pass in something bigger than INT_MAX we could end up with 0 after
truncation which will turn the percpu counter into a single spinlock global
counter.
Anton
Commit-ID: 43f85eab1411905afe5db510fbf9841b516e7e6a
Gitweb: http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
Author: Anton Blanchard <[email protected]>
AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
sched: cpuacct: Use bigger percpu counter batch values for stats counters
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
can call cpuacct_update_stats with values much larger than
percpu_counter_batch. This means the call to percpu_counter_add will
always add to the global count which is protected by a spinlock and we
end up with a global spinlock in the scheduler.
Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would if
CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu
add/remove.
This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.
For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC. So there is no need to add an #ifdef.
On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
Tested with:
wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1
Signed-off-by: Anton Blanchard <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <20100118044142.GS12666@kryten>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..8f94138 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch;
if (unlikely(!cpuacct_subsys.active))
return;
@@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
rcu_read_lock();
ca = task_ca(tsk);
+ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();
On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
<[email protected]> wrote:
> Commit-ID: ?43f85eb1411905afe5db510fbf9841b516e7e6a
> Gitweb: ? ? http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> Author: ? ? Anton Blanchard <[email protected]>
> AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
>
> sched: cpuacct: Use bigger percpu counter batch values for stats counters
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> can call cpuacct_update_stats with values much larger than
> percpu_counter_batch. This means the call to percpu_counter_add will
> always add to the global count which is protected by a spinlock and we
> end up with a global spinlock in the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would if
> CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu
> add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled: ? ? ? ? 16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled: ? ? ? ? ? ? 61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch: ? ? ? ? ?16663217 ctx switches/sec
>
> Tested with:
>
> ?wget http://ozlabs.org/~anton/junkcode/context_switch.c
> ?make context_switch
> ?for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> ?vmstat 1
>
> Signed-off-by: Anton Blanchard <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> LKML-Reference: <20100118044142.GS12666@kryten>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> ?kernel/sched.c | ? ?4 +++-
> ?1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..8f94138 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> ? ? ? ? ? ? ? ?enum cpuacct_stat_index idx, cputime_t val)
> ?{
> ? ? ? ?struct cpuacct *ca;
> + ? ? ? int batch;
>
> ? ? ? ?if (unlikely(!cpuacct_subsys.active))
> ? ? ? ? ? ? ? ?return;
> @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> ? ? ? ?rcu_read_lock();
> ? ? ? ?ca = task_ca(tsk);
>
> + ? ? ? batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> ? ? ? ?do {
> - ? ? ? ? ? ? ? percpu_counter_add(&ca->cpustat[idx], val);
> + ? ? ? ? ? ? ? __percpu_counter_add(&ca->cpustat[idx], val, batch);
> ? ? ? ? ? ? ? ?ca = ca->parent;
> ? ? ? ?} while (ca);
> ? ? ? ?rcu_read_unlock();
IIRC, Andrew picked up this patch as well and applied some checkpatch
fixes too..
Balbir
On Wed, 27 Jan 2010 19:04:55 +0530
Balbir Singh <[email protected]> wrote:
> On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
> <[email protected]> wrote:
> > Commit-ID: __43f85eb1411905afe5db510fbf9841b516e7e6a
> > Gitweb: __ __ http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> > Author: __ __ Anton Blanchard <[email protected]>
> > AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> > Committer: __Ingo Molnar <[email protected]>
> > CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
> >
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> >
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> > can call cpuacct_update_stats with values much larger than
> > percpu_counter_batch. This means the call to percpu_counter_add will
> > always add to the global count which is protected by a spinlock and we
> > end up with a global spinlock in the scheduler.
> >
> > Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> > cputime_one_jiffy such that we have the same batch limit as we would if
> > CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> > but that initialisation happened too early on PowerPC (before time_init)
> > and it was never updated at runtime as a result of a hotplug cpu
> > add/remove.
> >
> > This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> > runtime, which keeps the batch correct even after cpu hotplug operations.
> > We cap it at INT_MAX in case of overflow.
> >
> > For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> > cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> > min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> > least on x86 and PowerPC. So there is no need to add an #ifdef.
> >
> > On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> > faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> >
> > CONFIG_CGROUP_CPUACCT disabled: __ __ __ __ 16906698 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT enabled: __ __ __ __ __ __ 61720 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT + patch: __ __ __ __ __16663217 ctx switches/sec
> >
> > Tested with:
> >
> > __wget http://ozlabs.org/~anton/junkcode/context_switch.c
> > __make context_switch
> > __for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> > __vmstat 1
> >
> > Signed-off-by: Anton Blanchard <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > LKML-Reference: <20100118044142.GS12666@kryten>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > __kernel/sched.c | __ __4 +++-
> > __1 files changed, 3 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..8f94138 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __ __ __ __ __enum cpuacct_stat_index idx, cputime_t val)
> > __{
> > __ __ __ __struct cpuacct *ca;
> > + __ __ __ int batch;
> >
> > __ __ __ __if (unlikely(!cpuacct_subsys.active))
> > __ __ __ __ __ __ __ __return;
> > @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __rcu_read_lock();
> > __ __ __ __ca = task_ca(tsk);
> >
> > + __ __ __ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > __ __ __ __do {
> > - __ __ __ __ __ __ __ percpu_counter_add(&ca->cpustat[idx], val);
> > + __ __ __ __ __ __ __ __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > __ __ __ __ __ __ __ __ca = ca->parent;
> > __ __ __ __} while (ca);
> > __ __ __ __rcu_read_unlock();
^^ your email client inexplicably fills emails with 0xa0
> IIRC, Andrew picked up this patch as well and applied some checkpatch
> fixes too..
No, I have no changes.
Last I heard, Anton was "working on a useful comment" and will be
redoing the patch.
On Wed, 2010-01-27 at 13:22 -0800, Andrew Morton wrote:
> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.
If Anton is working on a new version, would Anton then also care to post
one that compiles on UP? :-)
>> > __ __ __ __ __ __ __ __ca = ca->parent;
>> > __ __ __ __} while (ca);
>> > __ __ __ __rcu_read_unlock();
>
> ^^ your email client inexplicably fills emails with 0xa0
>
I am going to be changing my email client soon, back to the good old
text client.
>> IIRC, Andrew picked up this patch as well and applied some checkpatch
>> fixes too..
>
> No, I have no changes.
>
Sorry, I was confused, the changes you had were for getdelays.c
> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.
Yep, lets wait on him.
Three Cheers,
Balbir
Even though batch isn't used on UP, we may want to pass one in to keep the
SMP and UP code paths similar. Convert __percpu_counter_add to an inline
function so we wont get variable unused warnings if we do.
Signed-off-by: Anton Blanchard <[email protected]>
---
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index a7684a5..794662b 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -98,9 +98,6 @@ static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
fbc->count = amount;
}
-#define __percpu_counter_add(fbc, amount, batch) \
- percpu_counter_add(fbc, amount)
-
static inline void
percpu_counter_add(struct percpu_counter *fbc, s64 amount)
{
@@ -109,6 +106,12 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
preempt_enable();
}
+static inline void
+__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+ percpu_counter_add(fbc, amount);
+}
+
static inline s64 percpu_counter_read(struct percpu_counter *fbc)
{
return fbc->count;
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.
Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.
This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.
For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.
On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec
Tested with:
wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1
Signed-off-by: Anton Blanchard <[email protected]>
---
v2: Added a comment and fixed the UP build.
Index: linux-cpumask/kernel/sched.c
===================================================================
--- linux-cpumask.orig/kernel/sched.c 2010-01-22 18:23:43.377212514 +1100
+++ linux-cpumask/kernel/sched.c 2010-01-28 23:24:02.677233753 +1100
@@ -10885,12 +10885,30 @@ static void cpuacct_charge(struct task_s
}
/*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH \
+ min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH 0
+#endif
+
+/*
* Charge the system/user time to the task's accounting group.
*/
static void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val)
{
struct cpuacct *ca;
+ int batch = CPUACCT_BATCH;
if (unlikely(!cpuacct_subsys.active))
return;
@@ -10899,7 +10917,7 @@ static void cpuacct_update_stats(struct
ca = task_ca(tsk);
do {
- percpu_counter_add(&ca->cpustat[idx], val);
+ __percpu_counter_add(&ca->cpustat[idx], val, batch);
ca = ca->parent;
} while (ca);
rcu_read_unlock();