LinuxLists.cc - Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

2009-03-02 14:56:38

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

On Fri, 2009-02-27 at 12:22 +0900, KAMEZAWA Hiroyuki wrote:

Comments below..

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> value in safe manner, it takes rq->lock of (other) cpus.
>
> In general, taking rq->lock of other cpus from codes not for scheduler
> is not good. This patch tries to remove rq->lock used in read-side.
>
> To read 64bit value in safe, this patch uses seqcounter.
>
> Pros.
> - rq->lock is not necessary.
> Cons.
> - When updating counter, sequence number must be updated.
> (I hope this per-cpu sequence number is on cache...)
> - not simple.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> kernel/sched.c | 141 ++++++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 105 insertions(+), 36 deletions(-)
>
> Index: mmotm-2.6.29-Feb24/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.29-Feb24.orig/kernel/sched.c
> +++ mmotm-2.6.29-Feb24/kernel/sched.c
> @@ -9581,6 +9581,67 @@ struct cgroup_subsys cpu_cgroup_subsys =
>
> #ifdef CONFIG_CGROUP_CPUACCT
>
> +#ifndef CONFIG_64BIT
> +DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
> +
> +static inline void cpuacct_start_counter_update(void)
> +{
> + /* This is called under rq->lock and IRQ is off */
> + struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> +
> + write_seqcount_begin(s);
> + put_cpu_var(cpuacct_cgroup_seq);
> +}
> +
> +static inline void cpuacct_end_counter_update(void)
> +{
> + struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> +
> + write_seqcount_end(s);
> + put_cpu_var(cpuacct_cgroup_seq);
> +}

It seems odd we disable/enable preemption in both, I would expect for
start to disable preemption, and have end enable it again, or use
__get_cpu_var() and assume preemption is already disabled (callsites are
under rq->lock, right?)

> +static inline u64
> +cpuacct_read_counter(u64 *val, int cpu)
> +{
> + struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> + unsigned int seq;
> + u64 data;
> +
> + do {
> + seq = read_seqcount_begin(s);
> + data = *val;
> + } while (read_seqcount_retry(s, seq));
> + return data;
> +}
> +/* This is a special funtion called against "offline" cpus. */
> +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> +{
> + struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> +
> + preempt_disable();
> + write_seqcount_begin(s);
> + *val = 0;
> + write_seqcount_end(s);
> + preempt_enable();
> +}

And here you double disable preemption, quite useless if you take a
remote cpu's per-cpu data.

> +#else
> +static inline void cpuacct_start_counter_update(void)
> +{
> +}
> +static inline void cpuacct_end_counter_update(void)
> +{
> +}
> +static inline u64 cpuacct_read_counter(u64 *val, int cpu)
> +{
> + return *val;
> +}
> +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> +{
> + *val = 0;
> +}
> +#endif
> +
> /*
> * CPU accounting code for task groups.
> *
> @@ -9596,6 +9657,11 @@ struct cpuacct {
> struct cpuacct *parent;
> };
>
> +struct cpuacct_work {
> + struct work_struct work;
> + struct cpuacct *cpuacct;
> +};
> +
> struct cgroup_subsys cpuacct_subsys;
>
> /* return cpu accounting group corresponding to this container */
> @@ -9643,39 +9709,29 @@ cpuacct_destroy(struct cgroup_subsys *ss
> kfree(ca);
> }
>
> +/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
> static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
> {
> u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> u64 data;
>
> -#ifndef CONFIG_64BIT
> - /*
> - * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> - */
> - spin_lock_irq(&cpu_rq(cpu)->lock);
> - data = *cpuusage;
> - spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> - data = *cpuusage;
> -#endif
> + data = cpuacct_read_counter(cpuusage, cpu);
>
> return data;
> }
>
> -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> +/* called by per-cpu workqueue */
> +static void cpuacct_cpuusage_reset_cpu(struct work_struct *work)
> {
> + struct cpuacct_work *cw = container_of(work, struct cpuacct_work, work);
> + struct cpuacct *ca = cw->cpuacct;
> + int cpu = get_cpu();
> u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
>
> -#ifndef CONFIG_64BIT
> - /*
> - * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> - */
> - spin_lock_irq(&cpu_rq(cpu)->lock);
> - *cpuusage = val;
> - spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> - *cpuusage = val;
> -#endif
> + cpuacct_start_counter_update();
> + *cpuusage = 0;
> + cpuacct_end_counter_update();
> + put_cpu();
> }
>
> /* return total cpu usage (in nanoseconds) of a group */
> @@ -9691,23 +9747,34 @@ static u64 cpuusage_read(struct cgroup *
> return totalcpuusage;
> }
>
> -static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> - u64 reset)
> +static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
> {
> struct cpuacct *ca = cgroup_ca(cgrp);
> - int err = 0;
> - int i;
> -
> - if (reset) {
> - err = -EINVAL;
> - goto out;
> + int cpu;
> + /*
> + * Reset All counters....doesn't need to be fast.
> + * "ca" will be stable while doing this. We are in write() syscall.
> + */
> + get_online_cpus();
> + /*
> + * Because we use alloc_percpu() for allocating counter, we have
> + * a counter per a possible cpu. Reset all online's by workqueue and
> + * reset offline cpu's directly.
> + */
> + for_each_possible_cpu(cpu) {
> + if (cpu_online(cpu)) {
> + struct cpuacct_work cw;
> + INIT_WORK(&cw.work, cpuacct_cpuusage_reset_cpu);
> + cw.cpuacct = ca;
> + schedule_work_on(cpu, &cw.work);
> + flush_work(&cw.work);
> + } else {
> + u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> + cpuacct_reset_offline_counter(cpuusage, cpu);
> + }

I'm not particularly convinced this is the right way, schedule_work_on()
sounds way expensive for setting a variable to 0.

Furthermore, if you want something like schedule_work_on() for each cpu,
there's schedule_on_each_cpu().

> }
> -
> - for_each_present_cpu(i)
> - cpuacct_cpuusage_write(ca, i, 0);
> -
> -out:
> - return err;
> + put_online_cpus();
> + return 0;
> }
>
> static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
> @@ -9729,7 +9796,7 @@ static struct cftype files[] = {
> {
> .name = "usage",
> .read_u64 = cpuusage_read,
> - .write_u64 = cpuusage_write,
> + .trigger = cpuacct_cpuusage_reset,
> },
> {
> .name = "usage_percpu",
> @@ -9756,10 +9823,12 @@ static void cpuacct_charge(struct task_s
> cpu = task_cpu(tsk);
> ca = task_ca(tsk);
>
> + cpuacct_start_counter_update();
> for (; ca; ca = ca->parent) {
> u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> *cpuusage += cputime;
> }
> + cpuacct_end_counter_update();
> }
>
> struct cgroup_subsys cpuacct_subsys = {
>

2009-03-02 23:43:48

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

On Mon, 02 Mar 2009 15:56:10 +0100
Peter Zijlstra <[email protected]> wrote:

> On Fri, 2009-02-27 at 12:22 +0900, KAMEZAWA Hiroyuki wrote:
>
> Comments below..
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> > per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> > value in safe manner, it takes rq->lock of (other) cpus.
> >
> > In general, taking rq->lock of other cpus from codes not for scheduler
> > is not good. This patch tries to remove rq->lock used in read-side.
> >
> > To read 64bit value in safe, this patch uses seqcounter.
> >
> > Pros.
> > - rq->lock is not necessary.
> > Cons.
> > - When updating counter, sequence number must be updated.
> > (I hope this per-cpu sequence number is on cache...)
> > - not simple.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > kernel/sched.c | 141 ++++++++++++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 105 insertions(+), 36 deletions(-)
> >
> > Index: mmotm-2.6.29-Feb24/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.29-Feb24.orig/kernel/sched.c
> > +++ mmotm-2.6.29-Feb24/kernel/sched.c
> > @@ -9581,6 +9581,67 @@ struct cgroup_subsys cpu_cgroup_subsys =
> >
> > #ifdef CONFIG_CGROUP_CPUACCT
> >
> > +#ifndef CONFIG_64BIT
> > +DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
> > +
> > +static inline void cpuacct_start_counter_update(void)
> > +{
> > + /* This is called under rq->lock and IRQ is off */
> > + struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> > +
> > + write_seqcount_begin(s);
> > + put_cpu_var(cpuacct_cgroup_seq);
> > +}
> > +
> > +static inline void cpuacct_end_counter_update(void)
> > +{
> > + struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> > +
> > + write_seqcount_end(s);
> > + put_cpu_var(cpuacct_cgroup_seq);
> > +}
>
> It seems odd we disable/enable preemption in both, I would expect for
> start to disable preemption, and have end enable it again, or use
> __get_cpu_var() and assume preemption is already disabled (callsites are
> under rq->lock, right?)
>
yes. calles are under rq->lock...ok, you're right.
I'll remove preepmtp disabling codes.

> > +static inline u64
> > +cpuacct_read_counter(u64 *val, int cpu)
> > +{
> > + struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> > + unsigned int seq;
> > + u64 data;
> > +
> > + do {
> > + seq = read_seqcount_begin(s);
> > + data = *val;
> > + } while (read_seqcount_retry(s, seq));
> > + return data;
> > +}
> > +/* This is a special funtion called against "offline" cpus. */
> > +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> > +{
> > + struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> > +
> > + preempt_disable();
> > + write_seqcount_begin(s);
> > + *val = 0;
> > + write_seqcount_end(s);
> > + preempt_enable();
> > +}
>
> And here you double disable preemption, quite useless if you take a
> remote cpu's per-cpu data.
>
Ok, maybe this "reset" from a user should be under rq->lock.
(But reading will not be under rq->lock.)

> > +#else
> > +static inline void cpuacct_start_counter_update(void)
> > +{
> > +}
> > +static inline void cpuacct_end_counter_update(void)
> > +{
> > +}
> > +static inline u64 cpuacct_read_counter(u64 *val, int cpu)
> > +{
> > + return *val;
> > +}
> > +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> > +{
> > + *val = 0;
> > +}
> > +#endif
> > +
> > /*
> > * CPU accounting code for task groups.
> > *
> > @@ -9596,6 +9657,11 @@ struct cpuacct {
> > struct cpuacct *parent;
> > };
> >
> > +struct cpuacct_work {
> > + struct work_struct work;
> > + struct cpuacct *cpuacct;
> > +};
> > +
> > struct cgroup_subsys cpuacct_subsys;
> >
> > /* return cpu accounting group corresponding to this container */
> > @@ -9643,39 +9709,29 @@ cpuacct_destroy(struct cgroup_subsys *ss
> > kfree(ca);
> > }
> >
> > +/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
> > static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
> > {
> > u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> > u64 data;
> >
> > -#ifndef CONFIG_64BIT
> > - /*
> > - * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> > - */
> > - spin_lock_irq(&cpu_rq(cpu)->lock);
> > - data = *cpuusage;
> > - spin_unlock_irq(&cpu_rq(cpu)->lock);
> > -#else
> > - data = *cpuusage;
> > -#endif
> > + data = cpuacct_read_counter(cpuusage, cpu);
> >
> > return data;
> > }
> >
> > -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> > +/* called by per-cpu workqueue */
> > +static void cpuacct_cpuusage_reset_cpu(struct work_struct *work)
> > {
> > + struct cpuacct_work *cw = container_of(work, struct cpuacct_work, work);
> > + struct cpuacct *ca = cw->cpuacct;
> > + int cpu = get_cpu();
> > u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> >
> > -#ifndef CONFIG_64BIT
> > - /*
> > - * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> > - */
> > - spin_lock_irq(&cpu_rq(cpu)->lock);
> > - *cpuusage = val;
> > - spin_unlock_irq(&cpu_rq(cpu)->lock);
> > -#else
> > - *cpuusage = val;
> > -#endif
> > + cpuacct_start_counter_update();
> > + *cpuusage = 0;
> > + cpuacct_end_counter_update();
> > + put_cpu();
> > }
> >
> > /* return total cpu usage (in nanoseconds) of a group */
> > @@ -9691,23 +9747,34 @@ static u64 cpuusage_read(struct cgroup *
> > return totalcpuusage;
> > }
> >
> > -static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> > - u64 reset)
> > +static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
> > {
> > struct cpuacct *ca = cgroup_ca(cgrp);
> > - int err = 0;
> > - int i;
> > -
> > - if (reset) {
> > - err = -EINVAL;
> > - goto out;
> > + int cpu;
> > + /*
> > + * Reset All counters....doesn't need to be fast.
> > + * "ca" will be stable while doing this. We are in write() syscall.
> > + */
> > + get_online_cpus();
> > + /*
> > + * Because we use alloc_percpu() for allocating counter, we have
> > + * a counter per a possible cpu. Reset all online's by workqueue and
> > + * reset offline cpu's directly.
> > + */
> > + for_each_possible_cpu(cpu) {
> > + if (cpu_online(cpu)) {
> > + struct cpuacct_work cw;
> > + INIT_WORK(&cw.work, cpuacct_cpuusage_reset_cpu);
> > + cw.cpuacct = ca;
> > + schedule_work_on(cpu, &cw.work);
> > + flush_work(&cw.work);
> > + } else {
> > + u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> > + cpuacct_reset_offline_counter(cpuusage, cpu);
> > + }
>
> I'm not particularly convinced this is the right way, schedule_work_on()
> sounds way expensive for setting a variable to 0.
>
yes, yes..

> Furthermore, if you want something like schedule_work_on() for each cpu,
> there's schedule_on_each_cpu().
>
It can't pass arguments...Maybe I should use rq->lock here to reset
other cpu's value.

Thank you for review.
-Kame

2009-03-03 07:52:04

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:

> > Furthermore, if you want something like schedule_work_on() for each cpu,
> > there's schedule_on_each_cpu().
> >
> It can't pass arguments...Maybe I should use rq->lock here to reset
> other cpu's value.

Why bother with serializing the reset code at all?

But there's always smp_call_function() I suppose.

2009-03-03 09:05:07

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

Peter Zijlstra wrote:
> On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>
>> > Furthermore, if you want something like schedule_work_on() for each
>> cpu,
>> > there's schedule_on_each_cpu().
>> >
>> It can't pass arguments...Maybe I should use rq->lock here to reset
>> other cpu's value.
>
> Why bother with serializing the reset code at all?
>
I don't think reset v.s. read is problem but reset v.s. increment
(read-modify-write) can't be ?

> But there's always smp_call_function() I suppose.
>

Hmm, I'll find some simpler way.

Thanks,
-Kame

2009-03-03 09:41:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> > Furthermore, if you want something like schedule_work_on() for each
> >> cpu,
> >> > there's schedule_on_each_cpu().
> >> >
> >> It can't pass arguments...Maybe I should use rq->lock here to reset
> >> other cpu's value.
> >
> > Why bother with serializing the reset code at all?
> >
> I don't think reset v.s. read is problem but reset v.s. increment
> (read-modify-write) can't be ?

Sure, can be, do we care?

2009-03-03 10:42:42

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

Peter Zijlstra wrote:
> On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
>> Peter Zijlstra wrote:
>> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>> >
>> >> > Furthermore, if you want something like schedule_work_on() for each
>> >> cpu,
>> >> > there's schedule_on_each_cpu().
>> >> >
>> >> It can't pass arguments...Maybe I should use rq->lock here to reset
>> >> other cpu's value.
>> >
>> > Why bother with serializing the reset code at all?
>> >
>> I don't think reset v.s. read is problem but reset v.s. increment
>> (read-modify-write) can't be ?
>
> Sure, can be, do we care?
>
If small/easy code allows us to declare "there are any racy case!
and you don't have to check whether you successfully reseted",
it's worth to do I think.

Thanks,
-Kame

2009-03-03 10:44:21

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
>> On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
>>> Peter Zijlstra wrote:
>>> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>>> >
>>> >> > Furthermore, if you want something like schedule_work_on() for
>>> each
>>> >> cpu,
>>> >> > there's schedule_on_each_cpu().
>>> >> >
>>> >> It can't pass arguments...Maybe I should use rq->lock here to reset
>>> >> other cpu's value.
>>> >
>>> > Why bother with serializing the reset code at all?
>>> >
>>> I don't think reset v.s. read is problem but reset v.s. increment
>>> (read-modify-write) can't be ?
>>
>> Sure, can be, do we care?
>>
> If small/easy code allows us to declare "there are any racy case!
isn't...
Sorry.
-Kame

2009-03-03 11:55:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction

On Tue, 2009-03-03 at 19:42 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
> > On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
> >> Peter Zijlstra wrote:
> >> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
> >> >
> >> >> > Furthermore, if you want something like schedule_work_on() for each
> >> >> cpu,
> >> >> > there's schedule_on_each_cpu().
> >> >> >
> >> >> It can't pass arguments...Maybe I should use rq->lock here to reset
> >> >> other cpu's value.
> >> >
> >> > Why bother with serializing the reset code at all?
> >> >
> >> I don't think reset v.s. read is problem but reset v.s. increment
> >> (read-modify-write) can't be ?
> >
> > Sure, can be, do we care?
> >
> If small/easy code allows us to declare "there are any racy case!
> and you don't have to check whether you successfully reseted",
> it's worth to do I think.

smp_call_function() it is...

2009-03-04 06:34:28

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH] remove rq->lock from cpuacct cgroup v2

From: KAMEZAWA Hiroyuki <[email protected]>

cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
value in safe manner, it takes rq->lock of (other) cpus.

In general, taking rq->lock of other cpus from codes not for scheduler
is not good. This patch tries to remove rq->lock in read-side.

To read 64bit value in atomic, this patch uses seqcounter.

Pros.
- rq->lock is not necessary.
Cons.
- When updating counter, sequence number must be updated.
(I hope this per-cpu sequence number is on cache...)

Changelog: v1->v2
- checking calling context of all calls and avoid unnecessary
preempt_disable calls.
- use on_each_cpu() instead of workqueue, at reset

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Index: mmotm-2.6.29-Mar3/kernel/sched.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/kernel/sched.c
+++ mmotm-2.6.29-Mar3/kernel/sched.c
@@ -9581,6 +9581,71 @@ struct cgroup_subsys cpu_cgroup_subsys =

#ifdef CONFIG_CGROUP_CPUACCT

+#ifndef CONFIG_64BIT
+/* seq counter for handle 64bit counter on 32bit system */
+DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
+
+/*
+ * Counter update happens while rq->lock is held and we don't need to
+ * disable preempt explcitly.
+ */
+static inline void cpuacct_start_counter_update(void)
+{
+ /* This is called under rq->lock and IRQ is off */
+ struct seqcount *s = &__get_cpu_var(cpuacct_cgroup_seq);
+
+ write_seqcount_begin(s);
+}
+
+static inline void cpuacct_end_counter_update(void)
+{
+ struct seqcount *s = &__get_cpu_var(cpuacct_cgroup_seq);
+
+ write_seqcount_end(s);
+}
+
+static inline u64
+cpuacct_read_counter(u64 *val, int cpu)
+{
+ struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+ unsigned int seq;
+ u64 data;
+
+ do {
+ seq = read_seqcount_begin(s);
+ data = *val;
+ } while (read_seqcount_retry(s, seq));
+ return data;
+}
+/* This is a special funtion called against "offline" cpus. */
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+ struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+
+ write_seqcount_begin(s);
+ *val = 0;
+ write_seqcount_end(s);
+}
+#else
+static inline void cpuacct_start_counter_update(void)
+{
+}
+
+static inline void cpuacct_end_counter_update(void)
+{
+}
+
+static inline u64 cpuacct_read_counter(u64 *val, int cpu)
+{
+ return *val;
+}
+
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+ *val = 0;
+}
+#endif
+
/*
* CPU accounting code for task groups.
*
@@ -9643,39 +9708,27 @@ cpuacct_destroy(struct cgroup_subsys *ss
kfree(ca);
}

+/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
{
u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
u64 data;

-#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit read safe on 32-bit platforms.
- */
- spin_lock_irq(&cpu_rq(cpu)->lock);
- data = *cpuusage;
- spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- data = *cpuusage;
-#endif
+ data = cpuacct_read_counter(cpuusage, cpu);

return data;
}

-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+/* called by per-cpu smp call function (in non-preemptable context) */
+static void cpuacct_cpuusage_reset_cpu(void *data)
{
+ int cpu = smp_processor_id();
+ struct cpuacct *ca = data;
u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);

-#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit write safe on 32-bit platforms.
- */
- spin_lock_irq(&cpu_rq(cpu)->lock);
- *cpuusage = val;
- spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- *cpuusage = val;
-#endif
+ cpuacct_start_counter_update();
+ *cpuusage = 0;
+ cpuacct_end_counter_update();
}

/* return total cpu usage (in nanoseconds) of a group */
@@ -9691,23 +9744,30 @@ static u64 cpuusage_read(struct cgroup *
return totalcpuusage;
}

-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
- u64 reset)
+static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
{
struct cpuacct *ca = cgroup_ca(cgrp);
- int err = 0;
- int i;
+ int cpu;
+ /*
+ * We prevent cpu hotplug while we do reset.
+ */
+ get_online_cpus();
+ /*
+ * clear all online cpu's status (including local one)
+ * This reseting uses nowait smp call and counter will be cleared in
+ * asynchronous way.
+ */
+ on_each_cpu(cpuacct_cpuusage_reset_cpu, ca, 0);

- if (reset) {
- err = -EINVAL;
- goto out;
+ /* clear all present but offline cpus' */
+ for_each_possible_cpu(cpu) {
+ if (!cpu_online(cpu)) {
+ u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+ cpuacct_reset_offline_counter(cpuusage, cpu);
+ }
}
-
- for_each_present_cpu(i)
- cpuacct_cpuusage_write(ca, i, 0);
-
-out:
- return err;
+ put_online_cpus();
+ return 0;
}

static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
@@ -9729,7 +9789,7 @@ static struct cftype files[] = {
{
.name = "usage",
.read_u64 = cpuusage_read,
- .write_u64 = cpuusage_write,
+ .trigger = cpuacct_cpuusage_reset,
},
{
.name = "usage_percpu",
@@ -9759,10 +9819,12 @@ static void cpuacct_charge(struct task_s
cpu = task_cpu(tsk);
ca = task_ca(tsk);

+ cpuacct_start_counter_update();
for (; ca; ca = ca->parent) {
u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
*cpuusage += cputime;
}
+ cpuacct_end_counter_update();
}

struct cgroup_subsys cpuacct_subsys = {

2009-03-04 07:54:58

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

On Wed, Mar 4, 2009 at 12:02 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> value in safe manner, it takes rq->lock of (other) cpus.
>
> In general, taking rq->lock of other cpus from codes not for scheduler
> is not good. This patch tries to remove rq->lock in read-side.
>
> To read 64bit value in atomic, this patch uses seqcounter.
>
> Pros.
> ?- rq->lock is not necessary.
> Cons.
> ?- When updating counter, sequence number must be updated.
> ? ?(I hope this per-cpu sequence number is on cache...)
>
> Changelog: v1->v2
> ?- checking calling context of all calls and avoid unnecessary
> ? preempt_disable calls.
> ?- use on_each_cpu() instead of workqueue, at reset
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

So cpuacct->cpuusage is a 64 bit percpu counter and I see that cpuacct
subsystem itself handles the following special cases:
- 32 vs 64 bit update issues
- resetting percpu counters on online and offline cpus

Tomorrow if I add other counters to cpuacct subsystem (like stime and
utime), I need to do what you have done in this patch all over again
for the additional counters.

Instead of subsystems handling all these percpu counter problems
themselves, shouldn't we be using percpu_counter subsytem and let it
handle all the issues transparently for us ? I am not sure if all
these problems have been addressed in percpu_counter, but would like
to know why we are not using percpu_counter for these kinds of things
and enhance percpu_counter if it can't handle some of the issues which
we are solving here specifically for cpuacct subsystem ?

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2009-03-04 08:21:37

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

On Wed, 4 Mar 2009 13:24:43 +0530
Bharata B Rao <[email protected]> wrote:

> On Wed, Mar 4, 2009 at 12:02 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> > per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> > value in safe manner, it takes rq->lock of (other) cpus.
> >
> > In general, taking rq->lock of other cpus from codes not for scheduler
> > is not good. This patch tries to remove rq->lock in read-side.
> >
> > To read 64bit value in atomic, this patch uses seqcounter.
> >
> > Pros.
> > - rq->lock is not necessary.
> > Cons.
> > - When updating counter, sequence number must be updated.
> > (I hope this per-cpu sequence number is on cache...)
> >
> > Changelog: v1->v2
> > - checking calling context of all calls and avoid unnecessary
> > preempt_disable calls.
> > - use on_each_cpu() instead of workqueue, at reset
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> So cpuacct->cpuusage is a 64 bit percpu counter and I see that cpuacct
> subsystem itself handles the following special cases:
> - 32 vs 64 bit update issues
> - resetting percpu counters on online and offline cpus
>
> Tomorrow if I add other counters to cpuacct subsystem (like stime and
> utime), I need to do what you have done in this patch all over again
> for the additional counters.
>
I'm not sure Yes or No. Is it necessary to be per-cpu ?
IIUC, stime/utime update is at most once-per-tick and not so frequent
as cpuacct.

> Instead of subsystems handling all these percpu counter problems
> themselves, shouldn't we be using percpu_counter subsytem and let it
> handle all the issues transparently for us ? I am not sure if all
> these problems have been addressed in percpu_counter, but would like
> to know why we are not using percpu_counter for these kinds of things
> and enhance percpu_counter if it can't handle some of the issues which
> we are solving here specifically for cpuacct subsystem ?
>
At first, generic per-cpu counter sounds interesting but to be honest,
some special handling is used for cpuacct based on its characteristic.

- Writer works under non-preemptable context.
- There is only one writer.

Otherwize, using generic atomic64_t.. or res_counter will be good...maybe.
(res_counter updates 64bit value under spinlock lock.)

Thanks,
-Kame

> Regards,
> Bharata.
> --
> http://bharata.sulekha.com/blog/posts.htm
>

2009-03-04 08:48:07

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

On Wed, 4 Mar 2009 17:20:05 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 4 Mar 2009 13:24:43 +0530
> Bharata B Rao <[email protected]> wrote:
> At first, generic per-cpu counter sounds interesting but to be honest,
> some special handling is used for cpuacct based on its characteristic.
>
> - Writer works under non-preemptable context.
> - There is only one writer.
>
If utime/stime updates works on above context, using the same code will be good.

I don't use any cpuacct structure specific in routines...
If you want me to rewrite it, I'll do. please request what you want.

Thanks,
-Kame

2009-03-04 10:35:29

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

On Wed, Mar 4, 2009 at 2:16 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 4 Mar 2009 17:20:05 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
>> On Wed, 4 Mar 2009 13:24:43 +0530
>> Bharata B Rao <[email protected]> wrote:
>> At first, generic per-cpu counter sounds interesting but to be honest,
>> some special handling is used for cpuacct based on its characteristic.
>>
>> ? - Writer works under non-preemptable context.
>> ? - There is only one writer.
>>
> If utime/stime updates works on above context, using the same code will be good.

IIUC, utime/stime updates also work under the above conditions.

>
> I don't use any cpuacct structure specific in routines...
> If you want me to rewrite it, I'll do. please request what you want.

After looking deep into your patch, I think I could use the same seq
counter introduced by you to update stime/utime also.
I guess I could use most part of your code except there is a slight
difference wrt preemption disabled assumption in the write path.
cpusuage updates happen under rq->lock but stime/utime updates don't.
So I probably can't use cpuacct_start/end_counter_update as is.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2009-03-04 12:11:25

by Bharata B Rao

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

On Wed, Mar 4, 2009 at 1:50 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 4 Mar 2009 13:24:43 +0530
> Bharata B Rao <[email protected]> wrote:
>
>> Instead of subsystems handling all these percpu counter problems
>> themselves, shouldn't we be using percpu_counter subsytem and let it
>> handle all the issues transparently for us ? I am not sure if all
>> these problems have been addressed in percpu_counter, but would like
>> to know why we are not using percpu_counter for these kinds of things
>> and enhance percpu_counter if it can't handle some of the issues which
>> we are solving here specifically for cpuacct subsystem ?
>>
> At first, generic per-cpu counter sounds interesting but to be honest,
> some special handling is used for cpuacct based on its characteristic.
>

Just trying to understand this clearly ...

> ?- Writer works under non-preemptable context.

Which means the writer is running with preemption disabled.
percpu_counter writes (via percpu_counter_add) don't assume anything
and disable preemption themselves. Is this the problem or is there a
bigger issue here why percpu_counter can't be used ?

> ?- There is only one writer.

Not sure how you have optimized for this case in cpuacct.
percpu_counters use spinlocks to serialize writers. Are you saying
using spinlocks for this 1 writer case is too much ? Also note that
they update 32 bit percpu counters without any lock and take spinlocks
only when they do a batch update to the 64bit counter.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm

2009-03-04 14:17:19

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] remove rq->lock from cpuacct cgroup v2

Bharata B Rao $B$5$s$O=q$-$^$7$?!'(B
> On Wed, Mar 4, 2009 at 1:50 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> On Wed, 4 Mar 2009 13:24:43 +0530
>> Bharata B Rao <[email protected]> wrote:
>>
>>> Instead of subsystems handling all these percpu counter problems
>>> themselves, shouldn't we be using percpu_counter subsytem and let it
>>> handle all the issues transparently for us ? I am not sure if all
>>> these problems have been addressed in percpu_counter, but would like
>>> to know why we are not using percpu_counter for these kinds of things
>>> and enhance percpu_counter if it can't handle some of the issues which
>>> we are solving here specifically for cpuacct subsystem ?
>>>
>> At first, generic per-cpu counter sounds interesting but to be honest,
>> some special handling is used for cpuacct based on its characteristic.
>>
>
> Just trying to understand this clearly ...
>
>>  - Writer works under non-preemptable context.
>
> Which means the writer is running with preemption disabled.
> percpu_counter writes (via percpu_counter_add) don't assume anything
> and disable preemption themselves. Is this the problem or is there a
> bigger issue here why percpu_counter can't be used ?
>
Ah, just means it doesn't call disable_preempt() explicitly and
use __get_per_cpu() (not get_cpu_var) and smp_processor_id (not get_cpu).

>>  - There is only one writer.
>
> Not sure how you have optimized for this case in cpuacct.
Because there is only one writer, I used seq_counter not seq_lock.
(there are no atomic ops by this.)

> percpu_counters use spinlocks to serialize writers. Are you saying
> using spinlocks for this 1 writer case is too much ?
I'm sorry I noticed there is lib/percpu_counter now.......
I think lib/percpu_counter does proper jobs.

> Also note that they update 32 bit percpu counters without any lock
> and take spinlocks only when they do a batch update to the 64bit counter.
>
Hmm, cpuacct's one uses hierachical update of several counters at once.
So, some private code like mine is not so bad, I think.

What I think now is it's ok to see your patch first and later do
this kind of update to avoid locks if necessary. This patch is micro
optimization and not in hurry.

Thanks,
-Kame