cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
rcupreempt is used.
cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge(). This will not
happen with rcu or tree rcu as cpuacct_charge() is called with preemption
disabled. However if rcupreempt is used, the race still exists. Thanks to
Li Zefan for explaining this.
Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().
Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9891,6 +9891,13 @@ static void cpuacct_charge(struct task_s
return;
cpu = task_cpu(tsk);
+
+ /*
+ * preemption is already disabled here, but to be safe with
+ * rcupreempt, take rcu_read_lock(). This protects ca and
+ * hence the hierarchy walk.
+ */
+ rcu_read_lock();
ca = task_ca(tsk);
do {
@@ -9898,6 +9905,7 @@ static void cpuacct_charge(struct task_s
*cpuusage += cputime;
ca = ca->parent;
} while (ca);
+ rcu_read_unlock();
}
struct cgroup_subsys cpuacct_subsys = {
Bharata B Rao wrote:
> cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> rcupreempt is used.
>
> cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> This can race with the task's movement between cgroups. This race
> can cause an access to freed ca pointer in cpuacct_charge(). This will not
Actually it can also end up access invalid tsk->cgroups. ;)
get tsk->cgroups (cg)
(move tsk to another cgroup) or (tsk exiting)
-> kfree(tsk->cgroups)
get cg->subsys[..]
> happen with rcu or tree rcu as cpuacct_charge() is called with preemption
> disabled. However if rcupreempt is used, the race still exists. Thanks to
> Li Zefan for explaining this.
>
> Fix this race by explicitly protecting ca and the hierarchy walk with
> rcu_read_lock().
>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> kernel/sched.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9891,6 +9891,13 @@ static void cpuacct_charge(struct task_s
> return;
>
> cpu = task_cpu(tsk);
> +
> + /*
> + * preemption is already disabled here, but to be safe with
> + * rcupreempt, take rcu_read_lock(). This protects ca and
> + * hence the hierarchy walk.
> + */
> + rcu_read_lock();
> ca = task_ca(tsk);
>
> do {
> @@ -9898,6 +9905,7 @@ static void cpuacct_charge(struct task_s
> *cpuusage += cputime;
> ca = ca->parent;
> } while (ca);
> + rcu_read_unlock();
> }
>
> struct cgroup_subsys cpuacct_subsys = {
>
>
On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> Bharata B Rao wrote:
> > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > rcupreempt is used.
> >
> > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > This can race with the task's movement between cgroups. This race
> > can cause an access to freed ca pointer in cpuacct_charge(). This will not
>
> Actually it can also end up access invalid tsk->cgroups. ;)
>
> get tsk->cgroups (cg)
> (move tsk to another cgroup) or (tsk exiting)
> -> kfree(tsk->cgroups)
> get cg->subsys[..]
Ok :) Here is the patch again with updated description.
cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
rcupreempt is used.
cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge() or access
to invalid cgroups pointer of the task. This will not happen with rcu or
tree rcu as cpuacct_charge() is called with preemption disabled. However if
rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().
Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9891,6 +9891,13 @@ static void cpuacct_charge(struct task_s
return;
cpu = task_cpu(tsk);
+
+ /*
+ * preemption is already disabled here, but to be safe with
+ * rcupreempt, take rcu_read_lock(). This protects ca and
+ * hence the hierarchy walk.
+ */
+ rcu_read_lock();
ca = task_ca(tsk);
do {
@@ -9898,6 +9905,7 @@ static void cpuacct_charge(struct task_s
*cpuusage += cputime;
ca = ca->parent;
} while (ca);
+ rcu_read_unlock();
}
struct cgroup_subsys cpuacct_subsys = {
* Li Zefan <[email protected]> [2009-03-17 14:28:11]:
> Bharata B Rao wrote:
> > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > rcupreempt is used.
> >
> > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > This can race with the task's movement between cgroups. This race
> > can cause an access to freed ca pointer in cpuacct_charge(). This will not
>
> Actually it can also end up access invalid tsk->cgroups. ;)
>
> get tsk->cgroups (cg)
> (move tsk to another cgroup) or (tsk exiting)
> -> kfree(tsk->cgroups)
> get cg->subsys[..]
>
That problem should only occur if we dereference tsk->cgroups
separately and then use that to dereference cg->subsys. Since we use
task_subsys_state() and that is RCU safe, we should be OK.
--
Balbir
* Bharata B Rao <[email protected]> [2009-03-17 13:06:49]:
> On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> > Bharata B Rao wrote:
> > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > rcupreempt is used.
> > >
> > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > This can race with the task's movement between cgroups. This race
> > > can cause an access to freed ca pointer in cpuacct_charge(). This will not
> >
> > Actually it can also end up access invalid tsk->cgroups. ;)
> >
> > get tsk->cgroups (cg)
> > (move tsk to another cgroup) or (tsk exiting)
> > -> kfree(tsk->cgroups)
> > get cg->subsys[..]
>
> Ok :) Here is the patch again with updated description.
>
> cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> rcupreempt is used.
>
> cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> This can race with the task's movement between cgroups. This race
> can cause an access to freed ca pointer in cpuacct_charge() or access
> to invalid cgroups pointer of the task. This will not happen with rcu or
> tree rcu as cpuacct_charge() is called with preemption disabled. However if
> rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
>
> Fix this race by explicitly protecting ca and the hierarchy walk with
> rcu_read_lock().
>
Looks good and works very well (except for the batch issue that you
pointed out, it takes up to batch values before updates are seen).
I'd like to get the patches in -tip and see the results, I would
recommend using percpu_counter_sum() while reading the data as an
enhancement to this patch. If user space does not overwhelm with a lot
of reads, sum would work out better.
Tested-by: Balbir Singh <[email protected]>
Acked-by: Balbir Singh <[email protected]>
--
Balbir
On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
> I'd like to get the patches in -tip and see the results, I would
> recommend using percpu_counter_sum() while reading the data as an
> enhancement to this patch. If user space does not overwhelm with a lot
> of reads, sum would work out better.
You trust userspace? I'd rather not.
* Peter Zijlstra <[email protected]> [2009-03-17 14:26:01]:
> On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
>
> > I'd like to get the patches in -tip and see the results, I would
> > recommend using percpu_counter_sum() while reading the data as an
> > enhancement to this patch. If user space does not overwhelm with a lot
> > of reads, sum would work out better.
>
> You trust userspace? I'd rather not.
>
Fair enough.. A badly written application monitor can frequently read
this data and cause horrible performance issues. On the other hand
large number of CPUs can make the lag even worse. Is it time yet for
percpu_counter batch numbers? I've tested this patch and the results
were not badly off the mark.
--
Balbir
On Tue, 2009-03-17 at 19:29 +0530, Balbir Singh wrote:
> * Peter Zijlstra <[email protected]> [2009-03-17 14:26:01]:
>
> > On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
> >
> > > I'd like to get the patches in -tip and see the results, I would
> > > recommend using percpu_counter_sum() while reading the data as an
> > > enhancement to this patch. If user space does not overwhelm with a lot
> > > of reads, sum would work out better.
> >
> > You trust userspace? I'd rather not.
> >
>
> Fair enough.. A badly written application monitor can frequently read
> this data and cause horrible performance issues. On the other hand
> large number of CPUs can make the lag even worse. Is it time yet for
> percpu_counter batch numbers? I've tested this patch and the results
> were not badly off the mark.
I'd rather err on the side of caution here, you might get some crazy
folks depending on it and then expecting us to maintain it.
On Tue, 17 Mar 2009 14:26:01 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
>
> > I'd like to get the patches in -tip and see the results, I would
> > recommend using percpu_counter_sum() while reading the data as an
> > enhancement to this patch. If user space does not overwhelm with a lot
> > of reads, sum would work out better.
>
> You trust userspace? I'd rather not.
>
hehe, I don't trust, either.
BTW, is there anyone wants "remove rq->lock" patch ?
If yes, I'll dig the patch from my graveyard.
Thanks,
-Kame
Balbir Singh wrote:
> * Li Zefan <[email protected]> [2009-03-17 14:28:11]:
>
>> Bharata B Rao wrote:
>>> cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
>>> rcupreempt is used.
>>>
>>> cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
>>> This can race with the task's movement between cgroups. This race
>>> can cause an access to freed ca pointer in cpuacct_charge(). This will not
>> Actually it can also end up access invalid tsk->cgroups. ;)
>>
>> get tsk->cgroups (cg)
>> (move tsk to another cgroup) or (tsk exiting)
>> -> kfree(tsk->cgroups)
>> get cg->subsys[..]
>>
>
> That problem should only occur if we dereference tsk->cgroups
> separately and then use that to dereference cg->subsys. Since we use
Do you mean tsk->cgroups->subsys is safe and
cg = tsk->cgroups;...; cg->subsys is unsafe ?
This is wrong.
> task_subsys_state() and that is RCU safe, we should be OK.
>
Yes, it's RCU safe, which means it's unsafe without rcu_read_lock/unlock.
* Li Zefan <[email protected]> [2009-03-18 09:40:44]:
> Balbir Singh wrote:
> > * Li Zefan <[email protected]> [2009-03-17 14:28:11]:
> >
> >> Bharata B Rao wrote:
> >>> cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> >>> rcupreempt is used.
> >>>
> >>> cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> >>> This can race with the task's movement between cgroups. This race
> >>> can cause an access to freed ca pointer in cpuacct_charge(). This will not
> >> Actually it can also end up access invalid tsk->cgroups. ;)
> >>
> >> get tsk->cgroups (cg)
> >> (move tsk to another cgroup) or (tsk exiting)
> >> -> kfree(tsk->cgroups)
> >> get cg->subsys[..]
> >>
> >
> > That problem should only occur if we dereference tsk->cgroups
> > separately and then use that to dereference cg->subsys. Since we use
>
> Do you mean tsk->cgroups->subsys is safe and
> cg = tsk->cgroups;...; cg->subsys is unsafe ?
> This is wrong.
Without rcu_read_lock/unlock they are unsafe, even with the lock, we
need to use rcu_dereference() to make sure we get consistent values.
>
> > task_subsys_state() and that is RCU safe, we should be OK.
> >
>
> Yes, it's RCU safe, which means it's unsafe without rcu_read_lock/unlock.
>
Yes, I meant under rcu_read_lock/unlock.
--
Balbir
On Tue, Mar 17, 2009 at 06:42:51PM +0530, Balbir Singh wrote:
> * Bharata B Rao <[email protected]> [2009-03-17 13:06:49]:
>
> > On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> > > Bharata B Rao wrote:
> > > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > > rcupreempt is used.
> > > >
> > > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > > This can race with the task's movement between cgroups. This race
> > > > can cause an access to freed ca pointer in cpuacct_charge(). This will not
> > >
> > > Actually it can also end up access invalid tsk->cgroups. ;)
> > >
> > > get tsk->cgroups (cg)
> > > (move tsk to another cgroup) or (tsk exiting)
> > > -> kfree(tsk->cgroups)
> > > get cg->subsys[..]
> >
> > Ok :) Here is the patch again with updated description.
> >
> > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > rcupreempt is used.
> >
> > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > This can race with the task's movement between cgroups. This race
> > can cause an access to freed ca pointer in cpuacct_charge() or access
> > to invalid cgroups pointer of the task. This will not happen with rcu or
> > tree rcu as cpuacct_charge() is called with preemption disabled. However if
> > rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
> >
> > Fix this race by explicitly protecting ca and the hierarchy walk with
> > rcu_read_lock().
> >
>
> Looks good and works very well (except for the batch issue that you
> pointed out, it takes up to batch values before updates are seen).
>
> I'd like to get the patches in -tip and see the results, I would
> recommend using percpu_counter_sum() while reading the data as an
> enhancement to this patch. If user space does not overwhelm with a lot
> of reads, sum would work out better.
>
>
> Tested-by: Balbir Singh <[email protected]>
> Acked-by: Balbir Singh <[email protected]>
So I guess this ack is not for this patch but for the per-cgroup
stime/utime cpuacct controller statistics patch.
Regards,
Bharata.
On Tue, Mar 17, 2009 at 03:04:46PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-03-17 at 19:29 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <[email protected]> [2009-03-17 14:26:01]:
> >
> > > On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
> > >
> > > > I'd like to get the patches in -tip and see the results, I would
> > > > recommend using percpu_counter_sum() while reading the data as an
> > > > enhancement to this patch. If user space does not overwhelm with a lot
> > > > of reads, sum would work out better.
> > >
> > > You trust userspace? I'd rather not.
> > >
> >
> > Fair enough.. A badly written application monitor can frequently read
> > this data and cause horrible performance issues. On the other hand
> > large number of CPUs can make the lag even worse. Is it time yet for
> > percpu_counter batch numbers? I've tested this patch and the results
> > were not badly off the mark.
>
> I'd rather err on the side of caution here, you might get some crazy
> folks depending on it and then expecting us to maintain it.
So if we want to be cautious, we could use percpu_counter_sum() as
Balbir suggested. That would address both the issues with percpu_counter
that I pointed out earlier:
- Readers are serialized with writers and we get consistent/correct
values during reads.
- Negates the effect of batching and reads would always get updated/current
values.
Regards,
Bharata.
On Wed, 18 Mar 2009 08:55:58 +0530
Bharata B Rao <[email protected]> wrote:
> On Tue, Mar 17, 2009 at 03:04:46PM +0100, Peter Zijlstra wrote:
> > On Tue, 2009-03-17 at 19:29 +0530, Balbir Singh wrote:
> > > * Peter Zijlstra <[email protected]> [2009-03-17 14:26:01]:
> > >
> > > > On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
> > > >
> > > > > I'd like to get the patches in -tip and see the results, I would
> > > > > recommend using percpu_counter_sum() while reading the data as an
> > > > > enhancement to this patch. If user space does not overwhelm with a lot
> > > > > of reads, sum would work out better.
> > > >
> > > > You trust userspace? I'd rather not.
> > > >
> > >
> > > Fair enough.. A badly written application monitor can frequently read
> > > this data and cause horrible performance issues. On the other hand
> > > large number of CPUs can make the lag even worse. Is it time yet for
> > > percpu_counter batch numbers? I've tested this patch and the results
> > > were not badly off the mark.
> >
> > I'd rather err on the side of caution here, you might get some crazy
> > folks depending on it and then expecting us to maintain it.
>
> So if we want to be cautious, we could use percpu_counter_sum() as
> Balbir suggested. That would address both the issues with percpu_counter
> that I pointed out earlier:
>
> - Readers are serialized with writers and we get consistent/correct
> values during reads.
> - Negates the effect of batching and reads would always get updated/current
> values.
>
Is this wrong ?
==
-- CONFIG_32BIT
s64 static inline s64 percpu_counter_read_slow(struct percpu_counter *fbc)
{
s64 val;
retry:
val = fbc->counter;
smp_mb();
wait_spin_unlock(&fbc->lock);
if (fbc->counter < val) {
goto retry;
return val;
}
==
Thanks,
-Kame
> Regards,
> Bharata.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, Mar 18, 2009 at 12:54:34PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Mar 2009 08:55:58 +0530
> Bharata B Rao <[email protected]> wrote:
>
> > On Tue, Mar 17, 2009 at 03:04:46PM +0100, Peter Zijlstra wrote:
> > > On Tue, 2009-03-17 at 19:29 +0530, Balbir Singh wrote:
> > > > * Peter Zijlstra <[email protected]> [2009-03-17 14:26:01]:
> > > >
> > > > > On Tue, 2009-03-17 at 18:42 +0530, Balbir Singh wrote:
> > > > >
> > > > > > I'd like to get the patches in -tip and see the results, I would
> > > > > > recommend using percpu_counter_sum() while reading the data as an
> > > > > > enhancement to this patch. If user space does not overwhelm with a lot
> > > > > > of reads, sum would work out better.
> > > > >
> > > > > You trust userspace? I'd rather not.
> > > > >
> > > >
> > > > Fair enough.. A badly written application monitor can frequently read
> > > > this data and cause horrible performance issues. On the other hand
> > > > large number of CPUs can make the lag even worse. Is it time yet for
> > > > percpu_counter batch numbers? I've tested this patch and the results
> > > > were not badly off the mark.
> > >
> > > I'd rather err on the side of caution here, you might get some crazy
> > > folks depending on it and then expecting us to maintain it.
> >
> > So if we want to be cautious, we could use percpu_counter_sum() as
> > Balbir suggested. That would address both the issues with percpu_counter
> > that I pointed out earlier:
> >
> > - Readers are serialized with writers and we get consistent/correct
> > values during reads.
> > - Negates the effect of batching and reads would always get updated/current
> > values.
> >
>
> Is this wrong ?
> ==
> -- CONFIG_32BIT
> s64 static inline s64 percpu_counter_read_slow(struct percpu_counter *fbc)
> {
> s64 val;
> retry:
> val = fbc->counter;
> smp_mb();
> wait_spin_unlock(&fbc->lock);
> if (fbc->counter < val) {
> goto retry;
> return val;
> }
> ==
Looks ok to me, but will wait for experts' comments.
However, I did a quick measurement of read times with percpu_counter_read()
(no readside lock) and percpu_counter_sum() (readside lock) and I don't
see a major slowdown with percpu_counter_sum().
Time taken for 100 reads of cpuacct.stat with 1s delay b/n every read.
percpu_counter_read() - 9845 us
percpu_counter_sum() - 9974 us
This is on a 8cpu system.
Regards,
Bharata.
On Wed, 18 Mar 2009 10:18:01 +0530
Bharata B Rao <[email protected]> wrote:
> Looks ok to me, but will wait for experts' comments.
>
> However, I did a quick measurement of read times with percpu_counter_read()
> (no readside lock) and percpu_counter_sum() (readside lock) and I don't
> see a major slowdown with percpu_counter_sum().
>
> Time taken for 100 reads of cpuacct.stat with 1s delay b/n every read.
> percpu_counter_read() - 9845 us
> percpu_counter_sum() - 9974 us
>
Then, almost 1 us overhead per read().....Hmm, seems big (as counter).
Thanks,
-Kame
On Wed, Mar 18, 2009 at 04:08:27PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Mar 2009 10:18:01 +0530
> Bharata B Rao <[email protected]> wrote:
> > Looks ok to me, but will wait for experts' comments.
> >
> > However, I did a quick measurement of read times with percpu_counter_read()
> > (no readside lock) and percpu_counter_sum() (readside lock) and I don't
> > see a major slowdown with percpu_counter_sum().
> >
> > Time taken for 100 reads of cpuacct.stat with 1s delay b/n every read.
> > percpu_counter_read() - 9845 us
> > percpu_counter_sum() - 9974 us
> >
> Then, almost 1 us overhead per read().....Hmm, seems big (as counter).
Well some cost for correct and accurate counter :)
BTW, I did a few more iterations and I don't see consistent numbers.
The results from 7 runs look like this:
percpu_counter_read() - 11325, 11549, 5939, 9999, 7129, 7758, 11385 us
percpu_counter_sum() - 8655, 9201, 8705, 11766, 10619, 9186, 8890 us
Regards,
Bharata.
* Bharata B Rao <[email protected]> [2009-03-18 08:48:32]:
> On Tue, Mar 17, 2009 at 06:42:51PM +0530, Balbir Singh wrote:
> > * Bharata B Rao <[email protected]> [2009-03-17 13:06:49]:
> >
> > > On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> > > > Bharata B Rao wrote:
> > > > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > > > rcupreempt is used.
> > > > >
> > > > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > > > This can race with the task's movement between cgroups. This race
> > > > > can cause an access to freed ca pointer in cpuacct_charge(). This will not
> > > >
> > > > Actually it can also end up access invalid tsk->cgroups. ;)
> > > >
> > > > get tsk->cgroups (cg)
> > > > (move tsk to another cgroup) or (tsk exiting)
> > > > -> kfree(tsk->cgroups)
> > > > get cg->subsys[..]
> > >
> > > Ok :) Here is the patch again with updated description.
> > >
> > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > rcupreempt is used.
> > >
> > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > This can race with the task's movement between cgroups. This race
> > > can cause an access to freed ca pointer in cpuacct_charge() or access
> > > to invalid cgroups pointer of the task. This will not happen with rcu or
> > > tree rcu as cpuacct_charge() is called with preemption disabled. However if
> > > rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
> > >
> > > Fix this race by explicitly protecting ca and the hierarchy walk with
> > > rcu_read_lock().
> > >
> >
> > Looks good and works very well (except for the batch issue that you
> > pointed out, it takes up to batch values before updates are seen).
> >
> > I'd like to get the patches in -tip and see the results, I would
> > recommend using percpu_counter_sum() while reading the data as an
> > enhancement to this patch. If user space does not overwhelm with a lot
> > of reads, sum would work out better.
> >
> >
> > Tested-by: Balbir Singh <[email protected]>
> > Acked-by: Balbir Singh <[email protected]>
>
> So I guess this ack is not for this patch but for the per-cgroup
> stime/utime cpuacct controller statistics patch.
>
Yes.. for both these patches actually. Thanks for pointing it out
though.
--
Balbir
On Tue, 2009-03-17 at 13:06 +0530, Bharata B Rao wrote:
> On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> > Bharata B Rao wrote:
> > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > rcupreempt is used.
> > >
> > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > This can race with the task's movement between cgroups. This race
> > > can cause an access to freed ca pointer in cpuacct_charge(). This will not
> >
> > Actually it can also end up access invalid tsk->cgroups. ;)
> >
> > get tsk->cgroups (cg)
> > (move tsk to another cgroup) or (tsk exiting)
> > -> kfree(tsk->cgroups)
> > get cg->subsys[..]
>
> Ok :) Here is the patch again with updated description.
>
> cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> rcupreempt is used.
>
> cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> This can race with the task's movement between cgroups. This race
> can cause an access to freed ca pointer in cpuacct_charge() or access
> to invalid cgroups pointer of the task. This will not happen with rcu or
> tree rcu as cpuacct_charge() is called with preemption disabled. However if
> rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
>
> Fix this race by explicitly protecting ca and the hierarchy walk with
> rcu_read_lock().
>
> Signed-off-by: Bharata B Rao <[email protected]>
I would ditch the comment, it doesn't add anything.
The simple rule is: if you want RCU-safe, use rcu_read_lock().
preempt/irq disable isn't sufficient -- hasn't been for a long long
while.
After that,
Acked-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9891,6 +9891,13 @@ static void cpuacct_charge(struct task_s
> return;
>
> cpu = task_cpu(tsk);
> +
> + /*
> + * preemption is already disabled here, but to be safe with
> + * rcupreempt, take rcu_read_lock(). This protects ca and
> + * hence the hierarchy walk.
> + */
> + rcu_read_lock();
> ca = task_ca(tsk);
>
> do {
> @@ -9898,6 +9905,7 @@ static void cpuacct_charge(struct task_s
> *cpuusage += cputime;
> ca = ca->parent;
> } while (ca);
> + rcu_read_unlock();
> }
>
> struct cgroup_subsys cpuacct_subsys = {
On Thu, Mar 19, 2009 at 10:20:21AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-03-17 at 13:06 +0530, Bharata B Rao wrote:
> > On Tue, Mar 17, 2009 at 02:28:11PM +0800, Li Zefan wrote:
> > > Bharata B Rao wrote:
> > > > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > > > rcupreempt is used.
> > > >
> > > > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > > > This can race with the task's movement between cgroups. This race
> > > > can cause an access to freed ca pointer in cpuacct_charge(). This will not
> > >
> > > Actually it can also end up access invalid tsk->cgroups. ;)
> > >
> > > get tsk->cgroups (cg)
> > > (move tsk to another cgroup) or (tsk exiting)
> > > -> kfree(tsk->cgroups)
> > > get cg->subsys[..]
> >
> > Ok :) Here is the patch again with updated description.
> >
> > cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
> > rcupreempt is used.
> >
> > cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
> > This can race with the task's movement between cgroups. This race
> > can cause an access to freed ca pointer in cpuacct_charge() or access
> > to invalid cgroups pointer of the task. This will not happen with rcu or
> > tree rcu as cpuacct_charge() is called with preemption disabled. However if
> > rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
> >
> > Fix this race by explicitly protecting ca and the hierarchy walk with
> > rcu_read_lock().
> >
> > Signed-off-by: Bharata B Rao <[email protected]>
>
> I would ditch the comment, it doesn't add anything.
>
> The simple rule is: if you want RCU-safe, use rcu_read_lock().
> preempt/irq disable isn't sufficient -- hasn't been for a long long
> while.
>
> After that,
>
> Acked-by: Peter Zijlstra <[email protected]>
>
Ok. Removed the comment. Here is the updated patch.
cpuacct: Make cpuacct hierarchy walk in cpuacct_charge() safe when
rcupreempt is used.
cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge() or access
to invalid cgroups pointer of the task. This will not happen with rcu or
tree rcu as cpuacct_charge() is called with preemption disabled. However if
rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.
Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().
Signed-off-by: Bharata B Rao <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Tested-by: Balbir Singh <[email protected]>
---
kernel/sched.c | 3 +++
1 file changed, 3 insertions(+)
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9894,6 +9894,8 @@ static void cpuacct_charge(struct task_s
return;
cpu = task_cpu(tsk);
+
+ rcu_read_lock();
ca = task_ca(tsk);
do {
@@ -9901,6 +9903,7 @@ static void cpuacct_charge(struct task_s
*cpuusage += cputime;
ca = ca->parent;
} while (ca);
+ rcu_read_unlock();
}
struct cgroup_subsys cpuacct_subsys = {