2012-10-02 11:42:57

by Jiri Olsa

[permalink] [raw]
Subject: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

forgot to CC Stephane
jirka


---
hi,
I'm getting following warning when running basic cgroup perf stuff:
(perf stat -a -e faults -G krava -- sleep 10)

WARNING: at kernel/events/core.c:397 perf_cgroup_switch+0x1c6/0x1e0()
Hardware name: Montevina platform
Modules linked in:
Pid: 1173, comm: bash Not tainted 3.6.0+ #129
Call Trace:
[<ffffffff8103736f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff810dc025>] ? perf_ctx_lock+0x15/0x30
[<ffffffff810373ca>] warn_slowpath_null+0x1a/0x20
[<ffffffff810e0b96>] perf_cgroup_switch+0x1c6/0x1e0
[<ffffffff810e09d0>] ? perf_event_context_sched_in+0xc0/0xc0
[<ffffffff810e0fa9>] __perf_event_task_sched_in+0xa9/0x200
[<ffffffff81089c67>] ? lock_release_holdtime.part.3+0xc7/0x160
[<ffffffff810689f8>] finish_task_switch+0xb8/0xf0
[<ffffffff814fab2b>] __schedule+0x2eb/0x930
[<ffffffff814fca10>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff8108fd75>] ? trace_hardirqs_on_caller+0x105/0x190
[<ffffffff8108fe0d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff814fca1b>] ? _raw_spin_unlock_irq+0x3b/0x60
[<ffffffff81055238>] ? start_flush_work+0x108/0x180
[<ffffffff814fb289>] schedule+0x29/0x70
[<ffffffff814f8725>] schedule_timeout+0x1c5/0x210
[<ffffffff8105cecd>] ? add_wait_queue+0x4d/0x60
[<ffffffff814fc9c5>] ? _raw_spin_unlock_irqrestore+0x65/0x80
[<ffffffff8108fd75>] ? trace_hardirqs_on_caller+0x105/0x190
[<ffffffff8108fe0d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff814fc9a2>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff812b5151>] n_tty_read+0x461/0x8b0
[<ffffffff8106d040>] ? try_to_wake_up+0x310/0x310
[<ffffffff812af2b9>] tty_read+0x99/0xf0
[<ffffffff81128fff>] vfs_read+0xaf/0x180
[<ffffffff8112911d>] sys_read+0x4d/0x90
[<ffffffff81504412>] system_call_fastpath+0x16/0x1b


Looking at the code the cpuctx->cgrp condition seems legal,
and need just some adjustment.

However, I don't fully understand that code and could be
wrong.. just want to get rid of probably wrong warning.. ;)

thanks,
jirka


---
Changing WARN_ON_ONCE condition for PERF_CGROUP_SWIN leg.

It's legal to have cpuctx->cgrp already defined, but it's
not legal not to have events active at the same time.

Cc: Peter Zijlstra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
CC: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
kernel/events/core.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7b9df35..733f794 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -394,7 +394,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
}

if (mode & PERF_CGROUP_SWIN) {
- WARN_ON_ONCE(cpuctx->cgrp);
+ WARN_ON_ONCE(cpuctx->cgrp && !cpuctx->ctx.is_active);
+
/* set cgrp before ctxsw in to
* allow event_filter_match() to not
* have to pass task around
--
1.7.7.6


2012-10-02 11:53:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, 2012-10-02 at 13:42 +0200, Jiri Olsa wrote:
> +++ b/kernel/events/core.c
> @@ -394,7 +394,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
> }
>
> if (mode & PERF_CGROUP_SWIN) {
> - WARN_ON_ONCE(cpuctx->cgrp);
> + WARN_ON_ONCE(cpuctx->cgrp && !cpuctx->ctx.is_active);
> +
> /* set cgrp before ctxsw in to
> * allow event_filter_match() to not
> * have to pass task around

OK, like you mentioned this is the result of multiple PMU being able to
share a cpuctx, shouldn't we in that case avoid the second loop over the
cpuctx as a whole?

Would something like the below do? IIRC I introduced that active_pmu for
exactly such reasons..

---
kernel/events/core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7b9df35..e98f014 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)

list_for_each_entry_rcu(pmu, &pmus, entry) {
cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ if (cpuctx->active_pmu != pmu)
+ continue;

/*
* perf_cgroup_events says at least one

2012-10-02 12:01:03

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, Oct 2, 2012 at 1:53 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, 2012-10-02 at 13:42 +0200, Jiri Olsa wrote:
> > +++ b/kernel/events/core.c
> > @@ -394,7 +394,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
> > }
> >
> > if (mode & PERF_CGROUP_SWIN) {
> > - WARN_ON_ONCE(cpuctx->cgrp);
> > + WARN_ON_ONCE(cpuctx->cgrp && !cpuctx->ctx.is_active);
> > +

Wait a minute, I thought we had fixed that problem which was caused by
the AMD IBS
pmu not having the task_ctx_nr set properly or something like that. We
were wrongfully
sharing the cpuctx between PMUs.



>
> > /* set cgrp before ctxsw in to
> > * allow event_filter_match() to not
> > * have to pass task around
>
> OK, like you mentioned this is the result of multiple PMU being able to
> share a cpuctx, shouldn't we in that case avoid the second loop over the
> cpuctx as a whole?
>
> Would something like the below do? IIRC I introduced that active_pmu for
> exactly such reasons..
>
> ---
> kernel/events/core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7b9df35..e98f014 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
>
> list_for_each_entry_rcu(pmu, &pmus, entry) {
> cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> + if (cpuctx->active_pmu != pmu)
> + continue;
>
> /*
> * perf_cgroup_events says at least one
>

2012-10-02 12:39:54

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, Oct 02, 2012 at 01:53:01PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-10-02 at 13:42 +0200, Jiri Olsa wrote:
> > +++ b/kernel/events/core.c
> > @@ -394,7 +394,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
> > }
> >
> > if (mode & PERF_CGROUP_SWIN) {
> > - WARN_ON_ONCE(cpuctx->cgrp);
> > + WARN_ON_ONCE(cpuctx->cgrp && !cpuctx->ctx.is_active);
> > +
> > /* set cgrp before ctxsw in to
> > * allow event_filter_match() to not
> > * have to pass task around
>
> OK, like you mentioned this is the result of multiple PMU being able to
> share a cpuctx, shouldn't we in that case avoid the second loop over the
> cpuctx as a whole?
>
> Would something like the below do? IIRC I introduced that active_pmu for
> exactly such reasons..
>
> ---
> kernel/events/core.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7b9df35..e98f014 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
>
> list_for_each_entry_rcu(pmu, &pmus, entry) {
> cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> + if (cpuctx->active_pmu != pmu)
> + continue;
>
> /*
> * perf_cgroup_events says at least one
>

this passed my test

jirka

2012-10-02 12:48:34

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, Oct 2, 2012 at 2:39 PM, Jiri Olsa <[email protected]> wrote:
> On Tue, Oct 02, 2012 at 01:53:01PM +0200, Peter Zijlstra wrote:
>> On Tue, 2012-10-02 at 13:42 +0200, Jiri Olsa wrote:
>> > +++ b/kernel/events/core.c
>> > @@ -394,7 +394,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
>> > }
>> >
>> > if (mode & PERF_CGROUP_SWIN) {
>> > - WARN_ON_ONCE(cpuctx->cgrp);
>> > + WARN_ON_ONCE(cpuctx->cgrp && !cpuctx->ctx.is_active);
>> > +
>> > /* set cgrp before ctxsw in to
>> > * allow event_filter_match() to not
>> > * have to pass task around
>>
>> OK, like you mentioned this is the result of multiple PMU being able to
>> share a cpuctx, shouldn't we in that case avoid the second loop over the
>> cpuctx as a whole?
>>
Not sure, I understand what active_pmu represents.

>> Would something like the below do? IIRC I introduced that active_pmu for
>> exactly such reasons..
>>
>> ---
>> kernel/events/core.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 7b9df35..e98f014 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
>>
>> list_for_each_entry_rcu(pmu, &pmus, entry) {
>> cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
>> + if (cpuctx->active_pmu != pmu)
>> + continue;
>>
>> /*
>> * perf_cgroup_events says at least one
>>
>
> this passed my test
>
> jirka

2012-10-02 13:11:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, 2012-10-02 at 14:48 +0200, Stephane Eranian wrote:
> Not sure, I understand what active_pmu represents.

Its a 'random' pmu of those that share the cpuctx, exactly so you can
limit pmu iterations to those with unique cpuctx instances.

Its assigned when we create a cpuctx to the pmu creating it, its
re-assigned on pmu destruction (if that ever were to happen).

I realize the name isn't really helping but at the time I couldn't come
up with anything better :/

If you've got a good suggestion I'd be glad to rename it.

2012-10-02 13:34:45

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, Oct 2, 2012 at 3:10 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2012-10-02 at 14:48 +0200, Stephane Eranian wrote:
>> Not sure, I understand what active_pmu represents.
>
> Its a 'random' pmu of those that share the cpuctx, exactly so you can
> limit pmu iterations to those with unique cpuctx instances.
>
> Its assigned when we create a cpuctx to the pmu creating it, its
> re-assigned on pmu destruction (if that ever were to happen).
>
Yeah, I saw that. active_pmu point to whatever was the last PMU
sharing the cpuctx.

But I guess what is confusing is the name. It has nothing to do
with active vs. inactive. They are all active.

In cgroup_switch(), we must go over all syswide events from all PMUs to
sched out all the ones monitoring ONLY the current cgroup. You must do
this only once per switch. So given that the events are linked off of cpuctx,
what matters is that we go over all unique cpuctx once. Hence, I think
your patch solves the problem, though this is kinda obscure why.

> I realize the name isn't really helping but at the time I couldn't come
> up with anything better :/
>
> If you've got a good suggestion I'd be glad to rename it.

how about unique_pmu? And adding a comment in cgroup_switch()

+ /* ensure we process each cpuctx only once */
+ if (cpuctx->active_pmu != pmu)
+ continue;

2012-10-02 13:46:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, 2012-10-02 at 15:34 +0200, Stephane Eranian wrote:
> > If you've got a good suggestion I'd be glad to rename it.
>
> how about unique_pmu?

Done!

---
Subject: perf: Clarify perf_cpu_context::active_pmu by renaming it
From: Peter Zijlstra <[email protected]>
Date: Tue Oct 02 15:38:52 CEST 2012

Stephane thought the perf_cpu_context::active_pmu name confusing and
suggested using unique_pmu instead.

This pointer is a pointer to a 'random' pmu sharing the cpuctx
instance, therefore limiting a for_each_pmu loop to those where
cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx
instances.

Suggested-by: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_event.h | 2 +-
kernel/events/core.c | 12 ++++++------
2 files changed, 7 insertions(+), 7 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1110,7 +1110,7 @@ struct perf_cpu_context {
int exclusive;
struct list_head rotation_list;
int jiffies_interval;
- struct pmu *active_pmu;
+ struct pmu *unique_pmu;
struct perf_cgroup *cgrp;
};

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4419,7 +4419,7 @@ static void perf_event_task_event(struct
rcu_read_lock();
list_for_each_entry_rcu(pmu, &pmus, entry) {
cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
- if (cpuctx->active_pmu != pmu)
+ if (cpuctx->unique_pmu != pmu)
goto next;
perf_event_task_ctx(&cpuctx->ctx, task_event);

@@ -4565,7 +4565,7 @@ static void perf_event_comm_event(struct
rcu_read_lock();
list_for_each_entry_rcu(pmu, &pmus, entry) {
cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
- if (cpuctx->active_pmu != pmu)
+ if (cpuctx->unique_pmu != pmu)
goto next;
perf_event_comm_ctx(&cpuctx->ctx, comm_event);

@@ -4761,7 +4761,7 @@ static void perf_event_mmap_event(struct
rcu_read_lock();
list_for_each_entry_rcu(pmu, &pmus, entry) {
cpuctx = get_cpu_ptr(pmu->pmu_cpu_context);
- if (cpuctx->active_pmu != pmu)
+ if (cpuctx->unique_pmu != pmu)
goto next;
perf_event_mmap_ctx(&cpuctx->ctx, mmap_event,
vma->vm_flags & VM_EXEC);
@@ -5862,8 +5862,8 @@ static void update_pmu_context(struct pm

cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);

- if (cpuctx->active_pmu == old_pmu)
- cpuctx->active_pmu = pmu;
+ if (cpuctx->unique_pmu == old_pmu)
+ cpuctx->unique_pmu = pmu;
}
}

@@ -5998,7 +5998,7 @@ int perf_pmu_register(struct pmu *pmu, c
cpuctx->ctx.pmu = pmu;
cpuctx->jiffies_interval = 1;
INIT_LIST_HEAD(&cpuctx->rotation_list);
- cpuctx->active_pmu = pmu;
+ cpuctx->unique_pmu = pmu;
}

got_cpu_context:

2012-10-02 13:47:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

On Tue, 2012-10-02 at 15:34 +0200, Stephane Eranian wrote:
> And adding a comment in cgroup_switch()
>
> + /* ensure we process each cpuctx only once */
> + if (cpuctx->active_pmu != pmu)
> + continue;
>
---
Subject: perf: Fix perf_cgroup_switch for sw-events
From: Peter Zijlstra <[email protected]>
Date: Tue Oct 02 15:41:23 CEST 2012

Jiri reported that he could trigger the WARN_ON_ONCE() in
perf_cgroup_switch() using sw-events. This is because sw-events share
a cpuctx with multiple PMUs.

Use the ->unique_pmu pointer to limit the pmu iteration to unique
cpuctx instances.

Reported-and-Tested-by: Jiri Olsa <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/events/core.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_stru

list_for_each_entry_rcu(pmu, &pmus, entry) {
cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ if (cpuctx->unique_pmu != pmu)
+ continue; /* ensure we process each cpuctx once */

/*
* perf_cgroup_events says at least one
@@ -395,9 +397,10 @@ void perf_cgroup_switch(struct task_stru

if (mode & PERF_CGROUP_SWIN) {
WARN_ON_ONCE(cpuctx->cgrp);
- /* set cgrp before ctxsw in to
- * allow event_filter_match() to not
- * have to pass task around
+ /*
+ * set cgrp before ctxsw in to allow
+ * event_filter_match() to not have to pass
+ * task around
*/
cpuctx->cgrp = perf_cgroup_from_task(task);
cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);

2012-10-02 14:18:26

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Fix perf_cgroup_switch schedule in warning

Cool. Thanks.

On Tue, Oct 2, 2012 at 3:46 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2012-10-02 at 15:34 +0200, Stephane Eranian wrote:
>> And adding a comment in cgroup_switch()
>>
>> + /* ensure we process each cpuctx only once */
>> + if (cpuctx->active_pmu != pmu)
>> + continue;
>>
> ---
> Subject: perf: Fix perf_cgroup_switch for sw-events
> From: Peter Zijlstra <[email protected]>
> Date: Tue Oct 02 15:41:23 CEST 2012
>
> Jiri reported that he could trigger the WARN_ON_ONCE() in
> perf_cgroup_switch() using sw-events. This is because sw-events share
> a cpuctx with multiple PMUs.
>
> Use the ->unique_pmu pointer to limit the pmu iteration to unique
> cpuctx instances.
>
> Reported-and-Tested-by: Jiri Olsa <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/events/core.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -372,6 +372,8 @@ void perf_cgroup_switch(struct task_stru
>
> list_for_each_entry_rcu(pmu, &pmus, entry) {
> cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> + if (cpuctx->unique_pmu != pmu)
> + continue; /* ensure we process each cpuctx once */
>
> /*
> * perf_cgroup_events says at least one
> @@ -395,9 +397,10 @@ void perf_cgroup_switch(struct task_stru
>
> if (mode & PERF_CGROUP_SWIN) {
> WARN_ON_ONCE(cpuctx->cgrp);
> - /* set cgrp before ctxsw in to
> - * allow event_filter_match() to not
> - * have to pass task around
> + /*
> + * set cgrp before ctxsw in to allow
> + * event_filter_match() to not have to pass
> + * task around
> */
> cpuctx->cgrp = perf_cgroup_from_task(task);
> cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>