Hi,
I ran into a problem on my AMD box whereby I would hit the
WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch().
It took me a while to track this down. It turns out that the
list_for_each_entry_rcu() loop had multiple iterations. That's
normal, we have CPU PMU and IBS PMU. But what caused
the warning to fire is that both the core and IBS PMU were
pointing to the same cpuctx struct. Thus, the cpuctx->cgrp
was already set in the second iteration.
Is the warning a false positive?
In perf_pmu_register(), there is a search for a matching
pmu->task_ctx_nr. Given that the field is pointing to
perf_hw_context for both cpu and IBS PMU, there is
a match and therefore the cpuctx are shared.
The question is: why do we have to share the cpuctx?
Note that the same issue probably exists with the Intel
uncore PMU.
If we need to share, then the perf_cgroup_switch() code
needs to change because, as it stands, it is doing the
switching twice in this case.
Either way something looks wrong here.
Any idea?
On 08/09/2012 08:51 AM, Stephane Eranian wrote:
> Hi,
>
> I ran into a problem on my AMD box whereby I would hit the
> WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch().
>
> It took me a while to track this down. It turns out that the
> list_for_each_entry_rcu() loop had multiple iterations. That's
> normal, we have CPU PMU and IBS PMU. But what caused
> the warning to fire is that both the core and IBS PMU were
> pointing to the same cpuctx struct. Thus, the cpuctx->cgrp
> was already set in the second iteration.
>
> Is the warning a false positive?
I think it's a false positive, I'm not sure.
>
> In perf_pmu_register(), there is a search for a matching
> pmu->task_ctx_nr. Given that the field is pointing to
> perf_hw_context for both cpu and IBS PMU, there is
> a match and therefore the cpuctx are shared.
>
> The question is: why do we have to share the cpuctx?
>
> Note that the same issue probably exists with the Intel
> uncore PMU.
uncore PMU does not have this issue because uncore_pmu->task_ctx_nr
is 'perf_invalid_context'. find_pmu_context() always return NULL in
that case.
Regards
Yan, Zheng.
>
> If we need to share, then the perf_cgroup_switch() code
> needs to change because, as it stands, it is doing the
> switching twice in this case.
>
> Either way something looks wrong here.
>
> Any idea?
>
On Thu, Aug 9, 2012 at 8:55 AM, Yan, Zheng <[email protected]> wrote:
> On 08/09/2012 08:51 AM, Stephane Eranian wrote:
>> Hi,
>>
>> I ran into a problem on my AMD box whereby I would hit the
>> WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch().
>>
>> It took me a while to track this down. It turns out that the
>> list_for_each_entry_rcu() loop had multiple iterations. That's
>> normal, we have CPU PMU and IBS PMU. But what caused
>> the warning to fire is that both the core and IBS PMU were
>> pointing to the same cpuctx struct. Thus, the cpuctx->cgrp
>> was already set in the second iteration.
>>
>> Is the warning a false positive?
>
> I think it's a false positive, I'm not sure.
>
Well, but then you're doing the same work twice.
>>
>> In perf_pmu_register(), there is a search for a matching
>> pmu->task_ctx_nr. Given that the field is pointing to
>> perf_hw_context for both cpu and IBS PMU, there is
>> a match and therefore the cpuctx are shared.
>>
>> The question is: why do we have to share the cpuctx?
>>
>> Note that the same issue probably exists with the Intel
>> uncore PMU.
>
> uncore PMU does not have this issue because uncore_pmu->task_ctx_nr
> is 'perf_invalid_context'. find_pmu_context() always return NULL in
> that case.
>
Yes, I think IBS should do the same and that should fix the problem
there too. Will try that.
On Thu, 2012-08-09 at 16:05 +0200, Stephane Eranian wrote:
> > uncore PMU does not have this issue because uncore_pmu->task_ctx_nr
> > is 'perf_invalid_context'. find_pmu_context() always return NULL in
> > that case.
> >
> Yes, I think IBS should do the same and that should fix the problem
> there too. Will try that.
I'm afraid not, per-task profiling with uncore doesn't really make that
much sense. For IBS it does.
We can't share a context with different PMUs, that'll totally mess up
the event scheduling.
We'll have to grow perf_event_task_context with an extra context and
have IBS use that.
On Thu, Aug 9, 2012 at 8:08 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-08-09 at 16:05 +0200, Stephane Eranian wrote:
>> > uncore PMU does not have this issue because uncore_pmu->task_ctx_nr
>> > is 'perf_invalid_context'. find_pmu_context() always return NULL in
>> > that case.
>> >
>> Yes, I think IBS should do the same and that should fix the problem
>> there too. Will try that.
>
> I'm afraid not, per-task profiling with uncore doesn't really make that
> much sense. For IBS it does.
>
> We can't share a context with different PMUs, that'll totally mess up
> the event scheduling.
>
> We'll have to grow perf_event_task_context with an extra context and
> have IBS use that.
Ok, I am fine with that. Don't know about to call it though.
OK,.. so the AMD IBS PMUs actually have perf_invalid_context.
Lemme have a proper look...
Weirdness.. perf_pmu_register() will allocate a pmu->pmu_cpu_context for
each PMU. find_pmu_context() even special cases the perf_invalid_context
to return NULL to force the allocation instead of sharing it.
So both IBS PMUs should have their own cpuctx.
In any case, I was talking about something like the below.. I hate
growing the per-task ctx array with two entries, esp. since we'll mostly
add two NULL pointer checks on every perf operation for everybody not
using IBS.
---
--- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
@@ -436,7 +436,7 @@ static void perf_ibs_read(struct perf_ev
static struct perf_ibs perf_ibs_fetch = {
.pmu = {
- .task_ctx_nr = perf_invalid_context,
+ .task_ctx_nr = perf_hw2_context,
.event_init = perf_ibs_init,
.add = perf_ibs_add,
@@ -459,7 +459,7 @@ static struct perf_ibs perf_ibs_fetch =
static struct perf_ibs perf_ibs_op = {
.pmu = {
- .task_ctx_nr = perf_invalid_context,
+ .task_ctx_nr = perf_hw3_context,
.event_init = perf_ibs_init,
.add = perf_ibs_add,
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1237,6 +1237,8 @@ enum perf_event_task_context {
perf_invalid_context = -1,
perf_hw_context = 0,
perf_sw_context,
+ perf_hw2_context, /* AMD IBS (fetch) */
+ perf_hw3_context, /* AMD IBS (ops) */
perf_nr_task_contexts,
};
Peter,
Ok, that should fix the problem that IBS would not work correctly in
per-thread mode.
I realized I was looking at an older kernel which did not have the
split between ibs
op and fetch. And there, the .task_nr_context was not initialized at all.
Your proposal solves the problem, though it is not that pretty because
you're exposing
IBS stuff in sched.h aside for the 2 new iterations. But I don't see
another way around
this at this point.
Thanks.
On Mon, Aug 13, 2012 at 12:28 PM, Peter Zijlstra <[email protected]> wrote:
>
> OK,.. so the AMD IBS PMUs actually have perf_invalid_context.
>
> Lemme have a proper look...
>
>
> Weirdness.. perf_pmu_register() will allocate a pmu->pmu_cpu_context for
> each PMU. find_pmu_context() even special cases the perf_invalid_context
> to return NULL to force the allocation instead of sharing it.
>
> So both IBS PMUs should have their own cpuctx.
>
>
>
> In any case, I was talking about something like the below.. I hate
> growing the per-task ctx array with two entries, esp. since we'll mostly
> add two NULL pointer checks on every perf operation for everybody not
> using IBS.
>
> ---
> --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c
> +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c
> @@ -436,7 +436,7 @@ static void perf_ibs_read(struct perf_ev
>
> static struct perf_ibs perf_ibs_fetch = {
> .pmu = {
> - .task_ctx_nr = perf_invalid_context,
> + .task_ctx_nr = perf_hw2_context,
>
> .event_init = perf_ibs_init,
> .add = perf_ibs_add,
> @@ -459,7 +459,7 @@ static struct perf_ibs perf_ibs_fetch =
>
> static struct perf_ibs perf_ibs_op = {
> .pmu = {
> - .task_ctx_nr = perf_invalid_context,
> + .task_ctx_nr = perf_hw3_context,
>
> .event_init = perf_ibs_init,
> .add = perf_ibs_add,
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1237,6 +1237,8 @@ enum perf_event_task_context {
> perf_invalid_context = -1,
> perf_hw_context = 0,
> perf_sw_context,
> + perf_hw2_context, /* AMD IBS (fetch) */
> + perf_hw3_context, /* AMD IBS (ops) */
> perf_nr_task_contexts,
> };
>
>
On Mon, 2012-08-13 at 14:53 +0200, Stephane Eranian wrote:
>
> Your proposal solves the problem, though it is not that pretty because
> you're exposing
> IBS stuff in sched.h aside for the 2 new iterations. But I don't see
> another way around
> this at this point.
Yeah, I don't particularly like it either.. but I'm having a similar
problem of not seeing anything better.
I was going to see if I could make the two IBS PMUs share something, but
that's going to be pain whichever way you turn it.
On Mon, Aug 13, 2012 at 5:32 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-08-13 at 14:53 +0200, Stephane Eranian wrote:
>>
>> Your proposal solves the problem, though it is not that pretty because
>> you're exposing
>> IBS stuff in sched.h aside for the 2 new iterations. But I don't see
>> another way around
>> this at this point.
>
> Yeah, I don't particularly like it either.. but I'm having a similar
> problem of not seeing anything better.
>
> I was going to see if I could make the two IBS PMUs share something, but
> that's going to be pain whichever way you turn it.
Let's use your patch then.
Acked-by: Stephane Eranian <[email protected]>