2019-06-01 08:29:06

by Ian Rogers

[permalink] [raw]
Subject: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

Currently perf_rotate_context assumes that if the context's nr_events !=
nr_active a rotation is necessary for perf event multiplexing. With
cgroups, nr_events is the total count of events for all cgroups and
nr_active will not include events in a cgroup other than the current
task's. This makes rotation appear necessary for cgroups when it is not.

Add a perf_event_context flag that is set when rotation is necessary.
Clear the flag during sched_out and set it when a flexible sched_in
fails due to resources.

Signed-off-by: Ian Rogers <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 42 +++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 15a82ff0aefe..7ab6c251aa3d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -747,6 +747,11 @@ struct perf_event_context {
int nr_stat;
int nr_freq;
int rotate_disable;
+ /*
+ * Set when nr_events != nr_active, except tolerant to events not
+ * needing to be active due to scheduling constraints, such as cgroups.
+ */
+ int rotate_necessary;
refcount_t refcount;
struct task_struct *task;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..41ae424b9f1d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
if (!ctx->nr_active || !(is_active & EVENT_ALL))
return;

+ /*
+ * If we had been multiplexing, no rotations are necessary now no events
+ * are active.
+ */
+ ctx->rotate_necessary = 0;
+
perf_pmu_disable(ctx->pmu);
if (is_active & EVENT_PINNED) {
list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
@@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
sid->can_add_hw = 0;
}

+ /*
+ * If the group wasn't scheduled then set that multiplexing is necessary
+ * for the context. Note, this won't be set if the event wasn't
+ * scheduled due to event_filter_match failing due to the earlier
+ * return.
+ */
+ if (event->state == PERF_EVENT_STATE_INACTIVE)
+ sid->ctx->rotate_necessary = 1;
+
return 0;
}

@@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
{
struct perf_event *cpu_event = NULL, *task_event = NULL;
- bool cpu_rotate = false, task_rotate = false;
- struct perf_event_context *ctx = NULL;
+ struct perf_event_context *task_ctx = NULL;
+ int cpu_rotate, task_rotate;

/*
* Since we run this from IRQ context, nobody can install new
* events, thus the event count values are stable.
*/

- if (cpuctx->ctx.nr_events) {
- if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
- cpu_rotate = true;
- }
-
- ctx = cpuctx->task_ctx;
- if (ctx && ctx->nr_events) {
- if (ctx->nr_events != ctx->nr_active)
- task_rotate = true;
- }
+ cpu_rotate = cpuctx->ctx.rotate_necessary;
+ task_ctx = cpuctx->task_ctx;
+ task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;

if (!(cpu_rotate || task_rotate))
return false;
@@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
perf_pmu_disable(cpuctx->ctx.pmu);

if (task_rotate)
- task_event = ctx_first_active(ctx);
+ task_event = ctx_first_active(task_ctx);
if (cpu_rotate)
cpu_event = ctx_first_active(&cpuctx->ctx);

@@ -3724,17 +3732,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
* As per the order given at ctx_resched() first 'pop' task flexible
* and then, if needed CPU flexible.
*/
- if (task_event || (ctx && cpu_event))
- ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+ if (task_event || (task_ctx && cpu_event))
+ ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
if (cpu_event)
cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);

if (task_event)
- rotate_ctx(ctx, task_event);
+ rotate_ctx(task_ctx, task_event);
if (cpu_event)
rotate_ctx(&cpuctx->ctx, cpu_event);

- perf_event_sched_in(cpuctx, ctx, current);
+ perf_event_sched_in(cpuctx, task_ctx, current);

perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
--
2.22.0.rc1.257.g3120a18244-goog


2019-06-13 16:14:56

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily



On 6/1/2019 4:27 AM, Ian Rogers wrote:
> Currently perf_rotate_context assumes that if the context's nr_events !=
> nr_active a rotation is necessary for perf event multiplexing. With
> cgroups, nr_events is the total count of events for all cgroups and
> nr_active will not include events in a cgroup other than the current
> task's. This makes rotation appear necessary for cgroups when it is not.
>
> Add a perf_event_context flag that is set when rotation is necessary.
> Clear the flag during sched_out and set it when a flexible sched_in
> fails due to resources.
>
> Signed-off-by: Ian Rogers <[email protected]>
> ---
> include/linux/perf_event.h | 5 +++++
> kernel/events/core.c | 42 +++++++++++++++++++++++---------------
> 2 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 15a82ff0aefe..7ab6c251aa3d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -747,6 +747,11 @@ struct perf_event_context {
> int nr_stat;
> int nr_freq;
> int rotate_disable;
> + /*
> + * Set when nr_events != nr_active, except tolerant to events not
> + * needing to be active due to scheduling constraints, such as cgroups.
> + */
> + int rotate_necessary;

It looks like the rotate_necessary is only useful for cgroup and cpuctx.
Why not move it to struct perf_cpu_context and under #ifdef
CONFIG_CGROUP_PERF?
And rename it cgrp_rotate_necessary?

Thanks,
Kan

> refcount_t refcount;
> struct task_struct *task;
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abbd4b3b96c2..41ae424b9f1d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
> if (!ctx->nr_active || !(is_active & EVENT_ALL))
> return;
>
> + /*
> + * If we had been multiplexing, no rotations are necessary now no events
> + * are active.
> + */
> + ctx->rotate_necessary = 0;
> +
> perf_pmu_disable(ctx->pmu);
> if (is_active & EVENT_PINNED) {
> list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> sid->can_add_hw = 0;
> }
>
> + /*
> + * If the group wasn't scheduled then set that multiplexing is necessary
> + * for the context. Note, this won't be set if the event wasn't
> + * scheduled due to event_filter_match failing due to the earlier
> + * return.
> + */
> + if (event->state == PERF_EVENT_STATE_INACTIVE)
> + sid->ctx->rotate_necessary = 1;
> +
> return 0;
> }
>
> @@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
> static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> {
> struct perf_event *cpu_event = NULL, *task_event = NULL;
> - bool cpu_rotate = false, task_rotate = false;
> - struct perf_event_context *ctx = NULL;
> + struct perf_event_context *task_ctx = NULL;
> + int cpu_rotate, task_rotate;
>
> /*
> * Since we run this from IRQ context, nobody can install new
> * events, thus the event count values are stable.
> */
>
> - if (cpuctx->ctx.nr_events) {
> - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> - cpu_rotate = true;
> - }
> -
> - ctx = cpuctx->task_ctx;
> - if (ctx && ctx->nr_events) {
> - if (ctx->nr_events != ctx->nr_active)
> - task_rotate = true;
> - }
> + cpu_rotate = cpuctx->ctx.rotate_necessary;
> + task_ctx = cpuctx->task_ctx;
> + task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
>
> if (!(cpu_rotate || task_rotate))
> return false;
> @@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> perf_pmu_disable(cpuctx->ctx.pmu);
>
> if (task_rotate)
> - task_event = ctx_first_active(ctx);
> + task_event = ctx_first_active(task_ctx);
> if (cpu_rotate)
> cpu_event = ctx_first_active(&cpuctx->ctx);
>
> @@ -3724,17 +3732,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> * As per the order given at ctx_resched() first 'pop' task flexible
> * and then, if needed CPU flexible.
> */
> - if (task_event || (ctx && cpu_event))
> - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> + if (task_event || (task_ctx && cpu_event))
> + ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
> if (cpu_event)
> cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>
> if (task_event)
> - rotate_ctx(ctx, task_event);
> + rotate_ctx(task_ctx, task_event);
> if (cpu_event)
> rotate_ctx(&cpuctx->ctx, cpu_event);
>
> - perf_event_sched_in(cpuctx, ctx, current);
> + perf_event_sched_in(cpuctx, task_ctx, current);
>
> perf_pmu_enable(cpuctx->ctx.pmu);
> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>

2019-06-14 19:11:14

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Thu, Jun 13, 2019 at 9:13 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 6/1/2019 4:27 AM, Ian Rogers wrote:
> > Currently perf_rotate_context assumes that if the context's nr_events !=
> > nr_active a rotation is necessary for perf event multiplexing. With
> > cgroups, nr_events is the total count of events for all cgroups and
> > nr_active will not include events in a cgroup other than the current
> > task's. This makes rotation appear necessary for cgroups when it is not.
> >
> > Add a perf_event_context flag that is set when rotation is necessary.
> > Clear the flag during sched_out and set it when a flexible sched_in
> > fails due to resources.
> >
> > Signed-off-by: Ian Rogers <[email protected]>
> > ---
> > include/linux/perf_event.h | 5 +++++
> > kernel/events/core.c | 42 +++++++++++++++++++++++---------------
> > 2 files changed, 30 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 15a82ff0aefe..7ab6c251aa3d 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -747,6 +747,11 @@ struct perf_event_context {
> > int nr_stat;
> > int nr_freq;
> > int rotate_disable;
> > + /*
> > + * Set when nr_events != nr_active, except tolerant to events not
> > + * needing to be active due to scheduling constraints, such as cgroups.
> > + */
> > + int rotate_necessary;
>
> It looks like the rotate_necessary is only useful for cgroup and cpuctx.
> Why not move it to struct perf_cpu_context and under #ifdef
> CONFIG_CGROUP_PERF?
> And rename it cgrp_rotate_necessary?
>
I am not sure I see the point here. What I'd like to see is a uniform
signal for rotation needed in per-task, per-cpu or per-cgroup modes.
Ian's patch does that. It does make it a lot more efficient in cgroup
mode, by avoiding unnecessary rotations, and does not alter/improve
on any of the other two modes.

> Thanks,
> Kan
>
> > refcount_t refcount;
> > struct task_struct *task;
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index abbd4b3b96c2..41ae424b9f1d 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
> > if (!ctx->nr_active || !(is_active & EVENT_ALL))
> > return;
> >
> > + /*
> > + * If we had been multiplexing, no rotations are necessary now no events
> > + * are active.
> > + */
> > + ctx->rotate_necessary = 0;
> > +
> > perf_pmu_disable(ctx->pmu);
> > if (is_active & EVENT_PINNED) {
> > list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> > sid->can_add_hw = 0;
> > }
> >
> > + /*
> > + * If the group wasn't scheduled then set that multiplexing is necessary
> > + * for the context. Note, this won't be set if the event wasn't
> > + * scheduled due to event_filter_match failing due to the earlier
> > + * return.
> > + */
> > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > + sid->ctx->rotate_necessary = 1;
> > +
> > return 0;
> > }
> >
> > @@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
> > static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> > {
> > struct perf_event *cpu_event = NULL, *task_event = NULL;
> > - bool cpu_rotate = false, task_rotate = false;
> > - struct perf_event_context *ctx = NULL;
> > + struct perf_event_context *task_ctx = NULL;
> > + int cpu_rotate, task_rotate;
> >
> > /*
> > * Since we run this from IRQ context, nobody can install new
> > * events, thus the event count values are stable.
> > */
> >
> > - if (cpuctx->ctx.nr_events) {
> > - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> > - cpu_rotate = true;
> > - }
> > -
> > - ctx = cpuctx->task_ctx;
> > - if (ctx && ctx->nr_events) {
> > - if (ctx->nr_events != ctx->nr_active)
> > - task_rotate = true;
> > - }
> > + cpu_rotate = cpuctx->ctx.rotate_necessary;
> > + task_ctx = cpuctx->task_ctx;
> > + task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
> >
> > if (!(cpu_rotate || task_rotate))
> > return false;
> > @@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> > perf_pmu_disable(cpuctx->ctx.pmu);
> >
> > if (task_rotate)
> > - task_event = ctx_first_active(ctx);
> > + task_event = ctx_first_active(task_ctx);
> > if (cpu_rotate)
> > cpu_event = ctx_first_active(&cpuctx->ctx);
> >
> > @@ -3724,17 +3732,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> > * As per the order given at ctx_resched() first 'pop' task flexible
> > * and then, if needed CPU flexible.
> > */
> > - if (task_event || (ctx && cpu_event))
> > - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> > + if (task_event || (task_ctx && cpu_event))
> > + ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
> > if (cpu_event)
> > cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> >
> > if (task_event)
> > - rotate_ctx(ctx, task_event);
> > + rotate_ctx(task_ctx, task_event);
> > if (cpu_event)
> > rotate_ctx(&cpuctx->ctx, cpu_event);
> >
> > - perf_event_sched_in(cpuctx, ctx, current);
> > + perf_event_sched_in(cpuctx, task_ctx, current);
> >
> > perf_pmu_enable(cpuctx->ctx.pmu);
> > perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> >

2019-06-14 21:45:32

by Liang, Kan

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily



On 6/14/2019 3:10 PM, Stephane Eranian wrote:
> On Thu, Jun 13, 2019 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 6/1/2019 4:27 AM, Ian Rogers wrote:
>>> Currently perf_rotate_context assumes that if the context's nr_events !=
>>> nr_active a rotation is necessary for perf event multiplexing. With
>>> cgroups, nr_events is the total count of events for all cgroups and
>>> nr_active will not include events in a cgroup other than the current
>>> task's. This makes rotation appear necessary for cgroups when it is not.
>>>
>>> Add a perf_event_context flag that is set when rotation is necessary.
>>> Clear the flag during sched_out and set it when a flexible sched_in
>>> fails due to resources.
>>>
>>> Signed-off-by: Ian Rogers <[email protected]>
>>> ---
>>> include/linux/perf_event.h | 5 +++++
>>> kernel/events/core.c | 42 +++++++++++++++++++++++---------------
>>> 2 files changed, 30 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 15a82ff0aefe..7ab6c251aa3d 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -747,6 +747,11 @@ struct perf_event_context {
>>> int nr_stat;
>>> int nr_freq;
>>> int rotate_disable;
>>> + /*
>>> + * Set when nr_events != nr_active, except tolerant to events not
>>> + * needing to be active due to scheduling constraints, such as cgroups.
>>> + */
>>> + int rotate_necessary;
>>
>> It looks like the rotate_necessary is only useful for cgroup and cpuctx.
>> Why not move it to struct perf_cpu_context and under #ifdef
>> CONFIG_CGROUP_PERF?
>> And rename it cgrp_rotate_necessary?
>>
> I am not sure I see the point here. What I'd like to see is a uniform
> signal for rotation needed in per-task, per-cpu or per-cgroup modes > Ian's patch does that. It does make it a lot more efficient in cgroup
> mode, by avoiding unnecessary rotations, and does not alter/improve
> on any of the other two modes.

I just thought if it is only used by cgroup, it may be better to move it
under #ifdef CONFIG_CGROUP_PERF. For users who don't care about cgroup,
it may save some space.
But if it's designed as a uniform signal, this is OK.

Thanks,
Kan

>
>> Thanks,
>> Kan
>>
>>> refcount_t refcount;
>>> struct task_struct *task;
>>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index abbd4b3b96c2..41ae424b9f1d 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>>> if (!ctx->nr_active || !(is_active & EVENT_ALL))
>>> return;
>>>
>>> + /*
>>> + * If we had been multiplexing, no rotations are necessary now no events
>>> + * are active.
>>> + */
>>> + ctx->rotate_necessary = 0;
>>> +
>>> perf_pmu_disable(ctx->pmu);
>>> if (is_active & EVENT_PINNED) {
>>> list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
>>> @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
>>> sid->can_add_hw = 0;
>>> }
>>>
>>> + /*
>>> + * If the group wasn't scheduled then set that multiplexing is necessary
>>> + * for the context. Note, this won't be set if the event wasn't
>>> + * scheduled due to event_filter_match failing due to the earlier
>>> + * return.
>>> + */
>>> + if (event->state == PERF_EVENT_STATE_INACTIVE)
>>> + sid->ctx->rotate_necessary = 1;
>>> +
>>> return 0;
>>> }
>>>
>>> @@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
>>> static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
>>> {
>>> struct perf_event *cpu_event = NULL, *task_event = NULL;
>>> - bool cpu_rotate = false, task_rotate = false;
>>> - struct perf_event_context *ctx = NULL;
>>> + struct perf_event_context *task_ctx = NULL;
>>> + int cpu_rotate, task_rotate;
>>>
>>> /*
>>> * Since we run this from IRQ context, nobody can install new
>>> * events, thus the event count values are stable.
>>> */
>>>
>>> - if (cpuctx->ctx.nr_events) {
>>> - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>>> - cpu_rotate = true;
>>> - }
>>> -
>>> - ctx = cpuctx->task_ctx;
>>> - if (ctx && ctx->nr_events) {
>>> - if (ctx->nr_events != ctx->nr_active)
>>> - task_rotate = true;
>>> - }
>>> + cpu_rotate = cpuctx->ctx.rotate_necessary;
>>> + task_ctx = cpuctx->task_ctx;
>>> + task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
>>>
>>> if (!(cpu_rotate || task_rotate))
>>> return false;
>>> @@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
>>> perf_pmu_disable(cpuctx->ctx.pmu);
>>>
>>> if (task_rotate)
>>> - task_event = ctx_first_active(ctx);
>>> + task_event = ctx_first_active(task_ctx);
>>> if (cpu_rotate)
>>> cpu_event = ctx_first_active(&cpuctx->ctx);
>>>
>>> @@ -3724,17 +3732,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
>>> * As per the order given at ctx_resched() first 'pop' task flexible
>>> * and then, if needed CPU flexible.
>>> */
>>> - if (task_event || (ctx && cpu_event))
>>> - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>>> + if (task_event || (task_ctx && cpu_event))
>>> + ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
>>> if (cpu_event)
>>> cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>>>
>>> if (task_event)
>>> - rotate_ctx(ctx, task_event);
>>> + rotate_ctx(task_ctx, task_event);
>>> if (cpu_event)
>>> rotate_ctx(&cpuctx->ctx, cpu_event);
>>>
>>> - perf_event_sched_in(cpuctx, ctx, current);
>>> + perf_event_sched_in(cpuctx, task_ctx, current);
>>>
>>> perf_pmu_enable(cpuctx->ctx.pmu);
>>> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>>

2019-06-20 23:09:40

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Fri, Jun 14, 2019 at 2:43 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 6/14/2019 3:10 PM, Stephane Eranian wrote:
> > On Thu, Jun 13, 2019 at 9:13 AM Liang, Kan <[email protected]> wrote:
> >>
> >>
> >>
> >> On 6/1/2019 4:27 AM, Ian Rogers wrote:
> >>> Currently perf_rotate_context assumes that if the context's nr_events !=
> >>> nr_active a rotation is necessary for perf event multiplexing. With
> >>> cgroups, nr_events is the total count of events for all cgroups and
> >>> nr_active will not include events in a cgroup other than the current
> >>> task's. This makes rotation appear necessary for cgroups when it is not.
> >>>
> >>> Add a perf_event_context flag that is set when rotation is necessary.
> >>> Clear the flag during sched_out and set it when a flexible sched_in
> >>> fails due to resources.
> >>>
> >>> Signed-off-by: Ian Rogers <[email protected]>
> >>> ---
> >>> include/linux/perf_event.h | 5 +++++
> >>> kernel/events/core.c | 42 +++++++++++++++++++++++---------------
> >>> 2 files changed, 30 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >>> index 15a82ff0aefe..7ab6c251aa3d 100644
> >>> --- a/include/linux/perf_event.h
> >>> +++ b/include/linux/perf_event.h
> >>> @@ -747,6 +747,11 @@ struct perf_event_context {
> >>> int nr_stat;
> >>> int nr_freq;
> >>> int rotate_disable;
> >>> + /*
> >>> + * Set when nr_events != nr_active, except tolerant to events not
> >>> + * needing to be active due to scheduling constraints, such as cgroups.
> >>> + */
> >>> + int rotate_necessary;
> >>
> >> It looks like the rotate_necessary is only useful for cgroup and cpuctx.
> >> Why not move it to struct perf_cpu_context and under #ifdef
> >> CONFIG_CGROUP_PERF?
> >> And rename it cgrp_rotate_necessary?
> >>
> > I am not sure I see the point here. What I'd like to see is a uniform
> > signal for rotation needed in per-task, per-cpu or per-cgroup modes > Ian's patch does that. It does make it a lot more efficient in cgroup
> > mode, by avoiding unnecessary rotations, and does not alter/improve
> > on any of the other two modes.
>
> I just thought if it is only used by cgroup, it may be better to move it
> under #ifdef CONFIG_CGROUP_PERF. For users who don't care about cgroup,
> it may save some space.
> But if it's designed as a uniform signal, this is OK.
>
> Thanks,
> Kan

Thanks Kan,
I think you have a wider valid point that by specializing to the CPU
or task context the code could avoid work. Similarly, adding the state
to perf_event_context is limited overhead for non-cgroup builds that
could be optimized away by going back to "nr_active == nr_events". I
agree with Stephane that we should try to keep the code uniform for
the task and CPU case for consistency - and this is the state in the
patch.

Are there any blockers to getting this merged?

Thanks,
Ian


> >
> >> Thanks,
> >> Kan
> >>
> >>> refcount_t refcount;
> >>> struct task_struct *task;
> >>>
> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >>> index abbd4b3b96c2..41ae424b9f1d 100644
> >>> --- a/kernel/events/core.c
> >>> +++ b/kernel/events/core.c
> >>> @@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
> >>> if (!ctx->nr_active || !(is_active & EVENT_ALL))
> >>> return;
> >>>
> >>> + /*
> >>> + * If we had been multiplexing, no rotations are necessary now no events
> >>> + * are active.
> >>> + */
> >>> + ctx->rotate_necessary = 0;
> >>> +
> >>> perf_pmu_disable(ctx->pmu);
> >>> if (is_active & EVENT_PINNED) {
> >>> list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> >>> @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> >>> sid->can_add_hw = 0;
> >>> }
> >>>
> >>> + /*
> >>> + * If the group wasn't scheduled then set that multiplexing is necessary
> >>> + * for the context. Note, this won't be set if the event wasn't
> >>> + * scheduled due to event_filter_match failing due to the earlier
> >>> + * return.
> >>> + */
> >>> + if (event->state == PERF_EVENT_STATE_INACTIVE)
> >>> + sid->ctx->rotate_necessary = 1;
> >>> +
> >>> return 0;
> >>> }
> >>>
> >>> @@ -3690,24 +3705,17 @@ ctx_first_active(struct perf_event_context *ctx)
> >>> static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> >>> {
> >>> struct perf_event *cpu_event = NULL, *task_event = NULL;
> >>> - bool cpu_rotate = false, task_rotate = false;
> >>> - struct perf_event_context *ctx = NULL;
> >>> + struct perf_event_context *task_ctx = NULL;
> >>> + int cpu_rotate, task_rotate;
> >>>
> >>> /*
> >>> * Since we run this from IRQ context, nobody can install new
> >>> * events, thus the event count values are stable.
> >>> */
> >>>
> >>> - if (cpuctx->ctx.nr_events) {
> >>> - if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
> >>> - cpu_rotate = true;
> >>> - }
> >>> -
> >>> - ctx = cpuctx->task_ctx;
> >>> - if (ctx && ctx->nr_events) {
> >>> - if (ctx->nr_events != ctx->nr_active)
> >>> - task_rotate = true;
> >>> - }
> >>> + cpu_rotate = cpuctx->ctx.rotate_necessary;
> >>> + task_ctx = cpuctx->task_ctx;
> >>> + task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
> >>>
> >>> if (!(cpu_rotate || task_rotate))
> >>> return false;
> >>> @@ -3716,7 +3724,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> >>> perf_pmu_disable(cpuctx->ctx.pmu);
> >>>
> >>> if (task_rotate)
> >>> - task_event = ctx_first_active(ctx);
> >>> + task_event = ctx_first_active(task_ctx);
> >>> if (cpu_rotate)
> >>> cpu_event = ctx_first_active(&cpuctx->ctx);
> >>>
> >>> @@ -3724,17 +3732,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
> >>> * As per the order given at ctx_resched() first 'pop' task flexible
> >>> * and then, if needed CPU flexible.
> >>> */
> >>> - if (task_event || (ctx && cpu_event))
> >>> - ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> >>> + if (task_event || (task_ctx && cpu_event))
> >>> + ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
> >>> if (cpu_event)
> >>> cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> >>>
> >>> if (task_event)
> >>> - rotate_ctx(ctx, task_event);
> >>> + rotate_ctx(task_ctx, task_event);
> >>> if (cpu_event)
> >>> rotate_ctx(&cpuctx->ctx, cpu_event);
> >>>
> >>> - perf_event_sched_in(cpuctx, ctx, current);
> >>> + perf_event_sched_in(cpuctx, task_ctx, current);
> >>>
> >>> perf_pmu_enable(cpuctx->ctx.pmu);
> >>> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> >>>

2019-06-21 08:25:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> sid->can_add_hw = 0;
> }
>
> + /*
> + * If the group wasn't scheduled then set that multiplexing is necessary
> + * for the context. Note, this won't be set if the event wasn't
> + * scheduled due to event_filter_match failing due to the earlier
> + * return.
> + */
> + if (event->state == PERF_EVENT_STATE_INACTIVE)
> + sid->ctx->rotate_necessary = 1;
> +
> return 0;
> }

That looked odd; which had me look harder at this function which
resulted in the below. Should we not terminate the context interation
the moment one flexible thingy fails to schedule?

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2314,12 +2314,8 @@ group_sched_in(struct perf_event *group_
return 0;

pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
-
- if (event_sched_in(group_event, cpuctx, ctx)) {
- pmu->cancel_txn(pmu);
- perf_mux_hrtimer_restart(cpuctx);
- return -EAGAIN;
- }
+ if (event_sched_in(group_event, cpuctx, ctx))
+ goto cancel;

/*
* Schedule in siblings as one group (if any):
@@ -2348,10 +2344,9 @@ group_sched_in(struct perf_event *group_
}
event_sched_out(group_event, cpuctx, ctx);

+cancel:
pmu->cancel_txn(pmu);
-
perf_mux_hrtimer_restart(cpuctx);
-
return -EAGAIN;
}

@@ -3317,6 +3312,7 @@ static int pinned_sched_in(struct perf_e
static int flexible_sched_in(struct perf_event *event, void *data)
{
struct sched_in_data *sid = data;
+ int ret;

if (event->state <= PERF_EVENT_STATE_OFF)
return 0;
@@ -3325,21 +3321,15 @@ static int flexible_sched_in(struct perf
return 0;

if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
- if (!group_sched_in(event, sid->cpuctx, sid->ctx))
- list_add_tail(&event->active_list, &sid->ctx->flexible_active);
- else
+ ret = group_sched_in(event, sid->cpuctx, sid->ctx);
+ if (ret) {
sid->can_add_hw = 0;
+ sid->ctx->rotate_necessary = 1;
+ return ret;
+ }
+ list_add_tail(&event->active_list, &sid->ctx->flexible_active);
}

- /*
- * If the group wasn't scheduled then set that multiplexing is necessary
- * for the context. Note, this won't be set if the event wasn't
- * scheduled due to event_filter_match failing due to the earlier
- * return.
- */
- if (event->state == PERF_EVENT_STATE_INACTIVE)
- sid->ctx->rotate_necessary = 1;
-
return 0;
}

2019-06-21 18:03:18

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Fri, Jun 21, 2019 at 1:24 AM Peter Zijlstra <[email protected]> wrote:
>
> On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> > sid->can_add_hw = 0;
> > }
> >
> > + /*
> > + * If the group wasn't scheduled then set that multiplexing is necessary
> > + * for the context. Note, this won't be set if the event wasn't
> > + * scheduled due to event_filter_match failing due to the earlier
> > + * return.
> > + */
> > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > + sid->ctx->rotate_necessary = 1;
> > +
> > return 0;
> > }
>
> That looked odd; which had me look harder at this function which
> resulted in the below. Should we not terminate the context interation
> the moment one flexible thingy fails to schedule?

If we knew all the events were hardware events then this would be
true, as there may be software events that always schedule then the
continued iteration is necessary.

Thanks,
Ian

> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2314,12 +2314,8 @@ group_sched_in(struct perf_event *group_
> return 0;
>
> pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
> -
> - if (event_sched_in(group_event, cpuctx, ctx)) {
> - pmu->cancel_txn(pmu);
> - perf_mux_hrtimer_restart(cpuctx);
> - return -EAGAIN;
> - }
> + if (event_sched_in(group_event, cpuctx, ctx))
> + goto cancel;
>
> /*
> * Schedule in siblings as one group (if any):
> @@ -2348,10 +2344,9 @@ group_sched_in(struct perf_event *group_
> }
> event_sched_out(group_event, cpuctx, ctx);
>
> +cancel:
> pmu->cancel_txn(pmu);
> -
> perf_mux_hrtimer_restart(cpuctx);
> -
> return -EAGAIN;
> }
>
> @@ -3317,6 +3312,7 @@ static int pinned_sched_in(struct perf_e
> static int flexible_sched_in(struct perf_event *event, void *data)
> {
> struct sched_in_data *sid = data;
> + int ret;
>
> if (event->state <= PERF_EVENT_STATE_OFF)
> return 0;
> @@ -3325,21 +3321,15 @@ static int flexible_sched_in(struct perf
> return 0;
>
> if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
> - if (!group_sched_in(event, sid->cpuctx, sid->ctx))
> - list_add_tail(&event->active_list, &sid->ctx->flexible_active);
> - else
> + ret = group_sched_in(event, sid->cpuctx, sid->ctx);
> + if (ret) {
> sid->can_add_hw = 0;
> + sid->ctx->rotate_necessary = 1;
> + return ret;
> + }
> + list_add_tail(&event->active_list, &sid->ctx->flexible_active);
> }
>
> - /*
> - * If the group wasn't scheduled then set that multiplexing is necessary
> - * for the context. Note, this won't be set if the event wasn't
> - * scheduled due to event_filter_match failing due to the earlier
> - * return.
> - */
> - if (event->state == PERF_EVENT_STATE_INACTIVE)
> - sid->ctx->rotate_necessary = 1;
> -
> return 0;
> }
>

2019-06-24 08:28:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Fri, Jun 21, 2019 at 11:01:29AM -0700, Ian Rogers wrote:
> On Fri, Jun 21, 2019 at 1:24 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> > > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> > > sid->can_add_hw = 0;
> > > }
> > >
> > > + /*
> > > + * If the group wasn't scheduled then set that multiplexing is necessary
> > > + * for the context. Note, this won't be set if the event wasn't
> > > + * scheduled due to event_filter_match failing due to the earlier
> > > + * return.
> > > + */
> > > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > > + sid->ctx->rotate_necessary = 1;
> > > +
> > > return 0;
> > > }
> >
> > That looked odd; which had me look harder at this function which
> > resulted in the below. Should we not terminate the context interation
> > the moment one flexible thingy fails to schedule?
>
> If we knew all the events were hardware events then this would be
> true, as there may be software events that always schedule then the
> continued iteration is necessary.

But this is the 'old' code, where this is guaranteed by the context.
That is, if this is a hardware context; there wil only be software
events due to them being in a group with hardware events.

If this is a software group, then we'll never fail to schedule and we'll
not get in this branch to begin with.

Or am I now confused for having been staring at two different code-bases
at the same time?

Subject: [tip:perf/core] perf/cgroups: Don't rotate events for cgroups unnecessarily

Commit-ID: fd7d55172d1e2e501e6da0a5c1de25f06612dc2e
Gitweb: https://git.kernel.org/tip/fd7d55172d1e2e501e6da0a5c1de25f06612dc2e
Author: Ian Rogers <[email protected]>
AuthorDate: Sat, 1 Jun 2019 01:27:22 -0700
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:30:04 +0200

perf/cgroups: Don't rotate events for cgroups unnecessarily

Currently perf_rotate_context assumes that if the context's nr_events !=
nr_active a rotation is necessary for perf event multiplexing. With
cgroups, nr_events is the total count of events for all cgroups and
nr_active will not include events in a cgroup other than the current
task's. This makes rotation appear necessary for cgroups when it is not.

Add a perf_event_context flag that is set when rotation is necessary.
Clear the flag during sched_out and set it when a flexible sched_in
fails due to resources.

Signed-off-by: Ian Rogers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 42 ++++++++++++++++++++++--------------------
2 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3dc01cf98e16..2ddae518dce6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -749,6 +749,11 @@ struct perf_event_context {
int nr_stat;
int nr_freq;
int rotate_disable;
+ /*
+ * Set when nr_events != nr_active, except tolerant to events not
+ * necessary to be active due to scheduling constraints, such as cgroups.
+ */
+ int rotate_necessary;
refcount_t refcount;
struct task_struct *task;

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 118ad1aef6af..23efe6792abc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2952,6 +2952,12 @@ static void ctx_sched_out(struct perf_event_context *ctx,
if (!ctx->nr_active || !(is_active & EVENT_ALL))
return;

+ /*
+ * If we had been multiplexing, no rotations are necessary, now no events
+ * are active.
+ */
+ ctx->rotate_necessary = 0;
+
perf_pmu_disable(ctx->pmu);
if (is_active & EVENT_PINNED) {
list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
@@ -3319,10 +3325,13 @@ static int flexible_sched_in(struct perf_event *event, void *data)
return 0;

if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) {
- if (!group_sched_in(event, sid->cpuctx, sid->ctx))
- list_add_tail(&event->active_list, &sid->ctx->flexible_active);
- else
+ int ret = group_sched_in(event, sid->cpuctx, sid->ctx);
+ if (ret) {
sid->can_add_hw = 0;
+ sid->ctx->rotate_necessary = 1;
+ return 0;
+ }
+ list_add_tail(&event->active_list, &sid->ctx->flexible_active);
}

return 0;
@@ -3690,24 +3699,17 @@ ctx_first_active(struct perf_event_context *ctx)
static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
{
struct perf_event *cpu_event = NULL, *task_event = NULL;
- bool cpu_rotate = false, task_rotate = false;
- struct perf_event_context *ctx = NULL;
+ struct perf_event_context *task_ctx = NULL;
+ int cpu_rotate, task_rotate;

/*
* Since we run this from IRQ context, nobody can install new
* events, thus the event count values are stable.
*/

- if (cpuctx->ctx.nr_events) {
- if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
- cpu_rotate = true;
- }
-
- ctx = cpuctx->task_ctx;
- if (ctx && ctx->nr_events) {
- if (ctx->nr_events != ctx->nr_active)
- task_rotate = true;
- }
+ cpu_rotate = cpuctx->ctx.rotate_necessary;
+ task_ctx = cpuctx->task_ctx;
+ task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;

if (!(cpu_rotate || task_rotate))
return false;
@@ -3716,7 +3718,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
perf_pmu_disable(cpuctx->ctx.pmu);

if (task_rotate)
- task_event = ctx_first_active(ctx);
+ task_event = ctx_first_active(task_ctx);
if (cpu_rotate)
cpu_event = ctx_first_active(&cpuctx->ctx);

@@ -3724,17 +3726,17 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
* As per the order given at ctx_resched() first 'pop' task flexible
* and then, if needed CPU flexible.
*/
- if (task_event || (ctx && cpu_event))
- ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+ if (task_event || (task_ctx && cpu_event))
+ ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
if (cpu_event)
cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);

if (task_event)
- rotate_ctx(ctx, task_event);
+ rotate_ctx(task_ctx, task_event);
if (cpu_event)
rotate_ctx(&cpuctx->ctx, cpu_event);

- perf_event_sched_in(cpuctx, ctx, current);
+ perf_event_sched_in(cpuctx, task_ctx, current);

perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);

2019-06-27 21:48:16

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

group_index On Mon, Jun 24, 2019 at 12:55 AM Peter Zijlstra
<[email protected]> wrote:
>
> On Fri, Jun 21, 2019 at 11:01:29AM -0700, Ian Rogers wrote:
> > On Fri, Jun 21, 2019 at 1:24 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> > > > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> > > > sid->can_add_hw = 0;
> > > > }
> > > >
> > > > + /*
> > > > + * If the group wasn't scheduled then set that multiplexing is necessary
> > > > + * for the context. Note, this won't be set if the event wasn't
> > > > + * scheduled due to event_filter_match failing due to the earlier
> > > > + * return.
> > > > + */
> > > > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > > > + sid->ctx->rotate_necessary = 1;
> > > > +
> > > > return 0;
> > > > }
> > >
> > > That looked odd; which had me look harder at this function which
> > > resulted in the below. Should we not terminate the context interation
> > > the moment one flexible thingy fails to schedule?
> >
> > If we knew all the events were hardware events then this would be
> > true, as there may be software events that always schedule then the
> > continued iteration is necessary.
>
> But this is the 'old' code, where this is guaranteed by the context.
> That is, if this is a hardware context; there wil only be software
> events due to them being in a group with hardware events.
>
> If this is a software group, then we'll never fail to schedule and we'll
> not get in this branch to begin with.
>
> Or am I now confused for having been staring at two different code-bases
> at the same time?

I believe you're right and I'd overlooked this. I think there is a
more efficient version of this code possible that I can follow up
with. There are 3 perf_event_contexts, potentially a sw and hw context
within the task_struct and one per-CPU in perf_cpu_context. With this
change I'm focussed on improving rotation of cgroup events that appear
system wide within the per-CPU context. Without cgroup events the
system wide events don't need to be iterated over during scheduling
in. The branch that can set rotate_necessary will only be necessary
for the task hardware events. For system wide events, considered with
cgroup mode scheduling in, the branch is necessary as rotation may be
necessary. It'd be possible to have variants of flexible_sched_in that
behave differently in the task software event context, and the system
wide and task hardware contexts.

I have a series of patches related to Kan Liang's cgroup
perf_event_groups improvements. I'll mail these out and see if I can
avoid the branch in the task software event context case.

Thanks,
Ian

2019-08-23 23:04:27

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

Hi,

We are seeing regression with our uncore perf driver(Marvell's
ThunderX2, ARM64 server platform) on 5.3-Rc1.
After bisecting, it turned out to be this patch causing the issue.

Test case:
Load module and run perf for more than 4 events( we have 4 counters,
event multiplexing takes place for more than 4 events), then unload
module.
With this sequence of testing, the system hangs(soft lockup) after 2
or 3 iterations. Same test runs for hours on 5.2.

while [ 1 ]
do
rmmod thunderx2_pmu
modprobe thunderx2_pmu
perf stat -a -e \
uncore_dmc_0/cnt_cycles/,\
uncore_dmc_0/data_transfers/,\
uncore_dmc_0/read_txns/,\
uncore_dmc_0/config=0xE/,\
uncore_dmc_0/write_txns/ sleep 1
sleep 2
done

Is this patch tested adequately?

On Fri, Jun 28, 2019 at 3:18 AM Ian Rogers <[email protected]> wrote:
>
> group_index On Mon, Jun 24, 2019 at 12:55 AM Peter Zijlstra
> <[email protected]> wrote:
> >
> > On Fri, Jun 21, 2019 at 11:01:29AM -0700, Ian Rogers wrote:
> > > On Fri, Jun 21, 2019 at 1:24 AM Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > On Sat, Jun 01, 2019 at 01:27:22AM -0700, Ian Rogers wrote:
> > > > > @@ -3325,6 +3331,15 @@ static int flexible_sched_in(struct perf_event *event, void *data)
> > > > > sid->can_add_hw = 0;
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * If the group wasn't scheduled then set that multiplexing is necessary
> > > > > + * for the context. Note, this won't be set if the event wasn't
> > > > > + * scheduled due to event_filter_match failing due to the earlier
> > > > > + * return.
> > > > > + */
> > > > > + if (event->state == PERF_EVENT_STATE_INACTIVE)
> > > > > + sid->ctx->rotate_necessary = 1;
> > > > > +
> > > > > return 0;
> > > > > }
> > > >
> > > > That looked odd; which had me look harder at this function which
> > > > resulted in the below. Should we not terminate the context interation
> > > > the moment one flexible thingy fails to schedule?
> > >
> > > If we knew all the events were hardware events then this would be
> > > true, as there may be software events that always schedule then the
> > > continued iteration is necessary.
> >
> > But this is the 'old' code, where this is guaranteed by the context.
> > That is, if this is a hardware context; there wil only be software
> > events due to them being in a group with hardware events.
> >
> > If this is a software group, then we'll never fail to schedule and we'll
> > not get in this branch to begin with.
> >
> > Or am I now confused for having been staring at two different code-bases
> > at the same time?
>
> I believe you're right and I'd overlooked this. I think there is a
> more efficient version of this code possible that I can follow up
> with. There are 3 perf_event_contexts, potentially a sw and hw context
> within the task_struct and one per-CPU in perf_cpu_context. With this
> change I'm focussed on improving rotation of cgroup events that appear
> system wide within the per-CPU context. Without cgroup events the
> system wide events don't need to be iterated over during scheduling
> in. The branch that can set rotate_necessary will only be necessary
> for the task hardware events. For system wide events, considered with
> cgroup mode scheduling in, the branch is necessary as rotation may be
> necessary. It'd be possible to have variants of flexible_sched_in that
> behave differently in the task software event context, and the system
> wide and task hardware contexts.
>
> I have a series of patches related to Kan Liang's cgroup
> perf_event_groups improvements. I'll mail these out and see if I can
> avoid the branch in the task software event context case.
>
> Thanks,
> Ian

Thanks,
Ganapat

2019-08-23 23:11:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:

> We are seeing regression with our uncore perf driver(Marvell's
> ThunderX2, ARM64 server platform) on 5.3-Rc1.
> After bisecting, it turned out to be this patch causing the issue.

Funnily enough; the email you replied to didn't contain a patch.

> Test case:
> Load module and run perf for more than 4 events( we have 4 counters,
> event multiplexing takes place for more than 4 events), then unload
> module.
> With this sequence of testing, the system hangs(soft lockup) after 2
> or 3 iterations. Same test runs for hours on 5.2.
>
> while [ 1 ]
> do
> rmmod thunderx2_pmu
> modprobe thunderx2_pmu
> perf stat -a -e \
> uncore_dmc_0/cnt_cycles/,\
> uncore_dmc_0/data_transfers/,\
> uncore_dmc_0/read_txns/,\
> uncore_dmc_0/config=0xE/,\
> uncore_dmc_0/write_txns/ sleep 1
> sleep 2
> done

Can you reproduce without the module load+unload? I don't think people
routinely unload modules.


2019-08-23 23:20:22

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

Hi Peter,

On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra <[email protected]> wrote:
>
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?

Sorry for the top-posting.
>
> On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
>
> > We are seeing regression with our uncore perf driver(Marvell's
> > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > After bisecting, it turned out to be this patch causing the issue.
>
> Funnily enough; the email you replied to didn't contain a patch.

Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
>
> > Test case:
> > Load module and run perf for more than 4 events( we have 4 counters,
> > event multiplexing takes place for more than 4 events), then unload
> > module.
> > With this sequence of testing, the system hangs(soft lockup) after 2
> > or 3 iterations. Same test runs for hours on 5.2.
> >
> > while [ 1 ]
> > do
> > rmmod thunderx2_pmu
> > modprobe thunderx2_pmu
> > perf stat -a -e \
> > uncore_dmc_0/cnt_cycles/,\
> > uncore_dmc_0/data_transfers/,\
> > uncore_dmc_0/read_txns/,\
> > uncore_dmc_0/config=0xE/,\
> > uncore_dmc_0/write_txns/ sleep 1
> > sleep 2
> > done
>
> Can you reproduce without the module load+unload? I don't think people
> routinely unload modules.

The issue wont happen, if module is not unloaded/reloaded.
IMHO, this could be potential bug!

>
>

Thanks,
Ganapat

2019-08-23 23:21:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Fri, Aug 23, 2019 at 06:26:34PM +0530, Ganapatrao Kulkarni wrote:
> On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra <[email protected]> wrote:
> > On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
> >
> > > We are seeing regression with our uncore perf driver(Marvell's
> > > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > > After bisecting, it turned out to be this patch causing the issue.
> >
> > Funnily enough; the email you replied to didn't contain a patch.
>
> Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.

Your email is in a random spot of the discussion for me. At least it was
fairly easy to find the related patch.

> > > Test case:
> > > Load module and run perf for more than 4 events( we have 4 counters,
> > > event multiplexing takes place for more than 4 events), then unload
> > > module.
> > > With this sequence of testing, the system hangs(soft lockup) after 2
> > > or 3 iterations. Same test runs for hours on 5.2.
> > >
> > > while [ 1 ]
> > > do
> > > rmmod thunderx2_pmu
> > > modprobe thunderx2_pmu
> > > perf stat -a -e \
> > > uncore_dmc_0/cnt_cycles/,\
> > > uncore_dmc_0/data_transfers/,\
> > > uncore_dmc_0/read_txns/,\
> > > uncore_dmc_0/config=0xE/,\
> > > uncore_dmc_0/write_txns/ sleep 1
> > > sleep 2
> > > done
> >
> > Can you reproduce without the module load+unload? I don't think people
> > routinely unload modules.
>
> The issue wont happen, if module is not unloaded/reloaded.
> IMHO, this could be potential bug!

Does the softlockup give a useful stacktrace? I don't have a thunderx2
so I cannot reproduce.


2019-09-18 09:08:07

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

On Fri, Aug 23, 2019 at 6:33 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Aug 23, 2019 at 06:26:34PM +0530, Ganapatrao Kulkarni wrote:
> > On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra <[email protected]> wrote:
> > > On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
> > >
> > > > We are seeing regression with our uncore perf driver(Marvell's
> > > > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > > > After bisecting, it turned out to be this patch causing the issue.
> > >
> > > Funnily enough; the email you replied to didn't contain a patch.
> >
> > Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
>
> Your email is in a random spot of the discussion for me. At least it was
> fairly easy to find the related patch.
>
> > > > Test case:
> > > > Load module and run perf for more than 4 events( we have 4 counters,
> > > > event multiplexing takes place for more than 4 events), then unload
> > > > module.
> > > > With this sequence of testing, the system hangs(soft lockup) after 2
> > > > or 3 iterations. Same test runs for hours on 5.2.
> > > >
> > > > while [ 1 ]
> > > > do
> > > > rmmod thunderx2_pmu
> > > > modprobe thunderx2_pmu
> > > > perf stat -a -e \
> > > > uncore_dmc_0/cnt_cycles/,\
> > > > uncore_dmc_0/data_transfers/,\
> > > > uncore_dmc_0/read_txns/,\
> > > > uncore_dmc_0/config=0xE/,\
> > > > uncore_dmc_0/write_txns/ sleep 1
> > > > sleep 2
> > > > done
> > >
> > > Can you reproduce without the module load+unload? I don't think people
> > > routinely unload modules.
> >
> > The issue wont happen, if module is not unloaded/reloaded.
> > IMHO, this could be potential bug!
>
> Does the softlockup give a useful stacktrace? I don't have a thunderx2
> so I cannot reproduce.

Sorry for the late reply, below is the dump that i am getting, when i
hit the softlockup.
Any suggestions to debug this further?

sequence of commands, which leads to this lockup,
insmod thunderx2_pmu.ko
perf stat -e \
uncore_dmc_0/cnt_cycles/,\
uncore_dmc_0/data_transfers/,\
uncore_dmc_0/read_txns/,\
uncore_dmc_0/config=0xE/,\
uncore_dmc_0/write_txns/\
rmmod thunderx2_pmu
insmod thunderx2_pmu.ko

root@SBR-26>~>> [ 1065.946772] watchdog: BUG: soft lockup - CPU#117
stuck for 22s! [perf:5206]
[ 1065.953722] Modules linked in: thunderx2_pmu(OE) nls_iso8859_1
joydev input_leds bridge ipmi_ssif ipmi_devintf stp llc
ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core
iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ppdev lp parport ip_tables x_tables
autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1
raid0 multipat
h linear aes_ce_blk hid_generic aes_ce_cipher usbhid uas usb_storage
hid ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops drm i40e i2c_smbus bnx2x
crct10dif_ce g
hash_ce e1000e sha2_ce mpt3sas nvme sha256_arm64 ptp ahci raid_class
sha1_ce nvme_core scsi_transport_sas mdio libahci pps_core libcrc32c
gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd
aes_arm6
4 [last unloaded: thunderx2_pmu]
[ 1066.029640] CPU: 117 PID: 5206 Comm: perf Tainted: G OE
5.3.0+ #160
[ 1066.037109] Hardware name: Cavium Inc. Saber/Saber, BIOS
TX2-FW-Release-7.2-build_08-0-g14f8c5bf8a 12/18/2018
[ 1066.047009] pstate: 20400009 (nzCv daif +PAN -UAO)
[ 1066.051799] pc : smp_call_function_single+0x198/0x1b0
[ 1066.056838] lr : smp_call_function_single+0x16c/0x1b0
[ 1066.061875] sp : fffffc00434cfc50
[ 1066.065177] x29: fffffc00434cfc50 x28: 0000000000000000
[ 1066.070475] x27: fffffebed4d2d952 x26: ffffffffd4d2d800
[ 1066.075774] x25: fffffddf5da1e240 x24: 0000000000000001
[ 1066.081073] x23: fffffc00434cfd38 x22: fffffc001026adb8
[ 1066.086371] x21: 0000000000000000 x20: fffffc0011843000
[ 1066.091669] x19: fffffc00434cfca0 x18: 0000000000000000
[ 1066.096968] x17: 0000000000000000 x16: 0000000000000000
[ 1066.102266] x15: 0000000000000000 x14: 0000000000000000
[ 1066.107564] x13: 0000000000000000 x12: 0000000000000020
[ 1066.112862] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[ 1066.118161] x9 : 0000000000000000 x8 : fffffebed44ec5e8
[ 1066.123459] x7 : 0000000000000000 x6 : fffffc00434cfca0
[ 1066.128757] x5 : fffffc00434cfca0 x4 : 0000000000000001
[ 1066.134055] x3 : fffffc00434cfcb8 x2 : 0000000000000000
[ 1066.139353] x1 : 0000000000000003 x0 : 0000000000000000
[ 1066.144652] Call trace:
[ 1066.147088] smp_call_function_single+0x198/0x1b0
[ 1066.151784] perf_install_in_context+0x1b4/0x1d8
[ 1066.156394] __se_sys_perf_event_open+0x634/0xa68
[ 1066.161089] __arm64_sys_perf_event_open+0x1c/0x28
[ 1066.165881] el0_svc_common.constprop.0+0x78/0x168
[ 1066.170660] el0_svc_handler+0x34/0x90
[ 1066.174399] el0_svc+0x8/0xc
[ 1100.985033] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1100.990944] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
softirq=2178/2178 fqs=7365
[ 1100.998864] (detected by 154, t=15005 jiffies, g=14325, q=7133)
[ 1101.004857] Task dump for CPU 0:
[ 1101.008074] swapper/0 R running task 0 0 0 0x0000002a
[ 1101.015110] Call trace:
[ 1101.017553] __switch_to+0xbc/0x220
[ 1244.366396] INFO: task kworker/116:1:1364 blocked for more than 120 seconds.
[ 1244.373443] Tainted: G OEL 5.3.0+ #160
[ 1244.378847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1244.386673] kworker/116:1 D 0 1364 2 0x00000028
[ 1244.392166] Workqueue: events do_free_init
[ 1244.396260] Call trace:
[ 1244.398705] __switch_to+0x194/0x220
[ 1244.402282] __schedule+0x30c/0x758
[ 1244.405758] schedule+0x38/0xa8
[ 1244.408896] schedule_timeout+0x220/0x3a8
[ 1244.412902] wait_for_common+0x110/0x220
[ 1244.416819] wait_for_completion+0x28/0x38
[ 1244.420914] __wait_rcu_gp+0x170/0x1a8
[ 1244.424658] synchronize_rcu+0x70/0xa0
[ 1244.428403] do_free_init+0x3c/0x70
[ 1244.431889] process_one_work+0x1f8/0x438
[ 1244.435896] worker_thread+0x50/0x4b8
[ 1244.439554] kthread+0x130/0x138
[ 1244.442778] ret_from_fork+0x10/0x18
[ 1280.996503] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1281.002415] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
softirq=2178/2178 fqs=29040
[ 1281.010420] (detected by 186, t=60010 jiffies, g=14325, q=22339)
[ 1281.016500] Task dump for CPU 0:
[ 1281.019715] swapper/0 R running task 0 0 0 0x0000002a
[ 1281.026751] Call trace:
[ 1281.029189] __switch_to+0xbc/0x220
[ 1365.192710] INFO: task kworker/116:1:1364 blocked for more than 241 seconds.
[ 1365.199756] Tainted: G OEL 5.3.0+ #160
[ 1365.205158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1365.212984] kworker/116:1 D 0 1364 2 0x00000028
[ 1365.218475] Workqueue: events do_free_init
[ 1365.222569] Call trace:
[ 1365.225014] __switch_to+0x194/0x220
[ 1365.228589] __schedule+0x30c/0x758
[ 1365.232065] schedule+0x38/0xa8
[ 1365.235203] schedule_timeout+0x220/0x3a8
[ 1365.239209] wait_for_common+0x110/0x220
[ 1365.243127] wait_for_completion+0x28/0x38
[ 1365.247222] __wait_rcu_gp+0x170/0x1a8
[ 1365.250966] synchronize_rcu+0x70/0xa0
[ 1365.254711] do_free_init+0x3c/0x70
[ 1365.258198] process_one_work+0x1f8/0x438
[ 1365.262205] worker_thread+0x50/0x4b8
[ 1365.265863] kthread+0x130/0x138
[ 1365.269087] ret_from_fork+0x10/0x18
[ 1461.008017] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1461.013929] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
softirq=2178/2178 fqs=51540
[ 1461.021930] (detected by 123, t=105015 jiffies, g=14325, q=37487)
[ 1461.028096] Task dump for CPU 0:
[ 1461.031311] swapper/0 R running task 0 0 0 0x0000002a
[ 1461.038347] Call trace:
[ 1461.040785] __switch_to+0xbc/0x220
[ 1486.023005] INFO: task kworker/116:1:1364 blocked for more than 362 seconds.
[ 1486.030049] Tainted: G OEL 5.3.0+ #160
[ 1486.035447] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1486.043272] kworker/116:1 D 0 1364 2 0x00000028
[ 1486.048762] Workqueue: events do_free_init
[ 1486.052856] Call trace:
[ 1486.055300] __switch_to+0x194/0x220
[ 1486.058875] __schedule+0x30c/0x758
[ 1486.062351] schedule+0x38/0xa8
[ 1486.065487] schedule_timeout+0x220/0x3a8
[ 1486.069493] wait_for_common+0x110/0x220
[ 1486.073411] wait_for_completion+0x28/0x38
[ 1486.077504] __wait_rcu_gp+0x170/0x1a8
[ 1486.081248] synchronize_rcu+0x70/0xa0
[ 1486.084994] do_free_init+0x3c/0x70
[ 1486.088479] process_one_work+0x1f8/0x438
[ 1486.092486] worker_thread+0x50/0x4b8
[ 1486.096143] kthread+0x130/0x138
[ 1486.099367] ret_from_fork+0x10/0x18
[ 1606.849300] INFO: task kworker/116:1:1364 blocked for more than 483 seconds.
[ 1606.856345] Tainted: G OEL 5.3.0+ #160


Thanks,
Ganapat

2019-10-15 05:39:06

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH] perf cgroups: Don't rotate events for cgroups unnecessarily

Hi Peter,

On Wed, Sep 18, 2019 at 12:51 PM Ganapatrao Kulkarni <[email protected]> wrote:
>
> On Fri, Aug 23, 2019 at 6:33 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Aug 23, 2019 at 06:26:34PM +0530, Ganapatrao Kulkarni wrote:
> > > On Fri, Aug 23, 2019 at 5:29 PM Peter Zijlstra <[email protected]> wrote:
> > > > On Fri, Aug 23, 2019 at 04:13:46PM +0530, Ganapatrao Kulkarni wrote:
> > > >
> > > > > We are seeing regression with our uncore perf driver(Marvell's
> > > > > ThunderX2, ARM64 server platform) on 5.3-Rc1.
> > > > > After bisecting, it turned out to be this patch causing the issue.
> > > >
> > > > Funnily enough; the email you replied to didn't contain a patch.
> > >
> > > Hmm sorry, not sure why the patch is clipped-off, I see it in my inbox.
> >
> > Your email is in a random spot of the discussion for me. At least it was
> > fairly easy to find the related patch.
> >
> > > > > Test case:
> > > > > Load module and run perf for more than 4 events( we have 4 counters,
> > > > > event multiplexing takes place for more than 4 events), then unload
> > > > > module.
> > > > > With this sequence of testing, the system hangs(soft lockup) after 2
> > > > > or 3 iterations. Same test runs for hours on 5.2.
> > > > >
> > > > > while [ 1 ]
> > > > > do
> > > > > rmmod thunderx2_pmu
> > > > > modprobe thunderx2_pmu
> > > > > perf stat -a -e \
> > > > > uncore_dmc_0/cnt_cycles/,\
> > > > > uncore_dmc_0/data_transfers/,\
> > > > > uncore_dmc_0/read_txns/,\
> > > > > uncore_dmc_0/config=0xE/,\
> > > > > uncore_dmc_0/write_txns/ sleep 1
> > > > > sleep 2
> > > > > done
> > > >
> > > > Can you reproduce without the module load+unload? I don't think people
> > > > routinely unload modules.
> > >
> > > The issue wont happen, if module is not unloaded/reloaded.
> > > IMHO, this could be potential bug!
> >
> > Does the softlockup give a useful stacktrace? I don't have a thunderx2
> > so I cannot reproduce.
>
> Sorry for the late reply, below is the dump that i am getting, when i
> hit the softlockup.
> Any suggestions to debug this further?
>
> sequence of commands, which leads to this lockup,
> insmod thunderx2_pmu.ko
> perf stat -e \
> uncore_dmc_0/cnt_cycles/,\
> uncore_dmc_0/data_transfers/,\
> uncore_dmc_0/read_txns/,\
> uncore_dmc_0/config=0xE/,\
> uncore_dmc_0/write_txns/\
> rmmod thunderx2_pmu
> insmod thunderx2_pmu.ko
>
> root@SBR-26>~>> [ 1065.946772] watchdog: BUG: soft lockup - CPU#117
> stuck for 22s! [perf:5206]
> [ 1065.953722] Modules linked in: thunderx2_pmu(OE) nls_iso8859_1
> joydev input_leds bridge ipmi_ssif ipmi_devintf stp llc
> ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core
> iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ppdev lp parport ip_tables x_tables
> autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1
> raid0 multipat
> h linear aes_ce_blk hid_generic aes_ce_cipher usbhid uas usb_storage
> hid ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops drm i40e i2c_smbus bnx2x
> crct10dif_ce g
> hash_ce e1000e sha2_ce mpt3sas nvme sha256_arm64 ptp ahci raid_class
> sha1_ce nvme_core scsi_transport_sas mdio libahci pps_core libcrc32c
> gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd
> aes_arm6
> 4 [last unloaded: thunderx2_pmu]
> [ 1066.029640] CPU: 117 PID: 5206 Comm: perf Tainted: G OE
> 5.3.0+ #160
> [ 1066.037109] Hardware name: Cavium Inc. Saber/Saber, BIOS
> TX2-FW-Release-7.2-build_08-0-g14f8c5bf8a 12/18/2018
> [ 1066.047009] pstate: 20400009 (nzCv daif +PAN -UAO)
> [ 1066.051799] pc : smp_call_function_single+0x198/0x1b0
> [ 1066.056838] lr : smp_call_function_single+0x16c/0x1b0
> [ 1066.061875] sp : fffffc00434cfc50
> [ 1066.065177] x29: fffffc00434cfc50 x28: 0000000000000000
> [ 1066.070475] x27: fffffebed4d2d952 x26: ffffffffd4d2d800
> [ 1066.075774] x25: fffffddf5da1e240 x24: 0000000000000001
> [ 1066.081073] x23: fffffc00434cfd38 x22: fffffc001026adb8
> [ 1066.086371] x21: 0000000000000000 x20: fffffc0011843000
> [ 1066.091669] x19: fffffc00434cfca0 x18: 0000000000000000
> [ 1066.096968] x17: 0000000000000000 x16: 0000000000000000
> [ 1066.102266] x15: 0000000000000000 x14: 0000000000000000
> [ 1066.107564] x13: 0000000000000000 x12: 0000000000000020
> [ 1066.112862] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
> [ 1066.118161] x9 : 0000000000000000 x8 : fffffebed44ec5e8
> [ 1066.123459] x7 : 0000000000000000 x6 : fffffc00434cfca0
> [ 1066.128757] x5 : fffffc00434cfca0 x4 : 0000000000000001
> [ 1066.134055] x3 : fffffc00434cfcb8 x2 : 0000000000000000
> [ 1066.139353] x1 : 0000000000000003 x0 : 0000000000000000
> [ 1066.144652] Call trace:
> [ 1066.147088] smp_call_function_single+0x198/0x1b0
> [ 1066.151784] perf_install_in_context+0x1b4/0x1d8
> [ 1066.156394] __se_sys_perf_event_open+0x634/0xa68
> [ 1066.161089] __arm64_sys_perf_event_open+0x1c/0x28
> [ 1066.165881] el0_svc_common.constprop.0+0x78/0x168
> [ 1066.170660] el0_svc_handler+0x34/0x90
> [ 1066.174399] el0_svc+0x8/0xc
> [ 1100.985033] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1100.990944] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
> softirq=2178/2178 fqs=7365
> [ 1100.998864] (detected by 154, t=15005 jiffies, g=14325, q=7133)
> [ 1101.004857] Task dump for CPU 0:
> [ 1101.008074] swapper/0 R running task 0 0 0 0x0000002a
> [ 1101.015110] Call trace:
> [ 1101.017553] __switch_to+0xbc/0x220
> [ 1244.366396] INFO: task kworker/116:1:1364 blocked for more than 120 seconds.
> [ 1244.373443] Tainted: G OEL 5.3.0+ #160
> [ 1244.378847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1244.386673] kworker/116:1 D 0 1364 2 0x00000028
> [ 1244.392166] Workqueue: events do_free_init
> [ 1244.396260] Call trace:
> [ 1244.398705] __switch_to+0x194/0x220
> [ 1244.402282] __schedule+0x30c/0x758
> [ 1244.405758] schedule+0x38/0xa8
> [ 1244.408896] schedule_timeout+0x220/0x3a8
> [ 1244.412902] wait_for_common+0x110/0x220
> [ 1244.416819] wait_for_completion+0x28/0x38
> [ 1244.420914] __wait_rcu_gp+0x170/0x1a8
> [ 1244.424658] synchronize_rcu+0x70/0xa0
> [ 1244.428403] do_free_init+0x3c/0x70
> [ 1244.431889] process_one_work+0x1f8/0x438
> [ 1244.435896] worker_thread+0x50/0x4b8
> [ 1244.439554] kthread+0x130/0x138
> [ 1244.442778] ret_from_fork+0x10/0x18
> [ 1280.996503] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1281.002415] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
> softirq=2178/2178 fqs=29040
> [ 1281.010420] (detected by 186, t=60010 jiffies, g=14325, q=22339)
> [ 1281.016500] Task dump for CPU 0:
> [ 1281.019715] swapper/0 R running task 0 0 0 0x0000002a
> [ 1281.026751] Call trace:
> [ 1281.029189] __switch_to+0xbc/0x220
> [ 1365.192710] INFO: task kworker/116:1:1364 blocked for more than 241 seconds.
> [ 1365.199756] Tainted: G OEL 5.3.0+ #160
> [ 1365.205158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1365.212984] kworker/116:1 D 0 1364 2 0x00000028
> [ 1365.218475] Workqueue: events do_free_init
> [ 1365.222569] Call trace:
> [ 1365.225014] __switch_to+0x194/0x220
> [ 1365.228589] __schedule+0x30c/0x758
> [ 1365.232065] schedule+0x38/0xa8
> [ 1365.235203] schedule_timeout+0x220/0x3a8
> [ 1365.239209] wait_for_common+0x110/0x220
> [ 1365.243127] wait_for_completion+0x28/0x38
> [ 1365.247222] __wait_rcu_gp+0x170/0x1a8
> [ 1365.250966] synchronize_rcu+0x70/0xa0
> [ 1365.254711] do_free_init+0x3c/0x70
> [ 1365.258198] process_one_work+0x1f8/0x438
> [ 1365.262205] worker_thread+0x50/0x4b8
> [ 1365.265863] kthread+0x130/0x138
> [ 1365.269087] ret_from_fork+0x10/0x18
> [ 1461.008017] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1461.013929] rcu: 0-...0: (4 ticks this GP) idle=bb6/0/0x1
> softirq=2178/2178 fqs=51540
> [ 1461.021930] (detected by 123, t=105015 jiffies, g=14325, q=37487)
> [ 1461.028096] Task dump for CPU 0:
> [ 1461.031311] swapper/0 R running task 0 0 0 0x0000002a
> [ 1461.038347] Call trace:
> [ 1461.040785] __switch_to+0xbc/0x220
> [ 1486.023005] INFO: task kworker/116:1:1364 blocked for more than 362 seconds.
> [ 1486.030049] Tainted: G OEL 5.3.0+ #160
> [ 1486.035447] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1486.043272] kworker/116:1 D 0 1364 2 0x00000028
> [ 1486.048762] Workqueue: events do_free_init
> [ 1486.052856] Call trace:
> [ 1486.055300] __switch_to+0x194/0x220
> [ 1486.058875] __schedule+0x30c/0x758
> [ 1486.062351] schedule+0x38/0xa8
> [ 1486.065487] schedule_timeout+0x220/0x3a8
> [ 1486.069493] wait_for_common+0x110/0x220
> [ 1486.073411] wait_for_completion+0x28/0x38
> [ 1486.077504] __wait_rcu_gp+0x170/0x1a8
> [ 1486.081248] synchronize_rcu+0x70/0xa0
> [ 1486.084994] do_free_init+0x3c/0x70
> [ 1486.088479] process_one_work+0x1f8/0x438
> [ 1486.092486] worker_thread+0x50/0x4b8
> [ 1486.096143] kthread+0x130/0x138
> [ 1486.099367] ret_from_fork+0x10/0x18
> [ 1606.849300] INFO: task kworker/116:1:1364 blocked for more than 483 seconds.
> [ 1606.856345] Tainted: G OEL 5.3.0+ #160
>
>

I am able to see same system lockup on Intel xeon machine as well, if
i do same sequence of commands.
If i run for 4 events, which will not kick-in event multiplexing and
no issues seen.
If i try with more than 4 events, which leads to event multiplexing
and softlockup happens.

[gkulkarni@xeon-numa ~]$ uname -a
Linux xeon-numa 5.3.0 #3 SMP Mon Oct 14 15:47:06 IST 2019 x86_64
x86_64 x86_64 GNU/Linux
[gkulkarni@xeon-numa ~]$ cat /proc/cpuinfo | more
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
stepping : 2
microcode : 0x29
cpu MHz : 1198.775
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca
sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm
abm cp
uid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid
ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm
xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
l1tf mds swapgs
bogomips : 4794.37
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

[gkulkarni@xeon-numa ~]$ sh -x t1.sh
+ sudo modprobe intel_uncore
+ sudo perf stat -a -e
uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/,uncore_imc_0/clockticks/,uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/
sleep 1

Performance counter stats for 'system wide':

8.85 MiB uncore_imc_0/cas_count_read/
(79.88%)
3.10 MiB uncore_imc_0/cas_count_write/
(80.10%)
1781830530 uncore_imc_0/clockticks/
(79.98%)
9.10 MiB uncore_imc_0/cas_count_read/
(80.04%)
3.08 MiB uncore_imc_0/cas_count_write/
(80.00%)

1.004236760 seconds time elapsed

+ sudo rmmod intel_uncore
[gkulkarni@xeon-numa ~]$ sh -x t1.sh
+ sudo modprobe intel_uncore
Message from syslogd@localhost at Oct 15 10:10:22 ...
kernel:[ 625.240222] watchdog: BUG: soft lockup - CPU#18 stuck for
23s! [systemd:3973]

Message from syslogd@localhost at Oct 15 10:10:22 ...
kernel:watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [systemd:3973]


Thanks,
Ganapat

> Thanks,
> Ganapat