2017-04-25 17:28:07

by Liang, Kan

[permalink] [raw]
Subject: RE: [RFC 0/6] optimize ctx switch with rb-tree

Hi David,

Is there any update about the patch series?

We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it.

Thanks,
Kan

> Subject: [RFC 0/6] optimize ctx switch with rb-tree
>
> Following the discussion in:
> https://patchwork.kernel.org/patch/9420035/
>
> This is is an early version of a series of perf context switches optimizations.
>
> The main idea is to create and maintain a list of inactive events sorted by
> timestamp, and a rb-tree index to index it. The rb-tree's key are
> {cpu,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for CPU
> contexts.
>
> The rb-tree provides functions to find intervals in the inactive event list so
> that ctx_sched_in only has to visit the events that can be potentially be
> scheduled (i.e. avoid iterations over events bound to CPUs or cgroups that
> are not current).
>
> Since the inactive list is sort by timestamp, rotation can be done by simply
> scheduling out and in the events. This implies that each timer interrupt, the
> events will rotate by q events (where q is the number of hardware counters).
> This changes the current behavior of rotation.
> Feedback welcome!
>
> I haven't profiled the new approach. I am only assuming it will be superior
> when the number of per-cpu or distict cgroup events is large.
>
> The last patch shows how perf_iterate_ctx can use the new rb-tree index to
> reduce the number of visited events. I haven't looked carefully if locking and
> other things are correct.
>
> If this changes are in the right direction. A next version could remove some
> existing code, specifically the lists ctx->pinned_groups and
> ctx->flexible_groups could be removed. Also, event_filter_match could be
> simplified when called on events groups filtered using the rb-tree, since both
> perform similar checks.
>
> David Carrillo-Cisneros (6):
> perf/core: create active and inactive event groups
> perf/core: add a rb-tree index to inactive_groups
> perf/core: use rb-tree to sched in event groups
> perf/core: avoid rb-tree traversal when no inactive events
> perf/core: rotation no longer neccesary. Behavior has changed. Beware
> perf/core: use rb-tree index to optimize filtered perf_iterate_ctx
>
> include/linux/perf_event.h | 13 ++
> kernel/events/core.c | 466
> +++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 426 insertions(+), 53 deletions(-)
>
> --
> 2.11.0.390.gc69c2f50cf-goog


2017-04-25 17:49:25

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

Hi Kan,

It's still on my list, but I won't have time to work on it for at
least another month.

What issues did you encounter?

Thanks,
David

On Tue, Apr 25, 2017 at 10:27 AM, Liang, Kan <[email protected]> wrote:
> Hi David,
>
> Is there any update about the patch series?
>
> We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it.
>
> Thanks,
> Kan
>
>> Subject: [RFC 0/6] optimize ctx switch with rb-tree
>>
>> Following the discussion in:
>> https://patchwork.kernel.org/patch/9420035/
>>
>> This is is an early version of a series of perf context switches optimizations.
>>
>> The main idea is to create and maintain a list of inactive events sorted by
>> timestamp, and a rb-tree index to index it. The rb-tree's key are
>> {cpu,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for CPU
>> contexts.
>>
>> The rb-tree provides functions to find intervals in the inactive event list so
>> that ctx_sched_in only has to visit the events that can be potentially be
>> scheduled (i.e. avoid iterations over events bound to CPUs or cgroups that
>> are not current).
>>
>> Since the inactive list is sort by timestamp, rotation can be done by simply
>> scheduling out and in the events. This implies that each timer interrupt, the
>> events will rotate by q events (where q is the number of hardware counters).
>> This changes the current behavior of rotation.
>> Feedback welcome!
>>
>> I haven't profiled the new approach. I am only assuming it will be superior
>> when the number of per-cpu or distict cgroup events is large.
>>
>> The last patch shows how perf_iterate_ctx can use the new rb-tree index to
>> reduce the number of visited events. I haven't looked carefully if locking and
>> other things are correct.
>>
>> If this changes are in the right direction. A next version could remove some
>> existing code, specifically the lists ctx->pinned_groups and
>> ctx->flexible_groups could be removed. Also, event_filter_match could be
>> simplified when called on events groups filtered using the rb-tree, since both
>> perform similar checks.
>>
>> David Carrillo-Cisneros (6):
>> perf/core: create active and inactive event groups
>> perf/core: add a rb-tree index to inactive_groups
>> perf/core: use rb-tree to sched in event groups
>> perf/core: avoid rb-tree traversal when no inactive events
>> perf/core: rotation no longer neccesary. Behavior has changed. Beware
>> perf/core: use rb-tree index to optimize filtered perf_iterate_ctx
>>
>> include/linux/perf_event.h | 13 ++
>> kernel/events/core.c | 466
>> +++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 426 insertions(+), 53 deletions(-)
>>
>> --
>> 2.11.0.390.gc69c2f50cf-goog
>

2017-04-25 18:12:24

by Budankov, Alexey

[permalink] [raw]
Subject: RE: [RFC 0/6] optimize ctx switch with rb-tree

Hi David,

I profiled single threaded STREAM benchmark pinned to core #1 on KNL machine with kernel 3.10.0-327.13.1.el7.x86_64 (from MPSS distribution). I run the system-wide and the per-process profiling for the same list of events and got ~5x slowdown for the per-process scenario.
Simultaneously I was collecting ftrace logs for the both cases and found that there is huge slowdown in handling of multiplexing timer interrupt for the per-process case in comparison to the system-wide one:

begin=>stream-22736 [001] d.h. 5409.272489: perf_cpu_hrtimer_handler <-__hrtimer_run_queues
stream-22736 [001] d.h. 5409.272490: perf_pmu_disable <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.272491: intel_pmu_disable_all <-x86_pmu_disable
stream-22736 [001] d.h. 5409.272492: intel_pmu_pebs_disable_all <-intel_pmu_disable_all
stream-22736 [001] d.h. 5409.272492: intel_pmu_lbr_disable_all <-intel_pmu_disable_all
stream-22736 [001] d.h. 5409.272494: perf_pmu_disable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272494: perf_pmu_enable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272495: perf_pmu_disable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.272506: perf_pmu_disable <-event_sched_out.isra.72
stream-22736 [001] d.h. 5409.272507: intel_pmu_disable_event <-x86_pmu_stop
stream-22736 [001] d.h. 5409.272509: intel_pmu_pebs_disable <-intel_pmu_disable_event
stream-22736 [001] d.h. 5409.272511: perf_event_update_userpage <-x86_pmu_del
=>stream-22736 [001] d.h. 5409.272511: perf_pmu_enable <-event_sched_out.isra.72
+2ms=>stream-22736 [001] d.h. 5409.274445: perf_pmu_enable <-ctx_sched_out_2.constprop.85
stream-22736 [001] d.h. 5409.274446: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.274447: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.274454: perf_pmu_disable <-x86_pmu_start_txn
stream-22736 [001] d.h. 5409.274455: perf_pmu_disable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274457: perf_pmu_enable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274459: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274460: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274461: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints
stream-22736 [001] d.h. 5409.274463: perf_pmu_enable <-x86_pmu_commit_txn
stream-22736 [001] d.h. 5409.274491: perf_pmu_disable <-x86_pmu_start_txn
stream-22736 [001] d.h. 5409.274492: perf_pmu_disable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274492: perf_pmu_enable <-event_sched_in.isra.74
stream-22736 [001] d.h. 5409.274493: perf_pmu_enable <-x86_pmu_cancel_txn
=>stream-22736 [001] d.h. 5409.274493: perf_cpu_hrtimer_restart <-group_sched_in
+600us=>stream-22736 [001] d.h. 5409.275146: perf_pmu_enable <-perf_cpu_hrtimer_handler
stream-22736 [001] d.h. 5409.275148: perf_event_update_userpage <-x86_perf_event_set_period
stream-22736 [001] d.h. 5409.275148: intel_pmu_enable_event <-x86_pmu_start
stream-22736 [001] d.h. 5409.275149: intel_pmu_pebs_enable <-intel_pmu_enable_event
stream-22736 [001] d.h. 5409.275150: perf_event_update_userpage <-x86_pmu_start
stream-22736 [001] d.h. 5409.275151: intel_pmu_enable_all <-x86_pmu_enable
stream-22736 [001] d.h. 5409.275152: intel_pmu_pebs_enable_all <-intel_pmu_enable_all
end=>stream-22736 [001] d.h. 5409.275152: intel_pmu_lbr_enable_all <-intel_pmu_enable_all
stream-22736 [001] d.h. 5409.275176: perf_event_task_tick <-scheduler_tick
stream-22736 [001] .... 5409.275217: perf_event_mmap_output <-perf_event_aux_ctx
stream-22736 [001] d.h. 5409.275233: intel_pmu_handle_irq <-perf_event_nmi_handler
stream-22736 [001] d.h. 5409.275234: intel_pmu_disable_all <-intel_pmu_handle_irq

I looked at the source code and see that in the per-process case the whole list of allocated events (29 x 272) is traversed twice on every timer interrupt during that 2ms and 600us intervals. However the only 29 per-cpu events are iterated in the system-wide case:

kernel/events/core.c:2799:

static void ctx_sched_out(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
enum event_type_t event_type)
{
#ifdef PERF___ENABLE_CTX_ROT
struct perf_event *event;
int is_active = ctx->is_active;
#endif

ctx->is_active &= ~event_type;
if (likely(!ctx->nr_events))
return;

update_context_time(ctx);
update_cgrp_time_from_cpuctx(cpuctx);
if (!ctx->nr_active)
return;

perf_pmu_disable(ctx->pmu);
#ifdef PERF___ENABLE_CTX_ROT
if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) {
list_for_each_entry(event, &ctx->pinned_groups, group_entry)
group_sched_out(event, cpuctx, ctx);
}

if ((is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) {
list_for_each_entry(event, &ctx->flexible_groups, group_entry)
group_sched_out(event, cpuctx, ctx);
}
#endif
perf_pmu_enable(ctx->pmu);
}

kernel/events/core.c:2384

static void
ctx_sched_in(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
enum event_type_t event_type,
struct task_struct *task)
{
u64 now;
#ifdef PERF___ENABLE_CTX_ROT
int is_active = ctx->is_active;
ctx->is_active |= event_type;
#endif
if (likely(!ctx->nr_events))
return;

now = perf_clock();
ctx->timestamp = now;
perf_cgroup_set_timestamp(task, ctx);
/*
* First go through the list and put on any pinned groups
* in order to give them the best chance of going on.
*/
#ifdef PERF___ENABLE_CTX_ROT
if (!(is_active & EVENT_PINNED) && (event_type & EVENT_PINNED))
ctx_pinned_sched_in(ctx, cpuctx);

/* Then walk through the lower prio flexible groups */
if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE))
ctx_flexible_sched_in(ctx, cpuctx);
#endif
}

If I disable traversing in the per-process case then the overhead disappears.

For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29).

Thanks,
Alexey

-----Original Message-----
From: David Carrillo-Cisneros [mailto:[email protected]]
Sent: Tuesday, April 25, 2017 8:49 PM
To: Liang, Kan <[email protected]>
Cc: [email protected]; [email protected]; Ingo Molnar <[email protected]>; Thomas Gleixner <[email protected]>; Andi Kleen <[email protected]>; Peter Zijlstra <[email protected]>; Borislav Petkov <[email protected]>; Srinivas Pandruvada <[email protected]>; Dave Hansen <[email protected]>; Vikas Shivappa <[email protected]>; Mark Rutland <[email protected]>; Arnaldo Carvalho de Melo <[email protected]>; Vince Weaver <[email protected]>; Paul Turner <[email protected]>; Stephane Eranian <[email protected]>; Budankov, Alexey <[email protected]>
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

Hi Kan,

It's still on my list, but I won't have time to work on it for at least another month.

What issues did you encounter?

Thanks,
David

On Tue, Apr 25, 2017 at 10:27 AM, Liang, Kan <[email protected]> wrote:
> Hi David,
>
> Is there any update about the patch series?
>
> We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it.
>
> Thanks,
> Kan
>
>> Subject: [RFC 0/6] optimize ctx switch with rb-tree
>>
>> Following the discussion in:
>> https://patchwork.kernel.org/patch/9420035/
>>
>> This is is an early version of a series of perf context switches optimizations.
>>
>> The main idea is to create and maintain a list of inactive events
>> sorted by timestamp, and a rb-tree index to index it. The rb-tree's
>> key are {cpu,flexible,stamp} for task contexts and
>> {cgroup,flexible,stamp} for CPU contexts.
>>
>> The rb-tree provides functions to find intervals in the inactive
>> event list so that ctx_sched_in only has to visit the events that can
>> be potentially be scheduled (i.e. avoid iterations over events bound
>> to CPUs or cgroups that are not current).
>>
>> Since the inactive list is sort by timestamp, rotation can be done by
>> simply scheduling out and in the events. This implies that each timer
>> interrupt, the events will rotate by q events (where q is the number of hardware counters).
>> This changes the current behavior of rotation.
>> Feedback welcome!
>>
>> I haven't profiled the new approach. I am only assuming it will be
>> superior when the number of per-cpu or distict cgroup events is large.
>>
>> The last patch shows how perf_iterate_ctx can use the new rb-tree
>> index to reduce the number of visited events. I haven't looked
>> carefully if locking and other things are correct.
>>
>> If this changes are in the right direction. A next version could
>> remove some existing code, specifically the lists ctx->pinned_groups
>> and
>> ctx->flexible_groups could be removed. Also, event_filter_match could
>> ctx->be
>> simplified when called on events groups filtered using the rb-tree,
>> since both perform similar checks.
>>
>> David Carrillo-Cisneros (6):
>> perf/core: create active and inactive event groups
>> perf/core: add a rb-tree index to inactive_groups
>> perf/core: use rb-tree to sched in event groups
>> perf/core: avoid rb-tree traversal when no inactive events
>> perf/core: rotation no longer neccesary. Behavior has changed. Beware
>> perf/core: use rb-tree index to optimize filtered perf_iterate_ctx
>>
>> include/linux/perf_event.h | 13 ++
>> kernel/events/core.c | 466
>> +++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 426 insertions(+), 53 deletions(-)
>>
>> --
>> 2.11.0.390.gc69c2f50cf-goog
>

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2017-04-25 18:54:47

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

>
> If I disable traversing in the per-process case then the overhead disappears.
>
> For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29).


Yes, seems like it would benefit from the rb-tree optimization.

Something that is wrong in my RFC (as Mark notes in the "enjoyment"
section of https://lkml.org/lkml/2017/1/12/254), is that care must be
taken to disable the right pmu when dealing with contexts that have
events from more than one PMU. A way to do it is to have the pmu as
part of the rb-tree key (as Peter initially suggested) and use that to
iterate events in the same pmu together.

There's still the open question of what to do when pmu->add fails.
Currently, it stops scheduling events, but that's not right when
dealing with events in "software context" that are not software events
(I am looking at you CQM) and in hardware contexts with more than one
PMU (ARM big-little). Ideally a change in event scheduler should
address that, but it requires more work. Here is a discussion with
Peter about that (https://lkml.org/lkml/2017/1/25/365).

If you guys want to work on it, I'll be happy to help review.
Otherwise, I'll get to it as soon as I have a chance (1-2 months).

Thanks,
David

2017-04-26 10:34:59

by Budankov, Alexey

[permalink] [raw]
Subject: RE: [RFC 0/6] optimize ctx switch with rb-tree

Hi David,

I would like to take over on the patches development relying on your help with reviews.

Could you provide me with the cumulative patch set to expedite the ramp up?

Thanks,
Alexey

-----Original Message-----
From: David Carrillo-Cisneros [mailto:[email protected]]
Sent: Tuesday, April 25, 2017 9:55 PM
To: Budankov, Alexey <[email protected]>
Cc: Liang, Kan <[email protected]>; [email protected]; [email protected]; Ingo Molnar <[email protected]>; Thomas Gleixner <[email protected]>; Andi Kleen <[email protected]>; Peter Zijlstra <[email protected]>; Borislav Petkov <[email protected]>; Srinivas Pandruvada <[email protected]>; Dave Hansen <[email protected]>; Vikas Shivappa <[email protected]>; Mark Rutland <[email protected]>; Arnaldo Carvalho de Melo <[email protected]>; Vince Weaver <[email protected]>; Paul Turner <[email protected]>; Stephane Eranian <[email protected]>
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

>
> If I disable traversing in the per-process case then the overhead disappears.
>
> For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29).


Yes, seems like it would benefit from the rb-tree optimization.

Something that is wrong in my RFC (as Mark notes in the "enjoyment"
section of https://lkml.org/lkml/2017/1/12/254), is that care must be taken to disable the right pmu when dealing with contexts that have events from more than one PMU. A way to do it is to have the pmu as part of the rb-tree key (as Peter initially suggested) and use that to iterate events in the same pmu together.

There's still the open question of what to do when pmu->add fails.
Currently, it stops scheduling events, but that's not right when dealing with events in "software context" that are not software events (I am looking at you CQM) and in hardware contexts with more than one PMU (ARM big-little). Ideally a change in event scheduler should address that, but it requires more work. Here is a discussion with Peter about that (https://lkml.org/lkml/2017/1/25/365).

If you guys want to work on it, I'll be happy to help review.
Otherwise, I'll get to it as soon as I have a chance (1-2 months).

Thanks,
David

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

2017-04-26 10:53:23

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

On Tue, Apr 25, 2017 at 11:54:36AM -0700, David Carrillo-Cisneros wrote:

> Yes, seems like it would benefit from the rb-tree optimization.

> If you guys want to work on it, I'll be happy to help review.

Likewise.

I'd very much appreciate being Cc'd on changes in this area.

Thanks,
Mark.

2017-04-26 19:40:16

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

On Wed, Apr 26, 2017 at 3:34 AM, Budankov, Alexey
<[email protected]> wrote:
> Hi David,
>
> I would like to take over on the patches development relying on your help with reviews.

Sounds good.

> Could you provide me with the cumulative patch set to expedite the ramp up?

This RFC is my latest version. I did not have a good solution on how
to solve the problem of handling failure of PMUs that share contexts,
and to activate/inactivate them.

Some things to keep in mind when dealing with task-contexts are:
1. The number of PMUs is large and growing, iterating over all PMUs
may be expensive (see https://lkml.org/lkml/2017/1/18/859 ).
2. event_filter_match in this RFC is only used because I did not
find a better ways to filter out events with the rb-tree. It would be
nice if we wouldn't have to check event->cpu != -1 && event->cpu ==
smp_processor_id() and cgroup stuff for every event in task contexts.
3. I used the inactive events list in this RFC as a cheaper
alternative to threading the rb-tree but it has the problem that
events that are removed due to conflict would be placed at the end of
the list even if didn't run. I cannot recall if that ever happens.
Using this list also causes problem (2.) maybe threading the tree is a
better alternative?
4. Making the key in task-events to be {PMU,CPU,last_time_scheduled}
(as opposed to {CPU,last_time_scheduled} in the RFC) may simplify
sched in by helping to iterate over all events in same PMU at once,
simplifying the activation/inactivation of the PMU and making it
simple to move to the next PMU on pmu::add errors. The problem with
this approach is to find only the PMUs with inactive events without
traversing a list of all PMUs. Maybe a per-context list of active PMUs
may help (see 1.).

cpu-contexts are much simpler and I think work well with what the RFC
does (they are per-pmu already).

This thread has Peter and Mark's original discussion of the rb-tree
(https://patchwork.kernel.org/patch/9176121/).

Thanks,
David