From: "Budankov, Alexey" <alexey.budankov@intel.com>
To: David Carrillo-Cisneros <davidcc@google.com>,
        "Liang, Kan" <kan.liang@intel.com>
CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "x86@kernel.org" <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
        "Thomas Gleixner" <tglx@linutronix.de>,
        Andi Kleen <ak@linux.intel.com>,
        "Peter Zijlstra" <peterz@infradead.org>, Borislav Petkov <bp@suse.de>,
        "Srinivas Pandruvada" <srinivas.pandruvada@linux.intel.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Vikas Shivappa <vikas.shivappa@linux.intel.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Vince Weaver <vince@deater.net>, Paul Turner <pjt@google.com>,
        Stephane Eranian <eranian@google.com>,
        "Budankov, Alexey" <alexey.budankov@intel.com>
Subject: RE: [RFC 0/6] optimize ctx switch with rb-tree
Thread-Topic: [RFC 0/6] optimize ctx switch with rb-tree
Thread-Index: AQHSayvxfT6zq1YGtkSnpdL7j43oA6HW+iMw///3P4CAABVEgA==
Date: Tue, 25 Apr 2017 18:11:33 +0000
Message-ID: <E3138441316DDA4991F87D33FA402D40678033B3@IRSMSX106.ger.corp.intel.com>
References: <20170110102502.106187-1-davidcc@google.com>
 <37D7C6CF3E00A74B8858931C1DB2F077536E6010@SHSMSX103.ccr.corp.intel.com>
 <CALcN6miYEYaxnSshy78mduOureT1kYNo6OLunwbjVPZvSnCogg@mail.gmail.com>
In-Reply-To: <CALcN6miYEYaxnSshy78mduOureT1kYNo6OLunwbjVPZvSnCogg@mail.gmail.com>
Accept-Language: ru-RU, en-US
Content-Language: en-US
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Transfer-Encoding: 8bit
Content-Length: 10905
Lines: 216

Hi David,

I profiled single threaded STREAM benchmark pinned to core #1 on KNL machine with kernel 3.10.0-327.13.1.el7.x86_64 (from MPSS distribution). I run the system-wide and the per-process profiling for the same list of events and got ~5x slowdown for the per-process scenario. 
Simultaneously I was collecting ftrace logs for the both cases and found that there is huge slowdown in handling of multiplexing timer interrupt for the per-process case in comparison to the system-wide one:

 begin=>stream-22736 [001] d.h.  5409.272489: perf_cpu_hrtimer_handler <-__hrtimer_run_queues
          stream-22736 [001] d.h.  5409.272490: perf_pmu_disable <-perf_cpu_hrtimer_handler
          stream-22736 [001] d.h.  5409.272491: intel_pmu_disable_all <-x86_pmu_disable
          stream-22736 [001] d.h.  5409.272492: intel_pmu_pebs_disable_all <-intel_pmu_disable_all
          stream-22736 [001] d.h.  5409.272492: intel_pmu_lbr_disable_all <-intel_pmu_disable_all
          stream-22736 [001] d.h.  5409.272494: perf_pmu_disable <-ctx_sched_out_2.constprop.85
          stream-22736 [001] d.h.  5409.272494: perf_pmu_enable <-ctx_sched_out_2.constprop.85
          stream-22736 [001] d.h.  5409.272495: perf_pmu_disable <-ctx_sched_out_2.constprop.85
          stream-22736 [001] d.h.  5409.272506: perf_pmu_disable <-event_sched_out.isra.72
          stream-22736 [001] d.h.  5409.272507: intel_pmu_disable_event <-x86_pmu_stop
          stream-22736 [001] d.h.  5409.272509: intel_pmu_pebs_disable <-intel_pmu_disable_event
          stream-22736 [001] d.h.  5409.272511: perf_event_update_userpage <-x86_pmu_del
        =>stream-22736 [001] d.h.  5409.272511: perf_pmu_enable <-event_sched_out.isra.72
   +2ms=>stream-22736 [001] d.h.  5409.274445: perf_pmu_enable <-ctx_sched_out_2.constprop.85
          stream-22736 [001] d.h.  5409.274446: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
          stream-22736 [001] d.h.  5409.274447: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler
          stream-22736 [001] d.h.  5409.274454: perf_pmu_disable <-x86_pmu_start_txn
          stream-22736 [001] d.h.  5409.274455: perf_pmu_disable <-event_sched_in.isra.74
          stream-22736 [001] d.h.  5409.274457: perf_pmu_enable <-event_sched_in.isra.74
          stream-22736 [001] d.h.  5409.274459: intel_pmu_event_map <-intel_get_event_constraints
          stream-22736 [001] d.h.  5409.274460: intel_pmu_event_map <-intel_get_event_constraints
          stream-22736 [001] d.h.  5409.274461: intel_pmu_event_map <-intel_get_event_constraints
          stream-22736 [001] d.h.  5409.274462: intel_pmu_event_map <-intel_get_event_constraints
          stream-22736 [001] d.h.  5409.274462: intel_pmu_event_map <-intel_get_event_constraints
          stream-22736 [001] d.h.  5409.274463: perf_pmu_enable <-x86_pmu_commit_txn
          stream-22736 [001] d.h.  5409.274491: perf_pmu_disable <-x86_pmu_start_txn
          stream-22736 [001] d.h.  5409.274492: perf_pmu_disable <-event_sched_in.isra.74
          stream-22736 [001] d.h.  5409.274492: perf_pmu_enable <-event_sched_in.isra.74
          stream-22736 [001] d.h.  5409.274493: perf_pmu_enable <-x86_pmu_cancel_txn
        =>stream-22736 [001] d.h.  5409.274493: perf_cpu_hrtimer_restart <-group_sched_in
 +600us=>stream-22736 [001] d.h.  5409.275146: perf_pmu_enable <-perf_cpu_hrtimer_handler
          stream-22736 [001] d.h.  5409.275148: perf_event_update_userpage <-x86_perf_event_set_period
          stream-22736 [001] d.h.  5409.275148: intel_pmu_enable_event <-x86_pmu_start
          stream-22736 [001] d.h.  5409.275149: intel_pmu_pebs_enable <-intel_pmu_enable_event
          stream-22736 [001] d.h.  5409.275150: perf_event_update_userpage <-x86_pmu_start
          stream-22736 [001] d.h.  5409.275151: intel_pmu_enable_all <-x86_pmu_enable
          stream-22736 [001] d.h.  5409.275152: intel_pmu_pebs_enable_all <-intel_pmu_enable_all
     end=>stream-22736 [001] d.h.  5409.275152: intel_pmu_lbr_enable_all <-intel_pmu_enable_all
          stream-22736 [001] d.h.  5409.275176: perf_event_task_tick <-scheduler_tick
          stream-22736 [001] ....  5409.275217: perf_event_mmap_output <-perf_event_aux_ctx
          stream-22736 [001] d.h.  5409.275233: intel_pmu_handle_irq <-perf_event_nmi_handler
          stream-22736 [001] d.h.  5409.275234: intel_pmu_disable_all <-intel_pmu_handle_irq

I looked at the source code and see that in the per-process case the whole list of allocated events (29 x 272) is traversed twice on every timer interrupt during that 2ms and 600us intervals. However the only 29 per-cpu events are iterated in the system-wide case:

kernel/events/core.c:2799:

static void ctx_sched_out(struct perf_event_context *ctx,
              struct perf_cpu_context *cpuctx,
              enum event_type_t event_type)
{
#ifdef PERF___ENABLE_CTX_ROT
    struct perf_event *event;
    int is_active = ctx->is_active;
#endif

    ctx->is_active &= ~event_type;
    if (likely(!ctx->nr_events))
        return;

    update_context_time(ctx);
    update_cgrp_time_from_cpuctx(cpuctx);
    if (!ctx->nr_active)
        return;

    perf_pmu_disable(ctx->pmu);
#ifdef PERF___ENABLE_CTX_ROT
    if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) {
        list_for_each_entry(event, &ctx->pinned_groups, group_entry)
            group_sched_out(event, cpuctx, ctx);
    }

    if ((is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) {
        list_for_each_entry(event, &ctx->flexible_groups, group_entry)
            group_sched_out(event, cpuctx, ctx);
    }
#endif
    perf_pmu_enable(ctx->pmu);
}

kernel/events/core.c:2384

static void
ctx_sched_in(struct perf_event_context *ctx,
         struct perf_cpu_context *cpuctx,
         enum event_type_t event_type,
         struct task_struct *task)
{
    u64 now;
#ifdef PERF___ENABLE_CTX_ROT
    int is_active = ctx->is_active;
    ctx->is_active |= event_type;
#endif
    if (likely(!ctx->nr_events))
        return;

    now = perf_clock();
    ctx->timestamp = now;
    perf_cgroup_set_timestamp(task, ctx);
    /*
     * First go through the list and put on any pinned groups
     * in order to give them the best chance of going on.
     */
#ifdef PERF___ENABLE_CTX_ROT
    if (!(is_active & EVENT_PINNED) && (event_type & EVENT_PINNED))
        ctx_pinned_sched_in(ctx, cpuctx);

    /* Then walk through the lower prio flexible groups */
    if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE))
        ctx_flexible_sched_in(ctx, cpuctx);
#endif
}

If I disable traversing in the per-process case then the overhead disappears.

For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29).

Thanks,
Alexey

-----Original Message-----
From: David Carrillo-Cisneros [mailto:davidcc@google.com] 
Sent: Tuesday, April 25, 2017 8:49 PM
To: Liang, Kan <kan.liang@intel.com>
Cc: linux-kernel@vger.kernel.org; x86@kernel.org; Ingo Molnar <mingo@redhat.com>; Thomas Gleixner <tglx@linutronix.de>; Andi Kleen <ak@linux.intel.com>; Peter Zijlstra <peterz@infradead.org>; Borislav Petkov <bp@suse.de>; Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>; Dave Hansen <dave.hansen@linux.intel.com>; Vikas Shivappa <vikas.shivappa@linux.intel.com>; Mark Rutland <mark.rutland@arm.com>; Arnaldo Carvalho de Melo <acme@kernel.org>; Vince Weaver <vince@deater.net>; Paul Turner <pjt@google.com>; Stephane Eranian <eranian@google.com>; Budankov, Alexey <alexey.budankov@intel.com>
Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree

Hi Kan,

It's still on my list, but I won't have time to work on it for at least another month.

What issues did you encounter?

Thanks,
David

On Tue, Apr 25, 2017 at 10:27 AM, Liang, Kan <kan.liang@intel.com> wrote:
> Hi David,
>
> Is there any update about the patch series?
>
> We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it.
>
> Thanks,
> Kan
>
>> Subject: [RFC 0/6] optimize ctx switch with rb-tree
>>
>> Following the discussion in:
>> https://patchwork.kernel.org/patch/9420035/
>>
>> This is is an early version of a series of perf context switches optimizations.
>>
>> The main idea is to create and maintain a list of inactive events 
>> sorted by timestamp, and a rb-tree index to index it. The rb-tree's 
>> key are {cpu,flexible,stamp} for task contexts and 
>> {cgroup,flexible,stamp} for CPU contexts.
>>
>> The rb-tree provides functions to find intervals in the inactive 
>> event list so that ctx_sched_in only has to visit the events that can 
>> be potentially be scheduled (i.e. avoid iterations over events bound 
>> to CPUs or cgroups that are not current).
>>
>> Since the inactive list is sort by timestamp, rotation can be done by 
>> simply scheduling out and in the events. This implies that each timer 
>> interrupt, the events will rotate by q events (where q is the number of hardware counters).
>> This changes the current behavior of rotation.
>> Feedback welcome!
>>
>> I haven't profiled the new approach. I am only assuming it will be 
>> superior when the number of per-cpu or distict cgroup events is large.
>>
>> The last patch shows how perf_iterate_ctx can use the new rb-tree 
>> index to reduce the number of visited events. I haven't looked 
>> carefully if locking and other things are correct.
>>
>> If this changes are in the right direction. A next version could 
>> remove some existing code, specifically the lists ctx->pinned_groups 
>> and
>> ctx->flexible_groups could be removed. Also, event_filter_match could 
>> ctx->be
>> simplified when called on events groups filtered using the rb-tree, 
>> since both perform similar checks.
>>
>> David Carrillo-Cisneros (6):
>>   perf/core: create active and inactive event groups
>>   perf/core: add a rb-tree index to inactive_groups
>>   perf/core: use rb-tree to sched in event groups
>>   perf/core: avoid rb-tree traversal when no inactive events
>>   perf/core: rotation no longer neccesary. Behavior has changed. Beware
>>   perf/core: use rb-tree index to optimize filtered  perf_iterate_ctx
>>
>>  include/linux/perf_event.h |  13 ++
>>  kernel/events/core.c       | 466
>> +++++++++++++++++++++++++++++++++++++++------
>>  2 files changed, 426 insertions(+), 53 deletions(-)
>>
>> --
>> 2.11.0.390.gc69c2f50cf-goog
>

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.