Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1432790AbdDYSMY (ORCPT ); Tue, 25 Apr 2017 14:12:24 -0400 Received: from mga09.intel.com ([134.134.136.24]:54090 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1431266AbdDYSMP (ORCPT ); Tue, 25 Apr 2017 14:12:15 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.37,250,1488873600"; d="scan'208";a="960973534" From: "Budankov, Alexey" To: David Carrillo-Cisneros , "Liang, Kan" CC: "linux-kernel@vger.kernel.org" , "x86@kernel.org" , Ingo Molnar , "Thomas Gleixner" , Andi Kleen , "Peter Zijlstra" , Borislav Petkov , "Srinivas Pandruvada" , Dave Hansen , Vikas Shivappa , Mark Rutland , Arnaldo Carvalho de Melo , Vince Weaver , Paul Turner , Stephane Eranian , "Budankov, Alexey" Subject: RE: [RFC 0/6] optimize ctx switch with rb-tree Thread-Topic: [RFC 0/6] optimize ctx switch with rb-tree Thread-Index: AQHSayvxfT6zq1YGtkSnpdL7j43oA6HW+iMw///3P4CAABVEgA== Date: Tue, 25 Apr 2017 18:11:33 +0000 Message-ID: References: <20170110102502.106187-1-davidcc@google.com> <37D7C6CF3E00A74B8858931C1DB2F077536E6010@SHSMSX103.ccr.corp.intel.com> In-Reply-To: Accept-Language: ru-RU, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMjllYzhlZWItMzc2Ni00NzY5LWE2ODMtMjkxOTQ0NGU2NDg2IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6ImlqNXpEaXFwc1BVVm4xTWZCNlRMTExrYlwvZkREeDVldlZSeHBwdWhFb0p3PSJ9 x-ctpclassification: CTP_IC x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id v3PICS3U017199 Content-Length: 10905 Lines: 216 Hi David, I profiled single threaded STREAM benchmark pinned to core #1 on KNL machine with kernel 3.10.0-327.13.1.el7.x86_64 (from MPSS distribution). I run the system-wide and the per-process profiling for the same list of events and got ~5x slowdown for the per-process scenario. Simultaneously I was collecting ftrace logs for the both cases and found that there is huge slowdown in handling of multiplexing timer interrupt for the per-process case in comparison to the system-wide one: begin=>stream-22736 [001] d.h. 5409.272489: perf_cpu_hrtimer_handler <-__hrtimer_run_queues stream-22736 [001] d.h. 5409.272490: perf_pmu_disable <-perf_cpu_hrtimer_handler stream-22736 [001] d.h. 5409.272491: intel_pmu_disable_all <-x86_pmu_disable stream-22736 [001] d.h. 5409.272492: intel_pmu_pebs_disable_all <-intel_pmu_disable_all stream-22736 [001] d.h. 5409.272492: intel_pmu_lbr_disable_all <-intel_pmu_disable_all stream-22736 [001] d.h. 5409.272494: perf_pmu_disable <-ctx_sched_out_2.constprop.85 stream-22736 [001] d.h. 5409.272494: perf_pmu_enable <-ctx_sched_out_2.constprop.85 stream-22736 [001] d.h. 5409.272495: perf_pmu_disable <-ctx_sched_out_2.constprop.85 stream-22736 [001] d.h. 5409.272506: perf_pmu_disable <-event_sched_out.isra.72 stream-22736 [001] d.h. 5409.272507: intel_pmu_disable_event <-x86_pmu_stop stream-22736 [001] d.h. 5409.272509: intel_pmu_pebs_disable <-intel_pmu_disable_event stream-22736 [001] d.h. 5409.272511: perf_event_update_userpage <-x86_pmu_del =>stream-22736 [001] d.h. 5409.272511: perf_pmu_enable <-event_sched_out.isra.72 +2ms=>stream-22736 [001] d.h. 5409.274445: perf_pmu_enable <-ctx_sched_out_2.constprop.85 stream-22736 [001] d.h. 5409.274446: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler stream-22736 [001] d.h. 5409.274447: rotate_ctx.part.41 <-perf_cpu_hrtimer_handler stream-22736 [001] d.h. 5409.274454: perf_pmu_disable <-x86_pmu_start_txn stream-22736 [001] d.h. 5409.274455: perf_pmu_disable <-event_sched_in.isra.74 stream-22736 [001] d.h. 5409.274457: perf_pmu_enable <-event_sched_in.isra.74 stream-22736 [001] d.h. 5409.274459: intel_pmu_event_map <-intel_get_event_constraints stream-22736 [001] d.h. 5409.274460: intel_pmu_event_map <-intel_get_event_constraints stream-22736 [001] d.h. 5409.274461: intel_pmu_event_map <-intel_get_event_constraints stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints stream-22736 [001] d.h. 5409.274462: intel_pmu_event_map <-intel_get_event_constraints stream-22736 [001] d.h. 5409.274463: perf_pmu_enable <-x86_pmu_commit_txn stream-22736 [001] d.h. 5409.274491: perf_pmu_disable <-x86_pmu_start_txn stream-22736 [001] d.h. 5409.274492: perf_pmu_disable <-event_sched_in.isra.74 stream-22736 [001] d.h. 5409.274492: perf_pmu_enable <-event_sched_in.isra.74 stream-22736 [001] d.h. 5409.274493: perf_pmu_enable <-x86_pmu_cancel_txn =>stream-22736 [001] d.h. 5409.274493: perf_cpu_hrtimer_restart <-group_sched_in +600us=>stream-22736 [001] d.h. 5409.275146: perf_pmu_enable <-perf_cpu_hrtimer_handler stream-22736 [001] d.h. 5409.275148: perf_event_update_userpage <-x86_perf_event_set_period stream-22736 [001] d.h. 5409.275148: intel_pmu_enable_event <-x86_pmu_start stream-22736 [001] d.h. 5409.275149: intel_pmu_pebs_enable <-intel_pmu_enable_event stream-22736 [001] d.h. 5409.275150: perf_event_update_userpage <-x86_pmu_start stream-22736 [001] d.h. 5409.275151: intel_pmu_enable_all <-x86_pmu_enable stream-22736 [001] d.h. 5409.275152: intel_pmu_pebs_enable_all <-intel_pmu_enable_all end=>stream-22736 [001] d.h. 5409.275152: intel_pmu_lbr_enable_all <-intel_pmu_enable_all stream-22736 [001] d.h. 5409.275176: perf_event_task_tick <-scheduler_tick stream-22736 [001] .... 5409.275217: perf_event_mmap_output <-perf_event_aux_ctx stream-22736 [001] d.h. 5409.275233: intel_pmu_handle_irq <-perf_event_nmi_handler stream-22736 [001] d.h. 5409.275234: intel_pmu_disable_all <-intel_pmu_handle_irq I looked at the source code and see that in the per-process case the whole list of allocated events (29 x 272) is traversed twice on every timer interrupt during that 2ms and 600us intervals. However the only 29 per-cpu events are iterated in the system-wide case: kernel/events/core.c:2799: static void ctx_sched_out(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx, enum event_type_t event_type) { #ifdef PERF___ENABLE_CTX_ROT struct perf_event *event; int is_active = ctx->is_active; #endif ctx->is_active &= ~event_type; if (likely(!ctx->nr_events)) return; update_context_time(ctx); update_cgrp_time_from_cpuctx(cpuctx); if (!ctx->nr_active) return; perf_pmu_disable(ctx->pmu); #ifdef PERF___ENABLE_CTX_ROT if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) { list_for_each_entry(event, &ctx->pinned_groups, group_entry) group_sched_out(event, cpuctx, ctx); } if ((is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) { list_for_each_entry(event, &ctx->flexible_groups, group_entry) group_sched_out(event, cpuctx, ctx); } #endif perf_pmu_enable(ctx->pmu); } kernel/events/core.c:2384 static void ctx_sched_in(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx, enum event_type_t event_type, struct task_struct *task) { u64 now; #ifdef PERF___ENABLE_CTX_ROT int is_active = ctx->is_active; ctx->is_active |= event_type; #endif if (likely(!ctx->nr_events)) return; now = perf_clock(); ctx->timestamp = now; perf_cgroup_set_timestamp(task, ctx); /* * First go through the list and put on any pinned groups * in order to give them the best chance of going on. */ #ifdef PERF___ENABLE_CTX_ROT if (!(is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) ctx_pinned_sched_in(ctx, cpuctx); /* Then walk through the lower prio flexible groups */ if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) ctx_flexible_sched_in(ctx, cpuctx); #endif } If I disable traversing in the per-process case then the overhead disappears. For the system-wide case the ctx->pinned_groups and ctx->flexible_groups lists are parts of per-cpu perf_cpu_context object and count of iterations is small (#events == 29). Thanks, Alexey -----Original Message----- From: David Carrillo-Cisneros [mailto:davidcc@google.com] Sent: Tuesday, April 25, 2017 8:49 PM To: Liang, Kan Cc: linux-kernel@vger.kernel.org; x86@kernel.org; Ingo Molnar ; Thomas Gleixner ; Andi Kleen ; Peter Zijlstra ; Borislav Petkov ; Srinivas Pandruvada ; Dave Hansen ; Vikas Shivappa ; Mark Rutland ; Arnaldo Carvalho de Melo ; Vince Weaver ; Paul Turner ; Stephane Eranian ; Budankov, Alexey Subject: Re: [RFC 0/6] optimize ctx switch with rb-tree Hi Kan, It's still on my list, but I won't have time to work on it for at least another month. What issues did you encounter? Thanks, David On Tue, Apr 25, 2017 at 10:27 AM, Liang, Kan wrote: > Hi David, > > Is there any update about the patch series? > > We recently encountered another performance issue on KNL. I think the RB-tree solution also has benefits for it. > > Thanks, > Kan > >> Subject: [RFC 0/6] optimize ctx switch with rb-tree >> >> Following the discussion in: >> https://patchwork.kernel.org/patch/9420035/ >> >> This is is an early version of a series of perf context switches optimizations. >> >> The main idea is to create and maintain a list of inactive events >> sorted by timestamp, and a rb-tree index to index it. The rb-tree's >> key are {cpu,flexible,stamp} for task contexts and >> {cgroup,flexible,stamp} for CPU contexts. >> >> The rb-tree provides functions to find intervals in the inactive >> event list so that ctx_sched_in only has to visit the events that can >> be potentially be scheduled (i.e. avoid iterations over events bound >> to CPUs or cgroups that are not current). >> >> Since the inactive list is sort by timestamp, rotation can be done by >> simply scheduling out and in the events. This implies that each timer >> interrupt, the events will rotate by q events (where q is the number of hardware counters). >> This changes the current behavior of rotation. >> Feedback welcome! >> >> I haven't profiled the new approach. I am only assuming it will be >> superior when the number of per-cpu or distict cgroup events is large. >> >> The last patch shows how perf_iterate_ctx can use the new rb-tree >> index to reduce the number of visited events. I haven't looked >> carefully if locking and other things are correct. >> >> If this changes are in the right direction. A next version could >> remove some existing code, specifically the lists ctx->pinned_groups >> and >> ctx->flexible_groups could be removed. Also, event_filter_match could >> ctx->be >> simplified when called on events groups filtered using the rb-tree, >> since both perform similar checks. >> >> David Carrillo-Cisneros (6): >> perf/core: create active and inactive event groups >> perf/core: add a rb-tree index to inactive_groups >> perf/core: use rb-tree to sched in event groups >> perf/core: avoid rb-tree traversal when no inactive events >> perf/core: rotation no longer neccesary. Behavior has changed. Beware >> perf/core: use rb-tree index to optimize filtered perf_iterate_ctx >> >> include/linux/perf_event.h | 13 ++ >> kernel/events/core.c | 466 >> +++++++++++++++++++++++++++++++++++++++------ >> 2 files changed, 426 insertions(+), 53 deletions(-) >> >> -- >> 2.11.0.390.gc69c2f50cf-goog > -------------------------------------------------------------------- Joint Stock Company Intel A/O Registered legal address: Krylatsky Hills Business Park, 17 Krylatskaya Str., Bldg 4, Moscow 121614, Russian Federation This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.