Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752108AbdHBILa (ORCPT ); Wed, 2 Aug 2017 04:11:30 -0400 Received: from mga04.intel.com ([192.55.52.120]:58627 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751121AbdHBIL2 (ORCPT ); Wed, 2 Aug 2017 04:11:28 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,310,1498546800"; d="scan'208";a="1158097524" From: Alexey Budankov Organization: Intel Corp. To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin Cc: Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel Subject: [PATCH v6 0/3] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi Message-ID: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> Date: Wed, 2 Aug 2017 11:11:23 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1587 Lines: 41 Hi, By default, the userspace perf tool opens per-cpu task-bound events when sampling, so for N logical events requested by the user, the tool will open N * NR_CPUS events. In the kernel, we mux events with a hrtimer, periodically rotating the flexible group list and trying to schedule each group in turn. We skip groups whose cpu filter doesn't match. So when we get unlucky, we can walk N * (NR_CPUS - 1) groups pointlessly for each hrtimer invocation. This has been observed to result in significant overhead when running the STREAM benchmark on 272 core Xeon Phi systems. One way to avoid this is to place our events into an rb tree sorted by CPU, so that our hrtimer can skip to the current CPU's list and ignore everything else. This patch set moves event groups into rb trees and implements skipping to the current CPU's list on hrtimer interrupt. The patch set was tested on Xeon Phi using perf_fuzzer and tests from here: https://github.com/deater/perf_event_tests Patches in the set are expected to be applied one after another in the mentioned order and they are logically split here into three parts to simplify the review process. Thanks, Alexey --- Alexey Budankov (3): perf/core: use rb trees for pinned/flexible groups perf/core: use context tstamp_data for skipped events on mux interrupt perf/core: add mux switch to skip to the current CPU's events list on mux interrupt include/linux/perf_event.h | 54 +++-- kernel/events/core.c | 584 +++++++++++++++++++++++++++++++++------------ 2 files changed, 473 insertions(+), 165 deletions(-)