Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp21848423ybl; Mon, 6 Jan 2020 12:31:58 -0800 (PST) X-Google-Smtp-Source: APXvYqzs9P3JIJd1GSgH048OM71KdGNjN9EPGyFb85o9XEaFQScdOEAYLpZH8l9VhvTrHHEWkjTo X-Received: by 2002:a05:6830:4a7:: with SMTP id l7mr109847577otd.372.1578342717984; Mon, 06 Jan 2020 12:31:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578342717; cv=none; d=google.com; s=arc-20160816; b=IvJ8S3gaztOU23cJ3srb3u0dwrRtsJrZf82TI8BMMeR172rLBZ1bAHj8uYBfFy6Re4 P/f2I/HUL/bzNmZyNKwxU1z3V9/7ouzWETeuBvOK1heCm9vT467DKLZotBYSxkpgzkGF IQXYPt4pO6NwHoN5ncEJVChMTUIqGaazxzag0fIHa5cog5uBX7WQEUZmqyAS7sjIkgrl bDLL6SJP76FWGLHWrqBrsqCkiWxSQMt3ZroCEzqEC8J8970F/kMujmnJ0DzQPFrV8w7u UQNtvU0FEj6sIFM2GZPQ06zxjsCFdjiKLLL+FV0O2sfp/8Tuo7xBxj4blgq5vDZKGRhs A7rA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from; bh=ojWcfx6CkqHW4bvu+bXuM305YkcdY1FbZQpXkeGOVn4=; b=U3yCTs7T0euInZJjHfpc/0YLZLkFpMf1TqjVf0TtFigisUzEKPgyo88S3BpDGBe1Ry dPQYoEqt5MzBm32dT6+0AkzsZOogha1q67e4mu/hYre00EeUXnxiRLv4wZbTvj35K/Tp olZtjKDgXBDgCxpC964Qu89aepdYpTcO19LaxIX89I4lAvpHEmsXP1O/5aFvUzI3rM4G ZoebJStKBUi0JxdboqAjBnWCrawlNwExtBjS30lUbRz15bbu4rteAYGoBEZ2GH3Df4bG 5bHqEWDN1CUQeaSQiOj4oG27N/LqElrZh7qKUCThzdkA/aILdvsQ4mKiT/UAKibLTVYi wetg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y22si36775136oti.269.2020.01.06.12.31.45; Mon, 06 Jan 2020 12:31:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727221AbgAFUak (ORCPT + 99 others); Mon, 6 Jan 2020 15:30:40 -0500 Received: from mga03.intel.com ([134.134.136.65]:10727 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726701AbgAFUac (ORCPT ); Mon, 6 Jan 2020 15:30:32 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Jan 2020 12:30:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.69,403,1571727600"; d="scan'208";a="245699539" Received: from labuser-ice-lake-client-platform.jf.intel.com ([10.54.55.50]) by fmsmga004.fm.intel.com with ESMTP; 06 Jan 2020 12:30:31 -0800 From: kan.liang@linux.intel.com To: peterz@infradead.org, acme@redhat.com, mingo@kernel.org, linux-kernel@vger.kernel.org Cc: tglx@linutronix.de, jolsa@kernel.org, eranian@google.com, alexander.shishkin@linux.intel.com, ak@linux.intel.com Subject: [PATCH V5 RESEND 14/14] perf, tools: Add documentation for topdown metrics Date: Mon, 6 Jan 2020 12:29:19 -0800 Message-Id: <20200106202919.2943-15-kan.liang@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20200106202919.2943-1-kan.liang@linux.intel.com> References: <20200106202919.2943-1-kan.liang@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andi Kleen Add some documentation how to use the topdown metrics in ring 3. Signed-off-by: Andi Kleen --- Changes since V4 - Update example for Ice Lake tools/perf/Documentation/topdown.txt | 235 +++++++++++++++++++++++++++ 1 file changed, 235 insertions(+) create mode 100644 tools/perf/Documentation/topdown.txt diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt new file mode 100644 index 000000000000..e724d2af3b8d --- /dev/null +++ b/tools/perf/Documentation/topdown.txt @@ -0,0 +1,235 @@ +Using TopDown metrics in user space +----------------------------------- + +Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown +methology to break down CPU pipeline execution into 4 bottlenecks: +frontend bound, backend bound, bad speculation, retiring. + +For more details on Topdown see [1][5] + +Traditionally this was implemented by events in generic counters +and specific formulas to compute the bottlenecks. + +perf stat --topdown implements this. + +Full Top Down includes more levels that can break down the +bottlenecks further. This is not directly implemented in perf, +but available in other tools that can run on top of perf, +such as toplev[2] or vtune[3] + +New Topdown features in Ice Lake +=============================== + +With Ice Lake CPUs the TopDown metrics are directly available as +fixed counters and do not require generic counters. This allows +to collect TopDown always in addition to other events. + +% perf stat -a --topdown -I1000 +# time counts unit events + 1.000854735 20,097,158,100 slots + 1.000854735 79,327,616 topdown-retiring # 0.4% retiring + 1.000854735 157,932,715 topdown-bad-spec # 0.8% bad speculation + 1.000854735 81,610,855 topdown-fe-bound # 0.4% frontend bound + 1.000854735 19,778,286,903 topdown-be-bound # 98.4% backend bound + 2.003623823 20,010,908,365 slots + 2.003623823 79,905,340 topdown-retiring # 0.4% retiring + 2.003623823 158,405,024 topdown-bad-spec # 0.8% bad speculation + 2.003623823 87,980,097 topdown-fe-bound # 0.4% frontend bound + 2.003623823 19,684,617,888 topdown-be-bound # 98.4% backend bound + 3.005828889 20,062,101,220 slots + 3.005828889 80,077,032 topdown-retiring # 0.4% retiring + 3.005828889 158,682,921 topdown-bad-spec # 0.8% bad speculation + 3.005828889 86,579,604 topdown-fe-bound # 0.4% frontend bound + 3.005828889 19,736,761,649 topdown-be-bound # 98.4% backend bound +... + +This also enables measuring TopDown per thread/process instead +of only per core. + +Using TopDown through RDPMC in applications on Ice Lake +====================================================== + +For more fine grained measurements it can be useful to +access the new directly from user space. This is more complicated, +but drastically lowers overhead. + +On Ice Lake, there is a new fixed counter 3: SLOTS, which reports +"pipeline SLOTS" (cycles multiplied by core issue width) and a +metric register that reports slots ratios for the different bottleneck +categories. + +The metrics counter is CPU model specific and is not be available +on older CPUs. + +Example code +============ + +Library functions to do the functionality described below +is also available in libjevents [4] + +The application opens a perf_event file descriptor +and sets up fixed counter 3 (SLOTS) to start and +allow user programs to read the performance counters. + +Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04, +so the perf_event_attr structure should be initialized with +{ .config = 0x0400, .type = PERF_TYPE_RAW } + +#include +#include +#include + +/* Provide own perf_event_open stub because glibc doesn't */ +__attribute__((weak)) +int perf_event_open(struct perf_event_attr *attr, pid_t pid, + int cpu, int group_fd, unsigned long flags) +{ + return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); +} + +/* open slots counter file descriptor for current task */ +struct perf_event_attr slots = { + .type = PERF_TYPE_RAW, + .size = sizeof(struct perf_event_attr), + .config = 0x400, + .exclude_kernel = 1, +}; + +int fd = perf_event_open(&slots, 0, -1, -1, 0); +if (fd < 0) + ... error ... + +The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used +to read slots and the topdown metrics at different points of the program: + +#include +#include + +#define RDPMC_FIXED (1 << 30) /* return fixed counters */ +#define RDPMC_METRIC (1 << 29) /* return metric counters */ + +#define FIXED_COUNTER_SLOTS 3 +#define METRIC_COUNTER_TOPDOWN_L1 0 + +static inline uint64_t read_slots(void) +{ + return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS); +} + +static inline uint64_t read_metrics(void) +{ + return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1); +} + +Then the program can be instrumented to read these metrics at different +points. + +It's not a good idea to do this with too short code regions, +as the parallelism and overlap in the CPU program execution will +cause too much measurement inaccuracy. For example instrumenting +individual basic blocks is definitely too fine grained. + +Decoding metrics values +======================= + +The value reported by read_metrics() contains four 8 bit fields +that represent a scaled ratio that represent the Level 1 bottleneck. +All four fields add up to 0xff (= 100%) + +The binary ratios in the metric value can be converted to float ratios: + +#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff) + +#define TOPDOWN_RETIRING(val) ((float)GET_METRIC(val, 0) / 0xff) +#define TOPDOWN_BAD_SPEC(val) ((float)GET_METRIC(val, 1) / 0xff) +#define TOPDOWN_FE_BOUND(val) ((float)GET_METRIC(val, 2) / 0xff) +#define TOPDOWN_BE_BOUND(val) ((float)GET_METRIC(val, 3) / 0xff) + +and then converted to percent for printing. + +The ratios in the metric accumulate for the time when the counter +is enabled. For measuring programs it is often useful to measure +specific sections. For this it is needed to deltas on metrics. + +This can be done by scaling the metrics with the slots counter +read at the same time. + +Then it's possible to take deltas of these slots counts +measured at different points, and determine the metrics +for that time period. + + slots_a = read_slots(); + metric_a = read_metrics(); + + ... larger code region ... + + slots_b = read_slots() + metric_b = read_metrics() + + # compute scaled metrics for measurement a + retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a + bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a + fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a + be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a + + # compute delta scaled metrics between b and a + retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a + bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a + fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a + be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a + +Later the individual ratios for the measurement period can be recreated +from these counts. + + slots_delta = slots_b - slots_a + retiring_ratio = (float)retiring_slots / slots_delta + bad_spec_ratio = (float)bad_spec_slots / slots_delta + fe_bound_ratio = (float)fe_bound_slots / slots_delta + be_bound_ratio = (float)be_bound_slots / slota_delta + + printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n", + retiring_ratio * 100., + bad_spec_ratio * 100., + fe_bound_ratio * 100., + be_bound_ratio * 100.); + +Resetting metrics counters +========================== + +Since the individual metrics are only 8bit they lose precision for +short regions over time because the number of cycles covered by each +fraction bit shrinks. So the counters need to be reset regularly. + +When using the kernel perf API the kernel resets on every read. +So as long as the reading is at reasonable intervals (every few +seconds) the precision is good. + +When using perf stat it is recommended to always use the -I option, +with no longer interval than a few seconds + + perf stat -I 1000 --topdown ... + +For user programs using RDPMC directly the counter can +be reset explicitly using ioctl: + + ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0); + +This "opens" a new measurement period. + +A program using RDPMC for TopDown should schedule such a reset +regularly, as in every few seconds. + +Limits on Ice Lake +================== + +All the TopDown events must be in a group with SLOTS events. + +There is no sampling support for TopDown events. +Sampling read SLOTS and TopDown events is forbidden. +For example, perf record -e '{slots, topdown-retiring}:S' + +[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win +[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual +[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe +[4] https://github.com/andikleen/pmu-tools/tree/master/jevents +[5] https://sites.google.com/site/analysismethods/yasin-pubs -- 2.17.1