Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp130585pxb; Wed, 11 Nov 2020 22:39:42 -0800 (PST) X-Google-Smtp-Source: ABdhPJxV0ConpUxDPPsJk0C02Hniyk1R8xbbUbm+nshDl4e6wzRZaOcHAfOkgTCD8hbIBpRTmZ/R X-Received: by 2002:aa7:ccd2:: with SMTP id y18mr3603179edt.11.1605163181983; Wed, 11 Nov 2020 22:39:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605163181; cv=none; d=google.com; s=arc-20160816; b=B4IBZc02UDCdbbYwr3Pe8QPkCzZu1nKDe8Zdr7Gsc0vZvXOEAqKi5qqo3fHOhvPZHa mT9UzITSP+gnRju+4pYVe8+3lUpvH7pvwOJC4iMAkqbgQgZYiHfJCbnpqgVbTzcge+qg V/1BEkPgeEbFAMlucHVLjj4sGJCFOByIndConSuXVpvdYhtUHnMiMqyAs4Z5Aib8BV/B ySlQG0xzfePKb7vLL7+m6mwBAklyo5XC6dug2C4wlsLN/Udu5feIXuEpO21NEpxj9zf2 hf9LpMFpv+/zNaRIx1VapX/xTlqMxJvvEgBSyUq0xiwVZdk+Ghwv0IHgig/48nFFQlUq k3oA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:ironport-sdr:ironport-sdr; bh=Kv7MwWOBv229fVD9+dvZgQ3dJk9/P4wWOuQsqt+KDJQ=; b=MxqSXsnX2Gkyr1TIh9IgVY9fDJhx+ujV1JryOAApicDd9oWfQ/Wn9iYJqMt3Bwr3k6 Hk6uNKdefxCZfBoqLsVGTs+EF3Vnwwpb70yob1lqXy/+ej/AI7GMVd7H2LfQKR21oLWD fYj3K7f+Oweo03Veb7cI1ZpX3s4Wen+v2mrxOnSHx1kWWztL4Q8VazgvvroYJi6HJAAg w/QdFed42BTqhGs5uJI0gTEMjBIkS1FzgS9aV4OEwKo9OCtewC0XZ/2NgMpqPB+yB7kq LxHDUjb6tdqsjjwxRrBzdV6UgkOxrhdjTMAwKxh3ZObOVtnf72apHlhz4I9lUsYYzGn9 04Bg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l12si3199975edq.55.2020.11.11.22.39.17; Wed, 11 Nov 2020 22:39:41 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726020AbgKLGfp (ORCPT + 99 others); Thu, 12 Nov 2020 01:35:45 -0500 Received: from mga06.intel.com ([134.134.136.31]:24011 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725860AbgKLGfo (ORCPT ); Thu, 12 Nov 2020 01:35:44 -0500 IronPort-SDR: fIn6TwWrvxni4a4XSVqQrNlVvgOTSfua/yiUah/ySIyif26Lg9ueyH/8mQxvuGPV5hgbdRTtYw qXRfC+THH3bA== X-IronPort-AV: E=McAfee;i="6000,8403,9802"; a="231886775" X-IronPort-AV: E=Sophos;i="5.77,471,1596524400"; d="scan'208";a="231886775" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2020 22:35:43 -0800 IronPort-SDR: V27cEYIko+0b1B2J4zogDP8ViHJtVurMA46rtZpjf9TFOsIh7Cfe3ZMMSceXZ/2Lb5TjBjUxFR 5xuzFzSE5cbQ== X-IronPort-AV: E=Sophos;i="5.77,471,1596524400"; d="scan'208";a="339306684" Received: from tassilo.jf.intel.com ([10.54.74.11]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Nov 2020 22:35:43 -0800 Date: Wed, 11 Nov 2020 22:35:42 -0800 From: Andi Kleen To: Ian Rogers Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , LKML , Jin Yao , John Garry , Paul Clarke , kajoljain , Stephane Eranian , Sandeep Dasgupta , linux-perf-users Subject: Re: [RFC PATCH 00/12] Topdown parser Message-ID: <20201112063542.GD894261@tassilo.jf.intel.com> References: <20201110100346.2527031-1-irogers@google.com> <20201111214635.GA894261@tassilo.jf.intel.com> <20201112031049.GC894261@tassilo.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 11, 2020 at 08:09:49PM -0800, Ian Rogers wrote: > >? ? to the optimization manual the group Topdown_Group_TopDownL1 > provides the > >? ? > metrics?Topdown_Metric_Frontend_Bound,?Topdown_Metric_Backend_Bound,?Topdown_Metric_Bad_Speculation > >? ? and?Topdown_Metric_Retiring. The hope is the events here will all > be > >? ? scheduled without multiplexing. > > That's not necessarily true. Some of the newer expressions are quite > complex (e.g .due to workarounds or because the events are complex, like > the FLOPS events) There's also some problems with the > scheduling of the fixed metrics on Icelake+, that need special handling. > > For FLOPS I see: > ( #FP_Arith_Scalar + #FP_Arith_Vector ) / ( 2 * CORE_CLKS )? > is the concern about multiplexing? Could we create metrics/groups aware of > the limitations? If you expand it you'll end up with a lot of events, so it has to be split into groups. But you still need to understand the rules, otherwise the tool ends up with non schedulable groups. For example here's a group schedule generated by toplev for level 3 Icelake. On pre Icelake with only 4 counters it's more complicated. Microcode_Sequencer[3] Heavy_Operations[1] Memory_Bound[1] Branch_Mispredicts[1] Fetch_Bandwidth[1] Other[4] Heavy_Operations[3] Frontend_Bound[1] FP_Arith[1] Backend_Bound[1] Light_Operations[3] Microcode_Sequencer[1] FP_Arith[4] Fetch_Bandwidth[2] Light_Operations[1] Retiring[1] Bad_Speculation[1] Machine_Clears[1] Other[1] Core_Bound[1] Ports_Utilization[1]: perf_metrics.frontend_bound[f] perf_metrics.bad_speculation[f] topdown.slots[f] perf_metrics.backend_bound[f] perf_metrics.retiring[f] [0 counters] Machine_Clears[2] Branch_Mispredicts[2] ITLB_Misses[3] Branch_Mispredicts[1] Fetch_Bandwidth[1] Machine_Clears[1] Frontend_Bound[1] Backend_Bound[1] ICache_Misses[3] Fetch_Latency[2] Fetch_Bandwidth[2] Bad_Speculation[1] Memory_Bound[1] DSB_Switches[3]: inst_retired.any[f] machine_clears.count cpu_clk_unhalted.thread[f] int_misc.recovery_cycles:c1:e1 br_misp_retired.all_branches idq_uops_not_delivered.cycles_0_uops_deliv.core topdown.slots[f] dsb2mite_switches.penalty_cycles icache_64b.iftag_stall int_misc.uop_dropping icache_16b.ifdata_stall [8 counters] Core_Bound[1] Core_Bound[2] Heavy_Operations[3] Memory_Bound[2] Light_Operations[3]: cycle_activity.stalls_mem_any idq.ms_uops exe_activity.1_ports_util exe_activity.exe_bound_0_ports exe_activity.2_ports_util exe_activity.bound_on_stores int_misc.recovery_cycles:c1:e1 uops_issued.any [8 counters] Microcode_Sequencer[3] Store_Bound[3] Branch_Resteers[3] MS_Switches[3] Divider[3] LCP[3]: int_misc.clear_resteer_cycles arith.divider_active cpu_clk_unhalted.thread[f] idq.ms_switches ild_stall.lcp baclears.any exe_activity.bound_on_stores idq.ms_uops uops_issued.any [8 counters] L1_Bound[3] L3_Bound[3]: cycle_activity.stalls_l1d_miss cycle_activity.stalls_mem_any cycle_activity.stalls_l2_miss cpu_clk_unhalted.thread[f] cycle_activity.stalls_l3_miss [4 counters] DSB[3] MITE[3]: idq.mite_cycles_ok idq.mite_cycles_any cpu_clk_unhalted.distributed idq.dsb_cycles_any idq.dsb_cycles_ok [5 counters] LSD[3] L2_Bound[3]: cycle_activity.stalls_l1d_miss cpu_clk_unhalted.thread[f] cpu_clk_unhalted.distributed lsd.cycles_ok lsd.cycles_active cycle_activity.stalls_l2_miss [5 counters] Other[4] FP_Arith[4] L2_Bound[3] DRAM_Bound[3]: mem_load_retired.fb_hit mem_load_retired.l2_hit fp_arith_inst_retired.512b_packed_single mem_load_retired.l1_miss fp_arith_inst_retired.512b_packed_double l1d_pend_miss.fb_full_periods [6 counters] Ports_Utilization[3] DRAM_Bound[3]: cycle_activity.stalls_l1d_miss arith.divider_active mem_load_retired.l2_hit exe_activity.1_ports_util cpu_clk_unhalted.thread[f] cycle_activity.stalls_l3_miss exe_activity.2_ports_util cycle_activity.stalls_l2_miss exe_activity.exe_bound_0_ports [8 counters] Other[4] FP_Arith[4]: fp_arith_inst_retired.128b_packed_single uops_executed.thread uops_executed.x87 fp_arith_inst_retired.scalar_double fp_arith_inst_retired.256b_packed_single fp_arith_inst_retired.scalar_single fp_arith_inst_retired.128b_packed_double fp_arith_inst_retired.256b_packed_double [8 counters] > Ok, so we can read the threshold from the spreadsheet and create an extra > metric for if the metric above the threshold? Yes it can be all derived from the spreadsheet. You need a much more complicated evaluation algorithm though, single pass is not enough. > > Also in other cases it's probably better to not drilldown, but collect > everything upfront, e.g. when someone else is doing the collection > for you. In this case the thresholding has to be figured out from > existing data. > > This sounds like perf record support for metrics, which I think is a good > idea but not currently a use-case I'm trying to solve. I meant just a single collection with perf stat, but with all events for all metrics/nodes collected. But the tool automatically figures out the bottleneck and only shows the relevant information instead of all. That's a good use case when it's not feasible to iterate (e.g. you ask someone else to measure) > The other biggie which is currently not in metrics is per core mode, > which is needed for many metrics on CPUs older than Icelake. This really > has to be supported in some way, otherwise the results on pre Icelake > SMT > are not good (Icelake fixes this problem) > > I think this can be added in the style of event modifiers. We may want to > distinguish metrics from user or kernel code. We may want one cpu, etc. There is a per core qualifier now, but you would also need to teach the perf stat output to show the shared nodes. That's likely significant surgery. > Sure, but I'm not understanding how a user without vtune or toplev is > expected follow top down? It's fairly difficult. Some experts can do it, but it's not a common skill. Giving the metrics automates a lot of the > process as well as alleviating differences across different?models. > Previously a lot of metrics had just broken expressions. Were C code used These problems should be all fixed now. We had some testing challenges, but they have been addressed > instead of strings in json, these would have resulted in compilation > errors. Even with the metrics being broken in perf I didn't see users > complaining. I suspect metrics are outside of most people's use, but that > doesn't mean we shouldn't make them convenient to use if they want them. > Yes there are issues, but that's also true for most events. > I think ideally we have the TMA_Metrics.csv (and related tools) > contributed to Linux, this would be processed at build time to build all > events and metrics (what this patch series does). Making the metrics work, > complexity, multiplexing, etc. are all issues, but ones worth fixing. My feeling before was that some selected nodes and the non TMA metrics make sense and that's the current metrics in fact. Perhaps it wasn't perfect, but at least it was something, and they work already quite well with the current perf infrastructure. It probably also makes sense to add a few more nodes (let's say level 2 or maybe level 3). But if you add much more you'll need all this extra machinery I described. Without it adding all the lower level nodes blindly is not a good idea because it will not be very usable, or worse even misleading. Even for level 2/3 we should probably have some simple thresholding at least, but maybe could get away with not having it yet. While it's of course possible to add this too it would be significant surgery. I actually considered doing it all in perf at some point, but it was quite complicated. I personally found it simpler to handle those higher level algorithms in a separate tool. -Andi