Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABAEEC05027 for ; Sun, 19 Feb 2023 09:34:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229658AbjBSJev (ORCPT ); Sun, 19 Feb 2023 04:34:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230004AbjBSJem (ORCPT ); Sun, 19 Feb 2023 04:34:42 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3EE593FB for ; Sun, 19 Feb 2023 01:33:26 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-5368974a1d7so16316217b3.20 for ; Sun, 19 Feb 2023 01:33:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=0WNtF2PwUjqgHiS8vMpkTdXaQVv2moJc+3rqhuSsfuE=; b=dLmrnaUGG3X1ERysDp0IRq/7zE8C6S2VBu+Y8Ifmmpg2eJgYsyXl7bisBwPs1DiTGl nMkXCAYn45Q51IOgYicn5/fRQnhwZ2AgrKX+gWHEDx/d9/PBofNNUcsNFmRKfZpJHljn fuXw2+ebxY1lKf1nliswtJRcwcEECKRiIhJT7Z4fap/sKtQKTuR+8tUN9NYyOH3/tzlf Vqr55zqW6Icy9VUi2nJ/WJT9wMuVBXSuanWyzxOzS5cpQJXuYVSGFZ7vSeG6uNArjFPu 2FhgxiTS4WX9/iZ8CTax+HQ7kb2837/YMdGze5ZXRJmdU8A7C7jaIkRn06uOxk2YHkSq U+Dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=0WNtF2PwUjqgHiS8vMpkTdXaQVv2moJc+3rqhuSsfuE=; b=UX4VDXv3f/M6MQvyzBuwUuYQ/wyFvH21nOs6WqFPaPJSPCjk2fuoKeeLEhInM/xSsm nH9FxbmowTVIh3uBTMalhTpkJcGm2PvX09eXUcngQrbvZzB6Cayyt6ZCwUs90PwUU9v/ 7kOwE/Agv7jQu0ZLXSqco//ByT602KE70LAdy1cBCV6PR2nU+ed5ApE7HPMRqDuqNKno fWSjRHXb3wscM0dnfFvAZia823d45aDIs1FbWUJgRqX1fUPz85peSAlJyh2fCJPnLH2x cY7aNrHwyBENx8WCmQ468Ih4SG5Xn3h1PbruDGQ8/Wyi7amjlWeoa0KIzg3CanQJR9X9 RqQg== X-Gm-Message-State: AO0yUKWtpowKPjrPe7XTB+CdSlmaQign2U3mYAu1ETvGDrvVZLnbORKh 3EOTExJv9D8ZWCc1sAhgspKQlY2KRyz6 X-Google-Smtp-Source: AK7set/GdbPogPBZxZbNnY3wcJgxc37uIDZu6R6SX7+vTAYx3FE+NZ5ujYuHIHTW3hHQ2eG1atV2Aj21iy+b X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a25:9b85:0:b0:901:73ee:7044 with SMTP id v5-20020a259b85000000b0090173ee7044mr1642835ybo.458.1676799156002; Sun, 19 Feb 2023 01:32:36 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:23 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-27-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 26/51] perf vendor events intel: Refresh sapphirerapids events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Update the sapphirerapids events from 1.09 to 1.11. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sapphirerapids/cache.json | 24 +- .../x86/sapphirerapids/floating-point.json | 32 + .../arch/x86/sapphirerapids/frontend.json | 8 + .../arch/x86/sapphirerapids/pipeline.json | 19 +- .../arch/x86/sapphirerapids/spr-metrics.json | 2283 +++++++++-------- .../arch/x86/sapphirerapids/uncore-other.json | 60 + 7 files changed, 1304 insertions(+), 1124 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 02715cbe4fd8..1f611a7dbdda 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -22,7 +22,7 @@ GenuineIntel-6-A[AC],v1.00,meteorlake,core GenuineIntel-6-1[AEF],v3,nehalemep,core GenuineIntel-6-2E,v3,nehalemex,core GenuineIntel-6-2A,v18,sandybridge,core -GenuineIntel-6-(8F|CF),v1.09,sapphirerapids,core +GenuineIntel-6-(8F|CF),v1.11,sapphirerapids,core GenuineIntel-6-(37|4A|4C|4D|5A),v14,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v53,skylake,core GenuineIntel-6-55-[01234],v1.28,skylakex,core diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json b/too= ls/perf/pmu-events/arch/x86/sapphirerapids/cache.json index 92a605ecac6e..9606e76b98d6 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json @@ -97,18 +97,18 @@ "UMask": "0x4" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias = to L2_RQSTS.REFERENCES]", + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES]", "EventCode": "0x24", "EventName": "L2_REQUEST.ALL", - "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses.[This event is alias to L2_RQSTS.REFERENCES]", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES]", "SampleAfterValue": "200003", "UMask": "0xff" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[Thi= s event is alias to L2_RQSTS.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [Th= is event is alias to L2_RQSTS.MISS]", "EventCode": "0x24", "EventName": "L2_REQUEST.MISS", - "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses.[This event is alias to L2_RQSTS.MISS]", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_RQSTS.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f" }, @@ -199,18 +199,18 @@ "UMask": "0x30" }, { - "BriefDescription": "Read requests with true-miss in L2 cache.[Thi= s event is alias to L2_REQUEST.MISS]", + "BriefDescription": "Read requests with true-miss in L2 cache. [Th= is event is alias to L2_REQUEST.MISS]", "EventCode": "0x24", "EventName": "L2_RQSTS.MISS", - "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses.[This event is alias to L2_REQUEST.MISS]", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_REQUEST.MISS]", "SampleAfterValue": "200003", "UMask": "0x3f" }, { - "BriefDescription": "All accesses to L2 cache[This event is alias = to L2_REQUEST.ALL]", + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_REQUEST.ALL]", "EventCode": "0x24", "EventName": "L2_RQSTS.REFERENCES", - "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses.[This event is alias to L2_REQUEST.ALL]", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_REQUEST.ALL]", "SampleAfterValue": "200003", "UMask": "0xff" }, @@ -863,6 +863,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1" }, + { + "BriefDescription": "Counts bus locks, accounts for cache line spl= it locks and UC locks.", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to= enforce cache coherency for certain memory accesses that need to be done a= tomically. Can be created by issuing an atomic instruction (via the LOCK p= refix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10" + }, { "BriefDescription": "Number of PREFETCHNTA instructions executed."= , "EventCode": "0x40", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.j= son b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json index 01baea3df562..4a9d211e9d4f 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/floating-point.json @@ -75,6 +75,14 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single and 256-bit packed double precision FP instructions retired; some = instructions will count twice as noted below. Each count represents 2 or/a= nd 4 computation operations, 1 for each element. Applies to SSE* and AVX* = packed single precision and packed double precision FP instructions: ADD SU= B HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DP= P and FM(N)ADD/SUB count twice as they perform 2 calculations per element."= , + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision and 256-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 2 or/and 4 computation operations, one for each element. = Applies to SSE* and AVX* packed single precision floating-point and packed= double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL= DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB ins= tructions count twice as they perform 2 calculations per element. The DAZ a= nd FTZ flags in the MXCSR register need to be set when using these events."= , + "SampleAfterValue": "100003", + "UMask": "0x18" + }, { "BriefDescription": "Counts number of SSE/AVX computational 512-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 8 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT14= RCP14 FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform = 2 calculations per element.", "EventCode": "0xc7", @@ -91,6 +99,22 @@ "SampleAfterValue": "100003", "UMask": "0x80" }, + { + "BriefDescription": "Number of SSE/AVX computational 256-bit packe= d single precision and 512-bit packed double precision FP instructions ret= ired; some instructions will count twice as noted below. Each count repres= ents 8 computation operations, 1 for each element. Applies to SSE* and AVX= * packed single precision and double precision FP instructions: ADD SUB HAD= D HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14 RCP RCP14 DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB count twice as they perform 2 calculations per elem= ent.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.8_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed single precision and 512-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 8 computation operations, one for each element. Applies = to SSE* and AVX* packed single precision and double precision floating-poin= t instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RSQRT14= RCP RCP14 DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice= as they perform 2 calculations per element. The DAZ and FTZ flags in the M= XCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x60" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floati= ng-point instructions retired; some instructions will count twice as noted = below. Applies to SSE* and AVX* scalar, double and single precision floati= ng-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB instructions count twice as they perform multiple c= alculations per element.", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision and double precision floating-point instructions retired; some = instructions will count twice as noted below. Each count represents 1 comp= utational operation. Applies to SSE* and AVX* scalar single precision float= ing-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB= . FM(N)ADD/SUB instructions count twice as they perform 2 calculations per= element. The DAZ and FTZ flags in the MXCSR register need to be set when u= sing these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, { "BriefDescription": "Counts number of SSE/AVX computational scalar= double precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar double precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructi= ons count twice as they perform 2 calculations per element.", "EventCode": "0xc7", @@ -107,6 +131,14 @@ "SampleAfterValue": "100003", "UMask": "0x2" }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic in= structions", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic i= nstructions. The DAZ and FTZ flags in the MXCSR register need to be set wh= en using these events.", + "SampleAfterValue": "1000003", + "UMask": "0xfc" + }, { "BriefDescription": "FP_ARITH_INST_RETIRED2.128B_PACKED_HALF", "EventCode": "0xcf", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json index 2c7c617f27ed..860a415e5e79 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json @@ -1,4 +1,12 @@ [ + { + "BriefDescription": "Clears due to Unknown Branches.", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered w= hen it finds a branch instruction in a fetch line. This is called Unknown B= ranch which occurs for the first time a branch instruction is fetched or wh= en the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Stalls caused by changing prefix length of th= e instruction.", "EventCode": "0x87", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json index ceb14181ebc8..40e52357ade1 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json @@ -1,15 +1,17 @@ [ { - "BriefDescription": "AMX_OPS_RETIRED.BF16", + "BriefDescription": "AMX retired arithmetic BF16 operations.", "EventCode": "0xce", "EventName": "AMX_OPS_RETIRED.BF16", + "PublicDescription": "Number of AMX-based retired arithmetic bfloa= t16 (BF16) floating-point operations. Counts TDPBF16PS FP instructions. SW = to use operation multiplier of 4", "SampleAfterValue": "1000003", "UMask": "0x2" }, { - "BriefDescription": "AMX_OPS_RETIRED.INT8", + "BriefDescription": "AMX retired arithmetic integer 8-bit operatio= ns.", "EventCode": "0xce", "EventName": "AMX_OPS_RETIRED.INT8", + "PublicDescription": "Number of AMX-based retired arithmetic integ= er operations of 8-bit width source operands. Counts TDPB[SS,UU,US,SU]D ins= tructions. SW should use operation multiplier of 8.", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -461,12 +463,23 @@ "UMask": "0x1" }, { - "BriefDescription": "INST_RETIRED.REP_ITERATION", + "BriefDescription": "Iterations of Repeat string retired instructi= ons.", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" }, + { + "BriefDescription": "Clears speculative count", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due = to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1" + }, { "BriefDescription": "Counts cycles after recovery from a branch mi= sprediction or machine clear till the first uop is issued from the resteere= d path.", "EventCode": "0xad", diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json= b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json index ce18fc458e37..149cc4c07fb5 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json @@ -1,1616 +1,1675 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency *= tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_i= cache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Band= width related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) = + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_acces= ses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_b= ound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of Memory Latency related= bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + = tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_= accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_b= ound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_p= mm_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_me= mory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + t= ma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dt= lb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_= blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + t= ma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtl= b_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_st= reaming_stores)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of branch related instruc= tions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.= NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * = BR_INST_RETIRED.NEAR_CALL)) / SLOTS)", - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / = 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma= _mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icach= e_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_cl= ks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "tma_retiring * SLOTS / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" + "BriefDescription": "This metric estimates fraction of cycles wher= e the Advanced Matrix Extensions (AMX) execution engine was busy with tile = (arithmetic) operations", + "MetricExpr": "EXE.AMX_BUSY / tma_info_core_clks", + "MetricGroup": "Compute;HPC;Server;TopdownL5;tma_L5_group;tma_port= s_utilized_0_group", + "MetricName": "tma_amx_busy", + "MetricThreshold": "tma_amx_busy > 0.5 & (tma_ports_utilized_0 > 0= .2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & tma_backend_bo= und > 0.2)))", + "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "100 * cpu@ASSISTS.ANY\\,umask\\=3D0x1B@ / tma_info_= slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(SLOTS / (TOPDOWN.SLOTS / 2) if #SMT_on else 1)", - "MetricGroup": "SMT;tma_L1_group", - "MetricName": "Slots_Utilization" + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_slots", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredict= s_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma= _unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARIT= H_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_H= ALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_A= RITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DO= UBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIR= ED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 *= AMX_OPS_RETIRED.BF16) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * CORE_CLKS)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(76 * tma_info_average_frequency * (MEM_LOAD_L3_HIT= _RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DAT= A_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + = 75.5 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1= + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks"= , + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma= _core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 els= e 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "75.5 * tma_info_average_frequency * (MEM_LOAD_L3_HI= T_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSync= xn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (= tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 6 > = 0.35))", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_clks - tm= a_pmm_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & = tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma= _info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SI= NGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_= HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRE= D2.COMPLEX_SCALAR_HALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + F= P_ARITH_INST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B= _PACKED_HALF + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RET= IRED.512B_PACKED_DOUBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + = FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512= B_PACKED_HALF + 4 * AMX_OPS_RETIRED.BF16)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SI= NGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR = + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PA= CKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETI= RED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.VECTOR))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). May undercount due to FMA doubl= e counting. Approximated prior to BDW." + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "80 * tma_info_average_frequency * OCR.DEMAND_RFO.L3= _HIT.SNOOP_HITM / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_is= sueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_b= andwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_= stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound= > 0.15 & tma_info_ipc / 6 > 0.35", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp"= , + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.128B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). May undercount = due to FMA double counting." + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.256B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). May undercount due= to FMA double counting." + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.512B_PACKED_HALF)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). May undercount due = to FMA double counting." + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) matrix uops fraction the CPU has retired (aggregated across all = supported FP datatypes in AMX engine)", + "MetricExpr": "cpu@AMX_OPS_RETIRED.BF16\\,cmask\\=3D1@ / (tma_reti= ring * tma_info_slots)", + "MetricGroup": "Compute;Flops;HPC;Pipeline;Server;TopdownL4;tma_L4= _group;tma_fp_arith_group", + "MetricName": "tma_fp_amx", + "MetricThreshold": "tma_fp_amx > 0.1 & (tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) matrix uops fraction the CPU has retired (aggregated across all= supported FP datatypes in AMX engine). Refer to AMX_Busy and GFLOPs metric= s for actual AMX utilization and FP performance, resp.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AMX operation = (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.BF16", - "MetricGroup": "Flops;FpVector;InsType;Server", - "MetricName": "IpArith_AMX_F16", - "PublicDescription": "Instructions per FP Arithmetic AMX operation= (lower number means higher occurrence rate). Operations factored per matri= ces' sizes of the AMX instructions." + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector + tma_f= p_amx", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Integer Arithmetic AMX opera= tion (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.INT8", - "MetricGroup": "InsType;IntVector;Server", - "MetricName": "IpArith_AMX_Int8", - "PublicDescription": "Instructions per Integer Arithmetic AMX oper= ation (lower number means higher occurrence rate). Operations factored per = matrices' sizes of the AMX instructions." + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists"= , + "MetricExpr": "30 * ASSISTS.FP / tma_info_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + FP_ARITH_INST_RETIRED2.SCALAR) / (tma_retiring * tma_info_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample w= ith: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,uma= sk\\=3D0x3c@ + FP_ARITH_INST_RETIRED2.VECTOR) / (tma_retiring * tma_info_sl= ots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * SLOTS / cpu@UOPS_RETIRED.SLOTS\\,cma= sk\\=3D1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Strings_Cycles" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per a microcode Assist invocatio= n. See Assists tree node for details (lower number means higher occurrence = rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@ASSISTS.ANY\\,umask\\=3D0x1B= @", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "IpAssist" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JC= C are commonly used examples.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEA= VY", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma= _fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) miss= es - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "Measured Average Frequency for unhalted proce= ssors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_t= ime", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Number of Instructions per non-speculative DS= B miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC"= , + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch rel= ated bottlenecks by large code footprint programs (i-side cache; TLB and BT= B misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_m= ispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_= misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RE= TIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" - }, - { - "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" - }, - { - "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" - }, - { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" - }, - { - "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" - }, - { - "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", - "MetricExpr": "1 - (Cond_NT + Cond_TK + CallRet + Jump)", - "MetricGroup": "Bad;Branches", - "MetricName": "Other_Branches" - }, - { - "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" - }, - { - "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss. Per-Logical Proces= sor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" - }, - { - "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" - }, - { - "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" - }, - { - "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" - }, - { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" - }, - { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" - }, - { - "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" - }, - { - "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_m= ispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_= misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / B= R_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredi= cts_resteers" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" - }, - { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" - }, - { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * CORE_CLKS)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" - }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "Total pipeline cost of branch related instruc= tions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.COND + 3 * BR_INST_RETIRED.= NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_TAKEN - 2 * = BR_INST_RETIRED.NEAR_CALL)) / tma_info_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instru= ctions (used for program control-flow including function calls). Related me= trics: tma_info_big_code" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)"= , + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utiliz= ation > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "CPU_CLK_UNHALTED.DISTRIBUTED", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { "BriefDescription": "Average CPU Utilization", "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" - }, - { - "BriefDescription": "Measured Average Frequency for unhalted proce= ssors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" - }, - { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARIT= H_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_H= ALF) + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_A= RITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DO= UBLE) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIR= ED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF + 4 *= AMX_OPS_RETIRED.BF16) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width and AMX engine." + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Tera Integer (matrix) Operations Per Second", - "MetricExpr": "8 * AMX_OPS_RETIRED.INT8 / 1e12 / duration_time", - "MetricGroup": "Cor;HPC;IntVector;Server", - "MetricName": "TIOPS" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_= REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_ba= ndwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", - "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 6= > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_l= cp" }, { - "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k"= , - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches= ", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { - "BriefDescription": "Average number of parallel data read requests= to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2= data-read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_PMM_Read_Latency" + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-= read prefetches", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + FP_ARITH_INST_RETIRED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.= 128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * cpu@= FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * (FP_ARITH_= INST_RETIRED2.128B_PACKED_HALF + cpu@FP_ARITH_INST_RETIRED.256B_PACKED_SING= LE\\,umask\\=3D0x60@) + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_= ARITH_INST_RETIRED.512B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_P= ACKED_HALF + 4 * AMX_OPS_RETIRED.BF16", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Read_BW" + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Write_BW" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "tma_info_flopc / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width and AMX engine." }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "uncore_cha_0@event\\=3D0x1@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma= _mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icach= e_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_cod= e", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Percentage of time spent in the active CPU po= wer state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC = * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALA= R_SINGLE\\,umask\\=3D0x03@ + FP_ARITH_INST_RETIRED2.SCALAR + (cpu@FP_ARITH_= INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0x3c@ + FP_ARITH_INST_RETIRED2.= VECTOR))", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). May undercount due to FMA doubl= e counting. Approximated prior to BDW." }, { - "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AMX operation = (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.BF16", + "MetricGroup": "Flops;FpVector;InsType;Server", + "MetricName": "tma_info_iparith_amx_f16", + "MetricThreshold": "tma_info_iparith_amx_f16 < 10", + "PublicDescription": "Instructions per FP Arithmetic AMX operation= (lower number means higher occurrence rate). Operations factored per matri= ces' sizes of the AMX instructions." }, { - "BriefDescription": "The ratio of number of completed memory store= instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Integer Arithmetic AMX opera= tion (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / AMX_OPS_RETIRED.INT8", + "MetricGroup": "InsType;IntVector;Server", + "MetricName": "tma_info_iparith_amx_int8", + "MetricThreshold": "tma_info_iparith_amx_int8 < 10", + "PublicDescription": "Instructions per Integer Arithmetic AMX oper= ation (lower number means higher occurrence rate). Operations factored per = matrices' sizes of the AMX instructions." }, { - "BriefDescription": "Ratio of number of requests missing L1 data c= ache (includes data+rfo w/ prefetches) to the total number of completed ins= tructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.128B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). May undercount = due to FMA double counting." }, { - "BriefDescription": "Ratio of number of demand load requests hitti= ng in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.256B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). May undercount due= to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read requests missing= in L1 instruction cache (includes prefetches) to the total number of compl= eted instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.512B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). May undercount due = to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed demand load requ= ests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache = (includes code+data+rfo w/ prefetches) to the total number of completed ins= tructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed data read reques= t missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / cpu@ASSISTS.ANY\\,umask\\=3D0x1B= @", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_ipassist", + "MetricThreshold": "tma_info_ipassist < 100e3", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Ratio of number of code read request missing = L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_= TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETI= RED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Ratio of number of code read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD / INST_RETIRED.ANY"= , - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_= OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to local memory in nano= seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL / UN= C_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL) / (UNC_CHA_CLOCKTICKS / (source_count(= UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL) * #num_packages)) * duration_time"= , - "MetricName": "llc_demand_data_read_miss_latency_for_local_request= s", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to remote memory in nan= o seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE / U= NC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE) / (UNC_CHA_CLOCKTICKS / (source_coun= t(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE) * #num_packages)) * duration_ti= me", - "MetricName": "llc_demand_data_read_miss_latency_for_remote_reques= ts", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_flopc", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to Intel(R) Optane(TM) = Persistent Memory(PMEM) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_pmem_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to DRAM in nano seconds= ", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_dram_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_cond_taken", + "MetricThreshold": "tma_info_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", - "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY"= , - "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_ret", + "MetricThreshold": "tma_info_ipmisp_ret < 500" }, { - "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", - "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", - "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 13", + "PublicDescription": "Instruction per taken branch. Related metric= s: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_d= sb_misses, tma_lcp" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTIC= KS) * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k"= , + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e= 6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 6= 4 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from decoded instruction cache= (decoded stream buffer or DSB) as a percent of total uops delivered to Ins= truction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructio= ns", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (M= icro-instruction Translation Engine or MITE) as a percent of total uops del= ivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ= .MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline"= , - "ScaleUnit": "100%" + "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) = as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.M= S_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", - "MetricName": "llc_miss_remote_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / slots", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Machine_Width uops every c= ycle to the Backend. Frontend Bound denotes unutilized issue-slots when the= re is no Backend stall; i.e. bubbles where Frontend delivered no uops while= Backend could have accepted them. For example; stalls due to instruction-c= ache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / slots", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_f= rontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", - "MetricExpr": "ICACHE_DATA.STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma= _fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_TAG.STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;= tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.TH= READ + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage. ", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) / max(1 = - (tma_frontend_bound + topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + topdown\\-r= etiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring= + topdown\\-be\\-bound)), 0) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNH= ALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - topdown\\-br\\-mispredict / (topdown\\-fe\\-bo= und + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) / m= ax(1 - (tma_frontend_bound + topdown\\-be\\-bound / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + topdow= n\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-ret= iring + topdown\\-be\\-bound)), 0)) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_C= LK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / CPU_CLK_UNHALTED.T= HREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch= _resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (First fetch or hitt= ing BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED= .THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "DECODE.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l3_miss_latency" }, { - "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_gro= up;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / CPU_CL= K_UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / CPU_CLK_= UNHALTED.DISTRIBUTED / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_pmm_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" }, { - "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "MetricExpr": "max(1 - (tma_frontend_bound + topdown\\-be\\-bound = / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdo= wn\\-be\\-bound) + topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-b= ad\\-spec + topdown\\-retiring + topdown\\-be\\-bound)), 0)", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L= 2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Band= width related bottlenecks", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) = + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_acces= ses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_b= ound + tma_store_bound) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma= _lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Ban= dwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_= use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "MetricExpr": "max(0, tma_bad_speculation - topdown\\-br\\-mispred= ict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + t= opdown\\-be\\-bound))", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L= 2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_me= mory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + t= ma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_dt= lb_load + tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_= blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + t= ma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtlb_store / (tma_dtl= b_store + tma_false_sharing + tma_split_stores + tma_store_latency + tma_st= reaming_stores)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" }, { - "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related= bottlenecks (external memory and off-core caches)", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + = tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_= accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_b= ound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_p= mm_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency relate= d bottlenecks (external memory and off-core caches). Related metrics: tma_l= 3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots"= , - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_ba= ckend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency *= tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_i= cache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bran= ch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_grou= p", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_cond_nt + tma_info_cond_tk + tma_info= _callret + tma_info_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_other_branches" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THR= EAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_read_bw" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED= .THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_write_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_R= QSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES *= (10 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUT= STANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group"= , - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "tma_retiring * tma_info_slots / cpu@UOPS_RETIRED.SL= OTS\\,cmask\\=3D1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_= CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "TOPDOWN.SLOTS", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "L1D_PEND_MISS.FB_FULL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group= ", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_slots / (TOPDOWN.SLOTS / 2) if #SMT_on el= se 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_slots_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "uncore_cha_0@event\\=3D0x1@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "min(((28 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHAL= TED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (= CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9= / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM += OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * (CPU_CLK_UNHALTED.= THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time= * 1e3 / 1e3)) - 3 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * = #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_HIT_RE= TIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS = / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "min((27 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALT= ED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (C= PU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 = / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM = / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_H= IT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / = 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_= group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance.", - "ScaleUnit": "100%" + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_strings_cycles", + "MetricThreshold": "tma_info_strings_cycles > 0.1" }, { - "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited)", - "MetricExpr": "min((12 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALT= ED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3 * (C= PU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 = / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_R= ETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)= ", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_grou= p", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L3 cache under unloaded scenarios (pos= sibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L= 3 hits) will improve the latency; reduce contention with sibling physical c= ores and increase performance. Note the value of this node may overlap wit= h its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "Tera Integer (matrix) Operations Per Second", + "MetricExpr": "8 * AMX_OPS_RETIRED.INT8 / 1e12 / duration_time", + "MetricGroup": "Cor;HPC;IntVector;Server", + "MetricName": "tma_info_tiops" }, { - "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / CPU_CL= K_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bou= nd_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). The Super Queue is used for = requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(MEMORY_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALT= ED.THREAD - tma_pmm_bound, 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_slots / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_b= ound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory (DRAM). The underlying heuristic assumes that a simi= lar off-core traffic is generated by all IA cores. This metric does not agg= regate non-data-read requests by this logical processor; requests from othe= r IA Logical Processors/Physical Cores/sockets; or other non-IA devices lik= e GPU; hence the maximum external memory bandwidth limits may or may not be= approached when this metric is flagged (see Uncore counters for that).", - "ScaleUnit": "100%" + "BriefDescription": "Cross-socket Ultra Path Interconnect (UPI) da= ta transmit bandwidth for data only [MB / sec]", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 64 / 9 / 1e6", + "MetricGroup": "Server;SoC", + "MetricName": "tma_info_upi_data_transmit_bw" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", - "MetricExpr": "INT_MISC.MBA_STALLS / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", - "MetricName": "tma_mba_stalls", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "tma_retiring * tma_info_slots / BR_INST_RETIRED.NEA= R_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 9" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory (DRAM= )", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_= bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory (DRA= M). This metric does not aggregate requests from other Logical Processors/= Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric approximates arithmetic Integer (= Int) matrix uops fraction the CPU has retired (aggregated across all suppor= ted Int datatypes in AMX engine)", + "MetricExpr": "cpu@AMX_OPS_RETIRED.INT8\\,cmask\\=3D1@ / (tma_reti= ring * tma_info_slots)", + "MetricGroup": "Compute;HPC;IntVector;Pipeline;Server;TopdownL4;tm= a_L4_group;tma_int_operations_group", + "MetricName": "tma_int_amx", + "MetricThreshold": "tma_int_amx > 0.1 & (tma_int_operations > 0.1 = & tma_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic Integer = (Int) matrix uops fraction the CPU has retired (aggregated across all suppo= rted Int datatypes in AMX engine). Refer to AMX_Busy and TIOPs metrics for = actual AMX utilization and Int performance, resp.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "min((66.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHA= LTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 *= (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1= e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * = (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHA= LTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance.", + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shu= ffles + tma_int_amx", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((131 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHAL= TED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * = (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e= 9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHA= LTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "min(((120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHA= LTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 *= (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1= e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM += (120 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_F= REQ / 1e9 / (duration_time * 1e3 / 1e3)) - 12 * (CPU_CLK_UNHALTED.THREAD / = CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / = 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HI= T / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_me= m_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DR= AM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_= LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_R= ETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIR= ED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HIT= M * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM= _LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.R= EMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_= LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_= RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIR= ED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REM= OTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (M= EMORY_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD) if 1e6 * (MEM_LOAD= _L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRE= D.L1_MISS else 0), 1)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module. ", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;= tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THR= EAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store= _bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full)", + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "min(28 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTE= D.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * OCR.DEM= AND_RFO.L3_HIT.SNOOP_HITM / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. ", + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CPU_CLK_UNHALTED.DI= STRIBUTED", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= .", + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited)", + "MetricExpr": "33 * tma_info_average_frequency * MEM_LOAD_RETIRED.= L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_= l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L3 cache under unloaded scenarios (pos= sibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L= 3 hits) will improve the latency; reduce contention with sibling physical c= ores and increase performance. Note the value of this node may overlap wit= h its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: t= ma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", - "MetricExpr": "min(9 * OCR.STREAMING_WR.ANY_RESPONSE / CPU_CLK_UNH= ALTED.THREAD, 1)", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_store_= bound_group", - "MetricName": "tma_streaming_stores", - "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck.", + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CPU_CLK_UNHALTED.DISTRIBUTED, 1)= ", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_g= roup", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized software = running on Intel Core/Xeon products. While this often indicates efficient X= 86 instructions were executed; high value does not necessarily mean better = performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", - "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_cl= ks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.DI= STRIBUTED", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "MetricExpr": "max(0, topdown\\-be\\-bound / (topdown\\-fe\\-bound= + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topd= own\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\= \-retiring + topdown\\-be\\-bound)) + 0 * slots", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_grou= p;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", + "MetricExpr": "71 * tma_info_average_frequency * MEM_LOAD_L3_MISS_= RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ += (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bound + = topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * cpu@EXE= _ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@)) / CPU_CLK_UNHALTED.THREAD if ARIT= H.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOAD= S else (EXE_ACTIVITY.1_PORTS_UTIL + topdown\\-retiring / (topdown\\-fe\\-bo= und + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * c= pu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / CPU_CLK_UNHALTED.THREAD) += 0 * slots", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / C= PU_CLK_UNHALTED.THREAD + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS= _TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.THREA= D", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group"= , - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance.", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", + "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", + "MetricName": "tma_mba_stalls", + "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions.", - "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", - "MetricName": "tma_slow_pause", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_b= ound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory (DRAM). The underlying heuristic assumes that a simi= lar off-core traffic is generated by all IA cores. This metric does not agg= regate non-data-read requests by this logical processor; requests from othe= r IA Logical Processors/Physical Cores/sockets; or other non-IA devices lik= e GPU; hence the maximum external memory bandwidth limits may or may not be= approached when this metric is flagged (see Uncore counters for that). Rel= ated metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth,= tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", - "MetricExpr": "min(13 * MISC2_RETIRED.LFENCE / CPU_CLK_UNHALTED.TH= READ, 1)", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", - "MetricName": "tma_memory_fence", + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory (DRAM= )", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_= bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory (DRA= M). This metric does not aggregate requests from other Logical Processors/= Physical Cores/sockets (see Uncore counters for that). Related metrics: tma= _info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentag= e of injected blend uops out of all uops issued", - "MetricExpr": "min(160 * ASSISTS.SSE_AVX_MIX / CPU_CLK_UNHALTED.TH= READ, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group"= , - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percenta= ge of injected blend uops out of all uops issued. Usually a Mixing_Vectors = over 5% is worth investigating. Read more in Appendix B1 of the Optimizatio= ns Guide for this topic.", + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the Advanced Matrix Extensions (AMX) execution engine was busy with tile = (arithmetic) operations", - "MetricExpr": "EXE.AMX_BUSY / CPU_CLK_UNHALTED.DISTRIBUTED", - "MetricGroup": "Compute;HPC;Server;TopdownL5;tma_L5_group;tma_port= s_utilized_0_group", - "MetricName": "tma_amx_busy", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 = & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD= ", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / CPU_CLK_UNHALTED.THREAD= ", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / CPU_CLK_UNHALTED.THREAD= ", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_inf= o_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * CPU_CLK_UNHALTED= .DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & = tma_frontend_bound > 0.15 & tma_info_ipc / 6 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "UOPS_DISPATCHED.PORT_0 / CPU_CLK_UNHALTED.DISTRIBUT= ED", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group", - "MetricName": "tma_port_0", + "BriefDescription": "The Mixing_Vectors metric gives the percentag= e of injected blend uops out of all uops issued", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percenta= ge of injected blend uops out of all uops issued. Usually a Mixing_Vectors = over 5% is worth investigating. Read more in Appendix B1 of the Optimizatio= ns Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_1 / CPU_CLK_UNHALTED.DISTRIBUT= ED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p", - "MetricName": "tma_port_1", + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (tm= a_retiring * tma_info_slots / UOPS_ISSUED.ANY) / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequence= r, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple = ALU)", - "MetricExpr": "UOPS_DISPATCHED.PORT_6 / CPU_CLK_UNHALTED.DISTRIBUT= ED", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p", - "MetricName": "tma_port_6", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * CPU_CLK_UNHALTED= .DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operati= ons > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * CPU_CLK_UNHALTED.DISTRIBUTED)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_store_op_utilization", + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches + tma_nop_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. ", + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdow= n\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-= retiring + topdown\\-be\\-bound)) + 0 * slots", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_ret= iring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UPI metri= c) ratio of 1 or less should be expected for decently optimized software ru= nning on Intel Core/Xeon products. While this often indicates efficient X86= instructions were executed; high value does not necessarily mean better pe= rformance cannot be achieved.", + "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", + "MetricExpr": "((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD= _L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.F= B_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * = (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOA= D_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOT= E_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD= _L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.F= B_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_= PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (MEMOR= Y_ACTIVITY.STALLS_L3_MISS / tma_info_clks) if 1e6 * (MEM_LOAD_L3_MISS_RETIR= ED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else= 0)", + "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_pmm_bound", + "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED= .X87 / UOPS_EXECUTED.THREAD + tma_fp_scalar + tma_fp_vector + tma_fp_amx", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * UOPS_EXECUTED= .X87 / UOPS_EXECUTED.THREAD + 0 * slots", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group"= , - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + FP_ARITH_INST_RETIRED2.SCALAR) / (topdown\\-retir= ing / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + t= opdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple = ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple= ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port= _5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_D= OUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.51= 2B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST= _RETIRED2.VECTOR) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\= -bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ += tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BO= UND_ON_LOADS) + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVIT= Y.2_PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_clks if ARITH.DIV_ACTIVE < CYC= LE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1= _PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@= ) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_= HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec = + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_clks + tma_serializing_operation * (CYCLE_ACTIVITY.STALLS_TOTAL - E= XE_ACTIVITY.BOUND_ON_LOADS) / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_= HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec = + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_= HALF) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec = + topdown\\-retiring + topdown\\-be\\-bound) * slots), 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-= point (FP) matrix uops fraction the CPU has retired (aggregated across all = supported FP datatypes in AMX engine)", - "MetricExpr": "cpu@AMX_OPS_RETIRED.BF16\\,cmask\\=3D0x1@ / (topdow= n\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-ret= iring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;Flops;HPC;Pipeline;Server;TopdownL4;tma_L4= _group;tma_fp_arith_group", - "MetricName": "tma_fp_amx", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) matrix uops fraction the CPU has retired (aggregated across all= supported FP datatypes in AMX engine). Refer to AMX_Busy and GFLOPs metric= s for actual AMX utilization and FP performance, resp.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b + tma_shu= ffles", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_int_operations", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", + "MetricExpr": "(135.5 * tma_info_average_frequency * MEM_LOAD_L3_M= ISS_RETIRED.REMOTE_HITM + 135.5 * tma_info_average_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + to= pdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group", - "MetricName": "tma_int_vector_128b", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", + "MetricExpr": "149 * tma_info_average_frequency * MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_M= ISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1= & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired.", - "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)= ", - "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group", - "MetricName": "tma_int_vector_256b", + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic Integer (= Int) matrix uops fraction the CPU has retired (aggregated across all suppor= ted Int datatypes in AMX engine)", - "MetricExpr": "cpu@AMX_OPS_RETIRED.INT8\\,cmask\\=3D0x1@ / (topdow= n\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-ret= iring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Compute;HPC;IntVector;Pipeline;Server;TopdownL4;tm= a_L4_group;tma_int_operations_group", - "MetricName": "tma_int_amx", - "PublicDescription": "This metric approximates arithmetic Integer = (Int) matrix uops fraction the CPU has retired (aggregated across all suppo= rted Int datatypes in AMX engine). Refer to AMX_Busy and TIOPs metrics for = actual AMX utilization and Int performance, resp.", + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_p= orts_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_u= tilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Shuffle (cross \"vecto= r lane\" data transfers) uops fraction the CPU has retired.", - "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (topdown\\-retiring / (t= opdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\= -be\\-bound) * slots)", + "MetricExpr": "INT_VEC_RETIRED.SHUFFLES / (tma_retiring * tma_info= _slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_int_operat= ions_group", "MetricName": "tma_shuffles", + "MetricThreshold": "tma_shuffles > 0.1 & (tma_int_operations > 0.1= & tma_light_operations > 0.6)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdow= n\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-= retiring + topdown\\-be\\-bound)) * MEM_UOP_RETIRED.ANY / (topdown\\-retiri= ng / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + to= pdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & = (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdow= n\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-= retiring + topdown\\-be\\-bound)) * INST_RETIRED.MACRO_FUSED / (topdown\\-r= etiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring= + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JC= C are commonly used examples.", + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR /= tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdow= n\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-= retiring + topdown\\-be\\-bound)) * (BR_INST_RETIRED.ALL_BRANCHES - INST_RE= TIRED.MACRO_FUSED) / (topdown\\-retiring / (topdown\\-fe\\-bound + topdown\= \-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks"= , + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "max(0, topdown\\-retiring / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - topdow= n\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-= retiring + topdown\\-be\\-bound)) * INST_RETIRED.NOP / (topdown\\-retiring = / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdo= wn\\-be\\-bound) * slots)", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody.", + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueB= W;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, max(0, topdown\\-retiring / (topdown\\-fe\\-= bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) -= topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + top= down\\-retiring + topdown\\-be\\-bound)) - (tma_fp_arith + tma_int_operatio= ns + tma_memory_operations + tma_fused_instructions + tma_non_fused_branche= s + tma_nop_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or microcoded sequences", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots"= , - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_ret= iring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or microcoded sequences. This highly-correlates with the= uop length of these instructions/sequences.", + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - tma_microc= ode_sequencer", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group"= , - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions.", + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issue= RFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.MS / slots", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "min(100 * cpu@ASSISTS.ANY\\,umask\\=3D0x1B@ / slots= , 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases.", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * ASSISTS.PAGE_FAULT / slots", - "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_page_faults", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks"= , + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists"= , - "MetricExpr": "30 * ASSISTS.FP / slots", - "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_fp_assists", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called denormals).", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists. ", - "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / slots", - "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", - "MetricName": "tma_avx_assists", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch= _resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit). Sample with: FRONTEND_RE= TIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group"= , + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C1 residency percent per core", - "MetricExpr": "cstate_core@c1\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C1_Core_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.= ", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles= ", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided= by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided= by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", + "BriefDescription": "Percentage of cycles within a transaction reg= ion.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.jso= n b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json index fd253e3276df..11c037e8291d 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-other.json @@ -2832,6 +2832,16 @@ "UMask": "0x80", "Unit": "IIO" }, + { + "BriefDescription": "Read request for 4 bytes made by IIO Part0-7 = to Memory", + "EventCode": "0x83", + "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", + "FCMask": "0x07", + "PerPkg": "1", + "PortMask": "0x00ff", + "UMask": "0x4", + "Unit": "IIO" + }, { "BriefDescription": "Read request for 4 bytes made by IIO Part0 to= Memory", "EventCode": "0x83", @@ -2920,6 +2930,16 @@ "UMask": "0x4", "Unit": "IIO" }, + { + "BriefDescription": "Write request of 4 bytes made by IIO Part0-7 = to Memory", + "EventCode": "0x83", + "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", + "FCMask": "0x07", + "PerPkg": "1", + "PortMask": "0x00ff", + "UMask": "0x1", + "Unit": "IIO" + }, { "BriefDescription": "Write request of 4 bytes made by IIO Part0 to= Memory", "EventCode": "0x83", @@ -4038,6 +4058,46 @@ "PublicDescription": "FlowQ Generated Prefetch : Count cases where= FlowQ causes spawn of Prefetch to iMC/SMI3 target", "Unit": "M3UPI" }, + { + "BriefDescription": "All CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.ALL", + "PerPkg": "1", + "UMask": "0xff", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Read CAS commands issued (regular and underfi= ll)", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD", + "PerPkg": "1", + "UMask": "0xcf", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Regular read CAS commands issued (does not in= clude underfills)", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD_REG", + "PerPkg": "1", + "UMask": "0xc1", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Underfill read CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.RD_UNDERFILL", + "PerPkg": "1", + "UMask": "0xc4", + "Unit": "MCHBM" + }, + { + "BriefDescription": "Write CAS commands issued", + "EventCode": "0x05", + "EventName": "UNC_MCHBM_CAS_COUNT.WR", + "PerPkg": "1", + "UMask": "0xf0", + "Unit": "MCHBM" + }, { "BriefDescription": "UPI Clockticks", "EventCode": "0x01", --=20 2.39.2.637.g21b0678d19-goog