Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71EA0C64ED6 for ; Sun, 19 Feb 2023 09:31:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229923AbjBSJb4 (ORCPT ); Sun, 19 Feb 2023 04:31:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbjBSJbn (ORCPT ); Sun, 19 Feb 2023 04:31:43 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B63BC126E3 for ; Sun, 19 Feb 2023 01:31:07 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a34-20020a25a1a5000000b0092aabd4fa90so2168436ybi.18 for ; Sun, 19 Feb 2023 01:31:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc:subject:date :message-id:reply-to; bh=GwlLCukzT6GbHSyONT3+T7aL0ArA6deTd4KeYIE1IWM=; b=QCefFBiefJ8mMMIoBJW5NW2xO3ly5g+vB3A/Czk1u8PYC5a6dmPEP9IFC/fKPfEWBp VzUjQGu7upG3cyhsQNgm9JwVx4D2Mzu/LWTFfKleRABhs5cZwBNoE4xDTl5MJ/BQ9TcJ NJEpXE09uBFTbQdNly/C3nou3HH4UE2/09Nb5Xvkw/np08fA8h4NCmMlWEPeYLG3DFVB qHNsszNwU3AiMBkQsXTnp0MovjqvuvcWLZkUy/cd0wttBdEzbCkclBFTMOcdAkoUNbmU +4fKQN1O3Uy+dSLlfE/6v6ONu2uKbaU++lt9Uux9/u/TZ887npe3KiZLNpTnV/kQl4CP BuRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=GwlLCukzT6GbHSyONT3+T7aL0ArA6deTd4KeYIE1IWM=; b=gFeRWJgN/KT1pPEegIEIIfd5z7LaE4yrvn6PTGmKMg9FOpQks8Y9b0WWKIn78HpYJ+ qiK9xEyfZrTnC1WtlgEuDmrc22esN8VGCbJT4qi0UrOvHH/baxfQl0gxfgEGnfm+79EQ y/m4Xw34HUKuRQKBChNl16Pb/RzO/TG1FrhWSdig7/rHgIkJlHwvnrEOlaXW8gEBrGvG EUzlInKnnVIMI+Brbc36/za4FyjxEU7NmrKUz3XMJGBG95XCNfFSghaCNMkHOk0AuuDf Ql5yPDSQC3sWx7aYFvrHofnnM0fKz0QnoKZhrcwlpy9fu+cn2ynAhlippZKqKVsWaFhH 8B/w== X-Gm-Message-State: AO0yUKWHw8HYNebv8sNOYLQ1aqRyDXavKWVu+HpZwrLtAGW1WRZSzlCp QFTCNnstQR6sDJRwPIdSJXKdho7Ehvs9 X-Google-Smtp-Source: AK7set8XJomH9KS+iZUmrag0jPse04gZxYnIJLfpdUgVlQzioVQ+pZvFI8aGmsoUpqGB9A/peLx8OVLjRgWt X-Received: from irogers.svl.corp.google.com ([2620:15c:2d4:203:cde9:3fbc:e1f1:6e3b]) (user=irogers job=sendgmr) by 2002:a05:6902:46:b0:8da:d0ab:258a with SMTP id m6-20020a056902004600b008dad0ab258amr21007ybh.5.1676799067361; Sun, 19 Feb 2023 01:31:07 -0800 (PST) Date: Sun, 19 Feb 2023 01:28:12 -0800 In-Reply-To: <20230219092848.639226-1-irogers@google.com> Message-Id: <20230219092848.639226-16-irogers@google.com> Mime-Version: 1.0 References: <20230219092848.639226-1-irogers@google.com> X-Mailer: git-send-email 2.39.2.637.g21b0678d19-goog Subject: [PATCH v1 15/51] perf vendor events intel: Refresh cascadelakex events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Maxime Coquelin , Alexandre Torgue , Zhengjun Xing , Sandipan Das , James Clark , Kajol Jain , John Garry , Kan Liang , Adrian Hunter , Andrii Nakryiko , Eduard Zingerman , Suzuki Poulouse , Leo Yan , Florian Fischer , Ravi Bangoria , Jing Zhang , Sean Christopherson , Athira Rajeev , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com, linux-arm-kernel@lists.infradead.org, Perry Taylor , Caleb Biggers Cc: Stephane Eranian , Ian Rogers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Update the cascadelakex events from 1.16 to 1.17. Generation was done using https://github.com/intel/perfmon. Notable changes are new events and event descriptions, TMA metrics are updated to version 4.5, TMA info metrics are renamed from their node name to be lower case and prefixed by tma_info_, MetricThreshold expressions are added, "Sample with" documentation is added to many TMA metrics, smi_cost and transaction metric groups are added replicating existing hard coded metrics in stat-shadow. Signed-off-by: Ian Rogers --- .../arch/x86/cascadelakex/cache.json | 24 +- .../arch/x86/cascadelakex/clx-metrics.json | 2198 +++++++++-------- .../arch/x86/cascadelakex/frontend.json | 8 +- .../arch/x86/cascadelakex/pipeline.json | 16 + .../arch/x86/cascadelakex/uncore-memory.json | 18 +- .../arch/x86/cascadelakex/uncore-other.json | 120 +- .../arch/x86/cascadelakex/uncore-power.json | 8 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 1236 insertions(+), 1158 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json b/tools= /perf/pmu-events/arch/x86/cascadelakex/cache.json index 1070ad317ec9..a842f05cb60d 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/cache.json @@ -234,20 +234,22 @@ "UMask": "0x4f" }, { - "BriefDescription": "All retired load instructions.", + "BriefDescription": "Retired load instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", "PEBS": "1", + "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "2000003", "UMask": "0x81" }, { - "BriefDescription": "All retired store instructions.", + "BriefDescription": "Retired store instructions.", "Data_LA": "1", "EventCode": "0xD0", "EventName": "MEM_INST_RETIRED.ALL_STORES", "PEBS": "1", + "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "2000003", "UMask": "0x82" }, @@ -388,12 +390,12 @@ "UMask": "0x4" }, { - "BriefDescription": "Retired load instructions with remote Intel(R= ) Optane(TM) DC persistent memory as the data source where the data request= missed all caches. Precise event.", + "BriefDescription": "Retired load instructions with remote Intel(R= ) Optane(TM) DC persistent memory as the data source where the data request= missed all caches.", "Data_LA": "1", "EventCode": "0xD3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", "PEBS": "1", - "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode). Pr= ecise event", + "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100007", "UMask": "0x10" }, @@ -477,12 +479,12 @@ "UMask": "0x20" }, { - "BriefDescription": "Retired load instructions with local Intel(R)= Optane(TM) DC persistent memory as the data source where the data request = missed all caches. Precise event.", + "BriefDescription": "Retired load instructions with local Intel(R)= Optane(TM) DC persistent memory as the data source where the data request = missed all caches.", "Data_LA": "1", "EventCode": "0xD1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", "PEBS": "1", - "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode). Pre= cise event", + "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100003", "UMask": "0x80" }, @@ -5039,7 +5041,7 @@ "UMask": "0x80" }, { - "BriefDescription": "Cacheable and noncachaeble code read requests= ", + "BriefDescription": "Cacheable and non-cacheable code read request= s", "EventCode": "0xB0", "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", "PublicDescription": "Counts both cacheable and non-cacheable code= read requests.", @@ -5146,14 +5148,6 @@ "SampleAfterValue": "2000003", "UMask": "0x4" }, - { - "BriefDescription": "Offcore response can be programmed only with = a specific pair of event select and counter MSR, and with specific event co= des and predefine mask bit value in a dedicated MSR to specify attributes o= f the offcore transaction", - "EventCode": "0xB7, 0xBB", - "EventName": "OFFCORE_RESPONSE", - "PublicDescription": "Offcore response can be programmed only with= a specific pair of event select and counter MSR, and with specific event c= odes and predefine mask bit value in a dedicated MSR to specify attributes = of the offcore transaction.", - "SampleAfterValue": "100003", - "UMask": "0x1" - }, { "BriefDescription": "This event is deprecated. Refer to new event = OCR.ALL_DATA_RD.ANY_RESPONSE", "Deprecated": "1", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b= /tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index 356cf6603b69..4e993a3220e3 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -1,1548 +1,1608 @@ [ { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency *= tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_i= cache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "Mispredictions" - }, - { - "BriefDescription": "Total pipeline cost of (external) Memory Band= width related bottlenecks", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) = + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_acces= ses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_b= ound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load += tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "Memory_Bandwidth" - }, - { - "BriefDescription": "Total pipeline cost of Memory Latency related= bottlenecks (external memory and off-core caches)", - "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + = tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_= accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_b= ound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_p= mm_bound + tma_store_bound))", - "MetricGroup": "Mem;MemoryLat;Offcore", - "MetricName": "Memory_Latency" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_me= mory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + t= ma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k= _aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_load= s + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound = + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtl= b_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_stor= e_latency)))", - "MetricGroup": "Mem;MemoryTLB;Offcore", - "MetricName": "Memory_Data_TLBs" - }, - { - "BriefDescription": "Total pipeline cost of branch related instruc= tions (used for program control-flow including function calls)", - "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_R= ETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITION= AL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / SLOTS)"= , - "MetricGroup": "Ret", - "MetricName": "Branching_Overhead" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "Big_Code" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks", - "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma= _mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icach= e_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - Big_Code", - "MetricGroup": "Fed;FetchBW;Frontend", - "MetricName": "Instruction_Fetch_BW" - }, - { - "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "INST_RETIRED.ANY / CLKS", - "MetricGroup": "Ret;Summary", - "MetricName": "IPC" - }, - { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", - "MetricGroup": "Pipeline;Ret;Retire", - "MetricName": "UPI" - }, - { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", - "MetricGroup": "Branches;Fed;FetchBW", - "MetricName": "UpTB" - }, - { - "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", - "MetricExpr": "1 / IPC", - "MetricGroup": "Mem;Pipeline", - "MetricName": "CPI" - }, - { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "Pipeline", - "MetricName": "CLKS" - }, - { - "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "4 * CORE_CLKS", - "MetricGroup": "tma_L1_group", - "MetricName": "SLOTS" - }, - { - "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", - "MetricGroup": "Cor;Pipeline", - "MetricName": "Execute_per_Issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." - }, - { - "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "INST_RETIRED.ANY / CORE_CLKS", - "MetricGroup": "Ret;SMT;tma_L1_group", - "MetricName": "CoreIPC" - }, - { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 = * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PA= CKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INS= T_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SING= LE) / CORE_CLKS", - "MetricGroup": "Flops;Ret", - "MetricName": "FLOPc" - }, - { - "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_AR= ITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOU= BLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.512B= _PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)) / (2 * CORE_CLK= S)", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "FP_Arith_Utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." - }, - { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", - "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", - "MetricName": "ILP" - }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "((1 - tma_core_bound / tma_ports_utilization if tma= _core_bound < tma_ports_utilization else 1) if SMT_2T_Utilization > 0.5 els= e 0)", - "MetricGroup": "Cor;SMT", - "MetricName": "Core_Bound_Likely" - }, - { - "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else CLKS))", - "MetricGroup": "SMT", - "MetricName": "CORE_CLKS" - }, - { - "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", - "MetricGroup": "InsType", - "MetricName": "IpLoad" - }, - { - "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", - "MetricGroup": "InsType", - "MetricName": "IpStore" - }, - { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Branches;Fed;InsType", - "MetricName": "IpBranch" + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "IpCall" + "BriefDescription": "C3 residency percent per core", + "MetricExpr": "cstate_core@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instruction per taken branch", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", - "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO", - "MetricName": "IpTB" + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch instructions per taken branch. ", - "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", - "MetricGroup": "Branches;Fed;PGO", - "MetricName": "BpTkBranch" + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SI= NGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B= _PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_I= NST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SI= NGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIR= ED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;InsType", - "MetricName": "IpFLOP" + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SI= NGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + (FP_ARITH_INST_RETIRED.128B_PA= CKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETI= RED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGL= E))", - "MetricGroup": "Flops;InsType", - "MetricName": "IpArith", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). May undercount due to FMA doubl= e counting. Approximated prior to BDW." + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_SP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", - "MetricGroup": "Flops;FpScalar;InsType", - "MetricName": "IpArith_Scalar_DP", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_socket_clks / #num_dies / duration_time / = 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX128", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). May undercount = due to FMA double counting." + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX256", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). May undercount due= to FMA double counting." + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", - "MetricGroup": "Flops;FpVector;InsType", - "MetricName": "IpArith_AVX512", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). May undercount due = to FMA double counting." + "BriefDescription": "This metric estimates how often memory load a= ccesses were aliased by preceding stores (in program order) with a 4K addre= ss offset", + "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_4k_aliasing", + "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", - "MetricGroup": "Prefetches", - "MetricName": "IpSWPF" + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.6", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total number of retired Instructions Sample w= ith: INST_RETIRED.PREC_DIST", - "MetricExpr": "INST_RETIRED.ANY", - "MetricGroup": "Summary;tma_L1_group", - "MetricName": "Instructions" + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_inf= o_slots", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", - "MetricGroup": "Pipeline;Ret", - "MetricName": "Retire" + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (IN= T_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) /= tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "ScaleUnit": "100%" }, { - "BriefDescription": "", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", - "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", - "MetricName": "Execute" + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * = (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)= ) / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", - "MetricGroup": "Fed;FetchBW", - "MetricName": "Fetch_UpC" + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", + "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_branch_misprediction_cost, tma_info_mispredictions, tma_mispredic= ts_resteers", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS)", - "MetricGroup": "DSB;Fed;FetchBW", - "MetricName": "DSB_Coverage" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_clks + tma= _unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", - "MetricGroup": "DSBmiss", - "MetricName": "DSB_Switch_Cost" + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Total penalty related to DSB (uop cache) miss= es - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "DSB_Misses" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative DS= B miss (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", - "MetricGroup": "DSBmiss;Fed", - "MetricName": "IpDSB_Miss_Ret" + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT= _RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMA= ND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_COR= E_FWD))) + 44 * tma_info_average_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BadSpec;BrMispredicts", - "MetricName": "IpMispredict" + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_backend_bound - tma_memory_bound", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_m= ispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_= misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * SLOTS / BR_MISP_RE= TIRED.ALL_BRANCHES", - "MetricGroup": "Bad;BrMispredicts", - "MetricName": "Branch_Misprediction_Cost" + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "44 * tma_info_average_frequency * (MEM_LOAD_L3_HIT_= RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA= _RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + O= CR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_H= IT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_issueSync= xn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRA= NCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_NT" + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & (= tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_ipc / 4 > = 0.35))", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_= TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches;CodeGen;PGO", - "MetricName": "Cond_TK" + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "CallRet" + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CY= CLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_cl= ks - tma_l2_bound - tma_pmm_bound", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.COND= ITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_= INST_RETIRED.ALL_BRANCHES", - "MetricGroup": "Bad;Branches", - "MetricName": "Jump" + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4= _UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & (tma_fetch_bandwidth > 0.1 & = tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", - "MetricGroup": "Mem;MemoryBound;MemoryLat", - "MetricName": "Load_Miss_Real_Latency" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_dsb_coverage, tma= _info_dsb_misses, tma_info_iptb, tma_lcp", + "ScaleUnit": "100%" }, { - "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss. Per-Logical Proces= sor)", - "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", - "MetricGroup": "Mem;MemoryBW;MemoryBound", - "MetricName": "MLP" + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI" + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_clks", + "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB;tma_= store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_memory_data_tlbs", + "ScaleUnit": "100%" }, { - "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L1MPKI_Load" + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(110 * tma_info_average_frequency * (OCR.DEMAND_RFO= .L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_info= _average_frequency * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.PF_L2_RFO= .L3_HIT.HITM_OTHER_CORE)) / tma_info_clks", + "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricGroup": "Backend;CacheMisses;Mem", - "MetricName": "L2MPKI" + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * cpu@L1D_PEND_MISS= .FB_FULL\\,cmask\\=3D1@ / tma_info_clks", + "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;tma_is= sueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_dram_bw_use, tma_info_memory_b= andwidth, tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_= stores", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem;Offcore", - "MetricName": "L2MPKI_All" + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "tma_frontend_bound - tma_fetch_latency", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound= > 0.15 & tma_info_ipc / 4 > 0.35", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb, tma_lcp"= , + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2MPKI_Load" + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE= / tma_info_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_All" + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" }, { - "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L2HPKI_Load" + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "ScaleUnit": "100%" }, { - "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "L3MPKI" + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", - "MetricGroup": "CacheMisses;Mem", - "MetricName": "FB_HPKI" + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricConstraint": "NO_NMI_WATCHDOG", - "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * CORE_CLK= S)", - "MetricGroup": "Mem;MemoryTLB", - "MetricName": "Page_Walks_Utilization" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW" + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW" + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_slots", + "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW" + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JC= C are commonly used examples.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", - "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_Silent_PKI" + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.", + "ScaleUnit": "100%" }, { - "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", - "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / Instructions", - "MetricGroup": "L2Evicts;Mem;Server", - "MetricName": "L2_Evictions_NonSilent_PKI" + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma= _fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "ScaleUnit": "100%" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "L1D_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L1D_Cache_Fill_BW_1T" + "BriefDescription": "Measured Average Frequency for unhalted proce= ssors [GHz]", + "MetricExpr": "tma_info_turbo_utilization * TSC / 1e9 / duration_t= ime", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_average_frequency" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "L2_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L2_Cache_Fill_BW_1T" + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", + "MetricGroup": "BigFoot;Fed;Frontend;IcMiss;MemoryTLB;tma_issueBC"= , + "MetricName": "tma_info_big_code", + "MetricThreshold": "tma_info_big_code > 20", + "PublicDescription": "Total pipeline cost of instruction fetch rel= ated bottlenecks by large code footprint programs (i-side cache; TLB and BT= B misses). Related metrics: tma_info_branching_overhead" }, { - "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Fill_BW", - "MetricGroup": "Mem;MemoryBW", - "MetricName": "L3_Cache_Fill_BW_1T" + "BriefDescription": "Branch instructions per taken branch.", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_bptkbranch" }, { - "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "L3_Cache_Access_BW", - "MetricGroup": "Mem;MemoryBW;Offcore", - "MetricName": "L3_Cache_Access_BW_1T" + "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "MetricExpr": "(tma_branch_mispredicts + tma_fetch_latency * tma_m= ispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icache_= misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) * tma_info_slots / B= R_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_mispredictions, tma_mispredi= cts_resteers" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricGroup": "HPC;Summary", - "MetricName": "CPU_Utilization" + "BriefDescription": "Total pipeline cost of branch related instruc= tions (used for program control-flow including function calls)", + "MetricExpr": "100 * ((BR_INST_RETIRED.CONDITIONAL + 3 * BR_INST_R= ETIRED.NEAR_CALL + (BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.CONDITION= AL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL)) / tma_inf= o_slots)", + "MetricGroup": "Ret;tma_issueBC", + "MetricName": "tma_info_branching_overhead", + "MetricThreshold": "tma_info_branching_overhead > 10", + "PublicDescription": "Total pipeline cost of branch related instru= ctions (used for program control-flow including function calls). Related me= trics: tma_info_big_code" }, { - "BriefDescription": "Measured Average Frequency for unhalted proce= ssors [GHz]", - "MetricExpr": "Turbo_Utilization * TSC / 1e9 / duration_time", - "MetricGroup": "Power;Summary", - "MetricName": "Average_Frequency" + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_callret" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 = * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PA= CKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INS= T_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SING= LE) / 1e9 / duration_time", - "MetricGroup": "Cor;Flops;HPC", - "MetricName": "GFLOPs", - "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width and AMX engine." + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_clks" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "CLKS / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", - "MetricName": "Turbo_Utilization" + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)"= , + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_code_stlb_mpki" }, { - "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", - "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / CORE_CLKS if #= SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License0_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.NOT_TAKEN / BR_INST_RETIRED.ALL_BRA= NCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_nt" }, { - "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", - "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / CORE_CLKS if #= SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License1_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "(BR_INST_RETIRED.CONDITIONAL - BR_INST_RETIRED.NOT_= TAKEN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_cond_tk" }, { - "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", - "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / CORE_CLKS if #= SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / CORE_CLKS)", - "MetricGroup": "Power", - "MetricName": "Power_License2_Utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_smt_2t_utiliz= ation > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_core_bound_likely", + "MetricThreshold": "tma_info_core_bound_likely > 0.5" }, { - "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_clks))", "MetricGroup": "SMT", - "MetricName": "SMT_2T_Utilization" + "MetricName": "tma_info_core_clks" }, { - "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", - "MetricGroup": "OS", - "MetricName": "Kernel_Utilization" + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_coreipc" }, { - "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k"= , - "MetricGroup": "OS", - "MetricName": "Kernel_CPI" + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_cpi" }, { - "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", - "MetricGroup": "HPC;Mem;MemoryBW;SoC", - "MetricName": "DRAM_BW_Use" + "BriefDescription": "Average CPU Utilization", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_cpu_utilization" }, { - "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetches= ", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (Socket_CLKS / duration_time)", - "MetricGroup": "Mem;MemoryLat;SoC", - "MetricName": "MEM_Read_Latency" + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_= REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_data_l2_mlp" }, { - "BriefDescription": "Average number of parallel data read requests= to external memory. Accounts for demand loads and L1/L2 prefetches", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", - "MetricGroup": "Mem;MemoryBW;SoC", - "MetricName": "MEM_Parallel_Reads" + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_memory_ba= ndwidth, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L2= data-read prefetches", - "MetricExpr": "1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ_= INSERTS) / imc_0@event\\=3D0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_PMM_Read_Latency" + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS)", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_dsb_coverage", + "MetricThreshold": "tma_info_dsb_coverage < 0.7 & tma_info_ipc / 4= > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_dsb_misses, tma_info_iptb, tma_lcp" }, { - "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data-= read prefetches", - "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / i= mc_0@event\\=3D0x0@", - "MetricGroup": "Mem;MemoryLat;Server;SoC", - "MetricName": "MEM_DRAM_Read_Latency" + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_dsb_misses", + "MetricThreshold": "tma_info_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_iptb, tma_l= cp" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Read_BW" + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_dsb_switch_cost" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", - "MetricGroup": "Mem;MemoryBW;Server;SoC", - "MetricName": "PMM_Write_BW" + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-thread", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_execute" }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Write_BW" + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." }, { - "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", - "MetricGroup": "IoBW;Mem;Server;SoC", - "MetricName": "IO_Read_BW" + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_fb_hpki" }, { - "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "cha_0@event\\=3D0x0@", - "MetricGroup": "SoC", - "MetricName": "Socket_CLKS" + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_fetch_upc" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", - "MetricGroup": "Branches;OS", - "MetricName": "IpFarBranch" + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 = * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PA= CKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INS= T_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SING= LE) / tma_info_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_flopc" }, { - "BriefDescription": "Uncore frequency per die [GHZ]", - "MetricExpr": "Socket_CLKS / #num_dies / duration_time / 1e9", - "MetricGroup": "SoC", - "MetricName": "UNCORE_FREQ" + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) = / (2 * tma_info_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." }, { - "BriefDescription": "Percentage of time spent in the active CPU po= wer state C0", - "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", - "MetricName": "cpu_utilization", - "ScaleUnit": "100%" + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 = * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PA= CKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INS= T_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SING= LE) / 1e9 / duration_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width and AMX engine." }, { - "BriefDescription": "CPU operating frequency (in GHz)", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC = * #SYSTEM_TSC_FREQ / 1e9 / duration_time", - "MetricName": "cpu_operating_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_ic_misses", + "MetricThreshold": "tma_info_ic_misses > 5", + "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", - "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", - "MetricName": "cpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_icache_miss_latency" }, { - "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", - "MetricName": "loads_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per-core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else UOPS_EXECUTED.CORE_CYCLES_GE_1)", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_ilp" }, { - "BriefDescription": "The ratio of number of completed memory store= instructions to the total number completed instructions", - "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", - "MetricName": "stores_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_frontend_bound - tma_fetch_latency * tma= _mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_icach= e_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_big_cod= e", + "MetricGroup": "Fed;FetchBW;Frontend", + "MetricName": "tma_info_instruction_fetch_bw", + "MetricThreshold": "tma_info_instruction_fetch_bw > 20" }, { - "BriefDescription": "Ratio of number of requests missing L1 data c= ache (includes data+rfo w/ prefetches) to the total number of completed ins= tructions", - "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", - "MetricName": "l1d_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST" }, { - "BriefDescription": "Ratio of number of demand load requests hitti= ng in L1 data cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", - "MetricName": "l1d_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_read_bw" }, { - "BriefDescription": "Ratio of number of code read requests missing= in L1 instruction cache (includes prefetches) to the total number of compl= eted instructions", - "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", - "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricGroup": "IoBW;Mem;Server;SoC", + "MetricName": "tma_info_io_write_bw" }, { - "BriefDescription": "Ratio of number of completed demand load requ= ests hitting in L2 cache to the total number of completed instructions ", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_hits_per_instr", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALA= R_SINGLE\\,umask\\=3D0x03@ + cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\= ,umask\\=3D0xfc@)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_iparith", + "MetricThreshold": "tma_info_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). May undercount due to FMA doubl= e counting. Approximated prior to BDW." }, { - "BriefDescription": "Ratio of number of requests missing L2 cache = (includes code+data+rfo w/ prefetches) to the total number of completed ins= tructions", - "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", - "MetricName": "l2_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx128", + "MetricThreshold": "tma_info_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). May undercount = due to FMA double counting." }, { - "BriefDescription": "Ratio of number of completed data read reques= t missing L2 cache to the total number of completed instructions", - "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_data_read_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx256", + "MetricThreshold": "tma_info_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). May undercount due= to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read request missing = L2 cache to the total number of completed instructions", - "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", - "MetricName": "l2_demand_code_mpi", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_iparith_avx512", + "MetricThreshold": "tma_info_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). May undercount due = to FMA double counting." }, { - "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x12D= 4043300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_data_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_dp", + "MetricThreshold": "tma_info_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." }, { - "BriefDescription": "Ratio of number of code read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x12C= C023300000000@ / INST_RETIRED.ANY", - "MetricName": "llc_code_read_mpi_demand_plus_prefetch", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_iparith_scalar_sp", + "MetricThreshold": "tma_info_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). May und= ercount due to FMA double counting." }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand and prefetch data read miss (read memory access) in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\= \=3D0x4043300000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x4043= 300000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #nu= m_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_ipbranch", + "MetricThreshold": "tma_info_ipbranch < 8" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand and prefetch data read miss (read memory access) addressed to local m= emory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\= \=3D0x4043200000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x4043= 200000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #nu= m_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for= _local_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_ipc" }, { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand and prefetch data read miss (read memory access) addressed to remote = memory in nano seconds", - "MetricExpr": "1e9 * (cha@unc_cha_tor_occupancy.ia_miss\\,config1\= \=3D0x4043100000000@ / cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x4043= 100000000@) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTICKS) * #nu= m_packages)) * duration_time", - "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for= _remote_requests", - "ScaleUnit": "1ns" + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_ipcall", + "MetricThreshold": "tma_info_ipcall < 200" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", - "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_ipdsb_miss_ret", + "MetricThreshold": "tma_info_ipdsb_miss_ret < 50" }, { - "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", - "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", - "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_ipfarbranch", + "MetricThreshold": "tma_info_ipfarbranch < 1e6" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY"= , - "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR_SI= NGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE + 2 * FP_ARITH_INST_RETIRED.128B= _PACKED_DOUBLE + 4 * (FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_I= NST_RETIRED.256B_PACKED_DOUBLE) + 8 * (FP_ARITH_INST_RETIRED.256B_PACKED_SI= NGLE + FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE) + 16 * FP_ARITH_INST_RETIR= ED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_ipflop", + "MetricThreshold": "tma_info_ipflop < 10" }, { - "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", - "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipload", + "MetricThreshold": "tma_info_ipload < 3" }, { - "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", - "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", - "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", - "ScaleUnit": "1per_instr" + "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "MetricExpr": "tma_info_instructions / (UOPS_RETIRED.RETIRE_SLOTS = / UOPS_ISSUED.ANY * cpu@BR_MISP_EXEC.ALL_BRANCHES\\,umask\\=3D0xE4@)", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_ipmisp_indirect", + "MetricThreshold": "tma_info_ipmisp_indirect < 1e3" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x404= 3200000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x404320000000= 0@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_local_dram", - "ScaleUnit": "100%" + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_ipmispredict", + "MetricThreshold": "tma_info_ipmispredict < 200" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", - "MetricExpr": "cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x404= 3100000000@ / (cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x404320000000= 0@ + cha@unc_cha_tor_inserts.ia_miss\\,config1\\=3D0x4043100000000@)", - "MetricName": "numa_reads_addressed_to_remote_dram", - "ScaleUnit": "100%" + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_ipstore", + "MetricThreshold": "tma_info_ipstore < 8" }, { - "BriefDescription": "Uncore operating frequency in GHz", - "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTIC= KS) * #num_packages) / 1e9 / duration_time", - "MetricName": "uncore_frequency", - "ScaleUnit": "1GHz" + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_ipswpf", + "MetricThreshold": "tma_info_ipswpf < 100" }, { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", - "MetricName": "upi_data_transmit_bw", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_iptb", + "MetricThreshold": "tma_info_iptb < 9", + "PublicDescription": "Instruction per taken branch. Related metric= s: tma_dsb_switches, tma_fetch_bandwidth, tma_info_dsb_coverage, tma_info_d= sb_misses, tma_lcp" }, { - "BriefDescription": "DDR memory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.RD * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_ipunknown_branch" }, { - "BriefDescription": "DDR memory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_CAS_COUNT.WR * 64 / 1e6 / duration_time", - "MetricName": "memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - (BR_INST_RETIRED.COND= ITIONAL - BR_INST_RETIRED.NOT_TAKEN) - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_= INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_jump" }, { - "BriefDescription": "DDR memory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) * 64 / 1e= 6 / duration_time", - "MetricName": "memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k"= , + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_cpi" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_kernel_utilization", + "MetricThreshold": "tma_info_kernel_utilization > 0.05" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw" }, { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 6= 4 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "tma_info_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l1d_cache_fill_bw_1t" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_writes", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", - "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO= _PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART= 2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", - "MetricName": "io_bandwidth_disk_or_network_reads", - "ScaleUnit": "1MB/s" + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l1mpki_load" }, { - "BriefDescription": "Uops delivered from decoded instruction cache= (decoded stream buffer or DSB) as a percent of total uops delivered to Ins= truction Decode Queue", - "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_decoded_icache", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw" }, { - "BriefDescription": "Uops delivered from legacy decode pipeline (M= icro-instruction Translation Engine or MITE) as a percent of total uops del= ivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ= .MS_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline"= , - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "tma_info_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l2_cache_fill_bw_1t" }, { - "BriefDescription": "Uops delivered from microcode sequencer (MS) = as a percent of total uops delivered to Instruction Decode Queue", - "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.M= S_UOPS + LSD.UOPS)", - "MetricName": "percent_uops_delivered_from_microcode_sequencer", - "ScaleUnit": "100%" + "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_instructio= ns", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_nonsilent_pki" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_instructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_l2_evictions_silent_pki" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_all" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2hpki_load" }, { - "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / SLOTS", - "MetricGroup": "PGO;TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_frontend_bound", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Machine_Width uops every c= ycle to the Backend. Frontend Bound denotes unutilized issue-slots when the= re is no Backend stall; i.e. bubbles where Frontend delivered no uops while= Backend could have accepted them. For example; stalls due to instruction-c= ache misses would be categorized under Frontend Bound.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheMisses;Mem", + "MetricName": "tma_info_l2mpki" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE= / SLOTS", - "MetricGroup": "Frontend;TopdownL2;tma_L2_group;tma_L2_group;tma_f= rontend_bound_group", - "MetricName": "tma_fetch_latency", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem;Offcore", + "MetricName": "tma_info_l2mpki_all" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;IcMiss;TopdownL3;tma_L3_group;tma= _fetch_latency_group", - "MetricName": "tma_icache_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses.", - "MetricExpr": "ICACHE_64B.IFTAG_STALL / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;= tma_fetch_latency_group", - "MetricName": "tma_itlb_misses", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_l2mpki_code_all" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK_UNHALTED.TH= READ + tma_unknown_branches", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", - "MetricName": "tma_branch_resteers", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l2mpki_load" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage. ", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / CPU_CLK= _UNHALTED.THREAD", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group", - "MetricName": "tma_mispredicts_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears. ", - "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / C= PU_CLK_UNHALTED.THREAD", - "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group", - "MetricName": "tma_clears_resteers", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_l3_cache_access_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "9 * BACLEARS.ANY / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch= _resteers_group", - "MetricName": "tma_unknown_branches", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (First fetch or hitt= ing BPU capacity limit).", - "ScaleUnit": "100%" + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED= .THREAD", - "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group", - "MetricName": "tma_dsb_switches", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty.", - "ScaleUnit": "100%" + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_l3_cache_fill_bw_1t" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "ILD_STALL.LCP / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", - "MetricName": "tma_lcp", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs.", - "ScaleUnit": "100%" + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Mem", + "MetricName": "tma_info_l3mpki" }, { - "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "2 * IDQ.MS_SWITCHES / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group", - "MetricName": "tma_ms_switches", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals.", - "ScaleUnit": "100%" + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_load_l2_miss_latency" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "MetricExpr": "tma_frontend_bound - tma_fetch_latency", - "MetricGroup": "FetchBW;Frontend;TopdownL2;tma_L2_group;tma_L2_gro= up;tma_frontend_bound_group", - "MetricName": "tma_fetch_bandwidth", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend.", - "ScaleUnit": "100%" + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_load_l2_mlp" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES= _4_UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", - "MetricName": "tma_mite", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", - "ScaleUnit": "100%" + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_load_miss_real_latency" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / CORE_CLKS", - "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", - "MetricName": "tma_decoder0_alone", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_load_stlb_mpki" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4= _UOPS) / CORE_CLKS / 2", - "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", - "MetricName": "tma_dsb", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_M_RPQ_OCCUPANCY / UNC_M_RPQ_INSERTS) / i= mc_0@event\\=3D0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, { - "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * = (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)= ) / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_bad_speculation", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", - "MetricGroup": "BadSpec;BrMispredicts;TopdownL2;tma_L2_group;tma_L= 2_group;tma_bad_speculation_group", - "MetricName": "tma_branch_mispredicts", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ_= INSERTS) / imc_0@event\\=3D0x0@", + "MetricGroup": "Mem;MemoryLat;Server;SoC", + "MetricName": "tma_info_mem_pmm_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", - "MetricGroup": "BadSpec;MachineClears;TopdownL2;tma_L2_group;tma_L= 2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes.", - "ScaleUnit": "100%" + "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_socket_clks / duration_time)", + "MetricGroup": "Mem;MemoryLat;SoC", + "MetricName": "tma_info_mem_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "MetricExpr": "1 - tma_frontend_bound - (UOPS_ISSUED.ANY + 4 * (IN= T_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)) /= SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_backend_bound", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of (external) Memory Band= width related bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) = + tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_pmm_bound + tma_store_bound) * (tma_sq_full / (tma_contested_acces= ses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full))) + tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_b= ound + tma_store_bound) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load += tma_fb_full + tma_lock_latency + tma_split_loads + tma_store_fwd_blk))", + "MetricGroup": "Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_info_memory_bandwidth", + "MetricThreshold": "tma_info_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of (external) Memory Ban= dwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_dram_bw_= use, tma_mem_bandwidth, tma_sq_full" }, { - "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND= _ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + U= OPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVIT= Y.BOUND_ON_STORES) * tma_backend_bound", - "MetricGroup": "Backend;TopdownL2;tma_L2_group;tma_L2_group;tma_ba= ckend_bound_group", - "MetricName": "tma_memory_bound", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_l1_bound / max(tma_me= mory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + t= ma_pmm_bound + tma_store_bound) * (tma_dtlb_load / max(tma_l1_bound, tma_4k= _aliasing + tma_dtlb_load + tma_fb_full + tma_lock_latency + tma_split_load= s + tma_store_fwd_blk)) + tma_store_bound / (tma_dram_bound + tma_l1_bound = + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) * (tma_dtl= b_store / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_stor= e_latency)))", + "MetricGroup": "Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_info_memory_data_tlbs", + "MetricThreshold": "tma_info_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / CPU_CLK_UNHALTED.THREAD, 0)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l1_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Memory Latency related= bottlenecks (external memory and off-core caches)", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * tma_memory_bound * (tma_dram_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + = tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_pmm_bound + tma_store_bound) * (tma_l3_hit_latency / (tma_contested_= accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_l2_b= ound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_p= mm_bound + tma_store_bound))", + "MetricGroup": "Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_info_memory_latency", + "MetricThreshold": "tma_info_memory_latency > 20", + "PublicDescription": "Total pipeline cost of Memory Latency relate= d bottlenecks (external memory and off-core caches). Related metrics: tma_l= 3_hit_latency, tma_mem_latency" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_l1_bound_grou= p", - "MetricName": "tma_dtlb_load", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss.", - "ScaleUnit": "100%" + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "100 * (tma_branch_mispredicts + tma_fetch_latency *= tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches + tma_i= cache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_mispredictions", + "MetricThreshold": "tma_info_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bran= ch_misprediction_cost, tma_mispredicts_resteers" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", - "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", - "MetricName": "tma_load_stlb_hit", - "ScaleUnit": "100%" + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / CPU_CLK_UNHALTED.THR= EAD", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", - "MetricName": "tma_load_stlb_miss", - "ScaleUnit": "100%" + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING + EPT.WALK_PENDING) / (2 * tma_info= _core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_page_walks_utilization", + "MetricThreshold": "tma_info_page_walks_utilization > 0.5" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "min(13 * LD_BLOCKS.STORE_FORWARD / CPU_CLK_UNHALTED= .THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_store_fwd_blk", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_read_bw" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "min((12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_R= QSTS.ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES *= (11 * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUT= STANDING.CYCLES_WITH_DEMAND_RFO))) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_l1_bound_group"= , - "MetricName": "tma_lock_latency", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them.", - "ScaleUnit": "100%" + "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", + "MetricExpr": "64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time", + "MetricGroup": "Mem;MemoryBW;Server;SoC", + "MetricName": "tma_info_pmm_write_bw" }, { - "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry. ", - "MetricExpr": "min(Load_Miss_Real_Latency * LD_BLOCKS.NO_SR / CPU_= CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_split_loads", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", + "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license0_utilization", + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." }, { - "BriefDescription": "This metric estimates how often memory load a= ccesses were aliased by preceding stores (in program order) with a 4K addre= ss offset", - "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / CPU_CLK_UNHALTED.= THREAD", - "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", - "MetricName": "tma_4k_aliasing", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", + "MetricExpr": "(CORE_POWER.LVL1_TURBO_LICENSE / 2 / tma_info_core_= clks if #SMT_on else CORE_POWER.LVL1_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license1_utilization", + "MetricThreshold": "tma_info_power_license1_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." }, { - "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "Load_Miss_Real_Latency * cpu@L1D_PEND_MISS.FB_FULL\= \,cmask\\=3D0x1@ / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;TopdownL4;tma_L4_group;tma_l1_bound_group= ", - "MetricName": "tma_fb_full", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory).", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", + "MetricExpr": "(CORE_POWER.LVL2_TURBO_LICENSE / 2 / tma_info_core_= clks if #SMT_on else CORE_POWER.LVL2_TURBO_LICENSE / tma_info_core_clks)", + "MetricGroup": "Power", + "MetricName": "tma_info_power_license2_utilization", + "MetricThreshold": "tma_info_power_license2_utilization > 0.5", + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / CPU_CLK_UNHALTED.THREAD)", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l2_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_retire" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", - "MetricName": "tma_l3_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "4 * tma_info_core_clks", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_slots" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "min(((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNH= ALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5= * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ /= 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM *= (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HI= TM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + (47.5 * (= CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9= / (duration_time * 1e3 / 1e3)) - 3.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_= UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3))) *= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LO= AD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_l3_bound_group", - "MetricName": "tma_contested_accesses", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing.", - "ScaleUnit": "100%" + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_smt_2t_utilization" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "min((47.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHA= LTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 = * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / = 1e9 / (duration_time * 1e3 / 1e3))) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + M= EM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHE= R_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3= _HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Snoop;TopdownL4;tma_L4_group;tma_l3_bound_= group", - "MetricName": "tma_data_sharing", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance.", - "ScaleUnit": "100%" + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_socket_clks" }, { - "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited)", - "MetricExpr": "min((20.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHA= LTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 3.5 = * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / = 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LO= AD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD= , 1)", - "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l3_bound_grou= p", - "MetricName": "tma_l3_hit_latency", - "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L3 cache under unloaded scenarios (pos= sibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L= 3 hits) will improve the latency; reduce contention with sibling physical c= ores and increase performance. Note the value of this node may overlap wit= h its siblings.", - "ScaleUnit": "100%" + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_store_stlb_mpki" }, { - "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / CORE_CLKS", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_l3_bou= nd_group", - "MetricName": "tma_sq_full", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). The Super Queue is used for = requests to access the L2 cache or to go out to the Uncore.", - "ScaleUnit": "100%" + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_turbo_utilization" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "min(CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTE= D.THREAD + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS)= / CPU_CLK_UNHALTED.THREAD - tma_l2_bound - min(((1 - (19 * (MEM_LOAD_L3_MI= SS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED= .FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MIS= S_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS))) / (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIR= ED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOC= AL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD= _L3_MISS_RETIRED.REMOTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS))))) * (CYCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD = + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CL= K_UNHALTED.THREAD - tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1), = 1)", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", - "MetricName": "tma_dram_bound", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance.", - "ScaleUnit": "100%" + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_uoppi", + "MetricThreshold": "tma_info_uoppi > 1.05" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory (DRAM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_b= ound_group", - "MetricName": "tma_mem_bandwidth", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory (DRAM). The underlying heuristic assumes that a simi= lar off-core traffic is generated by all IA cores. This metric does not agg= regate non-data-read requests by this logical processor; requests from othe= r IA Logical Processors/Physical Cores/sockets; or other non-IA devices lik= e GPU; hence the maximum external memory bandwidth limits may or may not be= approached when this metric is flagged (see Uncore counters for that).", - "ScaleUnit": "100%" + "BriefDescription": "Instruction per taken branch", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_uptb", + "MetricThreshold": "tma_info_uptb < 6" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory (DRAM= )", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / CPU_CLK_UNHALTED.THREAD - tma_mem_bandwidth", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_= bound_group", - "MetricName": "tma_mem_latency", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory (DRA= M). This metric does not aggregate requests from other Logical Processors/= Physical Cores/sockets (see Uncore counters for that).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_64B.IFTAG_STALL / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TopdownL3;tma_L3_group;= tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "min((80 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALT= ED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 *= (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1= e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * = (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHA= LTED.THREAD, 1)", - "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", - "MetricName": "tma_local_dram", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance.", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_clks, 0)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "min((147.5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNH= ALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.= 5 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ = / 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRA= M * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_= UNHALTED.THREAD, 1)", - "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", - "MetricName": "tma_remote_dram", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_clks)", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "min(((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHA= LTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5= * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ /= 1e9 / (duration_time * 1e3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM= + (110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC= _FREQ / 1e9 / (duration_time * 1e3 / 1e3)) - 20.5 * (CPU_CLK_UNHALTED.THREA= D / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e= 3 / 1e3))) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.F= B_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_me= m_latency_group", - "MetricName": "tma_remote_cache", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article", + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_clks", + "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TopdownL3;tma_L3_= group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "min(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DR= AM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_= LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_R= ETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIR= ED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HIT= M * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM= _LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.R= EMOTE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_= LOAD_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_= RETIRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIR= ED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REM= OTE_PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (C= YCLE_ACTIVITY.STALLS_L3_MISS / CPU_CLK_UNHALTED.THREAD + (CYCLE_ACTIVITY.ST= ALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / CPU_CLK_UNHALTED.THREAD - = tma_l2_bound) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETI= RED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_MISS else 0), 1)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module. ", + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited)", + "MetricExpr": "17 * tma_info_average_frequency * MEM_LOAD_RETIRED.= L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_issueLat;tma_= l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L3 cache under unloaded scenarios (pos= sibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L= 3 hits) will improve the latency; reduce contention with sibling physical c= ores and increase performance. Note the value of this node may overlap wit= h its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: t= ma_info_memory_latency, tma_mem_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / CPU_CLK_UNHALTED.THR= EAD", - "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", - "MetricName": "tma_store_bound", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck.", + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "ILD_STALL.LCP / tma_info_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_dsb_coverage, tma_info_dsb_misses, tma_info_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_store= _bound_group", - "MetricName": "tma_store_latency", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "MetricExpr": "tma_retiring - tma_heavy_operations", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized software = running on Intel Core/Xeon products. While this often indicates efficient X= 86 instructions were executed; high value does not necessarily mean better = performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "min((110 * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHAL= TED.REF_TSC * #SYSTEM_TSC_FREQ / 1e9 / (duration_time * 1e3 / 1e3)) * (OCR.= DEMAND_RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 = * (CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ / = 1e9 / (duration_time * 1e3 / 1e3)) * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE= + OCR.PF_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / CPU_CLK_UNHALTED.THREAD, 1)", - "MetricGroup": "DataSharing;Offcore;Snoop;TopdownL4;tma_L4_group;t= ma_store_bound_group", - "MetricName": "tma_false_sharing", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. ", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT= .PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 *= tma_info_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / CORE_CLKS", - "MetricGroup": "TopdownL4;tma_L4_group;tma_store_bound_group", - "MetricName": "tma_split_stores", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= .", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "min((9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / CORE_CLKS, 1)", - "MetricGroup": "MemoryTLB;TopdownL4;tma_L4_group;tma_store_bound_g= roup", - "MetricName": "tma_dtlb_store", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data.", + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", - "MetricExpr": "tma_dtlb_store - DTLB_STORE_MISSES.WALK_ACTIVE / CO= RE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", - "MetricName": "tma_store_stlb_hit", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", + "MetricExpr": "59.5 * tma_info_average_frequency * MEM_LOAD_L3_MIS= S_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_M= ISS / 2) / tma_info_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", + "MetricName": "tma_local_dram", + "MetricThreshold": "tma_local_dram > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / CORE_CLKS", - "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", - "MetricName": "tma_store_stlb_miss", + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_clks", + "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "MetricExpr": "tma_backend_bound - tma_memory_bound", - "MetricGroup": "Backend;Compute;TopdownL2;tma_L2_group;tma_L2_grou= p;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", + "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group= ;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "ARITH.DIVIDER_ACTIVE / CPU_CLK_UNHALTED.THREAD", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", - "MetricName": "tma_divider", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication.", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory (DRAM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_dram_b= ound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory (DRAM). The underlying heuristic assumes that a simi= lar off-core traffic is generated by all IA cores. This metric does not agg= regate non-data-read requests by this logical processor; requests from othe= r IA Logical Processors/Physical Cores/sockets; or other non-IA devices lik= e GPU; hence the maximum external memory bandwidth limits may or may not be= approached when this metric is flagged (see Uncore counters for that). Rel= ated metrics: tma_fb_full, tma_info_dram_bw_use, tma_info_memory_bandwidth,= tma_sq_full", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + UOPS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL)= ) / CPU_CLK_UNHALTED.THREAD if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS= _TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + UO= PS_RETIRED.RETIRE_SLOTS / SLOTS * EXE_ACTIVITY.2_PORTS_UTIL) / CPU_CLK_UNHA= LTED.THREAD)", - "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", - "MetricName": "tma_ports_utilization", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory (DRAM= )", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_clks - tma_mem_bandwidth", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_dram_= bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory (DRA= M). This metric does not aggregate requests from other Logical Processors/= Physical Cores/sockets (see Uncore counters for that). Related metrics: tma= _info_memory_latency, tma_l3_hit_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else= CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_0", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND= _ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + t= ma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES) * = tma_backend_bound", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / CPU_CLK_UNHALTED.TH= READ", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group"= , - "MetricName": "tma_serializing_operation", - "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions.", - "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / CPU_CLK_UNHALTED.= THREAD", - "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", - "MetricName": "tma_slow_pause", + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.M= S_UOPS / tma_info_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "The Mixing_Vectors metric gives the percentag= e of injected blend uops out of all uops issued", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD * UOPS_ISSUED.VECTOR_WI= DTH_MISMATCH / UOPS_ISSUED.ANY, 1)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_0_group"= , - "MetricName": "tma_mixing_vectors", - "PublicDescription": "The Mixing_Vectors metric gives the percenta= ge of injected blend uops out of all uops issued. Usually a Mixing_Vectors = over 5% is worth investigating. Read more in Appendix B1 of the Optimizatio= ns Guide for this topic.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_clks", + "MetricGroup": "BadSpec;BrMispredicts;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_branch_misprediction_cost, tma_inf= o_mispredictions", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / CORE_CLKS"= , - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_1", - "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful.", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES= _4_UOPS) / tma_info_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & (tma_fetch_bandwidth > 0.1 & = tma_frontend_bound > 0.15 & tma_info_ipc / 4 > 0.35)", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / CORE_CLKS"= , - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_2", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop.", + "BriefDescription": "The Mixing_Vectors metric gives the percentag= e of injected blend uops out of all uops issued", + "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "The Mixing_Vectors metric gives the percenta= ge of injected blend uops out of all uops issued. Usually a Mixing_Vectors = over 5% is worth investigating. Read more in Appendix B1 of the Optimizatio= ns Guide for this topic. Related metrics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / CORE_CLKS", - "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", - "MetricName": "tma_ports_utilized_3m", + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / SLOT= S", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_alu_op_utilization", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / CORE_CLKS", - "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group", - "MetricName": "tma_port_0", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_light_operati= ons > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p", - "MetricName": "tma_port_1", + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_m= emory_operations + tma_fused_instructions + tma_non_fused_branches + tma_no= p_instructions))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] = ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p", - "MetricName": "tma_port_5", + "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", + "MetricConstraint": "NO_GROUP_EVENTS", + "MetricExpr": "((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOAD= _L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.F= B_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * = (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LOA= D_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOT= E_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD= _L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.F= B_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_= PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCLE= _ACTIVITY.STALLS_L3_MISS / tma_info_clks + (CYCLE_ACTIVITY.STALLS_L1D_MISS = - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_clks - tma_l2_bound) if 1e6 * (= MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOA= D_RETIRED.L1_MISS else 0)", + "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_pmm_bound", + "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple = ALU)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p", - "MetricName": "tma_port_6", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT= .PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 *= CORE_CLKS)", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_load_op_utilization", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; [= ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_2", + "MetricThreshold": "tma_port_2 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 2 ([SNB+]Loads and Store-address; = [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; [= ICL+] Loads)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / CORE_CLKS", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_3", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / CORE_CLKS", - "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", - "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_port_3 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 3 ([SNB+]Loads and Store-address; = [ICL+] Loads). Sample with: UOPS_DISPATCHED_PORT.PORT_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 4 (Store-data)", "MetricExpr": "tma_store_op_utilization", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", + "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_= utilization_group", "MetricName": "tma_port_4", + "MetricThreshold": "tma_port_4 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 4 (Store-data). Sample with: UOPS_= DISPATCHED_PORT.PORT_4. Related metrics: tma_split_stores", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", - "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / CORE_CLKS", - "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", - "MetricName": "tma_port_7", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+] = ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_5", + "MetricThreshold": "tma_port_5 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / SLOTS", - "MetricGroup": "TopdownL1;tma_L1_group;tma_L1_group", - "MetricName": "tma_retiring", - "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. ", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple = ALU)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+]Primary Branch and simple= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "MetricExpr": "tma_retiring - (UOPS_RETIRED.RETIRE_SLOTS + UOPS_RE= TIRED.MACRO_FUSED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_ret= iring_group", - "MetricName": "tma_light_operations", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UPI metri= c) ratio of 1 or less should be expected for decently optimized software ru= nning on Intel Core/Xeon products. While this often indicates efficient X86= instructions were executed; high value does not necessarily mean better pe= rformance cannot be achieved.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 7 ([HSW+]simple Store-address)", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", + "MetricName": "tma_port_7", + "MetricThreshold": "tma_port_7 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 7 ([HSW+]simple Store-address). Sa= mple with: UOPS_DISPATCHED_PORT.PORT_7", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ + tma_fp_scalar + tma_fp_vector", - "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", - "MetricName": "tma_fp_arith", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_info_clks if = ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_= MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_POR= TS_UTIL) / tma_info_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", - "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group"= , - "MetricName": "tma_x87_use", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_NONE / 2 if #SMT_on else= CYCLE_ACTIVITY.STALLS_TOTAL - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_co= re_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INS= T_RETIRED.SCALAR_DOUBLE) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group", - "MetricName": "tma_fp_scalar", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_D= OUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.51= 2B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED= .RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group", - "MetricName": "tma_fp_vector", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_128b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.7 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_256b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(89.5 * tma_info_average_frequency * MEM_LOAD_L3_MI= SS_RETIRED.REMOTE_HITM + 89.5 * tma_info_average_frequency * MEM_LOAD_L3_MI= SS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", - "MetricExpr": "min((FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_= ARITH_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS, 1)", - "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group", - "MetricName": "tma_fp_vector_512b", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting.", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", + "MetricExpr": "127 * tma_info_average_frequency * MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_M= ISS / 2) / tma_info_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", + "MetricName": "tma_remote_dram", + "MetricThreshold": "tma_remote_dram > 0.1 & (tma_mem_latency > 0.1= & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_memory_operations", + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .RETIRE_SLOTS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_fused_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. The instruction pairs of CMP+JCC or DEC+JC= C are commonly used examples.", + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_clks", + "MetricGroup": "PortsUtil;TopdownL5;tma_L5_group;tma_issueSO;tma_p= orts_utilized_0_group", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_ports_u= tilized_0 > 0.2 & (tma_ports_utilization > 0.15 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - UOPS_RETIRED.MACRO_FUSED) / UOPS_RETIRED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_non_fused_branches", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / tma_info_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_ports_utilized_0 > 0.2 & (tma_ports_utilization > 0.15 & = (tma_core_bound > 0.1 & tma_backend_bound > 0.2))))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_nop_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody.", + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "tma_info_load_miss_real_latency * LD_BLOCKS.NO_SR /= tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes. May undercount due to FMA double counting", - "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_m= emory_operations + tma_fused_instructions + tma_non_fused_branches + tma_no= p_instructions))", - "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", - "MetricName": "tma_other_light_ops", + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_clks"= , + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or microcoded sequences", - "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / SLOTS", - "MetricGroup": "Retire;TopdownL2;tma_L2_group;tma_L2_group;tma_ret= iring_group", - "MetricName": "tma_heavy_operations", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or microcoded sequences. This highly-correlates with the= uop length of these instructions/sequences.", + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueB= W;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_dram_bw_use, tma_info_memory_bandwidth, tma_mem_bandwidth", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", - "MetricExpr": "tma_heavy_operations - UOPS_RETIRED.RETIRE_SLOTS / = UOPS_ISSUED.ANY * IDQ.MS_UOPS / SLOTS", - "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group"= , - "MetricName": "tma_few_uops_instructions", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions.", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.M= S_UOPS / SLOTS", - "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group", - "MetricName": "tma_microcode_sequencer", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided.", + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "min(100 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / SLO= TS, 1)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", - "MetricName": "tma_assists", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases.", + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_clks", + "MetricGroup": "MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_issue= RFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", - "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", - "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", - "MetricName": "tma_cisc", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per core", - "MetricExpr": "cstate_core@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Core_Residency", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_clks"= , + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", "ScaleUnit": "100%" }, { - "BriefDescription": "C7 residency percent per core", - "MetricExpr": "cstate_core@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Core_Residency", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "9 * BACLEARS.ANY / tma_info_clks", + "MetricGroup": "BigFoot;FetchLat;TopdownL4;tma_L4_group;tma_branch= _resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit). Sample with: BACLEARS.AN= Y", "ScaleUnit": "100%" }, { - "BriefDescription": "C2 residency percent per package", - "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C2_Pkg_Residency", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group"= , + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" }, { - "BriefDescription": "C3 residency percent per package", - "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C3_Pkg_Residency", + "BriefDescription": "Percentage of cycles in aborted transactions.= ", + "MetricExpr": "max(cpu@cycles\\-t@ - cpu@cycles\\-ct@, 0) / cycles= ", + "MetricGroup": "transaction", + "MetricName": "tsx_aborted_cycles", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per package", - "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C6_Pkg_Residency", - "ScaleUnit": "100%" + "BriefDescription": "Number of cycles within a transaction divided= by the number of elisions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@el\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_elision", + "ScaleUnit": "1cycles / elision" }, { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", + "BriefDescription": "Number of cycles within a transaction divided= by the number of transactions.", + "MetricExpr": "cpu@cycles\\-t@ / cpu@tx\\-start@", + "MetricGroup": "transaction", + "MetricName": "tsx_cycles_per_transaction", + "ScaleUnit": "1cycles / transaction" + }, + { + "BriefDescription": "Percentage of cycles within a transaction reg= ion.", + "MetricExpr": "cpu@cycles\\-t@ / cycles", + "MetricGroup": "transaction", + "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json b/to= ols/perf/pmu-events/arch/x86/cascadelakex/frontend.json index 13ccf50db43d..04f08e4d2402 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/frontend.json @@ -322,7 +322,7 @@ "UMask": "0x4" }, { - "BriefDescription": "Cycles when uops are being delivered to Instr= uction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops are being delivered to Instr= uction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_CYCLES", @@ -331,7 +331,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Cycles when uops initiated by Decode Stream B= uffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Mic= rocode Sequenser (MS) is busy", + "BriefDescription": "Cycles when uops initiated by Decode Stream B= uffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Mic= rocode Sequencer (MS) is busy", "CounterMask": "1", "EventCode": "0x79", "EventName": "IDQ.MS_DSB_CYCLES", @@ -340,7 +340,7 @@ "UMask": "0x10" }, { - "BriefDescription": "Uops initiated by MITE and delivered to Instr= uction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops initiated by MITE and delivered to Instr= uction Decode Queue (IDQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_MITE_UOPS", "PublicDescription": "Counts the number of uops initiated by MITE = and delivered to Instruction Decode Queue (IDQ) while the Microcode Sequenc= er (MS) is busy. Counting includes uops that may 'bypass' the IDQ.", @@ -358,7 +358,7 @@ "UMask": "0x30" }, { - "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) while Microcode Sequenser (MS) is busy", + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) while Microcode Sequencer (MS) is busy", "EventCode": "0x79", "EventName": "IDQ.MS_UOPS", "PublicDescription": "Counts the total number of uops delivered by= the Microcode Sequencer (MS). Any instruction over 4 uops will be delivere= d by the MS. Some instructions such as transcendentals may additionally gen= erate uops from the MS.", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json b/to= ols/perf/pmu-events/arch/x86/cascadelakex/pipeline.json index 64e1fe351333..0f06e314fe36 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/pipeline.json @@ -93,6 +93,22 @@ "SampleAfterValue": "400009", "UMask": "0x10" }, + { + "BriefDescription": "Speculative and retired mispredicted macro co= nditional branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.ALL_BRANCHES", + "PublicDescription": "This event counts both taken and not taken s= peculative and retired mispredicted branch instructions.", + "SampleAfterValue": "200003", + "UMask": "0xff" + }, + { + "BriefDescription": "Speculative mispredicted indirect branches", + "EventCode": "0x89", + "EventName": "BR_MISP_EXEC.INDIRECT", + "PublicDescription": "Counts speculatively miss-predicted indirect= branches at execution time. Counts for indirect near CALL or JMP instructi= ons (RET excluded).", + "SampleAfterValue": "200003", + "UMask": "0xe4" + }, { "BriefDescription": "All mispredicted macro branch instructions re= tired.", "EventCode": "0xC5", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json= b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json index 70a2c0ff8dfd..aafd2c9b813b 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-memory.json @@ -192,7 +192,7 @@ "EventCode": "0x9", "EventName": "UNC_M_ECC_CORRECTABLE_ERRORS", "PerPkg": "1", - "PublicDescription": "Counts the number of ECC errors detected and= corrected by the iMC on this channel. This counter is only useful with EC= C DRAM devices. This count will increment one time for each correction reg= ardless of the number of bits corrected. The iMC can correct up to 4 bit e= rrors in independent channel mode and 8 bit erros in lockstep mode.", + "PublicDescription": "Counts the number of ECC errors detected and= corrected by the iMC on this channel. This counter is only useful with EC= C DRAM devices. This count will increment one time for each correction reg= ardless of the number of bits corrected. The iMC can correct up to 4 bit e= rrors in independent channel mode and 8 bit errors in lockstep mode.", "Unit": "iMC" }, { @@ -212,7 +212,7 @@ "Unit": "iMC" }, { - "BriefDescription": "UNC_M_MAJMODE2.PMM_CYC", + "BriefDescription": "Major Mode 2 : Cycles in PMM major mode", "EventCode": "0xED", "EventName": "UNC_M_MAJMODE2.PMM_CYC", "PerPkg": "1", @@ -220,7 +220,7 @@ "Unit": "iMC" }, { - "BriefDescription": "UNC_M_MAJMODE2.PMM_ENTER", + "BriefDescription": "Major Mode 2 : Entered PMM major mode", "EventCode": "0xED", "EventName": "UNC_M_MAJMODE2.PMM_ENTER", "PerPkg": "1", @@ -290,7 +290,7 @@ "Unit": "iMC" }, { - "BriefDescription": "All commands for Intel Optane DC persistent m= emory", + "BriefDescription": "All commands for Intel(R) Optane(TM) DC persi= stent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.ALL", "PerPkg": "1", @@ -314,7 +314,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Regular reads(RPQ) commands for Intel Optane = DC persistent memory", + "BriefDescription": "Regular reads(RPQ) commands for Intel(R) Opta= ne(TM) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.RD", "PerPkg": "1", @@ -331,7 +331,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Underfill read commands for Intel Optane DC p= ersistent memory", + "BriefDescription": "Underfill read commands for Intel(R) Optane(T= M) DC persistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.UFILL_RD", "PerPkg": "1", @@ -348,7 +348,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Write commands for Intel Optane DC persistent= memory", + "BriefDescription": "Write commands for Intel(R) Optane(TM) DC per= sistent memory", "EventCode": "0xEA", "EventName": "UNC_M_PMM_CMD1.WR", "PerPkg": "1", @@ -522,7 +522,7 @@ "Unit": "iMC" }, { - "BriefDescription": "Write Pending Queue Occupancy of all write re= quests for Intel Optane DC persistent memory", + "BriefDescription": "Write Pending Queue Occupancy of all write re= quests for Intel(R) Optane(TM) DC persistent memory", "EventCode": "0xE4", "EventName": "UNC_M_PMM_WPQ_OCCUPANCY.ALL", "PerPkg": "1", @@ -2735,7 +2735,7 @@ "EventCode": "0x81", "EventName": "UNC_M_WPQ_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Counts the number of entries in the Write Pe= nding Queue (WPQ) at each cycle. This can then be used to calculate both t= he average queue occupancy (in conjunction with the number of cycles not em= pty) and the average latency (in conjunction with the number of allocations= ). The WPQ is used to schedule writes out to the memory controller and to = track the requests. Requests allocate into the WPQ soon after they enter t= he memory controller, and need credits for an entry in this buffer before b= eing sent from the CHA to the iMC (memory controller). They deallocate aft= er being issued to DRAM. Write requests themselves are able to complete (f= rom the perspective of the rest of the system) as soon they have 'posted' t= o the iMC. This is not to be confused with actually performing the write t= o DRAM. Therefore, the average latency for this queue is actually not usef= ul for deconstruction intermediate write latencies. So, we provide filteri= ng based on if the request has posted or not. By using the 'not posted' fi= lter, we can track how long writes spent in the iMC before completions were= sent to the HA. The 'posted' filter, on the other hand, provides informat= ion about how much queueing is actually happenning in the iMC for writes be= fore they are actually issued to memory. High average occupancies will gen= erally coincide with high write major mode counts. Is there a filter of sor= ts???", + "PublicDescription": "Counts the number of entries in the Write Pe= nding Queue (WPQ) at each cycle. This can then be used to calculate both t= he average queue occupancy (in conjunction with the number of cycles not em= pty) and the average latency (in conjunction with the number of allocations= ). The WPQ is used to schedule writes out to the memory controller and to = track the requests. Requests allocate into the WPQ soon after they enter t= he memory controller, and need credits for an entry in this buffer before b= eing sent from the CHA to the iMC (memory controller). They deallocate aft= er being issued to DRAM. Write requests themselves are able to complete (f= rom the perspective of the rest of the system) as soon they have 'posted' t= o the iMC. This is not to be confused with actually performing the write t= o DRAM. Therefore, the average latency for this queue is actually not usef= ul for deconstruction intermediate write latencies. So, we provide filteri= ng based on if the request has posted or not. By using the 'not posted' fi= lter, we can track how long writes spent in the iMC before completions were= sent to the HA. The 'posted' filter, on the other hand, provides informat= ion about how much queueing is actually happening in the iMC for writes bef= ore they are actually issued to memory. High average occupancies will gene= rally coincide with high write major mode counts. Is there a filter of sort= s???", "Unit": "iMC" }, { diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json index ef4767feb4e2..5f3ed5e843b9 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-other.json @@ -44,7 +44,7 @@ "MetricName": "LLC_MISSES.PCIE_WRITE", "PerPkg": "1", "PortMask": "0x01", - "PublicDescription": "Counts every write request of 4 bytes of dat= a made by IIO Part0 to a unit onthe main die (generally memory). In the gen= eral case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) tha= t is plugged directly into one of the PCIe slots. Part0 could also refer to= any device plugged into the first slot of a PCIe riser card or to a device= attached to the IIO unit which starts its use of the bus using lane 0 of t= he 16 lanes supported by the bus.", + "PublicDescription": "Counts every write request of 4 bytes of dat= a made by IIO Part0 to a unit on the main die (generally memory). In the ge= neral case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) th= at is plugged directly into one of the PCIe slots. Part0 could also refer t= o any device plugged into the first slot of a PCIe riser card or to a devic= e attached to the IIO unit which starts its use of the bus using lane 0 of = the 16 lanes supported by the bus.", "ScaleUnit": "4Bytes", "UMask": "0x1", "Unit": "IIO" @@ -856,7 +856,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Counts the number of Allocate/Update to HitMe= Cache; Deallocate HtiME$ on Reads without RspFwdI*", + "BriefDescription": "Counts the number of Allocate/Update to HitMe= Cache; Deallocate HitME$ on Reads without RspFwdI*", "EventCode": "0x61", "EventName": "UNC_CHA_HITME_UPDATE.DEALLOCATE", "PerPkg": "1", @@ -1210,7 +1210,7 @@ "EventCode": "0x34", "EventName": "UNC_CHA_LLC_LOOKUP.WRITE", "PerPkg": "1", - "PublicDescription": "Counts the number of times the LLC was acces= sed - this includes code, data, prefetches and hints coming from L2. This = has numerous filters available. Note the non-standard filtering equation. = This event will count requests that lookup the cache multiple times with m= ultiple increments. One must ALWAYS set umask bit 0 and select a state or = states to match. Otherwise, the event will count nothing. CHAFilter0[24:= 21,17] bits correspond to [FMESI] state.; Writeback transactions from L2 to= the LLC This includes all write transactions -- both Cachable and UC.", + "PublicDescription": "Counts the number of times the LLC was acces= sed - this includes code, data, prefetches and hints coming from L2. This = has numerous filters available. Note the non-standard filtering equation. = This event will count requests that lookup the cache multiple times with m= ultiple increments. One must ALWAYS set umask bit 0 and select a state or = states to match. Otherwise, the event will count nothing. CHAFilter0[24:= 21,17] bits correspond to [FMESI] state.; Writeback transactions from L2 to= the LLC This includes all write transactions -- both Cacheable and UC.", "UMask": "0x5", "Unit": "CHA" }, @@ -3481,7 +3481,7 @@ "EventCode": "0x5D", "EventName": "UNC_CHA_SNOOP_RESP_LOCAL.RSPSFWD", "PerPkg": "1", - "PublicDescription": "Number of snoop responses received for a Loc= al request; Filters for a snoop response of RspSFwd to local CA requests. = This is returned when a remote caching agent forwards data but holds on to= its currentl copy. This is common for data and code reads that hit in a r= emote socket in E or F state.", + "PublicDescription": "Number of snoop responses received for a Loc= al request; Filters for a snoop response of RspSFwd to local CA requests. = This is returned when a remote caching agent forwards data but holds on to= its current copy. This is common for data and code reads that hit in a re= mote socket in E or F state.", "UMask": "0x8", "Unit": "CHA" }, @@ -4082,10 +4082,11 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.ALL", + "BriefDescription": "TOR Occupancy : All", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.ALL", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : All : For each cycle, this e= vent accumulates the number of valid entries in the TOR that match qualific= ations specified by the subevent. Does not include addressless requests= such as locks and interrupts.", "UMask": "0xff", "Unit": "CHA" }, @@ -4153,20 +4154,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that = Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", "Filter": "config1=3D0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", + "BriefDescription": "TOR Occupancy : DRds issued by iA Cores that = Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", "Filter": "config1=3D0x40433", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4189,20 +4192,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores= that hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", "Filter": "config1=3D0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that hit the LLC : For each cycle, this event accumulates the number of v= alid entries in the TOR that match qualifications specified by the subevent= . Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that = Hit the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", "Filter": "config1=3D0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", "Unit": "CHA" }, @@ -4216,11 +4221,12 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", + "BriefDescription": "TOR Occupancy : CRds issued by iA Cores that = Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", "Filter": "config1=3D0x40233", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -4253,20 +4259,22 @@ "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", + "BriefDescription": "TOR Occupancy : LLCPrefRFO issued by iA Cores= that missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", "Filter": "config1=3D0x4b033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that missed the LLC : For each cycle, this event accumulates the number o= f valid entries in the TOR that match qualifications specified by the subev= ent. Does not include addressless requests such as locks and interrupts= .", "UMask": "0x21", "Unit": "CHA" }, { - "BriefDescription": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", + "BriefDescription": "TOR Occupancy : RFOs issued by iA Cores that = Missed the LLC", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", "Filter": "config1=3D0x40033", "PerPkg": "1", + "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", "Unit": "CHA" }, @@ -4308,7 +4316,7 @@ "Unit": "CHA" }, { - "BriefDescription": "TOR Occupancy; RDCUR isses from Local IO", + "BriefDescription": "TOR Occupancy; RDCUR misses from Local IO", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Filter": "config1=3D0x43C33", @@ -11637,7 +11645,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x01", - "PublicDescription": "Counts every write request of 4 bytes of dat= a made by IIO Part0 to a unit onthe main die (generally memory). In the gen= eral case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) tha= t is plugged directly into one of the PCIe slots. Part0 could also refer to= any device plugged into the first slot of a PCIe riser card or to a device= attached to the IIO unit which starts its use of the bus using lane 0 of t= he 16 lanes supported by the bus.", + "PublicDescription": "Counts every write request of 4 bytes of dat= a made by IIO Part0 to a unit on the main die (generally memory). In the ge= neral case, Part0 refers to a standard PCIe card of any size (x16,x8,x4) th= at is plugged directly into one of the PCIe slots. Part0 could also refer t= o any device plugged into the first slot of a PCIe riser card or to a devic= e attached to the IIO unit which starts its use of the bus using lane 0 of = the 16 lanes supported by the bus.", "UMask": "0x1", "Unit": "IIO" }, @@ -12024,7 +12032,7 @@ "Unit": "IIO" }, { - "BriefDescription": "UNC_IIO_NOTHING", + "BriefDescription": "Counting disabled", "EventName": "UNC_IIO_NOTHING", "PerPkg": "1", "Unit": "IIO" @@ -15622,7 +15630,7 @@ "EventCode": "0xC", "EventName": "UNC_I_TxS_REQUEST_OCCUPANCY", "PerPkg": "1", - "PublicDescription": "Accumultes the number of outstanding outboun= d requests from the IRP to the switch (towards the devices). This can be u= sed in conjuection with the allocations event in order to calculate average= latency of outbound requests.", + "PublicDescription": "Accumulates the number of outstanding outbou= nd requests from the IRP to the switch (towards the devices). This can be = used in conjunction with the allocations event in order to calculate averag= e latency of outbound requests.", "Unit": "IRP" }, { @@ -16128,35 +16136,35 @@ "Unit": "M2M" }, { - "BriefDescription": "Number of reads in which direct to Intel UPI = transactions were overridden", + "BriefDescription": "Number of reads in which direct to Intel(R) U= PI transactions were overridden", "EventCode": "0x28", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_CREDITS", "PerPkg": "1", - "PublicDescription": "Counts reads in which direct to Intel Ultra = Path Interconnect (UPI) transactions (which would have bypassed the CHA) we= re overridden", + "PublicDescription": "Counts reads in which direct to Intel(R) Ult= ra Path Interconnect (UPI) transactions (which would have bypassed the CHA)= were overridden", "Unit": "M2M" }, { - "BriefDescription": "Cycles when direct to Intel UPI was disabled"= , + "BriefDescription": "Cycles when direct to Intel(R) UPI was disabl= ed", "EventCode": "0x27", "EventName": "UNC_M2M_DIRECT2UPI_NOT_TAKEN_DIRSTATE", "PerPkg": "1", - "PublicDescription": "Counts cycles when the ability to send messa= ges direct to the Intel Ultra Path Interconnect (bypassing the CHA) was dis= abled", + "PublicDescription": "Counts cycles when the ability to send messa= ges direct to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was = disabled", "Unit": "M2M" }, { - "BriefDescription": "Messages sent direct to the Intel UPI", + "BriefDescription": "Messages sent direct to the Intel(R) UPI", "EventCode": "0x26", "EventName": "UNC_M2M_DIRECT2UPI_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts when messages were sent direct to the= Intel Ultra Path Interconnect (bypassing the CHA)", + "PublicDescription": "Counts when messages were sent direct to the= Intel(R) Ultra Path Interconnect (bypassing the CHA)", "Unit": "M2M" }, { - "BriefDescription": "Number of reads that a message sent direct2 I= ntel UPI was overridden", + "BriefDescription": "Number of reads that a message sent direct2 I= ntel(R) UPI was overridden", "EventCode": "0x29", "EventName": "UNC_M2M_DIRECT2UPI_TXN_OVERRIDE", "PerPkg": "1", - "PublicDescription": "Counts when a read message that was sent dir= ect to the Intel Ultra Path Interconnect (bypassing the CHA) was overridden= ", + "PublicDescription": "Counts when a read message that was sent dir= ect to the Intel(R) Ultra Path Interconnect (bypassing the CHA) was overrid= den", "Unit": "M2M" }, { @@ -16583,7 +16591,7 @@ "Unit": "M2M" }, { - "BriefDescription": "Read requests to Intel Optane DC persistent m= emory issued to the iMC from M2M", + "BriefDescription": "Read requests to Intel(R) Optane(TM) DC persi= stent memory issued to the iMC from M2M", "EventCode": "0x37", "EventName": "UNC_M2M_IMC_READS.TO_PMM", "PerPkg": "1", @@ -16650,7 +16658,7 @@ "Unit": "M2M" }, { - "BriefDescription": "Write requests to Intel Optane DC persistent = memory issued to the iMC from M2M", + "BriefDescription": "Write requests to Intel(R) Optane(TM) DC pers= istent memory issued to the iMC from M2M", "EventCode": "0x38", "EventName": "UNC_M2M_IMC_WRITES.TO_PMM", "PerPkg": "1", @@ -16675,7 +16683,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; C= hannel 0", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Chan= nel 0", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN0", "PerPkg": "1", @@ -16683,7 +16691,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; C= hannel 1", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Chan= nel 1", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN1", "PerPkg": "1", @@ -16691,7 +16699,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; C= hannel 2", + "BriefDescription": "M2M->iMC RPQ Cycles w/Credits - Regular; Chan= nel 2", "EventCode": "0x4F", "EventName": "UNC_M2M_PMM_RPQ_CYCLES_REG_CREDITS.CHN2", "PerPkg": "1", @@ -16699,7 +16707,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; C= hannel 0", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Chan= nel 0", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN0", "PerPkg": "1", @@ -16707,7 +16715,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; C= hannel 1", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Chan= nel 1", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN1", "PerPkg": "1", @@ -16715,7 +16723,7 @@ "Unit": "M2M" }, { - "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; C= hannel 2", + "BriefDescription": "M2M->iMC WPQ Cycles w/Credits - Regular; Chan= nel 2", "EventCode": "0x51", "EventName": "UNC_M2M_PMM_WPQ_CYCLES_REG_CREDITS.CHN2", "PerPkg": "1", @@ -16737,11 +16745,11 @@ "Unit": "M2M" }, { - "BriefDescription": "Prefecth requests that got turn into a demand= request", + "BriefDescription": "Prefetch requests that got turn into a demand= request", "EventCode": "0x56", "EventName": "UNC_M2M_PREFCAM_DEMAND_PROMOTIONS", "PerPkg": "1", - "PublicDescription": "Counts when the M2M (Mesh to Memory) promote= s a outstanding request in the prefetch queue due to a subsequent demand re= ad request that entered the M2M with the same address. Explanatory Side No= te: The Prefecth queue is made of CAM (Content Addressable Memory)", + "PublicDescription": "Counts when the M2M (Mesh to Memory) promote= s a outstanding request in the prefetch queue due to a subsequent demand re= ad request that entered the M2M with the same address. Explanatory Side No= te: The Prefetch queue is made of CAM (Content Addressable Memory)", "Unit": "M2M" }, { @@ -20804,7 +20812,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S0_BL_ARB", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around t= he Ingress Queue; AD is taking bypass to slot 0 of independent flit while b= l message is in arbitration", + "PublicDescription": "Number of times message is bypassed around t= he Ingress Queue; AD is taking bypass to slot 0 of independent flit while b= l message is in arbitration", "UMask": "0x2", "Unit": "M3UPI" }, @@ -20813,7 +20821,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S0_IDLE", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around t= he Ingress Queue; AD is taking bypass to slot 0 of independent flit while p= ipeline is idle", + "PublicDescription": "Number of times message is bypassed around t= he Ingress Queue; AD is taking bypass to slot 0 of independent flit while p= ipeline is idle", "UMask": "0x1", "Unit": "M3UPI" }, @@ -20822,7 +20830,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S1_BL_SLOT", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around t= he Ingress Queue; AD is taking bypass to flit slot 1 while merging with bl = message in same flit", + "PublicDescription": "Number of times message is bypassed around t= he Ingress Queue; AD is taking bypass to flit slot 1 while merging with bl = message in same flit", "UMask": "0x4", "Unit": "M3UPI" }, @@ -20831,7 +20839,7 @@ "EventCode": "0x40", "EventName": "UNC_M3UPI_RxC_BYPASSED.AD_S2_BL_SLOT", "PerPkg": "1", - "PublicDescription": "Number ot times message is bypassed around t= he Ingress Queue; AD is taking bypass to flit slot 2 while merging with bl = message in same flit", + "PublicDescription": "Number of times message is bypassed around t= he Ingress Queue; AD is taking bypass to flit slot 2 while merging with bl = message in same flit", "UMask": "0x8", "Unit": "M3UPI" }, @@ -21397,7 +21405,7 @@ "Unit": "M3UPI" }, { - "BriefDescription": "Flit Gen - Header 1; Acumullate", + "BriefDescription": "Flit Gen - Header 1; Accumulate", "EventCode": "0x53", "EventName": "UNC_M3UPI_RxC_FLIT_GEN_HDR1.ACCUM", "PerPkg": "1", @@ -24618,7 +24626,7 @@ "EventCode": "0x29", "EventName": "UNC_M3UPI_UPI_PREFETCH_SPAWN", "PerPkg": "1", - "PublicDescription": "Count cases where flow control queue that si= ts between the Intel Ultra Path Interconnect (UPI) and the mesh spawns a pr= efetch to the iMC (Memory Controller)", + "PublicDescription": "Count cases where flow control queue that si= ts between the Intel(R) Ultra Path Interconnect (UPI) and the mesh spawns a= prefetch to the iMC (Memory Controller)", "Unit": "M3UPI" }, { @@ -24973,11 +24981,11 @@ "Unit": "M2M" }, { - "BriefDescription": "Clocks of the Intel Ultra Path Interconnect (= UPI)", + "BriefDescription": "Clocks of the Intel(R) Ultra Path Interconnec= t (UPI)", "EventCode": "0x1", "EventName": "UNC_UPI_CLOCKTICKS", "PerPkg": "1", - "PublicDescription": "Counts clockticks of the fixed frequency clo= ck controlling the Intel Ultra Path Interconnect (UPI). This clock runs at= 1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link will= have a fixed Frequency of 1.2 Ghz.", + "PublicDescription": "Counts clockticks of the fixed frequency clo= ck controlling the Intel(R) Ultra Path Interconnect (UPI). This clock runs= at1/8th the 'GT/s' speed of the UPI link. For example, a 9.6GT/s link w= ill have a fixed Frequency of 1.2 Ghz.", "Unit": "UPI LL" }, { @@ -24999,11 +25007,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Data Response packets that go direct to Intel= UPI", + "BriefDescription": "Data Response packets that go direct to Intel= (R) UPI", "EventCode": "0x12", "EventName": "UNC_UPI_DIRECT_ATTEMPTS.D2U", "PerPkg": "1", - "PublicDescription": "Counts Data Response (DRS) packets that atte= mpted to go direct to Intel Ultra Path Interconnect (UPI) bypassing the CHA= .", + "PublicDescription": "Counts Data Response (DRS) packets that atte= mpted to go direct to Intel(R) Ultra Path Interconnect (UPI) bypassing the = CHA .", "UMask": "0x2", "Unit": "UPI LL" }, @@ -25072,11 +25080,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles Intel UPI is in L1 power mode (shutdow= n)", + "BriefDescription": "Cycles Intel(R) UPI is in L1 power mode (shut= down)", "EventCode": "0x21", "EventName": "UNC_UPI_L1_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the Intel Ultra Path Inte= rconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts down t= he UPI link. Link power states are per link and per direction, so for exam= ple the Tx direction could be in one state while Rx was in another, this ev= ent only coutns when both links are shutdown.", + "PublicDescription": "Counts cycles when the Intel(R) Ultra Path I= nterconnect (UPI) is in L1 power mode. L1 is a mode that totally shuts dow= n the UPI link. Link power states are per link and per direction, so for e= xample the Tx direction could be in one state while Rx was in another, this= event only coutns when both links are shutdown.", "Unit": "UPI LL" }, { @@ -25238,11 +25246,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles the Rx of the Intel UPI is in L0p powe= r mode", + "BriefDescription": "Cycles the Rx of the Intel(R) UPI is in L0p p= ower mode", "EventCode": "0x25", "EventName": "UNC_UPI_RxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the receive side (Rx) of = the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode = where we disable 60% of the UPI lanes, decreasing our bandwidth in order to= save power.", + "PublicDescription": "Counts cycles when the receive side (Rx) of = the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mo= de where we disable 60% of the UPI lanes, decreasing our bandwidth in order= to save power.", "Unit": "UPI LL" }, { @@ -25451,7 +25459,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow contro= l unITs: 64bits of data) received from any of the 3 Intel Ultra Path Interc= onnect (UPI) Receive Queue slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow contro= l unITs: 64bits of data) received from any of the 3 Intel(R) Ultra Path Int= erconnect (UPI) Receive Queue slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25460,7 +25468,7 @@ "EventCode": "0x3", "EventName": "UNC_UPI_RxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs= ) received from any of the 3 Intel Ultra Path Interconnect (UPI) Receive Qu= eue slots on this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs= ) received from any of the 3 Intel(R) Ultra Path Interconnect (UPI) Receive= Queue slots on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -25784,11 +25792,11 @@ "Unit": "UPI LL" }, { - "BriefDescription": "Cycles in which the Tx of the Intel Ultra Pat= h Interconnect (UPI) is in L0p power mode", + "BriefDescription": "Cycles in which the Tx of the Intel(R) Ultra = Path Interconnect (UPI) is in L0p power mode", "EventCode": "0x27", "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts cycles when the transmit side (Tx) of= the Intel Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a mode= where we disable 60% of the UPI lanes, decreasing our bandwidth in order t= o save power.", + "PublicDescription": "Counts cycles when the transmit side (Tx) of= the Intel(R) Ultra Path Interconnect(UPI) is in L0p power mode. L0p is a m= ode where we disable 60% of the UPI lanes, decreasing our bandwidth in orde= r to save power.", "Unit": "UPI LL" }, { @@ -25960,7 +25968,7 @@ "EventCode": "0x41", "EventName": "UNC_UPI_TxL_BYPASSED", "PerPkg": "1", - "PublicDescription": "Counts incoming FLITs (FLow control unITs) w= hich bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI L= ink. Generally, when data is transmitted across the Intel Ultra Path Interc= onnect (UPI), it will bypass the TxQ and pass directly to the link. Howeve= r, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) LLR = mode, increasing latency to transfer out to the link.", + "PublicDescription": "Counts incoming FLITs (FLow control unITs) w= hich bypassed the TxL(transmit) FLIT buffer and pass directly out the UPI L= ink. Generally, when data is transmitted across the Intel(R) Ultra Path Int= erconnect (UPI), it will bypass the TxQ and pass directly to the link. How= ever, the TxQ will be used in L0p (Low Power) mode and (Link Layer Retry) L= LR mode, increasing latency to transfer out to the link.", "Unit": "UPI LL" }, { @@ -25968,7 +25976,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_DATA", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path= Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", "UMask": "0xf", "Unit": "UPI LL" }, @@ -25977,7 +25985,7 @@ "EventCode": "0x2", "EventName": "UNC_UPI_TxL_FLITS.ALL_NULL", "PerPkg": "1", - "PublicDescription": "Counts null FLITs (80 bit FLow control unITs= ) transmitted via any of the 3 Intel Ulra Path Interconnect (UPI) slots on = this UPI unit.", + "PublicDescription": "Counts null FLITs (80 bit FLow control unITs= ) transmitted via any of the 3 Intel(R) Ulra Path Interconnect (UPI) slots = on this UPI unit.", "UMask": "0x27", "Unit": "UPI LL" }, @@ -26328,7 +26336,7 @@ "EventCode": "0x2", "EventName": "UPI_DATA_BANDWIDTH_TX", "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel Ultra Path= Interconnect (UPI) slots on this UPI unit.", + "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", "ScaleUnit": "7.11E-06Bytes", "UMask": "0xf", "Unit": "UPI LL" diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json index 6835e14cd42c..c6254af7a468 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-power.json @@ -143,7 +143,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with threshholding to gene= rate histograms, or with other PCU events and occupancy triggering to captu= re other details.", + "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with thresholding to gener= ate histograms, or with other PCU events and occupancy triggering to captur= e other details.", "Unit": "PCU" }, { @@ -151,7 +151,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with threshholding to gene= rate histograms, or with other PCU events and occupancy triggering to captu= re other details.", + "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with thresholding to gener= ate histograms, or with other PCU events and occupancy triggering to captur= e other details.", "Unit": "PCU" }, { @@ -159,7 +159,7 @@ "EventCode": "0x80", "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6", "PerPkg": "1", - "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with threshholding to gene= rate histograms, or with other PCU events and occupancy triggering to captu= re other details.", + "PublicDescription": "This is an occupancy event that tracks the n= umber of cores that are in the chosen C-State. It can be used by itself to= get the average number of cores in that C-state with thresholding to gener= ate histograms, or with other PCU events and occupancy triggering to captur= e other details.", "Unit": "PCU" }, { @@ -175,7 +175,7 @@ "EventCode": "0x9", "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles that we are in I= nteral PROCHOT mode. This mode is triggered when a sensor on the die deter= mines that we are too hot and must throttle to avoid damaging the chip.", + "PublicDescription": "Counts the number of cycles that we are in I= nternal PROCHOT mode. This mode is triggered when a sensor on the die dete= rmines that we are too hot and must throttle to avoid damaging the chip.", "Unit": "PCU" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index cad51223d0ea..793076e00188 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -5,7 +5,7 @@ GenuineIntel-6-(1C|26|27|35|36),v4,bonnell,core GenuineIntel-6-(3D|47),v26,broadwell,core GenuineIntel-6-56,v7,broadwellde,core GenuineIntel-6-4F,v19,broadwellx,core -GenuineIntel-6-55-[56789ABCDEF],v1.16,cascadelakex,core +GenuineIntel-6-55-[56789ABCDEF],v1.17,cascadelakex,core GenuineIntel-6-9[6C],v1.03,elkhartlake,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core --=20 2.39.2.637.g21b0678d19-goog