Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754746AbdFWRVm convert rfc822-to-8bit (ORCPT ); Fri, 23 Jun 2017 13:21:42 -0400 Received: from mga09.intel.com ([134.134.136.24]:41835 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754141AbdFWRVk (ORCPT ); Fri, 23 Jun 2017 13:21:40 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.39,379,1493708400"; d="scan'208";a="118119092" From: "Liang, Kan" To: "peterz@infradead.org" , "mingo@redhat.com" , "eranian@google.com" , "linux-kernel@vger.kernel.org" CC: "alexander.shishkin@linux.intel.com" , "acme@redhat.com" , "jolsa@redhat.com" , "torvalds@linux-foundation.org" , "tglx@linutronix.de" , "vincent.weaver@maine.edu" , "ak@linux.intel.com" , "Don Zickus (dzickus@redhat.com)" Subject: RE: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Thread-Topic: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Thread-Index: AQHS4U1VW/Bjj/rXMEe9XMQK1mskEKIyuq2A Date: Fri, 23 Jun 2017 17:21:15 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753711263@SHSMSX103.ccr.corp.intel.com> References: <1497029283-3332-1-git-send-email-kan.liang@intel.com> In-Reply-To: <1497029283-3332-1-git-send-email-kan.liang@intel.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMGNmNzVlNDItZThmMy00MmEwLWE4ZjMtNmNiY2NhZjRmMGIzIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IkVKT2IyM2FkN3pSNytPT3psUWw1VnhWV0t2S2VXXC9HSE40bmsrMWZtcFVnPSJ9 x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17482 Lines: 485 Hi all, Any comments for the series? We had some users (from both client and server) reported spurious NMI watchdog timeouts issue. Now, there is a proposed workaround fix from tglx. We are testing it. However, this patch series is believed to be a real fix for the issue. It's better that the patch series can be merged into mainline. Here is the workaround fix, if you are interested. https://patchwork.kernel.org/patch/9803033/ https://patchwork.kernel.org/patch/9805903/ Thanks, Kan > > From: Kan Liang > > The dominant motivation is to make it possible to switch cycles NMI > watchdog to ref_cycles on x86 platform. The new NMI watchdog could > - Free widely used precious cycles fixed counter. For example, topdown > code needs the cycles fixed counter in group counting. Otherwise, > it has to fall back to not use groups, which can cause measurement > inaccuracy due to multiplexing errors. > - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster > than the measured CPU Frequency due to Turbo mode. > Although we can enlarge the period of cycles to workaround the fast > expiring, it is still limited by the Turbo ratio and may fail > eventually. > > CPU ref_cycles can only be counted by fixed counter 2. It uses pseudo- > encoding. The GP counter doesn't recognize. > > BUS_CYCLES (0x013c) is another event which is not affected by core > frequency changes. It has a constant ratio with the CPU ref_cycles. > BUS_CYCLES could be used as an alternative event for ref_cycles on GP > counter. > A hook is implemented in x86_schedule_events. If the fixed counter 2 is > occupied and a GP counter is assigned, BUS_CYCLES is used to replace > ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in hw_perf_event > is introduced to indicate the replacement. > To make the switch transparent for perf tool, counting and sampling are also > specially handled. > - For counting, it multiplies the result with the constant ratio after > reading it. > - For sampling with fixed period, the BUS_CYCLES period = ref_cycles > period / the constant ratio. > - For sampling with fixed frequency, the adaptive frequency algorithm > will figure it out by calculating ref_cycles event's last_period and > event counts. But the reload value has to be BUS_CYCLES left period. > > User visible RDPMC issue > It is impossible to transparently handle the case, which reading ref-cycles by > RDPMC instruction in user space. > ref_cycles_factor_for_rdpmc is exposed for RDPMC user. > For the few users who want to read ref-cycles by RDPMC, they have to > correct the result by multipling ref_cycles_factor_for_rdpmc manually, if GP > counter is used. > The solution is only for ref-cycles. It doesn't impact other events' > result. > > The constant ratio is model specific. > For the model after NEHALEM but before Skylake, the ratio is defined in > MSR_PLATFORM_INFO. > For the model after Skylake, it can be get from CPUID.15H. > For Knights Landing, Goldmont and later, the ratio is always 1. > > The old Silvermont/Airmont, Core2 and Atom machines are not covered by > the patch. The behavior on those machines will not change. > > Signed-off-by: Kan Liang > --- > > Changes since V1: > - Retry the whole scheduling thing for 0x0300 replacement event, > if 0x0300 event is unscheduled. > - Correct the method to calculate event->counter, period_left, left > and offset in different cases > - Expose the factor to user space > - Modify the changelog > - There is no transparent way to handle ref-cycles replacement events for > RDPMC. RDPMC users have to multiply the results with factor if the > ref-cycles replacement event is scheduled in GP counters. > But it will not impact other events. > There are few RDPMC users who use ref cycles. The impact should be > limited. > I will also send patches to other user tools, such as pmu-tools and PAPI, > to minimize the impact. > > > arch/x86/events/core.c | 92 ++++++++++++++++++++++++++++++++++-- > arch/x86/events/intel/core.c | 110 > ++++++++++++++++++++++++++++++++++++++++++- > arch/x86/events/perf_event.h | 5 ++ > 3 files changed, 203 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index > e6f5e4b..18f8d37 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event) > int shift = 64 - x86_pmu.cntval_bits; > u64 prev_raw_count, new_raw_count; > int idx = hwc->idx; > - u64 delta; > + u64 delta, adjust_delta; > > if (idx == INTEL_PMC_IDX_FIXED_BTS) > return 0; > @@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event > *event) > delta = (new_raw_count << shift) - (prev_raw_count << shift); > delta >>= shift; > > - local64_add(delta, &event->count); > - local64_sub(delta, &hwc->period_left); > + /* > + * Correct the count number if applying ref_cycles replacement. > + * There is a constant ratio between ref_cycles (event A) and > + * ref_cycles replacement (event B). The delta result is from event B. > + * To get accurate event A count, the real delta result must be > + * multiplied with the constant ratio. > + * > + * It is handled differently for sampling and counting. > + * - Fixed period Sampling: The period is already adjusted in > + * scheduling for event B. The period_left is the remaining period > + * for event B. Don't need to modify period_left. > + * It's enough to only adjust event->count here. > + * > + * - Fixed freq Sampling: The adaptive frequency algorithm needs > + * the last_period and event counts for event A to estimate the next > + * period for event A. So the period_left is the remaining period > + * for event A. It needs to multiply the result with the ratio. > + * However, the period_left is also used to reload the event counter > + * for event B in x86_perf_event_set_period. It has to be adjusted to > + * event B's remaining period there. > + * > + * - Counting: It's enough to only adjust event->count for perf stat. > + * > + * - RDPMC: User can also read the counter directly by rdpmc/mmap. > + * Users have to handle the adjustment themselves. > + * For kernel, it only needs to guarantee that the offset is correct. > + * In x86's arch_perf_update_userpage, the offset will be corrected > + * if event B is used. > + */ > + adjust_delta = delta; > + if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) { > + adjust_delta = delta * x86_pmu.ref_cycles_factor; > + > + if (is_sampling_event(event) && event->attr.freq) > + local64_sub(adjust_delta, &hwc->period_left); > + else > + local64_sub(delta, &hwc->period_left); > + } else > + local64_sub(delta, &hwc->period_left); > + > + local64_add(adjust_delta, &event->count); > > return new_raw_count; > } > @@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events > *cpuc, int n, int *assign) > if (x86_pmu.start_scheduling) > x86_pmu.start_scheduling(cpuc); > > +retry: > for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) { > cpuc->event_constraint[i] = NULL; > c = x86_pmu.get_event_constraints(cpuc, i, cpuc- > >event_list[i]); @@ -952,6 +992,25 @@ int x86_schedule_events(struct > cpu_hw_events *cpuc, int n, int *assign) > */ > if (x86_pmu.put_event_constraints) > x86_pmu.put_event_constraints(cpuc, e); > + > + /* > + * 0x0300 is pseudo-encoding for REF_CPU_CYCLES. > + * It can only be scheduled on fixed counter 2. > + * If the fixed counter 2 is occupied, the event is > + * failed scheduling. > + * > + * If that's the case, an alternative event, which > + * can be counted in GP counters, could be used to > + * replace the pseudo-encoding REF_CPU_CYCLES > event. > + * > + * Here we reset the event and retry the whole > + * scheduling thing if there is unsched 0x0300 event. > + */ > + if (unsched && x86_pmu.ref_cycles_rep && > + ((e->hw.config & X86_RAW_EVENT_MASK) == > 0x0300)) { > + x86_pmu.ref_cycles_rep(e); > + goto retry; > + } > } > } > > @@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event > *event) > hwc->last_period = period; > ret = 1; > } > + > + /* > + * Adjust left for ref_cycles replacement. > + * Please refer the comments in x86_perf_event_update for details. > + */ > + if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) && > + event->attr.freq) > + left /= (u64)x86_pmu.ref_cycles_factor; > /* > * Quirk: certain CPUs dont like it if just 1 hw_event is left: > */ > @@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void) > x86_pmu_attr_group.attrs = tmp; > } > > + if (x86_pmu.factor_attrs) { > + struct attribute **tmp; > + > + tmp = merge_attr(x86_pmu_attr_group.attrs, > x86_pmu.factor_attrs); > + if (!WARN_ON(!tmp)) > + x86_pmu_attr_group.attrs = tmp; > + } > + > pr_info("... version: %d\n", x86_pmu.version); > pr_info("... bit width: %d\n", x86_pmu.cntval_bits); > pr_info("... generic registers: %d\n", x86_pmu.num_counters); > @@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct > perf_event *event, > } > > cyc2ns_read_end(data); > + > + /* > + * offset should be corrected if ref cycles replacement is used. > + * Please refer the comments in x86_perf_event_update for details. > + */ > + if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) { > + userpg->offset = __perf_event_count(event); > + userpg->offset /= (u64)x86_pmu.ref_cycles_factor; > + if (userpg->index) > + userpg->offset -= local64_read(&event- > >hw.prev_count); > + } > } > > void > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index > da9047e..4853795 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct > perf_event *event, unsigned left) > return left; > } > > +static __init unsigned int glm_get_ref_cycles_factor(void) { > + return 1; > +} > + > +static __init unsigned int nhm_get_ref_cycles_factor(void) { > + u64 platform_info; > + > + rdmsrl(MSR_PLATFORM_INFO, platform_info); > + return (platform_info >> 8) & 0xff; > +} > + > +static __init unsigned int skl_get_ref_cycles_factor(void) { > + unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused; > + > + cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused); > + if (!cpuid21_eax || !cpuid21_ebx) > + return 0; > + > + return cpuid21_ebx / cpuid21_eax; > +} > + > +/* > + * BUS_CYCLES (0x013c) is another event which is not affected by core > + * frequency changes. It has a constant ratio with the CPU ref_cycles. > + * BUS_CYCLES could be used as an alternative event for ref_cycles on > + * GP counter. > + */ > +void nhm_ref_cycles_rep(struct perf_event *event) { > + struct hw_perf_event *hwc = &event->hw; > + > + hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) | > + > intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES]; > + hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP; > + > + /* adjust the sample period for ref_cycles replacement event */ > + if (is_sampling_event(event) && hwc->sample_period != 1) { > + > + hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor; > + if (hwc->sample_period < 2) > + hwc->sample_period = 2; > + hwc->last_period = hwc->sample_period; > + local64_set(&hwc->period_left, hwc->sample_period); > + } > +} > + > PMU_FORMAT_ATTR(event, "config:0-7" ); > PMU_FORMAT_ATTR(umask, "config:8-15" ); > PMU_FORMAT_ATTR(edge, "config:18" ); > @@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = { > NULL, > }; > > + > +static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev, > + struct device_attribute *attr, > + char *buf) > +{ > + return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor); } > + > +DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc); > + > +static struct attribute *intel_ref_cycles_factor_attrs[] = { > + &dev_attr_ref_cycles_factor_for_rdpmc.attr, > + NULL, > +}; > + > __init int intel_pmu_init(void) > { > union cpuid10_edx edx; > @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void) > > intel_pmu_pebs_data_source_nhm(); > x86_add_quirk(intel_nehalem_quirk); > - > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Nehalem events, "); > break; > > @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void) > x86_pmu.lbr_pt_coexist = true; > x86_pmu.flags |= PMU_FL_HAS_RSP_1; > x86_pmu.cpu_events = glm_events_attrs; > + x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Goldmont events, "); > break; > > @@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void) > > X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1); > > intel_pmu_pebs_data_source_nhm(); > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Westmere events, "); > break; > > @@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void) > /* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/ > > intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BAC > KEND] = > > X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1); > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > > pr_cont("SandyBridge events, "); > break; > @@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void) > /* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */ > > intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRO > NTEND] = > > X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1); > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > > pr_cont("IvyBridge events, "); > break; > @@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void) > x86_pmu.get_event_constraints = > hsw_get_event_constraints; > x86_pmu.cpu_events = hsw_events_attrs; > x86_pmu.lbr_double_abort = true; > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Haswell events, "); > break; > > @@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void) > x86_pmu.get_event_constraints = > hsw_get_event_constraints; > x86_pmu.cpu_events = hsw_events_attrs; > x86_pmu.limit_period = bdw_limit_period; > + x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Broadwell events, "); > break; > > @@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void) > /* all extra regs are per-cpu when HT is on */ > x86_pmu.flags |= PMU_FL_HAS_RSP_1; > x86_pmu.flags |= PMU_FL_NO_HT_SHARING; > + x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > > pr_cont("Knights Landing/Mill events, "); > break; > @@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void) > skl_format_attr); > WARN_ON(!x86_pmu.format_attrs); > x86_pmu.cpu_events = hsw_events_attrs; > + x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor(); > + if (x86_pmu.ref_cycles_factor) { > + x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep; > + x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs; > + } > pr_cont("Skylake events, "); > break; > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h > index 53728ee..3470178 100644 > --- a/arch/x86/events/perf_event.h > +++ b/arch/x86/events/perf_event.h > @@ -68,6 +68,7 @@ struct event_constraint { > #define PERF_X86_EVENT_EXCL_ACCT 0x0200 /* accounted EXCL event */ > #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto- > reload */ > #define PERF_X86_EVENT_FREERUNNING 0x0800 /* use freerunning > PEBS */ > +#define PERF_X86_EVENT_REF_CYCLES_REP 0x1000 /* use ref_cycles > replacement */ > > > struct amd_nb { > @@ -550,6 +551,8 @@ struct x86_pmu { > int perfctr_second_write; > bool late_ack; > unsigned (*limit_period)(struct perf_event *event, unsigned l); > + unsigned int ref_cycles_factor; > + void (*ref_cycles_rep)(struct perf_event *event); > > /* > * sysfs attrs > @@ -565,6 +568,8 @@ struct x86_pmu { > unsigned long attr_freeze_on_smi; > struct attribute **attrs; > > + unsigned int attr_ref_cycles_factor; > + struct attribute **factor_attrs; > /* > * CPU Hotplug hooks > */ > -- > 2.7.4