From: "Liang, Kan" <kan.liang@intel.com>
To: "peterz@infradead.org" <peterz@infradead.org>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "eranian@google.com" <eranian@google.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
CC: "alexander.shishkin@linux.intel.com" 
        <alexander.shishkin@linux.intel.com>,
        "acme@redhat.com" <acme@redhat.com>,
        "jolsa@redhat.com" <jolsa@redhat.com>,
        "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
        "tglx@linutronix.de" <tglx@linutronix.de>,
        "vincent.weaver@maine.edu" <vincent.weaver@maine.edu>,
        "ak@linux.intel.com" <ak@linux.intel.com>,
        "Don Zickus (dzickus@redhat.com)" <dzickus@redhat.com>
Subject: RE: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP
 counter
Thread-Topic: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP
 counter
Thread-Index: AQHS4U1VW/Bjj/rXMEe9XMQK1mskEKIyuq2A
Date: Fri, 23 Jun 2017 17:21:15 +0000
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753711263@SHSMSX103.ccr.corp.intel.com>
References: <1497029283-3332-1-git-send-email-kan.liang@intel.com>
In-Reply-To: <1497029283-3332-1-git-send-email-kan.liang@intel.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 17482
Lines: 485

Hi all,

Any comments for the series?

We had some users (from both client and server) reported spurious
NMI watchdog timeouts issue.
Now, there is a proposed workaround fix from tglx. We are testing it.
However, this patch series is believed to be a real fix for the issue.
It's better that the patch series can be merged into mainline.


Here is the workaround fix, if you are interested.
https://patchwork.kernel.org/patch/9803033/
https://patchwork.kernel.org/patch/9805903/

Thanks,
Kan

> 
> From: Kan Liang <Kan.liang@intel.com>
> 
> The dominant motivation is to make it possible to switch cycles NMI
> watchdog to ref_cycles on x86 platform. The new NMI watchdog could
>  - Free widely used precious cycles fixed counter. For example, topdown
>    code needs the cycles fixed counter in group counting. Otherwise,
>    it has to fall back to not use groups, which can cause measurement
>    inaccuracy due to multiplexing errors.
>  - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
>    than the measured CPU Frequency due to Turbo mode.
>    Although we can enlarge the period of cycles to workaround the fast
>    expiring, it is still limited by the Turbo ratio and may fail
>    eventually.
> 
> CPU ref_cycles can only be counted by fixed counter 2. It uses pseudo-
> encoding. The GP counter doesn't recognize.
> 
> BUS_CYCLES (0x013c) is another event which is not affected by core
> frequency changes. It has a constant ratio with the CPU ref_cycles.
> BUS_CYCLES could be used as an alternative event for ref_cycles on GP
> counter.
> A hook is implemented in x86_schedule_events. If the fixed counter 2 is
> occupied and a GP counter is assigned, BUS_CYCLES is used to replace
> ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in hw_perf_event
> is introduced to indicate the replacement.
> To make the switch transparent for perf tool, counting and sampling are also
> specially handled.
>  - For counting, it multiplies the result with the constant ratio after
>    reading it.
>  - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
>    period / the constant ratio.
>  - For sampling with fixed frequency, the adaptive frequency algorithm
>    will figure it out by calculating ref_cycles event's last_period and
>    event counts. But the reload value has to be BUS_CYCLES left period.
> 
> User visible RDPMC issue
> It is impossible to transparently handle the case, which reading ref-cycles by
> RDPMC instruction in user space.
> ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
> For the few users who want to read ref-cycles by RDPMC, they have to
> correct the result by multipling ref_cycles_factor_for_rdpmc manually, if GP
> counter is used.
> The solution is only for ref-cycles. It doesn't impact other events'
> result.
> 
> The constant ratio is model specific.
> For the model after NEHALEM but before Skylake, the ratio is defined in
> MSR_PLATFORM_INFO.
> For the model after Skylake, it can be get from CPUID.15H.
> For Knights Landing, Goldmont and later, the ratio is always 1.
> 
> The old Silvermont/Airmont, Core2 and Atom machines are not covered by
> the patch. The behavior on those machines will not change.
> 
> Signed-off-by: Kan Liang <Kan.liang@intel.com>
> ---
> 
> Changes since V1:
>  - Retry the whole scheduling thing for 0x0300 replacement event,
>    if 0x0300 event is unscheduled.
>  - Correct the method to calculate event->counter, period_left, left
>    and offset in different cases
>  - Expose the factor to user space
>  - Modify the changelog
>  - There is no transparent way to handle ref-cycles replacement events for
>    RDPMC. RDPMC users have to multiply the results with factor if the
>    ref-cycles replacement event is scheduled in GP counters.
>    But it will not impact other events.
>    There are few RDPMC users who use ref cycles. The impact should be
> limited.
>    I will also send patches to other user tools, such as pmu-tools and PAPI,
>    to minimize the impact.
> 
> 
>  arch/x86/events/core.c       |  92 ++++++++++++++++++++++++++++++++++--
>  arch/x86/events/intel/core.c | 110
> ++++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h |   5 ++
>  3 files changed, 203 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> e6f5e4b..18f8d37 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
>  	int shift = 64 - x86_pmu.cntval_bits;
>  	u64 prev_raw_count, new_raw_count;
>  	int idx = hwc->idx;
> -	u64 delta;
> +	u64 delta, adjust_delta;
> 
>  	if (idx == INTEL_PMC_IDX_FIXED_BTS)
>  		return 0;
> @@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event
> *event)
>  	delta = (new_raw_count << shift) - (prev_raw_count << shift);
>  	delta >>= shift;
> 
> -	local64_add(delta, &event->count);
> -	local64_sub(delta, &hwc->period_left);
> +	/*
> +	 * Correct the count number if applying ref_cycles replacement.
> +	 * There is a constant ratio between ref_cycles (event A) and
> +	 * ref_cycles replacement (event B). The delta result is from event B.
> +	 * To get accurate event A count, the real delta result must be
> +	 * multiplied with the constant ratio.
> +	 *
> +	 * It is handled differently for sampling and counting.
> +	 * - Fixed period Sampling: The period is already adjusted in
> +	 *   scheduling for event B. The period_left is the remaining period
> +	 *   for event B. Don't need to modify period_left.
> +	 *   It's enough to only adjust event->count here.
> +	 *
> +	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
> +	 *   the last_period and event counts for event A to estimate the next
> +	 *   period for event A. So the period_left is the remaining period
> +	 *   for event A. It needs to multiply the result with the ratio.
> +	 *   However, the period_left is also used to reload the event counter
> +	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
> +	 *   event B's remaining period there.
> +	 *
> +	 * - Counting: It's enough to only adjust event->count for perf stat.
> +	 *
> +	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> +	 *   Users have to handle the adjustment themselves.
> +	 *   For kernel, it only needs to guarantee that the offset is correct.
> +	 *   In x86's arch_perf_update_userpage, the offset will be corrected
> +	 *   if event B is used.
> +	 */
> +	adjust_delta = delta;
> +	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +		adjust_delta = delta * x86_pmu.ref_cycles_factor;
> +
> +		if (is_sampling_event(event) && event->attr.freq)
> +			local64_sub(adjust_delta, &hwc->period_left);
> +		else
> +			local64_sub(delta, &hwc->period_left);
> +	} else
> +		local64_sub(delta, &hwc->period_left);
> +
> +	local64_add(adjust_delta, &event->count);
> 
>  	return new_raw_count;
>  }
> @@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events
> *cpuc, int n, int *assign)
>  	if (x86_pmu.start_scheduling)
>  		x86_pmu.start_scheduling(cpuc);
> 
> +retry:
>  	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>  		cpuc->event_constraint[i] = NULL;
>  		c = x86_pmu.get_event_constraints(cpuc, i, cpuc-
> >event_list[i]); @@ -952,6 +992,25 @@ int x86_schedule_events(struct
> cpu_hw_events *cpuc, int n, int *assign)
>  			 */
>  			if (x86_pmu.put_event_constraints)
>  				x86_pmu.put_event_constraints(cpuc, e);
> +
> +			/*
> +			 * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
> +			 * It can only be scheduled on fixed counter 2.
> +			 * If the fixed counter 2 is occupied, the event is
> +			 * failed scheduling.
> +			 *
> +			 * If that's the case, an alternative event, which
> +			 * can be counted in GP counters, could be used to
> +			 * replace the pseudo-encoding REF_CPU_CYCLES
> event.
> +			 *
> +			 * Here we reset the event and retry the whole
> +			 * scheduling thing if there is unsched 0x0300 event.
> +			 */
> +			if (unsched && x86_pmu.ref_cycles_rep &&
> +			    ((e->hw.config & X86_RAW_EVENT_MASK) ==
> 0x0300)) {
> +				x86_pmu.ref_cycles_rep(e);
> +				goto retry;
> +			}
>  		}
>  	}
> 
> @@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event
> *event)
>  		hwc->last_period = period;
>  		ret = 1;
>  	}
> +
> +	/*
> +	 * Adjust left for ref_cycles replacement.
> +	 * Please refer the comments in x86_perf_event_update for details.
> +	 */
> +	if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
> +	    event->attr.freq)
> +		left /= (u64)x86_pmu.ref_cycles_factor;
>  	/*
>  	 * Quirk: certain CPUs dont like it if just 1 hw_event is left:
>  	 */
> @@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
>  			x86_pmu_attr_group.attrs = tmp;
>  	}
> 
> +	if (x86_pmu.factor_attrs) {
> +		struct attribute **tmp;
> +
> +		tmp = merge_attr(x86_pmu_attr_group.attrs,
> x86_pmu.factor_attrs);
> +		if (!WARN_ON(!tmp))
> +			x86_pmu_attr_group.attrs = tmp;
> +	}
> +
>  	pr_info("... version:                %d\n",     x86_pmu.version);
>  	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
>  	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
> @@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct
> perf_event *event,
>  	}
> 
>  	cyc2ns_read_end(data);
> +
> +	/*
> +	 * offset should be corrected if ref cycles replacement is used.
> +	 * Please refer the comments in x86_perf_event_update for details.
> +	 */
> +	if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +		userpg->offset = __perf_event_count(event);
> +		userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
> +		if (userpg->index)
> +			userpg->offset -= local64_read(&event-
> >hw.prev_count);
> +	}
>  }
> 
>  void
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index
> da9047e..4853795 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct
> perf_event *event, unsigned left)
>  	return left;
>  }
> 
> +static __init unsigned int glm_get_ref_cycles_factor(void) {
> +	return 1;
> +}
> +
> +static __init unsigned int nhm_get_ref_cycles_factor(void) {
> +	u64 platform_info;
> +
> +	rdmsrl(MSR_PLATFORM_INFO, platform_info);
> +	return (platform_info >> 8) & 0xff;
> +}
> +
> +static __init unsigned int skl_get_ref_cycles_factor(void) {
> +	unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
> +
> +	cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
> +	if (!cpuid21_eax || !cpuid21_ebx)
> +		return 0;
> +
> +	return cpuid21_ebx / cpuid21_eax;
> +}
> +
> +/*
> + * BUS_CYCLES (0x013c) is another event which is not affected by core
> + * frequency changes. It has a constant ratio with the CPU ref_cycles.
> + * BUS_CYCLES could be used as an alternative event for ref_cycles on
> + * GP counter.
> + */
> +void nhm_ref_cycles_rep(struct perf_event *event) {
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
> +
> intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
> +	hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
> +
> +	/* adjust the sample period for ref_cycles replacement event */
> +	if (is_sampling_event(event) && hwc->sample_period != 1) {
> +
> +		hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
> +		if (hwc->sample_period < 2)
> +			hwc->sample_period = 2;
> +		hwc->last_period = hwc->sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +	}
> +}
> +
>  PMU_FORMAT_ATTR(event,	"config:0-7"	);
>  PMU_FORMAT_ATTR(umask,	"config:8-15"	);
>  PMU_FORMAT_ATTR(edge,	"config:18"	);
> @@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
>  	NULL,
>  };
> 
> +
> +static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
> +						struct device_attribute *attr,
> +						char *buf)
> +{
> +	return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor); }
> +
> +DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
> +
> +static struct attribute *intel_ref_cycles_factor_attrs[] = {
> +	&dev_attr_ref_cycles_factor_for_rdpmc.attr,
> +	NULL,
> +};
> +
>  __init int intel_pmu_init(void)
>  {
>  	union cpuid10_edx edx;
> @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
> 
>  		intel_pmu_pebs_data_source_nhm();
>  		x86_add_quirk(intel_nehalem_quirk);
> -
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Nehalem events, ");
>  		break;
> 
> @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.lbr_pt_coexist = true;
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.cpu_events = glm_events_attrs;
> +		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Goldmont events, ");
>  		break;
> 
> @@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
> 
> 	X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
> 
>  		intel_pmu_pebs_data_source_nhm();
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Westmere events, ");
>  		break;
> 
> @@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
>  		/* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
> 
> 	intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BAC
> KEND] =
> 
> 	X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("SandyBridge events, ");
>  		break;
> @@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
>  		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
> 
> 	intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRO
> NTEND] =
> 
> 	X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("IvyBridge events, ");
>  		break;
> @@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>  		x86_pmu.cpu_events = hsw_events_attrs;
>  		x86_pmu.lbr_double_abort = true;
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Haswell events, ");
>  		break;
> 
> @@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>  		x86_pmu.cpu_events = hsw_events_attrs;
>  		x86_pmu.limit_period = bdw_limit_period;
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Broadwell events, ");
>  		break;
> 
> @@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
>  		/* all extra regs are per-cpu when HT is on */
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
> +		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("Knights Landing/Mill events, ");
>  		break;
> @@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
>  						  skl_format_attr);
>  		WARN_ON(!x86_pmu.format_attrs);
>  		x86_pmu.cpu_events = hsw_events_attrs;
> +		x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Skylake events, ");
>  		break;
> 
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 53728ee..3470178 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -68,6 +68,7 @@ struct event_constraint {
>  #define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
>  #define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-
> reload */
>  #define PERF_X86_EVENT_FREERUNNING	0x0800 /* use freerunning
> PEBS */
> +#define PERF_X86_EVENT_REF_CYCLES_REP	0x1000 /* use ref_cycles
> replacement */
> 
> 
>  struct amd_nb {
> @@ -550,6 +551,8 @@ struct x86_pmu {
>  	int		perfctr_second_write;
>  	bool		late_ack;
>  	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
> +	unsigned int	ref_cycles_factor;
> +	void		(*ref_cycles_rep)(struct perf_event *event);
> 
>  	/*
>  	 * sysfs attrs
> @@ -565,6 +568,8 @@ struct x86_pmu {
>  	unsigned long	attr_freeze_on_smi;
>  	struct attribute **attrs;
> 
> +	unsigned int	attr_ref_cycles_factor;
> +	struct attribute **factor_attrs;
>  	/*
>  	 * CPU Hotplug hooks
>  	 */
> --
> 2.7.4