From: kan.liang@intel.com
To: peterz@infradead.org, mingo@redhat.com, acme@kernel.org,
        linux-kernel@vger.kernel.org
Cc: tglx@linutronix.de, jolsa@redhat.com, eranian@google.com,
        ak@linux.intel.com, Kan Liang <kan.liang@linux.intel.com>
Subject: [RESEND PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload
Date: Mon,  8 Jan 2018 07:15:13 -0800
Message-Id: <1515424516-143728-2-git-send-email-kan.liang@intel.com>
In-Reply-To: <1515424516-143728-1-git-send-email-kan.liang@intel.com>
References: <1515424516-143728-1-git-send-email-kan.liang@intel.com>
Sender: linux-kernel-owner@vger.kernel.org

From: Kan Liang <kan.liang@linux.intel.com>

There is bug when mmap read event->count with large PEBS enabled.
Here is an example.
 #./read_count
 0x71f0
 0x122c0
 0x1000000001c54
 0x100000001257d
 0x200000000bdc5

There is auto-reload mechanism enabled for PEBS events in fixed period
mode. But the calculation of event->count does not take the auto-reload
values into account. Anyone who read the event->count will get wrong
result, e.g x86_pmu_read. Also, the calculation of hwc->period_left is
wrong either. It impacts the accuracy of period for the first record in
PEBS multiple records.

The issue is introduced with the auto-reload mechanism enabled by
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")

Introduce intel_pmu_save_and_restart_reload only for auto-reload.

The formula to calculate the event->count is as below.
event->count = period left from last time +
               (reload_times - 1) * reload_val +
               latency of PMI handler

prev_count is the last observed hardware counter value. Just the same as
non-auto-reload, its absolute value is the period of the first record.
It should not update with each reload. Because it doesn't 'observe' the
hardware counter for each auto-reload.

For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val to
event->count.

The calculation of the latency of PMI handler is a little bit different
as non-auto-reload. Because the start point is -reload_value. It needs
to be adjusted by adding reload_value.
The period_left needs to do the same adjustment.

There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b..cc1f373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
 	return NULL;
 }
 
+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+					     u64 reload_val,
+					     int reload_times)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	int shift = 64 - x86_pmu.cntval_bits;
+	u64 prev_raw_count, new_raw_count;
+	u64 delta;
+
+	if ((reload_times == 0) || (reload_val == 0))
+		return intel_pmu_save_and_restart(event);
+
+	/*
+	 * Careful: an NMI might modify the previous event value.
+	 *
+	 * Our tactic to handle this is to first atomically read and
+	 * exchange a new raw count - then add that new-prev delta
+	 * count to the generic event atomically:
+	 */
+again:
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+					new_raw_count) != prev_raw_count)
+		goto again;
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 *
+	 * event->count = period left from last time +
+	 *                (reload_times - 1) * reload_val +
+	 *                latency of PMI handler
+	 * The period left from last time can be got from -prev_count.
+	 * The start points of counting is always -reload_val.
+	 * So the real latency of PMI handler is reload_val + new_raw_count.
+	 */
+	delta = (reload_val << shift) + (new_raw_count << shift) -
+		(prev_raw_count << shift);
+	delta >>= shift;
+
+	local64_add(reload_val * (reload_times - 1), &event->count);
+	local64_add(delta, &event->count);
+	local64_sub(delta, &hwc->period_left);
+
+	return x86_perf_event_set_period(event);
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs,
 				   void *base, void *top,
 				   int bit, int count)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 
-	if (!intel_pmu_save_and_restart(event) &&
-	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/*
+		 * Now, auto-reload is only enabled in fixed period mode.
+		 * The reload value is always hwc->sample_period.
+		 * May need to change it, if auto-reload is enabled in
+		 * freq mode later.
+		 */
+		intel_pmu_save_and_restart_reload(event, hwc->sample_period,
+						  count);
+	} else if (!intel_pmu_save_and_restart(event))
 		return;
 
 	while (count > 1) {
-- 
2.7.4