2016-03-10 23:33:57

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V6 0/6] Intel memory b/w monitoring support

The patch series has two preparatory patch for cqm and then 4 MBM
patches. Patches are based on tip perf/core.
Thanks to Thomas and PeterZ for feedback on V5 and have tried to
implement feedback in this version.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.
Extending the cache quality of service monitoring(CQM) we add two more
events to the perf infrastructure.

intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

changes in V6:

following changes made as per the feedback.

- Fixed the cleanup code for cqm and mbm and seperated the cleaning for
them.
- Fixed a few changelogs.
- removed bw related events and related code as the total bytes can just
be used to measure the b/w.
- Fixed some of the init code and changed the overflow handling counting
code to follow the perf conventions.
- Made changes to be consistent with use of enum vs. #defines

Changes in V5:
As per Thomas feedback made the below changes:
- Fixed the memory leak and notifier leak in cqm init and also made it a
separate patch
- Changed mbm patch to using topology_max_packages to count the max
packages rather than online packages.
- Removed the unnecessary out: label and goto in the 0003 .
- Fixed the restarting of timer when the event list is empty.

- Also Fixed the incorrect usage of mutex in timer context.

Changes in v4:

The V4 version of MBM is almost a complete rewrite of the prior
versions. It has seemed the best way to address all of Thomas earlier
comments.


[PATCH 1/6] x86/perf/intel/cqm: Fix cqm handling of grouping events
[PATCH 2/6] x86/perf/intel/cqm: Fix cqm memory leak and notifier leak
[PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
[PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
[PATCH 5/6] x86/mbm: RMID Recycling MBM changes
[PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling


2016-03-10 23:32:24

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 1/6] x86/perf/intel/cqm: Fix cqm handling of grouping events into a cache_group

Currently cqm(cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero for
all events other than the group leader. The patch tries to address the
issue by keeping a flag in the perf_event.hw which has other cqm related
fields. The field is updated at event creation and during grouping.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 13 ++++++++++---
include/linux/perf_event.h | 1 +
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index a316ca9..e6be335 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -281,9 +281,13 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)

/*
* Events that target same task are placed into the same cache group.
+ * Mark it as a multi event group, so that we update ->count
+ * for every event rather than just the group leader later.
*/
- if (a->hw.target == b->hw.target)
+ if (a->hw.target == b->hw.target) {
+ b->hw.is_group_event = true;
return true;
+ }

/*
* Are we an inherited event?
@@ -849,6 +853,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
bool conflict = false;
u32 rmid;

+ event->hw.is_group_event = false;
list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
rmid = iter->hw.cqm_rmid;

@@ -940,7 +945,9 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);

/*
- * Only the group leader gets to report values. This stops us
+ * Only the group leader gets to report values except in case of
+ * multiple events in the same group, we still need to read the
+ * other events.This stops us
* reporting duplicate values to userspace, and gives us a clear
* rule for which task gets to report the values.
*
@@ -948,7 +955,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
* specific packages - we forfeit that ability when we create
* task events.
*/
- if (!cqm_group_leader(event))
+ if (!cqm_group_leader(event) && !event->hw.is_group_event)
return 0;

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5c5a3f..a3ba886 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -121,6 +121,7 @@ struct hw_perf_event {
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
+ bool is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;
--
1.9.1

2016-03-10 23:32:32

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

From: Tony Luck <[email protected]>

Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external bytes
and local socket memory-controller bytes. The OS does MSR writes to
MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and uses the
IA32_PQR_ASSOC_MSR to associate the RMID with the task. The tasks have a
common RMID for cqm(cache quality of service monitoring) and MBM. Hence
most of the scheduling code is reused from cqm.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 130 +++++++++++++++++++++++++++--
1 file changed, 125 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 0496a56..7f1e6b3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,6 +13,13 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

+/*
+ * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
+ * value
+ */
+#define MBM_CNTR_WIDTH 24
+#define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)
+
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
@@ -62,6 +69,16 @@ static struct sample *mbm_total;
*/
static struct sample *mbm_local;

+#define pkg_id topology_physical_package_id(smp_processor_id())
+/*
+ * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
+ * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
+ * rmids per socket, an example is given below
+ * RMID1 of Socket0: vrmid = 1
+ * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
+ * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
+ */
+#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
@@ -85,8 +102,12 @@ static cpumask_t cqm_cpumask;
#define RMID_VAL_UNAVAIL (1ULL << 62)

#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
-
-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+/*
+ * MBM Event IDs as defined in SDM section 17.15.5
+ * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
+ */
+#define QOS_MBM_TOTAL_EVENT_ID 0x02
+#define QOS_MBM_LOCAL_EVENT_ID 0x03

/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -429,9 +450,16 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
struct rmid_read {
u32 rmid;
atomic64_t value;
+ u32 evt_type;
};

static void __intel_cqm_event_count(void *info);
+static void init_mbm_sample(u32 rmid, u32 evt_type);
+
+static bool is_mbm_event(int e)
+{
+ return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
+}

/*
* Exchange the RMID of a group of events.
@@ -873,6 +901,73 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
schedule_delayed_work(&intel_cqm_rmid_work, delay);
}

+static u64 update_sample(unsigned int rmid,
+ u32 evt_type, int first)
+{
+ struct sample *mbm_current;
+ u32 vrmid = rmid_2_index(rmid);
+ u64 val, bytes, shift;
+ u32 eventid;
+
+ if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
+ mbm_current = &mbm_local[vrmid];
+ eventid = QOS_MBM_LOCAL_EVENT_ID;
+ } else {
+ mbm_current = &mbm_total[vrmid];
+ eventid = QOS_MBM_TOTAL_EVENT_ID;
+ }
+
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, val);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return mbm_current->total_bytes;
+
+ if (first) {
+ mbm_current->prev_msr = val;
+ mbm_current->total_bytes = 0;
+ return mbm_current->total_bytes;
+ }
+
+ if (val < mbm_current->prev_msr) {
+ bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
+ } else {
+ shift = 64 - MBM_CNTR_WIDTH;
+ bytes = (val << shift) - (mbm_current->prev_msr << shift);
+ bytes >>= shift;
+ }
+
+ bytes *= cqm_l3_scale;
+
+ mbm_current->total_bytes += bytes;
+ mbm_current->prev_msr = val;
+
+ return mbm_current->total_bytes;
+}
+
+static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
+{
+ return update_sample(rmid, evt_type, 0);
+}
+
+static void __intel_mbm_event_init(void *info)
+{
+ struct rmid_read *rr = info;
+
+ update_sample(rr->rmid, rr->evt_type, 1);
+}
+
+static void init_mbm_sample(u32 rmid, u32 evt_type)
+{
+ struct rmid_read rr = {
+ .value = ATOMIC64_INIT(0),
+ .rmid = rmid,
+ .evt_type = evt_type,
+ };
+
+ /* on each socket, init sample */
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
+}
+
/*
* Find a group and setup RMID.
*
@@ -893,6 +988,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
return;
}

@@ -909,6 +1006,9 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
event->hw.cqm_rmid = rmid;
}

@@ -930,7 +1030,10 @@ static void intel_cqm_event_read(struct perf_event *event)
if (!__rmid_valid(rmid))
goto out;

- val = __rmid_read(rmid);
+ if (is_mbm_event(event->attr.config))
+ val = rmid_read_mbm(rmid, event->attr.config);
+ else
+ val = __rmid_read(rmid);

/*
* Ignore this reading on error states and do not update the value.
@@ -961,6 +1064,17 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_groups_entry);
}

+static void __intel_mbm_event_count(void *info)
+{
+ struct rmid_read *rr = info;
+ u64 val;
+
+ val = rmid_read_mbm(rr->rmid, rr->evt_type);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1014,7 +1128,12 @@ static u64 intel_cqm_event_count(struct perf_event *event)
if (!__rmid_valid(rr.rmid))
goto out;

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ if (is_mbm_event(event->attr.config)) {
+ rr.evt_type = event->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ }

raw_spin_lock_irqsave(&cache_lock, flags);
if (event->hw.cqm_rmid == rr.rmid)
@@ -1129,7 +1248,8 @@ static int intel_cqm_event_init(struct perf_event *event)
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;

- if (event->attr.config & ~QOS_EVENT_MASK)
+ if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
+ (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
return -EINVAL;

/* unsupported modes and filters */
--
1.9.1

2016-03-10 23:32:30

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 5/6] x86/mbm: RMID Recycling MBM changes

RMID could be allocated or deallocated as part of RMID recycling.
When an RMID is allocated for mbm event, the mbm counter needs to be
initialized because next time we read the counter we need the previous
value to account for total bytes that went to the memory controller.
Similarly, when RMID is deallocated we need to update the ->count
variable.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 32 ++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 7f1e6b3..6bca59d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -455,6 +455,7 @@ struct rmid_read {

static void __intel_cqm_event_count(void *info);
static void init_mbm_sample(u32 rmid, u32 evt_type);
+static void __intel_mbm_event_count(void *info);

static bool is_mbm_event(int e)
{
@@ -481,8 +482,14 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
.rmid = old_rmid,
};

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
- &rr, 1);
+ if (is_mbm_event(group->attr.config)) {
+ rr.evt_type = group->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count,
+ &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
+ &rr, 1);
+ }
local64_set(&group->count, atomic64_read(&rr.value));
}

@@ -494,6 +501,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)

raw_spin_unlock_irq(&cache_lock);

+ /*
+ * If the allocation is for mbm, init the mbm stats.
+ * Need to check if each event in the group is mbm event
+ * because there could be multiple type of events in the same group.
+ */
+ if (__rmid_valid(rmid)) {
+ event = group;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
+ list_for_each_entry(event, head, hw.cqm_group_entry) {
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+ }
+ }
+
return old_rmid;
}

@@ -988,7 +1011,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) &&
+ __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);
return;
}
@@ -1006,7 +1030,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);

event->hw.cqm_rmid = rmid;
--
1.9.1

2016-03-10 23:32:28

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 140 +++++++++++++++++++++++++++--
1 file changed, 135 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 6bca59d..a49b54e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -19,10 +19,15 @@
*/
#define MBM_CNTR_WIDTH 24
#define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME 1000

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -50,6 +55,7 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
* @total_bytes #bytes since we began monitoring
@@ -951,6 +957,11 @@ static u64 update_sample(unsigned int rmid,
return mbm_current->total_bytes;
}

+ /*
+ * As per SDM, the h/w guarentees that counters will not
+ * overflow in 1s interval. The 1s periodic timers call
+ * update_sample to ensure the same.
+ */
if (val < mbm_current->prev_msr) {
bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
} else {
@@ -1099,6 +1110,84 @@ static void __intel_mbm_event_count(void *info)
atomic64_add(val, &rr->value);
}

+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct perf_event *iter, *iter1;
+ int ret = HRTIMER_RESTART;
+ struct list_head *head;
+ unsigned long flags;
+ u32 grp_rmid;
+
+ /*
+ * Need to cache_lock as the timer Event Select MSR reads
+ * can race with the mbm/cqm count() and mbm_init() reads.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (list_empty(&cache_groups)) {
+ ret = HRTIMER_NORESTART;
+ goto out;
+ }
+
+ list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ grp_rmid = iter->hw.cqm_rmid;
+ if (!__rmid_valid(grp_rmid))
+ continue;
+ if (is_mbm_event(iter->attr.config))
+ update_sample(grp_rmid, iter->attr.config, 0);
+
+ head = &iter->hw.cqm_group_entry;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+ if (!iter1->hw.is_group_event)
+ break;
+ if (is_mbm_event(iter1->attr.config))
+ update_sample(iter1->hw.cqm_rmid,
+ iter1->attr.config, 0);
+ }
+ }
+
+ hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+ hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+ hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+ struct hrtimer *hr;
+ int i;
+
+ for (i = 0; i < mbm_socket_max; i++) {
+ hr = &mbm_timers[i];
+ hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hr->function = mbm_hrtimer_handle;
+ }
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1228,8 +1317,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ unsigned long flags;

mutex_lock(&cache_mutex);
+ /*
+ * Hold the cache_lock as mbm timer handlers could be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);

/*
* If there's another event in this group...
@@ -1261,6 +1356,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ /*
+ * Stop the mbm overflow timers when the last event is destroyed.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_stop_timers();
+
mutex_unlock(&cache_mutex);
}

@@ -1268,6 +1371,7 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
+ unsigned long flags;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1293,9 +1397,21 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /*
+ * Start the mbm overflow timers when the first event is created.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_start_timers();
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);

+ /*
+ * Hold the cache_lock as mbm timer handlers be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1314,6 +1430,7 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);

if (rotate)
@@ -1557,20 +1674,33 @@ static const struct x86_cpu_id intel_mbm_total_match[] = {

static int intel_mbm_init(void)
{
- int array_size, maxid = cqm_max_rmid + 1;
+ int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_socket_max = topology_max_packages();
+ array_size = sizeof(struct sample) * maxid * mbm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;

mbm_total = kmalloc(array_size, GFP_KERNEL);
if (!mbm_total) {
- mbm_cleanup();
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out;
}

- return 0;
+ array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ mbm_timers = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_timers) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mbm_hrtimer_init();
+
+out:
+ if (ret)
+ mbm_cleanup();
+
+ return ret;
}

static int __init intel_cqm_init(void)
--
1.9.1

2016-03-10 23:33:36

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

The MBM init patch enumerates the Intel (Memory b/w monitoring)MBM and
initializes the perf events and datastructures for monitoring the memory
b/w. Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring(CQM) we add two more
events to the perf infrastructure:
intel_cqm_llc/local_bytes - bytes sent through local socket memory
controller
intel_cqm_llc/total_bytes - total L3 external bytes sent

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 2 +
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 126 ++++++++++++++++++++++++++++-
3 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c94..9b4233e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -241,6 +241,8 @@

/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */

/* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
#define X86_FEATURE_CLZERO (13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa05680..13af76e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -635,7 +635,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
c->x86_capability[CPUID_F_1_EDX] = edx;

- if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+ ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+ (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
c->x86_cache_max_rmid = ecx;
c->x86_cache_occ_scale = ebx;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index fc704ed..0496a56 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -15,6 +15,7 @@

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,24 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @total_bytes #bytes since we began monitoring
+ * @prev_msr previous value of MSR
+ */
+struct sample {
+ u64 total_bytes;
+ u64 prev_msr;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -223,6 +242,7 @@ static void cqm_cleanup(void)

kfree(cqm_rmid_ptrs);
cqm_rmid_ptrs = NULL;
+ cqm_enabled = false;
}

static int intel_cqm_setup_rmid_cache(void)
@@ -1164,6 +1184,16 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");

+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
EVENT_PTR(intel_cqm_llc_pkg),
@@ -1173,9 +1203,38 @@ static struct attribute *intel_cqm_events_attr[] = {
NULL,
};

+static struct attribute *intel_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_llc),
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_llc_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_llc_unit),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_llc_scale),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_llc_snapshot),
+ NULL,
+};
+
static struct attribute_group intel_cqm_events_group = {
.name = "events",
- .attrs = intel_cqm_events_attr,
+ .attrs = NULL,
};

PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1332,12 +1391,57 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};

+static void mbm_cleanup(void)
+{
+ if (!mbm_enabled)
+ return;
+
+ kfree(mbm_local);
+ kfree(mbm_total);
+ mbm_enabled = false;
+}
+
+static const struct x86_cpu_id intel_mbm_local_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+ {}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+ {}
+};
+
+static int intel_mbm_init(void)
+{
+ int array_size, maxid = cqm_max_rmid + 1;
+
+ array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_local = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_local)
+ return -ENOMEM;
+
+ mbm_total = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_total) {
+ mbm_cleanup();
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
static int __init intel_cqm_init(void)
{
char *str = NULL, scale[20];
int i, cpu, ret;

- if (!x86_match_cpu(intel_cqm_match))
+ if (x86_match_cpu(intel_cqm_match))
+ cqm_enabled = true;
+
+ if (x86_match_cpu(intel_mbm_local_match) &&
+ x86_match_cpu(intel_mbm_total_match))
+ mbm_enabled = true;
+
+ if (!cqm_enabled && !mbm_enabled)
return -ENODEV;

cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1394,13 +1498,28 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

+ if (mbm_enabled)
+ ret = intel_mbm_init();
+ if (ret && !cqm_enabled)
+ goto out;
+
+ if (cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+ else if (!cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_mbm_events_attr;
+ else if (cqm_enabled && !mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
goto out;
}

- pr_info("Intel CQM monitoring enabled\n");
+ if (cqm_enabled)
+ pr_info("Intel CQM monitoring enabled\n");
+ if (mbm_enabled)
+ pr_info("Intel MBM enabled\n");

/*
* Register the hot cpu notifier once we are sure cqm
@@ -1412,6 +1531,7 @@ out:
if (ret) {
kfree(str);
cqm_cleanup();
+ mbm_cleanup();
}

return ret;
--
1.9.1

2016-03-10 23:33:55

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 2/6] x86/perf/intel/cqm: Fix cqm memory leak and notifier leak

Fixes the hotcpu notifier leak and other global variable memory leaks
during cqm(cache quality of service monitoring) initialization.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 43 ++++++++++++++++++++++--------
1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e6be335..fc704ed 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
}

+static void cqm_cleanup(void)
+{
+ int i;
+
+ if (!cqm_rmid_ptrs)
+ return;
+
+ for (i = 0; i < cqm_max_rmid; i++)
+ kfree(cqm_rmid_ptrs[i]);
+
+ kfree(cqm_rmid_ptrs);
+ cqm_rmid_ptrs = NULL;
+}
+
static int intel_cqm_setup_rmid_cache(void)
{
struct cqm_rmid_entry *entry;
@@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(void)
int r = 0;

nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+ cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
nr_rmids, GFP_KERNEL);
if (!cqm_rmid_ptrs)
return -ENOMEM;
@@ -249,11 +263,9 @@ static int intel_cqm_setup_rmid_cache(void)
mutex_unlock(&cache_mutex);

return 0;
-fail:
- while (r--)
- kfree(cqm_rmid_ptrs[r]);

- kfree(cqm_rmid_ptrs);
+fail:
+ cqm_cleanup();
return -ENOMEM;
}

@@ -1322,7 +1334,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {

static int __init intel_cqm_init(void)
{
- char *str, scale[20];
+ char *str = NULL, scale[20];
int i, cpu, ret;

if (!x86_match_cpu(intel_cqm_match))
@@ -1382,16 +1394,25 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

- __perf_cpu_notifier(intel_cqm_cpu_notifier);
-
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret)
+ if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
- else
- pr_info("Intel CQM monitoring enabled\n");
+ goto out;
+ }
+
+ pr_info("Intel CQM monitoring enabled\n");

+ /*
+ * Register the hot cpu notifier once we are sure cqm
+ * is enabled to avoid notifier leak.
+ */
+ __perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
+ if (ret) {
+ kfree(str);
+ cqm_cleanup();
+ }

return ret;
}
--
1.9.1

2016-03-11 19:26:17

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external bytes
and local socket memory-controller bytes. The OS does MSR writes to
MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and uses the
IA32_PQR_ASSOC_MSR to associate the RMID with the task. The tasks have a
common RMID for cqm(cache quality of service monitoring) and MBM. Hence
most of the scheduling code is reused from cqm.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
Updated patch:
Further cleanup of EVENT_ID #defines for consistency
Vikas didn't quite understand all the cleverness of the "<< shift" trick

arch/x86/kernel/cpu/perf_event_intel_cqm.c | 128 +++++++++++++++++++++++++++--
1 file changed, 122 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 0496a5697a45..dfe365edeff1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,6 +13,13 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

+/*
+ * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
+ * value
+ */
+#define MBM_CNTR_WIDTH 24
+#define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)
+
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
@@ -62,6 +69,16 @@ static struct sample *mbm_total;
*/
static struct sample *mbm_local;

+#define pkg_id topology_physical_package_id(smp_processor_id())
+/*
+ * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
+ * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
+ * rmids per socket, an example is given below
+ * RMID1 of Socket0: vrmid = 1
+ * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
+ * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
+ */
+#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
@@ -84,9 +101,13 @@ static cpumask_t cqm_cpumask;
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)

-#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
-
-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+/*
+ * Event IDs are used to program IA32_QM_EVTSEL before reading event
+ * counter from IA32_QM_CTR
+ */
+#define QOS_L3_OCCUP_EVENT_ID 0x01
+#define QOS_MBM_TOTAL_EVENT_ID 0x02
+#define QOS_MBM_LOCAL_EVENT_ID 0x03

/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -429,9 +450,16 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
struct rmid_read {
u32 rmid;
atomic64_t value;
+ u32 evt_type;
};

static void __intel_cqm_event_count(void *info);
+static void init_mbm_sample(u32 rmid, u32 evt_type);
+
+static bool is_mbm_event(int e)
+{
+ return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
+}

/*
* Exchange the RMID of a group of events.
@@ -873,6 +901,69 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
schedule_delayed_work(&intel_cqm_rmid_work, delay);
}

+static u64 update_sample(unsigned int rmid,
+ u32 evt_type, int first)
+{
+ struct sample *mbm_current;
+ u32 vrmid = rmid_2_index(rmid);
+ u64 val, bytes, shift;
+ u32 eventid;
+
+ if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
+ mbm_current = &mbm_local[vrmid];
+ eventid = QOS_MBM_LOCAL_EVENT_ID;
+ } else {
+ mbm_current = &mbm_total[vrmid];
+ eventid = QOS_MBM_TOTAL_EVENT_ID;
+ }
+
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, val);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return mbm_current->total_bytes;
+
+ if (first) {
+ mbm_current->prev_msr = val;
+ mbm_current->total_bytes = 0;
+ return mbm_current->total_bytes;
+ }
+
+ shift = 64 - MBM_CNTR_WIDTH;
+ bytes = (val << shift) - (mbm_current->prev_msr << shift);
+ bytes >>= shift;
+
+ bytes *= cqm_l3_scale;
+
+ mbm_current->total_bytes += bytes;
+ mbm_current->prev_msr = val;
+
+ return mbm_current->total_bytes;
+}
+
+static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
+{
+ return update_sample(rmid, evt_type, 0);
+}
+
+static void __intel_mbm_event_init(void *info)
+{
+ struct rmid_read *rr = info;
+
+ update_sample(rr->rmid, rr->evt_type, 1);
+}
+
+static void init_mbm_sample(u32 rmid, u32 evt_type)
+{
+ struct rmid_read rr = {
+ .value = ATOMIC64_INIT(0),
+ .rmid = rmid,
+ .evt_type = evt_type,
+ };
+
+ /* on each socket, init sample */
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
+}
+
/*
* Find a group and setup RMID.
*
@@ -893,6 +984,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
return;
}

@@ -909,6 +1002,9 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
event->hw.cqm_rmid = rmid;
}

@@ -930,7 +1026,10 @@ static void intel_cqm_event_read(struct perf_event *event)
if (!__rmid_valid(rmid))
goto out;

- val = __rmid_read(rmid);
+ if (is_mbm_event(event->attr.config))
+ val = rmid_read_mbm(rmid, event->attr.config);
+ else
+ val = __rmid_read(rmid);

/*
* Ignore this reading on error states and do not update the value.
@@ -961,6 +1060,17 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_groups_entry);
}

+static void __intel_mbm_event_count(void *info)
+{
+ struct rmid_read *rr = info;
+ u64 val;
+
+ val = rmid_read_mbm(rr->rmid, rr->evt_type);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1014,7 +1124,12 @@ static u64 intel_cqm_event_count(struct perf_event *event)
if (!__rmid_valid(rr.rmid))
goto out;

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ if (is_mbm_event(event->attr.config)) {
+ rr.evt_type = event->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ }

raw_spin_lock_irqsave(&cache_lock, flags);
if (event->hw.cqm_rmid == rr.rmid)
@@ -1129,7 +1244,8 @@ static int intel_cqm_event_init(struct perf_event *event)
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;

- if (event->attr.config & ~QOS_EVENT_MASK)
+ if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
+ (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
return -EINVAL;

/* unsupported modes and filters */
--
2.5.0

2016-03-11 19:26:19

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling

From: Vikas Shivappa <[email protected]>

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
Updated patch:
Changes to part 4 made this not apply, so fix that.
Fixed spelling of "guarantees" and trimmed that comment

arch/x86/kernel/cpu/perf_event_intel_cqm.c | 139 +++++++++++++++++++++++++++--
1 file changed, 134 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index a85216f8b3ad..752d182e2d7b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -19,10 +19,15 @@
*/
#define MBM_CNTR_WIDTH 24
#define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME 1000

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -50,6 +55,7 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
* @total_bytes #bytes since we began monitoring
@@ -951,6 +957,10 @@ static u64 update_sample(unsigned int rmid,
return mbm_current->total_bytes;
}

+ /*
+ * The h/w guarantees that counters will not overflow
+ * so long as we poll them at least once per second.
+ */
shift = 64 - MBM_CNTR_WIDTH;
bytes = (val << shift) - (mbm_current->prev_msr << shift);
bytes >>= shift;
@@ -1095,6 +1105,84 @@ static void __intel_mbm_event_count(void *info)
atomic64_add(val, &rr->value);
}

+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct perf_event *iter, *iter1;
+ int ret = HRTIMER_RESTART;
+ struct list_head *head;
+ unsigned long flags;
+ u32 grp_rmid;
+
+ /*
+ * Need to cache_lock as the timer Event Select MSR reads
+ * can race with the mbm/cqm count() and mbm_init() reads.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (list_empty(&cache_groups)) {
+ ret = HRTIMER_NORESTART;
+ goto out;
+ }
+
+ list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ grp_rmid = iter->hw.cqm_rmid;
+ if (!__rmid_valid(grp_rmid))
+ continue;
+ if (is_mbm_event(iter->attr.config))
+ update_sample(grp_rmid, iter->attr.config, 0);
+
+ head = &iter->hw.cqm_group_entry;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+ if (!iter1->hw.is_group_event)
+ break;
+ if (is_mbm_event(iter1->attr.config))
+ update_sample(iter1->hw.cqm_rmid,
+ iter1->attr.config, 0);
+ }
+ }
+
+ hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+ hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+ hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+ struct hrtimer *hr;
+ int i;
+
+ for (i = 0; i < mbm_socket_max; i++) {
+ hr = &mbm_timers[i];
+ hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hr->function = mbm_hrtimer_handle;
+ }
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1224,8 +1312,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ unsigned long flags;

mutex_lock(&cache_mutex);
+ /*
+ * Hold the cache_lock as mbm timer handlers could be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);

/*
* If there's another event in this group...
@@ -1257,6 +1351,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ /*
+ * Stop the mbm overflow timers when the last event is destroyed.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_stop_timers();
+
mutex_unlock(&cache_mutex);
}

@@ -1264,6 +1366,7 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
+ unsigned long flags;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1289,9 +1392,21 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /*
+ * Start the mbm overflow timers when the first event is created.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_start_timers();
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);

+ /*
+ * Hold the cache_lock as mbm timer handlers be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1310,6 +1425,7 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);

if (rotate)
@@ -1553,20 +1669,33 @@ static const struct x86_cpu_id intel_mbm_total_match[] = {

static int intel_mbm_init(void)
{
- int array_size, maxid = cqm_max_rmid + 1;
+ int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_socket_max = topology_max_packages();
+ array_size = sizeof(struct sample) * maxid * mbm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;

mbm_total = kmalloc(array_size, GFP_KERNEL);
if (!mbm_total) {
- mbm_cleanup();
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out;
}

- return 0;
+ array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ mbm_timers = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_timers) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mbm_hrtimer_init();
+
+out:
+ if (ret)
+ mbm_cleanup();
+
+ return ret;
}

static int __init intel_cqm_init(void)
--
2.5.0

2016-03-11 22:54:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 0/6] Intel memory b/w monitoring support

On Thu, Mar 10, 2016 at 03:32:06PM -0800, Vikas Shivappa wrote:
> The patch series has two preparatory patch for cqm and then 4 MBM
> patches. Patches are based on tip perf/core.

They were not (or at least not a recent copy of it); all the files got
moved about by someone..

But a little sed quickly fixed that.

I also fixed a bunch of little things while applying and added a little
cleanup patch at the end.

Please see if the branch below works for you:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

2016-03-11 23:22:12

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH V6 0/6] Intel memory b/w monitoring support



On Fri, 11 Mar 2016, Peter Zijlstra wrote:

> On Thu, Mar 10, 2016 at 03:32:06PM -0800, Vikas Shivappa wrote:
>> The patch series has two preparatory patch for cqm and then 4 MBM
>> patches. Patches are based on tip perf/core.
>
> They were not (or at least not a recent copy of it); all the files got
> moved about by someone..

Right, I mentioned based on perf/core because Thomas had a patch for
topology_max_packages(which went to tip) on which these mbm are dependent - I
tested although by applying
the topology patch seperately on rc7 .. the perf/core for some reason did not
have a lot of files including perf_event_intel_cqm.c

>
> But a little sed quickly fixed that.
>
> I also fixed a bunch of little things while applying and added a little
> cleanup patch at the end.
>
> Please see if the branch below works for you:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
>
You mean I test the mbm patches on top of this ?

Thanks,
Vikas

2016-03-11 23:25:25

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH V6 0/6] Intel memory b/w monitoring support



>>
>> Please see if the branch below works for you:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
>>
> You mean I test the mbm patches on top of this ?

I see you applied the mbm patches here already

>
> Thanks,
> Vikas
>

2016-03-11 23:45:26

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH V6 0/6] Intel memory b/w monitoring support

> Please see if the branch below works for you:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

tragically no :-( The instant I started perf stat to trace some MBM events, I got a panic.

But I think something went awry with the base version you applied these patches to. I see
a whole lot of differences between what the tree you pointed to and the version I have.

-Tony

[ 320.988548] BUG: unable to handle kernel paging request at ffff888f153d3b88
[ 320.998399] IP: [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.006622] PGD 1f88067 PUD 0
[ 321.011910] Oops: 0000 [#1] SMP
[ 321.017328] Modules linked in: af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) xfs(E) libcrc32c(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) joydev(E) dm_mod(E) ixgbe(E) kvm(E) irqbypass(E) ptp(E) crct10dif_pclmul(E) crc32_pclmul(E) iTCO_wdt(E) pps_core(E) mptctl(E) ghash_clmulni_intel(E) iTCO_vendor_support(E) mdio(E) mptbase(E) dca(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) pcspkr(E) sb_edac(E) mei_me(E) lpc_ich(E) mei(E) mfd_core(E) edac_core(E) i2c_i801(E) wmi(E) shpchp(E) ipmi_si(E) ipmi_msghandler(E) processor(E) acpi_pad(E) button(E) efivarfs(E) btrfs(E) xor(E) raid6_pq(E) sd_mod(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) ahci(E) drm_kms_helper(E) syscopyarea(E) libahci(E) sysfillrect(E) ehci_pci(E) ehci_hcd(E) sysimgblt(E) fb_sys_fops(E) ttm(E) crc32c_intel(E) mpt3sas(E) usbcore(E) raid_class(E) drm(E) libata(E) usb_common(E) scsi_transport_sas(E) sg(E) scsi_mod(E) autofs4(E)
[ 321.136529] CPU: 72 PID: 0 Comm: swapper/72 Tainted: G E 4.5.0-rc6-371-g520a80bcb13b #2
[ 321.148713] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0336.V05.1603031638 03/03/2016
[ 321.162290] task: ffff881ff29994c0 ti: ffff881ff299c000 task.ti: ffff881ff299c000
[ 321.172684] RIP: 0010:[<ffffffff8100afff>] [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.183790] RSP: 0018:ffff887fff003f28 EFLAGS: 00010046
[ 321.191794] RAX: 0000000000000000 RBX: ffff888f153d3b80 RCX: 0000000000000000
[ 321.201872] RDX: 0000000000000000 RSI: ffff887fff003f2c RDI: 0000000000000c8e
[ 321.211928] RBP: ffff887fff003f40 R08: 000000000000001c R09: 0000000000004a23
[ 321.221990] R10: 00000000000000d8 R11: 0000000000000005 R12: 0000000000000000
[ 321.232047] R13: ffffffff8100b0b0 R14: ffff883ff2533d60 R15: 0000004a5791b2c7
[ 321.242113] FS: 0000000000000000(0000) GS:ffff887fff000000(0000) knlGS:0000000000000000
[ 321.253262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 321.261778] CR2: ffff888f153d3b88 CR3: 0000000001a0a000 CR4: 00000000003406e0
[ 321.271856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 321.281937] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 321.292012] Stack:
[ 321.296348] 00000000810e0d58 ffff883ff2533d60 0000000000000000 ffff887fff003f58
[ 321.306804] ffffffff8100b0c9 ffffe8ffff8037c0 ffff887fff003f88 ffffffff810e562c
[ 321.317281] ffffe8ffff80f500 0000000000000002 0000000000000048 ffffffff81ad0a20
[ 321.327761] Call Trace:
[ 321.332639] <IRQ>
[ 321.334805] [<ffffffff8100b0c9>] __intel_mbm_event_count+0x19/0x30
[ 321.346244] [<ffffffff810e562c>] flush_smp_call_function_queue+0x4c/0x130
[ 321.356030] [<ffffffff810e6053>] generic_smp_call_function_single_interrupt+0x13/0x60
[ 321.366978] [<ffffffff8103ac07>] smp_call_function_interrupt+0x27/0x40
[ 321.376469] [<ffffffff815bd132>] call_function_interrupt+0x82/0x90
[ 321.385568] <EOI>
[ 321.387735] [<ffffffff81486fc5>] ? cpuidle_enter_state+0xd5/0x250
[ 321.399053] [<ffffffff81486fa1>] ? cpuidle_enter_state+0xb1/0x250
[ 321.408053] [<ffffffff81487177>] cpuidle_enter+0x17/0x20
[ 321.416184] [<ffffffff810aa68d>] cpu_startup_entry+0x25d/0x350
[ 321.424901] [<ffffffff8103b793>] start_secondary+0x113/0x140
[ 321.433439] Code: 04 00 66 90 bf 8e 0c 00 00 48 8d 75 ec e8 ea 35 04 00 66 90 48 ba 00 00 00 00 00 00 00 c0 48 85 d0 75 48 45 85 e4 75 2d 48 89 c2 <48> 2b 53 08 8b 0d 5b 93 d2 00 48 89 43 08 81 e2 ff ff ff 00 48
[ 321.459674] RIP [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.468328] RSP <ffff887fff003f28>
[ 321.474436] CR2: ffff888f153d3b88
[ 321.490193] ---[ end trace f73c5e7e5070d07b ]---
[ 321.490200] BUG: unable to handle kernel paging request at ffff888f153d2388
[ 321.490214] IP: [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.490216] PGD 1f88067 PUD 0
[ 321.490219] Oops: 0000 [#2] SMP
[ 321.490265] Modules linked in: af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) xfs(E) libcrc32c(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) joydev(E) dm_mod(E) ixgbe(E) kvm(E) irqbypass(E) ptp(E) crct10dif_pclmul(E) crc32_pclmul(E) iTCO_wdt(E) pps_core(E) mptctl(E) ghash_clmulni_intel(E) iTCO_vendor_support(E) mdio(E) mptbase(E) dca(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) pcspkr(E) sb_edac(E) mei_me(E) lpc_ich(E) mei(E) mfd_core(E) edac_core(E) i2c_i801(E) wmi(E) shpchp(E) ipmi_si(E) ipmi_msghandler(E) processor(E) acpi_pad(E) button(E) efivarfs(E) btrfs(E) xor(E) raid6_pq(E) sd_mod(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) ahci(E) drm_kms_helper(E) syscopyarea(E) libahci(E) sysfillrect(E) ehci_pci(E) ehci_hcd(E) sysimgblt(E) fb_sys_fops(E) ttm(E) crc32c_intel(E) mpt3sas(E) usbcore(E) raid_class(E) drm(E) libata(E) usb_common(E) scsi_transport_sas(E) sg(E) scsi_mod(E) autofs4(E)
[ 321.490282] CPU: 24 PID: 0 Comm: swapper/24 Tainted: G D E 4.5.0-rc6-371-g520a80bcb13b #2
[ 321.490283] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0336.V05.1603031638 03/03/2016
[ 321.490285] task: ffff881ff280c8c0 ti: ffff881ff2810000 task.ti: ffff881ff2810000
[ 321.490290] RIP: 0010:[<ffffffff8100afff>] [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.490292] RSP: 0018:ffff883fff803f28 EFLAGS: 00010046
[ 321.490293] RAX: 0000000000000000 RBX: ffff888f153d2380 RCX: 0000000000000000
[ 321.490294] RDX: 0000000000000000 RSI: ffff883fff803f2c RDI: 0000000000000c8e
[ 321.490295] RBP: ffff883fff803f40 R08: 000000000000001c R09: 0000000000004a2e
[ 321.490296] R10: 00000000000000f4 R11: 0000000000000000 R12: 0000000000000000
[ 321.490297] R13: ffffffff8100b0b0 R14: ffff883ff2533d60 R15: 0000004a5791a454
[ 321.490299] FS: 0000000000000000(0000) GS:ffff883fff800000(0000) knlGS:0000000000000000
[ 321.490300] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 321.490301] CR2: ffff888f153d2388 CR3: 0000000001a0a000 CR4: 00000000003406e0
[ 321.490302] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 321.490303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 321.490304] Stack:
[ 321.490307] 00000000810e0d58 ffff883ff2533d60 0000000000000000 ffff883fff803f58
[ 321.490309] ffffffff8100b0c9 ffffe8c0000037c0 ffff883fff803f88 ffffffff810e562c
[ 321.490311] ffffe8c00000f500 0000000000000002 0000000000000018 ffffffff81ad0a20
[ 321.490311] Call Trace:
[ 321.490319] <IRQ>
[ 321.490319] [<ffffffff8100b0c9>] __intel_mbm_event_count+0x19/0x30
[ 321.490327] [<ffffffff810e562c>] flush_smp_call_function_queue+0x4c/0x130
[ 321.490330] [<ffffffff810e6053>] generic_smp_call_function_single_interrupt+0x13/0x60
[ 321.490337] [<ffffffff8103ac07>] smp_call_function_interrupt+0x27/0x40
[ 321.490347] [<ffffffff815bd132>] call_function_interrupt+0x82/0x90
[ 321.490356] <EOI>
[ 321.490356] [<ffffffff81486fc5>] ? cpuidle_enter_state+0xd5/0x250
[ 321.490359] [<ffffffff81486fa1>] ? cpuidle_enter_state+0xb1/0x250
[ 321.490362] [<ffffffff81487177>] cpuidle_enter+0x17/0x20
[ 321.490369] [<ffffffff810aa68d>] cpu_startup_entry+0x25d/0x350
[ 321.490372] [<ffffffff8103b793>] start_secondary+0x113/0x140
[ 321.490399] Code: 04 00 66 90 bf 8e 0c 00 00 48 8d 75 ec e8 ea 35 04 00 66 90 48 ba 00 00 00 00 00 00 00 c0 48 85 d0 75 48 45 85 e4 75 2d 48 89 c2 <48> 2b 53 08 8b 0d 5b 93 d2 00 48 89 43 08 81 e2 ff ff ff 00 48
[ 321.490402] RIP [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.490403] RSP <ffff883fff803f28>
[ 321.490404] CR2: ffff888f153d2388
[ 321.490409] ---[ end trace f73c5e7e5070d07c ]---
[ 321.490417] BUG: unable to handle kernel paging request at ffff888f153d1788
[ 321.491691] IP: [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.491695] PGD 1f88067 PUD 0
[ 321.491698] Oops: 0000 [#3] SMP
[ 321.495324] Modules linked in: af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) msr(E) xfs(E) libcrc32c(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E)
[ 321.495324] Kernel panic - not syncing: Fatal exception in interrupt
[ 321.495374] coretemp(E) joydev(E) dm_mod(E) ixgbe(E) kvm(E) irqbypass(E) ptp(E) crct10dif_pclmul(E) crc32_pclmul(E) iTCO_wdt(E) pps_core(E) mptctl(E) ghash_clmulni_intel(E) iTCO_vendor_support(E) mdio(E) mptbase(E) dca(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) pcspkr(E) sb_edac(E) mei_me(E) lpc_ich(E) mei(E) mfd_core(E) edac_core(E) i2c_i801(E) wmi(E) shpchp(E) ipmi_si(E) ipmi_msghandler(E) processor(E) acpi_pad(E) button(E) efivarfs(E) btrfs(E) xor(E) raid6_pq(E) sd_mod(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) mgag200(E) i2c_algo_bit(E) ahci(E) drm_kms_helper(E) syscopyarea(E) libahci(E) sysfillrect(E) ehci_pci(E) ehci_hcd(E) sysimgblt(E) fb_sys_fops(E) ttm(E) crc32c_intel(E) mpt3sas(E) usbcore(E) raid_class(E) drm(E) libata(E) usb_common(E) scsi_transport_sas(E) sg(E) scsi_mod(E) autofs4(E)
[ 321.495383] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 4.5.0-rc6-371-g520a80bcb13b #2
[ 321.495385] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0336.V05.1603031638 03/03/2016
[ 321.495386] task: ffffffff81a0f4c0 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[ 321.495392] RIP: 0010:[<ffffffff8100afff>] [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.495393] RSP: 0018:ffff881fff803f28 EFLAGS: 00010046
[ 321.495395] RAX: 0000000000013c38 RBX: ffff888f153d1780 RCX: 0000000000000000
[ 321.495396] RDX: 0000000000013c38 RSI: ffff881fff803f2c RDI: 0000000000000c8e
[ 321.495397] RBP: ffff881fff803f40 R08: 0000000000000018 R09: 0000000000069083
[ 321.495398] R10: 0000000000004aa6 R11: 0000000000000000 R12: 0000000000000000
[ 321.495399] R13: ffffffff8100b0b0 R14: ffff883ff2533d60 R15: 0000004a57919508
[ 321.495401] FS: 0000000000000000(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
[ 321.495402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 321.495403] CR2: ffff888f153d1788 CR3: 0000000001a0a000 CR4: 00000000003406f0
[ 321.495404] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 321.495405] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 321.495406] Stack:
[ 321.495409] 00000000810e0d58 ffff883ff2533d60 0000000000000000 ffff881fff803f58
[ 321.495411] ffffffff8100b0c9 ffffe8a0000037c0 ffff881fff803f88 ffffffff810e562c
[ 321.495413] ffffe8a00000f500 0000000000000004 0000000000000000 ffffffff81ad0a20
[ 321.495414] Call Trace:
[ 321.495421] <IRQ>
[ 321.495421] [<ffffffff8100b0c9>] __intel_mbm_event_count+0x19/0x30
[ 321.495426] [<ffffffff810e562c>] flush_smp_call_function_queue+0x4c/0x130
[ 321.495429] [<ffffffff810e6053>] generic_smp_call_function_single_interrupt+0x13/0x60
[ 321.495434] [<ffffffff8103ac07>] smp_call_function_interrupt+0x27/0x40
[ 321.495440] [<ffffffff815bd132>] call_function_interrupt+0x82/0x90
[ 321.495447] <EOI>
[ 321.495448] [<ffffffff81486fc5>] ? cpuidle_enter_state+0xd5/0x250
[ 321.495450] [<ffffffff81486fa1>] ? cpuidle_enter_state+0xb1/0x250
[ 321.495453] [<ffffffff81487177>] cpuidle_enter+0x17/0x20
[ 321.495457] [<ffffffff810aa68d>] cpu_startup_entry+0x25d/0x350
[ 321.495463] [<ffffffff815b03dc>] rest_init+0x7c/0x80
[ 321.495472] [<ffffffff81b5d0be>] start_kernel+0x486/0x493
[ 321.495475] [<ffffffff81b5ca26>] ? set_init_arg+0x55/0x55
[ 321.495479] [<ffffffff81b5c120>] ? early_idt_handler_array+0x120/0x120
[ 321.495482] [<ffffffff81b5c5ca>] x86_64_start_reservations+0x2a/0x2c
[ 321.495485] [<ffffffff81b5c709>] x86_64_start_kernel+0x13d/0x14c
[ 321.495511] Code: 04 00 66 90 bf 8e 0c 00 00 48 8d 75 ec e8 ea 35 04 00 66 90 48 ba 00 00 00 00 00 00 00 c0 48 85 d0 75 48 45 85 e4 75 2d 48 89 c2 <48> 2b 53 08 8b 0d 5b 93 d2 00 48 89 43 08 81 e2 ff ff ff 00 48
[ 321.495515] RIP [<ffffffff8100afff>] update_sample+0x8f/0xf0
[ 321.495516] RSP <ffff881fff803f28>
[ 321.495517] CR2: ffff888f153d1788
[ 321.495520] ---[ end trace f73c5e7e5070d07d ]---
[ 322.567367] Shutting down cpus with NMI
[ 322.579083] Kernel Offset: disabled
[ 322.603886] ---[ end Kernel panic - not syncing: Fatal exception in interrupt



2016-03-12 01:56:17

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH V6 0/6] Intel memory b/w monitoring support

Some tracing printk() show that we are calling update_sample() with totally bogus arguments.

There are a few good calls, then I see rmid=-380863112 evt_type=-30689 first=0

That turns into a wild vrmid, and we fault accessing mbm_current->prev_msr

-Tony


2016-03-12 07:53:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V6 0/6] Intel memory b/w monitoring support

On Sat, Mar 12, 2016 at 01:56:13AM +0000, Luck, Tony wrote:
> Some tracing printk() show that we are calling update_sample() with totally bogus arguments.
>
> There are a few good calls, then I see rmid=-380863112 evt_type=-30689 first=0
>
> That turns into a wild vrmid, and we fault accessing mbm_current->prev_msr

It's because I'm a right idiot.. The below should sort that methinks.

Will push a new branch

--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -466,9 +466,9 @@ static bool is_mbm_event(int e)
static void cqm_mask_call(struct rmid_read *rr)
{
if (is_mbm_event(rr->evt_type))
- on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, rr, 1);
else
- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, rr, 1);
}

/*

2016-03-12 16:14:43

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH V6 0/6] Intel memory b/w monitoring support

>> There are a few good calls, then I see rmid=-380863112 evt_type=-30689 first=0
>>
>> That turns into a wild vrmid, and we fault accessing mbm_current->prev_msr
>
> It's because I'm a right idiot.. The below should sort that methinks.

Tsk tsk ... don't insult the coder, just critique the code :-)

> Will push a new branch

Pulled the new branch ... seems to be working fine now.

Is the series good now (i.e. on a trajectory to be merged when Linus opens the 4.6 window)?
Or is there anything else that anyone would like to see cleaned up?

-Tony

Subject: [tip:perf/urgent] perf/x86/cqm: Fix CQM memory leak and notifier leak

Commit-ID: ada2f634cd50d050269b67b4e2966582387e7c27
Gitweb: http://git.kernel.org/tip/ada2f634cd50d050269b67b4e2966582387e7c27
Author: Vikas Shivappa <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:32:08 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:19 +0100

perf/x86/cqm: Fix CQM memory leak and notifier leak

Fixes the hotcpu notifier leak and other global variable memory leaks
during CQM (cache quality of service monitoring) initialization.

Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 43 ++++++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index b0226f1..dbb058d 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
}

+static void cqm_cleanup(void)
+{
+ int i;
+
+ if (!cqm_rmid_ptrs)
+ return;
+
+ for (i = 0; i < cqm_max_rmid; i++)
+ kfree(cqm_rmid_ptrs[i]);
+
+ kfree(cqm_rmid_ptrs);
+ cqm_rmid_ptrs = NULL;
+}
+
static int intel_cqm_setup_rmid_cache(void)
{
struct cqm_rmid_entry *entry;
@@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(void)
int r = 0;

nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+ cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
nr_rmids, GFP_KERNEL);
if (!cqm_rmid_ptrs)
return -ENOMEM;
@@ -249,11 +263,9 @@ static int intel_cqm_setup_rmid_cache(void)
mutex_unlock(&cache_mutex);

return 0;
-fail:
- while (r--)
- kfree(cqm_rmid_ptrs[r]);

- kfree(cqm_rmid_ptrs);
+fail:
+ cqm_cleanup();
return -ENOMEM;
}

@@ -1312,7 +1324,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {

static int __init intel_cqm_init(void)
{
- char *str, scale[20];
+ char *str = NULL, scale[20];
int i, cpu, ret;

if (!x86_match_cpu(intel_cqm_match))
@@ -1372,16 +1384,25 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

- __perf_cpu_notifier(intel_cqm_cpu_notifier);
-
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret)
+ if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
- else
- pr_info("Intel CQM monitoring enabled\n");
+ goto out;
+ }

+ pr_info("Intel CQM monitoring enabled\n");
+
+ /*
+ * Register the hot cpu notifier once we are sure cqm
+ * is enabled to avoid notifier leak.
+ */
+ __perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
+ if (ret) {
+ kfree(str);
+ cqm_cleanup();
+ }

return ret;
}

Subject: [tip:perf/urgent] perf/x86/cqm: Fix CQM handling of grouping events into a cache_group

Commit-ID: a223c1c7ab4cc64537dc4b911f760d851683768a
Gitweb: http://git.kernel.org/tip/a223c1c7ab4cc64537dc4b911f760d851683768a
Author: Vikas Shivappa <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:32:07 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:18 +0100

perf/x86/cqm: Fix CQM handling of grouping events into a cache_group

Currently CQM (cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero
for all events other than the group leader.

The patch tries to address the issue by keeping a flag in the
perf_event.hw which has other CQM related fields. The field is updated
at event creation and during grouping.

Signed-off-by: Vikas Shivappa <[email protected]>
[peterz: Changed hw_perf_event::is_group_event to an int]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 13 ++++++++++---
include/linux/perf_event.h | 1 +
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 93cb412..b0226f1 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -281,9 +281,13 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)

/*
* Events that target same task are placed into the same cache group.
+ * Mark it as a multi event group, so that we update ->count
+ * for every event rather than just the group leader later.
*/
- if (a->hw.target == b->hw.target)
+ if (a->hw.target == b->hw.target) {
+ b->hw.is_group_event = true;
return true;
+ }

/*
* Are we an inherited event?
@@ -849,6 +853,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
bool conflict = false;
u32 rmid;

+ event->hw.is_group_event = false;
list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
rmid = iter->hw.cqm_rmid;

@@ -940,7 +945,9 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);

/*
- * Only the group leader gets to report values. This stops us
+ * Only the group leader gets to report values except in case of
+ * multiple events in the same group, we still need to read the
+ * other events.This stops us
* reporting duplicate values to userspace, and gives us a clear
* rule for which task gets to report the values.
*
@@ -948,7 +955,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
* specific packages - we forfeit that ability when we create
* task events.
*/
- if (!cqm_group_leader(event))
+ if (!cqm_group_leader(event) && !event->hw.is_group_event)
return 0;

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 79ec7bb..7bb315b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -121,6 +121,7 @@ struct hw_perf_event {
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
+ int is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;

Subject: [tip:perf/urgent] perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init

Commit-ID: 33c3cc7acfd95968d74247f1a4e1b0727a07ed43
Gitweb: http://git.kernel.org/tip/33c3cc7acfd95968d74247f1a4e1b0727a07ed43
Author: Vikas Shivappa <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:32:09 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:19 +0100

perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init

The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
and initializes the perf events and datastructures for monitoring the
memory b/w.

Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring (CQM) we add two
more events to the perf infrastructure:

intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent

The tasks are associated with a Resouce Monitoring ID (RMID) just like
in CQM and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 126 ++++++++++++++++++++++++++++++++++++-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/kernel/cpu/common.c | 4 +-
3 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index dbb058d..515df11 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -15,6 +15,7 @@

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,24 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @total_bytes #bytes since we began monitoring
+ * @prev_msr previous value of MSR
+ */
+struct sample {
+ u64 total_bytes;
+ u64 prev_msr;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -223,6 +242,7 @@ static void cqm_cleanup(void)

kfree(cqm_rmid_ptrs);
cqm_rmid_ptrs = NULL;
+ cqm_enabled = false;
}

static int intel_cqm_setup_rmid_cache(void)
@@ -1164,6 +1184,16 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");

+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
EVENT_PTR(intel_cqm_llc_pkg),
@@ -1173,9 +1203,38 @@ static struct attribute *intel_cqm_events_attr[] = {
NULL,
};

+static struct attribute *intel_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_llc),
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_llc_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_llc_unit),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_llc_scale),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_llc_snapshot),
+ NULL,
+};
+
static struct attribute_group intel_cqm_events_group = {
.name = "events",
- .attrs = intel_cqm_events_attr,
+ .attrs = NULL,
};

PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1322,12 +1381,57 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};

+static void mbm_cleanup(void)
+{
+ if (!mbm_enabled)
+ return;
+
+ kfree(mbm_local);
+ kfree(mbm_total);
+ mbm_enabled = false;
+}
+
+static const struct x86_cpu_id intel_mbm_local_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+ {}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+ {}
+};
+
+static int intel_mbm_init(void)
+{
+ int array_size, maxid = cqm_max_rmid + 1;
+
+ array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_local = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_local)
+ return -ENOMEM;
+
+ mbm_total = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_total) {
+ mbm_cleanup();
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
static int __init intel_cqm_init(void)
{
char *str = NULL, scale[20];
int i, cpu, ret;

- if (!x86_match_cpu(intel_cqm_match))
+ if (x86_match_cpu(intel_cqm_match))
+ cqm_enabled = true;
+
+ if (x86_match_cpu(intel_mbm_local_match) &&
+ x86_match_cpu(intel_mbm_total_match))
+ mbm_enabled = true;
+
+ if (!cqm_enabled && !mbm_enabled)
return -ENODEV;

cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1384,13 +1488,28 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

+ if (mbm_enabled)
+ ret = intel_mbm_init();
+ if (ret && !cqm_enabled)
+ goto out;
+
+ if (cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+ else if (!cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_mbm_events_attr;
+ else if (cqm_enabled && !mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
goto out;
}

- pr_info("Intel CQM monitoring enabled\n");
+ if (cqm_enabled)
+ pr_info("Intel CQM monitoring enabled\n");
+ if (mbm_enabled)
+ pr_info("Intel MBM enabled\n");

/*
* Register the hot cpu notifier once we are sure cqm
@@ -1402,6 +1521,7 @@ out:
if (ret) {
kfree(str);
cqm_cleanup();
+ mbm_cleanup();
}

return ret;
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 074b760..746dd6a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -245,6 +245,8 @@

/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */

/* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
#define X86_FEATURE_CLZERO (13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62590aa..e601c12 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -649,7 +649,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
c->x86_capability[CPUID_F_1_EDX] = edx;

- if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+ ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+ (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
c->x86_cache_max_rmid = ecx;
c->x86_cache_occ_scale = ebx;
}

Subject: [tip:perf/urgent] perf/x86/mbm: Add memory bandwidth monitoring event management

Commit-ID: 87f01cc2a2914b61ade5ec834377fa7819484173
Gitweb: http://git.kernel.org/tip/87f01cc2a2914b61ade5ec834377fa7819484173
Author: Tony Luck <[email protected]>
AuthorDate: Fri, 11 Mar 2016 11:26:11 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:20 +0100

perf/x86/mbm: Add memory bandwidth monitoring event management

Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external
bytes and local socket memory-controller bytes. The OS does MSR writes
to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the counters and
uses the IA32_PQR_ASSOC_MSR to associate the RMID with the task. The
tasks have a common RMID for CQM (cache quality of service monitoring)
and MBM. Hence most of the scheduling code is reused from CQM.

Signed-off-by: Tony Luck <[email protected]>
[ Restructured rmid_read to not have an obvious hole, removed MBM_CNTR_MAX as its unused. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vikas Shivappa <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/abd7aac9a18d93b95b985b931cf258df0164746d.1457723885.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 122 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 116 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 515df11..610bd8a 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -13,6 +13,8 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

+#define MBM_CNTR_WIDTH 24
+
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
@@ -62,6 +64,16 @@ static struct sample *mbm_total;
*/
static struct sample *mbm_local;

+#define pkg_id topology_physical_package_id(smp_processor_id())
+/*
+ * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
+ * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
+ * rmids per socket, an example is given below
+ * RMID1 of Socket0: vrmid = 1
+ * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
+ * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
+ */
+#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
@@ -84,9 +96,13 @@ static cpumask_t cqm_cpumask;
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)

-#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
-
-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+/*
+ * Event IDs are used to program IA32_QM_EVTSEL before reading event
+ * counter from IA32_QM_CTR
+ */
+#define QOS_L3_OCCUP_EVENT_ID 0x01
+#define QOS_MBM_TOTAL_EVENT_ID 0x02
+#define QOS_MBM_LOCAL_EVENT_ID 0x03

/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -428,10 +444,17 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)

struct rmid_read {
u32 rmid;
+ u32 evt_type;
atomic64_t value;
};

static void __intel_cqm_event_count(void *info);
+static void init_mbm_sample(u32 rmid, u32 evt_type);
+
+static bool is_mbm_event(int e)
+{
+ return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
+}

/*
* Exchange the RMID of a group of events.
@@ -873,6 +896,68 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
schedule_delayed_work(&intel_cqm_rmid_work, delay);
}

+static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
+{
+ struct sample *mbm_current;
+ u32 vrmid = rmid_2_index(rmid);
+ u64 val, bytes, shift;
+ u32 eventid;
+
+ if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
+ mbm_current = &mbm_local[vrmid];
+ eventid = QOS_MBM_LOCAL_EVENT_ID;
+ } else {
+ mbm_current = &mbm_total[vrmid];
+ eventid = QOS_MBM_TOTAL_EVENT_ID;
+ }
+
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, val);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return mbm_current->total_bytes;
+
+ if (first) {
+ mbm_current->prev_msr = val;
+ mbm_current->total_bytes = 0;
+ return mbm_current->total_bytes;
+ }
+
+ shift = 64 - MBM_CNTR_WIDTH;
+ bytes = (val << shift) - (mbm_current->prev_msr << shift);
+ bytes >>= shift;
+
+ bytes *= cqm_l3_scale;
+
+ mbm_current->total_bytes += bytes;
+ mbm_current->prev_msr = val;
+
+ return mbm_current->total_bytes;
+}
+
+static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
+{
+ return update_sample(rmid, evt_type, 0);
+}
+
+static void __intel_mbm_event_init(void *info)
+{
+ struct rmid_read *rr = info;
+
+ update_sample(rr->rmid, rr->evt_type, 1);
+}
+
+static void init_mbm_sample(u32 rmid, u32 evt_type)
+{
+ struct rmid_read rr = {
+ .rmid = rmid,
+ .evt_type = evt_type,
+ .value = ATOMIC64_INIT(0),
+ };
+
+ /* on each socket, init sample */
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
+}
+
/*
* Find a group and setup RMID.
*
@@ -893,6 +978,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
return;
}

@@ -909,6 +996,9 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
event->hw.cqm_rmid = rmid;
}

@@ -930,7 +1020,10 @@ static void intel_cqm_event_read(struct perf_event *event)
if (!__rmid_valid(rmid))
goto out;

- val = __rmid_read(rmid);
+ if (is_mbm_event(event->attr.config))
+ val = rmid_read_mbm(rmid, event->attr.config);
+ else
+ val = __rmid_read(rmid);

/*
* Ignore this reading on error states and do not update the value.
@@ -961,6 +1054,17 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_groups_entry);
}

+static void __intel_mbm_event_count(void *info)
+{
+ struct rmid_read *rr = info;
+ u64 val;
+
+ val = rmid_read_mbm(rr->rmid, rr->evt_type);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1014,7 +1118,12 @@ static u64 intel_cqm_event_count(struct perf_event *event)
if (!__rmid_valid(rr.rmid))
goto out;

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ if (is_mbm_event(event->attr.config)) {
+ rr.evt_type = event->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ }

raw_spin_lock_irqsave(&cache_lock, flags);
if (event->hw.cqm_rmid == rr.rmid)
@@ -1129,7 +1238,8 @@ static int intel_cqm_event_init(struct perf_event *event)
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;

- if (event->attr.config & ~QOS_EVENT_MASK)
+ if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
+ (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
return -EINVAL;

/* unsupported modes and filters */

Subject: [tip:perf/urgent] perf/x86/mbm: Implement RMID recycling

Commit-ID: 2d4de8376ff1d94a5070cfa9092c59bfdc4e693e
Gitweb: http://git.kernel.org/tip/2d4de8376ff1d94a5070cfa9092c59bfdc4e693e
Author: Vikas Shivappa <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:32:11 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:20 +0100

perf/x86/mbm: Implement RMID recycling

RMID could be allocated or deallocated as part of RMID recycling.

When an RMID is allocated for MBM event, the MBM counter needs to be
initialized because next time we read the counter we need the previous
value to account for total bytes that went to the memory controller.

Similarly, when RMID is deallocated we need to update the ->count
variable.

Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 31 +++++++++++++++++++++++++++----
1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 610bd8a..a98f472 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -450,6 +450,7 @@ struct rmid_read {

static void __intel_cqm_event_count(void *info);
static void init_mbm_sample(u32 rmid, u32 evt_type);
+static void __intel_mbm_event_count(void *info);

static bool is_mbm_event(int e)
{
@@ -476,8 +477,14 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
.rmid = old_rmid,
};

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
- &rr, 1);
+ if (is_mbm_event(group->attr.config)) {
+ rr.evt_type = group->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count,
+ &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
+ &rr, 1);
+ }
local64_set(&group->count, atomic64_read(&rr.value));
}

@@ -489,6 +496,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)

raw_spin_unlock_irq(&cache_lock);

+ /*
+ * If the allocation is for mbm, init the mbm stats.
+ * Need to check if each event in the group is mbm event
+ * because there could be multiple type of events in the same group.
+ */
+ if (__rmid_valid(rmid)) {
+ event = group;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
+ list_for_each_entry(event, head, hw.cqm_group_entry) {
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+ }
+ }
+
return old_rmid;
}

@@ -978,7 +1001,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);
return;
}
@@ -996,7 +1019,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);

event->hw.cqm_rmid = rmid;

Subject: [tip:perf/urgent] perf/x86/mbm: Add support for MBM counter overflow handling

Commit-ID: e7ee3e8cb550ce43752ae1d1b190d6b5c4150a43
Gitweb: http://git.kernel.org/tip/e7ee3e8cb550ce43752ae1d1b190d6b5c4150a43
Author: Vikas Shivappa <[email protected]>
AuthorDate: Fri, 11 Mar 2016 11:26:17 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:21 +0100

perf/x86/mbm: Add support for MBM counter overflow handling

This patch adds a per package timer which periodically updates the
memory bandwidth counters for the events that are currently active.

Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Signed-off-by: Vikas Shivappa <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/013b756c5006b1c4ca411f3ecf43ed52f19fbf87.1457723885.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/intel/cqm.c | 139 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 134 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index a98f472..380d62d 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -14,10 +14,15 @@
#define MSR_IA32_QM_EVTSEL 0x0c8d

#define MBM_CNTR_WIDTH 24
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME 1000

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -45,6 +50,7 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
* @total_bytes #bytes since we began monitoring
@@ -945,6 +951,10 @@ static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
return mbm_current->total_bytes;
}

+ /*
+ * The h/w guarantees that counters will not overflow
+ * so long as we poll them at least once per second.
+ */
shift = 64 - MBM_CNTR_WIDTH;
bytes = (val << shift) - (mbm_current->prev_msr << shift);
bytes >>= shift;
@@ -1088,6 +1098,84 @@ static void __intel_mbm_event_count(void *info)
atomic64_add(val, &rr->value);
}

+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct perf_event *iter, *iter1;
+ int ret = HRTIMER_RESTART;
+ struct list_head *head;
+ unsigned long flags;
+ u32 grp_rmid;
+
+ /*
+ * Need to cache_lock as the timer Event Select MSR reads
+ * can race with the mbm/cqm count() and mbm_init() reads.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (list_empty(&cache_groups)) {
+ ret = HRTIMER_NORESTART;
+ goto out;
+ }
+
+ list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ grp_rmid = iter->hw.cqm_rmid;
+ if (!__rmid_valid(grp_rmid))
+ continue;
+ if (is_mbm_event(iter->attr.config))
+ update_sample(grp_rmid, iter->attr.config, 0);
+
+ head = &iter->hw.cqm_group_entry;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+ if (!iter1->hw.is_group_event)
+ break;
+ if (is_mbm_event(iter1->attr.config))
+ update_sample(iter1->hw.cqm_rmid,
+ iter1->attr.config, 0);
+ }
+ }
+
+ hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+ hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+ hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+ struct hrtimer *hr;
+ int i;
+
+ for (i = 0; i < mbm_socket_max; i++) {
+ hr = &mbm_timers[i];
+ hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hr->function = mbm_hrtimer_handle;
+ }
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1217,8 +1305,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ unsigned long flags;

mutex_lock(&cache_mutex);
+ /*
+ * Hold the cache_lock as mbm timer handlers could be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);

/*
* If there's another event in this group...
@@ -1250,6 +1344,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ /*
+ * Stop the mbm overflow timers when the last event is destroyed.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_stop_timers();
+
mutex_unlock(&cache_mutex);
}

@@ -1257,6 +1359,7 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
+ unsigned long flags;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1282,9 +1385,21 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /*
+ * Start the mbm overflow timers when the first event is created.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_start_timers();
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);

+ /*
+ * Hold the cache_lock as mbm timer handlers be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1303,6 +1418,7 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);

if (rotate)
@@ -1536,20 +1652,33 @@ static const struct x86_cpu_id intel_mbm_total_match[] = {

static int intel_mbm_init(void)
{
- int array_size, maxid = cqm_max_rmid + 1;
+ int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_socket_max = topology_max_packages();
+ array_size = sizeof(struct sample) * maxid * mbm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;

mbm_total = kmalloc(array_size, GFP_KERNEL);
if (!mbm_total) {
- mbm_cleanup();
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out;
}

- return 0;
+ array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ mbm_timers = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_timers) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mbm_hrtimer_init();
+
+out:
+ if (ret)
+ mbm_cleanup();
+
+ return ret;
}

static int __init intel_cqm_init(void)

2016-03-21 14:57:47

by Matt Fleming

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/cqm: Fix CQM handling of grouping events into a cache_group

On Mon, 21 Mar, at 02:51:29AM, tip-bot for Vikas Shivappa wrote:
> Commit-ID: a223c1c7ab4cc64537dc4b911f760d851683768a
> Gitweb: http://git.kernel.org/tip/a223c1c7ab4cc64537dc4b911f760d851683768a
> Author: Vikas Shivappa <[email protected]>
> AuthorDate: Thu, 10 Mar 2016 15:32:07 -0800
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Mon, 21 Mar 2016 09:08:18 +0100
>
> perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
>
> Currently CQM (cache quality of service monitoring) is grouping all
> events belonging to same PID to use one RMID. However its not counting
> all of these different events. Hence we end up with a count of zero
> for all events other than the group leader.

The reason that was done originally was because reporting for all events
in a group led to duplicate values, since you'd be emitting the same
RMID value multiple times.

Is this no longer a problem?

2016-03-21 15:09:12

by Matt Fleming

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/mbm: Implement RMID recycling

On Mon, 21 Mar, at 02:53:04AM, tip-bot for Vikas Shivappa wrote:
> @@ -489,6 +496,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
>
> raw_spin_unlock_irq(&cache_lock);
>
> + /*
> + * If the allocation is for mbm, init the mbm stats.
> + * Need to check if each event in the group is mbm event
> + * because there could be multiple type of events in the same group.
> + */
> + if (__rmid_valid(rmid)) {
> + event = group;
> + if (is_mbm_event(event->attr.config))
> + init_mbm_sample(rmid, event->attr.config);
> +
> + list_for_each_entry(event, head, hw.cqm_group_entry) {
> + if (is_mbm_event(event->attr.config))
> + init_mbm_sample(rmid, event->attr.config);
> + }
> + }
> +
> return old_rmid;
> }
>

You're calling init_mbm_sample() without holding cache_lock. Won't
this potentially trash the existing value in MSR_IA32_QM_EVTSEL, if
say, we're reading the counter at the same time as the recycling
worker is running?

2016-03-21 18:14:33

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/cqm: Fix CQM handling of grouping events into a cache_group



On Mon, 21 Mar 2016, Matt Fleming wrote:

> On Mon, 21 Mar, at 02:51:29AM, tip-bot for Vikas Shivappa wrote:
>> Commit-ID: a223c1c7ab4cc64537dc4b911f760d851683768a
>> Gitweb: http://git.kernel.org/tip/a223c1c7ab4cc64537dc4b911f760d851683768a
>> Author: Vikas Shivappa <[email protected]>
>> AuthorDate: Thu, 10 Mar 2016 15:32:07 -0800
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Mon, 21 Mar 2016 09:08:18 +0100
>>
>> perf/x86/cqm: Fix CQM handling of grouping events into a cache_group
>>
>> Currently CQM (cache quality of service monitoring) is grouping all
>> events belonging to same PID to use one RMID. However its not counting
>> all of these different events. Hence we end up with a count of zero
>> for all events other than the group leader.
>
> The reason that was done originally was because reporting for all events
> in a group led to duplicate values, since you'd be emitting the same
> RMID value multiple times.
>
> Is this no longer a problem?

Before MBM , the below condition was never hit because we had only one event ?

- if (a->hw.target == b->hw.target)
+ if (a->hw.target == b->hw.target) {
+ b->hw.is_group_event = true;

We are trying to address this for cases where different MBM(local or total) and
cqm events are grouped into one RMID.

Which is the case which led to duplicate values ?

Thanks,
Vikas

>

2016-03-21 19:49:29

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/mbm: Implement RMID recycling



On Mon, 21 Mar 2016, Matt Fleming wrote:

> On Mon, 21 Mar, at 02:53:04AM, tip-bot for Vikas Shivappa wrote:
>> @@ -489,6 +496,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
>>
>> raw_spin_unlock_irq(&cache_lock);
>>
>> + /*
>> + * If the allocation is for mbm, init the mbm stats.
>> + * Need to check if each event in the group is mbm event
>> + * because there could be multiple type of events in the same group.
>> + */
>> + if (__rmid_valid(rmid)) {
>> + event = group;
>> + if (is_mbm_event(event->attr.config))
>> + init_mbm_sample(rmid, event->attr.config);
>> +
>> + list_for_each_entry(event, head, hw.cqm_group_entry) {
>> + if (is_mbm_event(event->attr.config))
>> + init_mbm_sample(rmid, event->attr.config);
>> + }
>> + }
>> +
>> return old_rmid;
>> }
>>
>
> You're calling init_mbm_sample() without holding cache_lock. Won't
> this potentially trash the existing value in MSR_IA32_QM_EVTSEL, if
> say, we're reading the counter at the same time as the recycling
> worker is running?

The init_mbm_sample calls the update_sample to read the MSR in IPI .. Since the
count is also in IPI , they should not trash each other ?

Basically all the MSR read/writes are in high irql , except for the mbm overflow
timer and read calls which holds an irqsave spinlock.

Thanks,
Vikas


>

2016-03-23 20:14:43

by Matt Fleming

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/cqm: Fix CQM handling of grouping events into a cache_group

On Mon, 21 Mar, at 11:14:37AM, Vikas Shivappa wrote:
>
>
> Before MBM , the below condition was never hit because we had only one event ?
>
> - if (a->hw.target == b->hw.target)
> + if (a->hw.target == b->hw.target) {
> + b->hw.is_group_event = true;
>
> We are trying to address this for cases where different MBM(local or total)
> and cqm events are grouped into one RMID.

I can't test these changes, so I'm only working from memory, but I
seem to recall that this condition is hit if monitoring simultaneously
from two invocations of perf. It's also possible to have pid/tid
groups overlapping, and that needs to be handled.

> Which is the case which led to duplicate values ?

Good question. Try monitoring a multithread process with these changes
and see if you get duplicate values reported.

2016-03-23 20:59:12

by Matt Fleming

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/mbm: Implement RMID recycling

On Mon, 21 Mar, at 11:27:55AM, Vikas Shivappa wrote:
>
> The init_mbm_sample calls the update_sample to read the MSR in IPI .. Since
> the count is also in IPI , they should not trash each other ?
>
> Basically all the MSR read/writes are in high irql , except for the mbm
> overflow timer and read calls which holds an irqsave spinlock.

Good point! This should be fine.

2016-03-23 22:49:52

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [tip:perf/urgent] perf/x86/cqm: Fix CQM handling of grouping events into a cache_group



On Wed, 23 Mar 2016, Matt Fleming wrote:

> On Mon, 21 Mar, at 11:14:37AM, Vikas Shivappa wrote:
>>
>>
>> Before MBM , the below condition was never hit because we had only one event ?
>>
>> - if (a->hw.target == b->hw.target)
>> + if (a->hw.target == b->hw.target) {
>> + b->hw.is_group_event = true;
>>
>> We are trying to address this for cases where different MBM(local or total)
>> and cqm events are grouped into one RMID.
>
> I can't test these changes, so I'm only working from memory, but I
> seem to recall that this condition is hit if monitoring simultaneously
> from two invocations of perf. It's also possible to have pid/tid
> groups overlapping, and that needs to be handled.

Each task in a multithreaded process has an event. So it gets a different RMID.
If two perf instances invoke
the same pid then both of the instances expect to see the counters so its
reported to both of them.

>
>> Which is the case which led to duplicate values ?
>
> Good question. Try monitoring a multithread process with these changes
> and see if you get duplicate values reported.

perf starts an event for each thread even when you give -p <process id> a
process which has multiple threads).
So it sends the pid of each thread to monitor and they get all seperate RMIDs.
This should apply to the groups overlapping as well as this is dealing with only
the perf task events..

>