2016-03-01 23:47:59

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V5 0/6] Intel memory b/w monitoring support

The patch series has two preparatory patch for cqm and then 4 MBM
patches. Patches are based on tip perf/core.
Thanks to Thomas for feedback on V4 and have tried to implement his
feedback in this version.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Memory bandwitdh monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwitch monitoring.
It supports both 'local bandwidth' and 'total bandwidth' monitoring for
the socket. Local bandwidth measures the amount of data sent through
the memory controller on the socket and total b/w measures the total
system bandwidth.
Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure.

intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Changes in V5:
As per Thomas feedback made the below changes:
- Fixed the memory leak and notifier leak in cqm init and also made it a
separate patch
- Changed mbm patch to using topology_max_packages to count the max
packages rather than online packages.
- Removed the unnecessary out: label and goto in the 0003 .
- Fixed the restarting of timer when the event list is empty.

- Also Fixed the incorrect usage of mutex in timer context.

Changes in v4:
The V4 version of MBM is almost a complete rewrite of the prior
versions except for some declarations.
It has seemed the best way to address all of Thomas earlier
comments.

[PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a
[PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak
[PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init
[PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management
[PATCH 5/6] x86/mbm: RMID Recycling MBM changes
[PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling


2016-03-01 23:48:10

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group

Currently cqm(cache quality of service monitoring) is grouping all
events belonging to same PID to use one RMID. However its not counting
all of these different events. Hence we end up with a count of zero for
all events other than the group leader. The patch tries to address the
issue by keeping a flag in the perf_event.hw which has other cqm related
fields. The field is updated at event creation and during grouping.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 13 ++++++++++---
include/linux/perf_event.h | 1 +
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index a316ca9..e6be335 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -281,9 +281,13 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)

/*
* Events that target same task are placed into the same cache group.
+ * Mark it as a multi event group, so that we update ->count
+ * for every event rather than just the group leader later.
*/
- if (a->hw.target == b->hw.target)
+ if (a->hw.target == b->hw.target) {
+ b->hw.is_group_event = true;
return true;
+ }

/*
* Are we an inherited event?
@@ -849,6 +853,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
bool conflict = false;
u32 rmid;

+ event->hw.is_group_event = false;
list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
rmid = iter->hw.cqm_rmid;

@@ -940,7 +945,9 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);

/*
- * Only the group leader gets to report values. This stops us
+ * Only the group leader gets to report values except in case of
+ * multiple events in the same group, we still need to read the
+ * other events.This stops us
* reporting duplicate values to userspace, and gives us a clear
* rule for which task gets to report the values.
*
@@ -948,7 +955,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
* specific packages - we forfeit that ability when we create
* task events.
*/
- if (!cqm_group_leader(event))
+ if (!cqm_group_leader(event) && !event->hw.is_group_event)
return 0;

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5c5a3f..a3ba886 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -121,6 +121,7 @@ struct hw_perf_event {
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
+ bool is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;
--
1.9.1

2016-03-01 23:48:04

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

The MBM init patch enumerates the Intel (Memory b/w monitoring)MBM and
initializes the perf events and datastructures for monitoring the memory
b/w. Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure:
intel_cqm_llc/local_bytes - bytes sent through local socket memory
controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 2 +
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 157 ++++++++++++++++++++++++++++-
3 files changed, 158 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c94..9b4233e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -241,6 +241,8 @@

/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */

/* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
#define X86_FEATURE_CLZERO (13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa05680..13af76e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -635,7 +635,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
c->x86_capability[CPUID_F_1_EDX] = edx;

- if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+ ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+ (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
c->x86_cache_max_rmid = ecx;
c->x86_cache_occ_scale = ebx;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 5666171..cf08a0f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -15,6 +15,7 @@

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,30 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @interval_start Time this interval began
+ * @interval_bytes #bytes in this interval
+ * @total_bytes #bytes since we began monitoring
+ * @prev_msr previous value of MSR
+ * @bandwidth bytes/sec in previous completed interval
+ */
+struct sample {
+ ktime_t interval_start;
+ u64 interval_bytes;
+ u64 total_bytes;
+ u64 prev_msr;
+ u64 bandwidth;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -1152,6 +1177,28 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");

+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(total_bw, intel_cqm_total_bw, "event=0x04");
+EVENT_ATTR_STR(total_bw.per-pkg, intel_cqm_total_bw_pkg, "1");
+EVENT_ATTR_STR(total_bw.unit, intel_cqm_total_bw_unit, "MB/sec");
+EVENT_ATTR_STR(total_bw.scale, intel_cqm_total_bw_scale, "1e-6");
+EVENT_ATTR_STR(total_bw.snapshot, intel_cqm_total_bw_snapshot, "1");
+
+EVENT_ATTR_STR(local_bw, intel_cqm_local_bw, "event=0x05");
+EVENT_ATTR_STR(local_bw.per-pkg, intel_cqm_local_bw_pkg, "1");
+EVENT_ATTR_STR(local_bw.unit, intel_cqm_local_bw_unit, "MB/sec");
+EVENT_ATTR_STR(local_bw.scale, intel_cqm_local_bw_scale, "1e-6");
+EVENT_ATTR_STR(local_bw.snapshot, intel_cqm_local_bw_snapshot, "1");
+
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
EVENT_PTR(intel_cqm_llc_pkg),
@@ -1161,9 +1208,58 @@ static struct attribute *intel_cqm_events_attr[] = {
NULL,
};

+static struct attribute *intel_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bw),
+ EVENT_PTR(intel_cqm_local_bw),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bw_pkg),
+ EVENT_PTR(intel_cqm_local_bw_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bw_unit),
+ EVENT_PTR(intel_cqm_local_bw_unit),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_total_bw_scale),
+ EVENT_PTR(intel_cqm_local_bw_scale),
+ EVENT_PTR(intel_cqm_total_bw_snapshot),
+ EVENT_PTR(intel_cqm_local_bw_snapshot),
+ NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_llc),
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bw),
+ EVENT_PTR(intel_cqm_local_bw),
+ EVENT_PTR(intel_cqm_llc_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bw_pkg),
+ EVENT_PTR(intel_cqm_local_bw_pkg),
+ EVENT_PTR(intel_cqm_llc_unit),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bw_unit),
+ EVENT_PTR(intel_cqm_local_bw_unit),
+ EVENT_PTR(intel_cqm_llc_scale),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_total_bw_scale),
+ EVENT_PTR(intel_cqm_local_bw_scale),
+ EVENT_PTR(intel_cqm_llc_snapshot),
+ EVENT_PTR(intel_cqm_total_bw_snapshot),
+ EVENT_PTR(intel_cqm_local_bw_snapshot),
+ NULL,
+};
+
static struct attribute_group intel_cqm_events_group = {
.name = "events",
- .attrs = intel_cqm_events_attr,
+ .attrs = NULL,
};

PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1320,12 +1416,47 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};

+static const struct x86_cpu_id intel_mbm_local_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+ {}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+ {}
+};
+
+static int intel_mbm_init(void)
+{
+ int ret = 0, array_size, maxid = cqm_max_rmid + 1;
+
+ array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_local = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_local)
+ return -ENOMEM;
+
+ mbm_total = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_total) {
+ kfree(mbm_local);
+ ret = -ENOMEM;
+ }
+
+ return ret;
+}
+
static int __init intel_cqm_init(void)
{
char *str = NULL, scale[20];
int i, cpu, ret;

- if (!x86_match_cpu(intel_cqm_match))
+ if (x86_match_cpu(intel_cqm_match))
+ cqm_enabled = true;
+
+ if (x86_match_cpu(intel_mbm_local_match) &&
+ x86_match_cpu(intel_mbm_total_match))
+ mbm_enabled = true;
+
+ if (!cqm_enabled && !mbm_enabled)
return -ENODEV;

cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1382,12 +1513,27 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

+ if (mbm_enabled)
+ ret = intel_mbm_init();
+ if (ret && !cqm_enabled)
+ goto out;
+
+ if (cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+ else if (!cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_mbm_events_attr;
+ else if (cqm_enabled && !mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
goto out;
} else {
- pr_info("Intel CQM monitoring enabled\n");
+ if (cqm_enabled)
+ pr_info("Intel CQM monitoring enabled\n");
+ if (mbm_enabled)
+ pr_info("Intel MBM enabled\n");
}

/*
@@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
__perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
- if (ret)
+ if (ret) {
+ mbm_enabled = false;
+ cqm_enabled = false;
kfree(str);
+ }

return ret;
}
--
1.9.1

2016-03-01 23:48:09

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 5/6] x86/mbm: RMID Recycling MBM changes

RMID could be allocated or deallocated as part of RMID recycling.
When an RMID is allocated for mbm event, the mbm counter needs to be
initialized because next time we read the counter we need the previous
value to account for total bytes that went to the memory controller.
Similarly, when RMID is deallocated we need to update the ->count
variable.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 32 ++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 6638dcc..fa5ec85 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -453,6 +453,7 @@ struct rmid_read {

static void __intel_cqm_event_count(void *info);
static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type);
+static void __intel_mbm_event_count(void *info);

static bool is_mbm_event(int e)
{
@@ -479,8 +480,14 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
.rmid = old_rmid,
};

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
- &rr, 1);
+ if (is_mbm_event(group->attr.config)) {
+ rr.evt_type = group->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count,
+ &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
+ &rr, 1);
+ }
local64_set(&group->count, atomic64_read(&rr.value));
}

@@ -492,6 +499,22 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)

raw_spin_unlock_irq(&cache_lock);

+ /*
+ * If the allocation is for mbm, init the mbm stats.
+ * Need to check if each event in the group is mbm event
+ * because there could be multiple type of events in the same group.
+ */
+ if (__rmid_valid(rmid)) {
+ event = group;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
+ list_for_each_entry(event, head, hw.cqm_group_entry) {
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+ }
+ }
+
return old_rmid;
}

@@ -1011,7 +1034,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) &&
+ __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);
return;
}
@@ -1029,7 +1053,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

- if (is_mbm_event(event->attr.config))
+ if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);

event->hw.cqm_rmid = rmid;
--
1.9.1

2016-03-01 23:48:06

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

From: Tony Luck <[email protected]>

Includes all the core infrastructure to measure the total_bytes and
bandwidth.

We have per socket counters for both total system wide L3 external bytes
and local socket memory-controller bytes. The current b/w is calculated
for a minimum diff time(time since it was last counted) of 100ms. The OS
does MSR writes to MSR_IA32_QM_EVTSEL and MSR_IA32_QM_CTR to read the
counters and uses the IA32_PQR_ASSOC_MSR to associate the RMID with
the task. The tasks have a common RMID for cqm(cache quality of
service monitoring) and MBM. Hence most of the scheduling code is
reused from cqm.

Lot of the scheduling code was taken out from Tony's patch and a 3-4
lines of change were added in the intel_cqm_event_read. Since the timer
is no more added on every context switch this change was made.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 158 ++++++++++++++++++++++++++++-
1 file changed, 154 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index cf08a0f..6638dcc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -13,6 +13,11 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d

+/*
+ * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
+ * value
+ */
+#define MBM_CNTR_MAX 0xffffff
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
@@ -68,6 +73,16 @@ static struct sample *mbm_total;
*/
static struct sample *mbm_local;

+#define pkg_id topology_physical_package_id(smp_processor_id())
+/*
+ * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
+ * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
+ * rmids per socket, an example is given below
+ * RMID1 of Socket0: vrmid = 1
+ * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
+ * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
+ */
+#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
@@ -91,8 +106,19 @@ static cpumask_t cqm_cpumask;
#define RMID_VAL_UNAVAIL (1ULL << 62)

#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
+/*
+ * MBM Event IDs as defined in SDM section 17.15.5
+ * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
+ */
+enum mbm_evt_type {
+ QOS_MBM_TOTAL_EVENT_ID = 0x02,
+ QOS_MBM_LOCAL_EVENT_ID,
+ QOS_MBM_TOTAL_BW_EVENT_ID,
+ QOS_MBM_LOCAL_BW_EVENT_ID,
+};

-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+#define QOS_MBM_BW_EVENT_MASK 0x04
+#define QOS_MBM_LOCAL_EVENT_MASK 0x01

/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -422,9 +448,16 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
struct rmid_read {
u32 rmid;
atomic64_t value;
+ enum mbm_evt_type evt_type;
};

static void __intel_cqm_event_count(void *info);
+static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type);
+
+static bool is_mbm_event(int e)
+{
+ return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
+}

/*
* Exchange the RMID of a group of events.
@@ -866,6 +899,98 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
schedule_delayed_work(&intel_cqm_rmid_work, delay);
}

+static struct sample *update_sample(unsigned int rmid,
+ enum mbm_evt_type evt_type, int first)
+{
+ ktime_t cur_time;
+ struct sample *mbm_current;
+ u32 vrmid = rmid_2_index(rmid);
+ u64 val, bytes, diff_time;
+ u32 eventid;
+
+ if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
+ mbm_current = &mbm_local[vrmid];
+ eventid = QOS_MBM_LOCAL_EVENT_ID;
+ } else {
+ mbm_current = &mbm_total[vrmid];
+ eventid = QOS_MBM_TOTAL_EVENT_ID;
+ }
+
+ cur_time = ktime_get();
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ rdmsrl(MSR_IA32_QM_CTR, val);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return mbm_current;
+ val &= MBM_CNTR_MAX;
+
+ if (first) {
+ mbm_current->interval_start = cur_time;
+ mbm_current->prev_msr = val;
+ mbm_current->total_bytes = 0;
+ mbm_current->interval_bytes = 0;
+ mbm_current->bandwidth = 0;
+ return mbm_current;
+ }
+
+ if (val < mbm_current->prev_msr)
+ bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
+ else
+ bytes = val - mbm_current->prev_msr;
+ bytes *= cqm_l3_scale;
+
+ mbm_current->total_bytes += bytes;
+ mbm_current->interval_bytes += bytes;
+ mbm_current->prev_msr = val;
+ diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);
+
+ /*
+ * The b/w measured is really the most recent/current b/w.
+ * We wait till enough time has passed to avoid
+ * arthmetic rounding problems.Having it at >=100ms,
+ * such errors would be <=1%.
+ */
+ if (diff_time > 100) {
+ bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
+ do_div(bytes, diff_time);
+ mbm_current->bandwidth = bytes;
+ mbm_current->interval_bytes = 0;
+ mbm_current->interval_start = cur_time;
+ }
+
+ return mbm_current;
+}
+
+static u64 rmid_read_mbm(unsigned int rmid, enum mbm_evt_type evt_type)
+{
+ struct sample *mbm_current;
+
+ mbm_current = update_sample(rmid, evt_type, 0);
+
+ if (evt_type & QOS_MBM_BW_EVENT_MASK)
+ return mbm_current->bandwidth;
+ else
+ return mbm_current->total_bytes;
+}
+
+static void __intel_mbm_event_init(void *info)
+{
+ struct rmid_read *rr = info;
+
+ update_sample(rr->rmid, rr->evt_type, 1);
+}
+
+static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
+{
+ struct rmid_read rr = {
+ .value = ATOMIC64_INIT(0),
+ };
+
+ rr.rmid = rmid;
+ rr.evt_type = evt_type;
+ /* on each socket, init sample */
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
+}
+
/*
* Find a group and setup RMID.
*
@@ -886,6 +1011,8 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
return;
}

@@ -902,6 +1029,9 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();

+ if (is_mbm_event(event->attr.config))
+ init_mbm_sample(rmid, event->attr.config);
+
event->hw.cqm_rmid = rmid;
}

@@ -923,7 +1053,10 @@ static void intel_cqm_event_read(struct perf_event *event)
if (!__rmid_valid(rmid))
goto out;

- val = __rmid_read(rmid);
+ if (is_mbm_event(event->attr.config))
+ val = rmid_read_mbm(rmid, event->attr.config);
+ else
+ val = __rmid_read(rmid);

/*
* Ignore this reading on error states and do not update the value.
@@ -954,6 +1087,17 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_groups_entry);
}

+static void __intel_mbm_event_count(void *info)
+{
+ struct rmid_read *rr = info;
+ u64 val;
+
+ val = rmid_read_mbm(rr->rmid, rr->evt_type);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1007,7 +1151,12 @@ static u64 intel_cqm_event_count(struct perf_event *event)
if (!__rmid_valid(rr.rmid))
goto out;

- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ if (is_mbm_event(event->attr.config)) {
+ rr.evt_type = event->attr.config;
+ on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, &rr, 1);
+ } else {
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
+ }

raw_spin_lock_irqsave(&cache_lock, flags);
if (event->hw.cqm_rmid == rr.rmid)
@@ -1122,7 +1271,8 @@ static int intel_cqm_event_init(struct perf_event *event)
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;

- if (event->attr.config & ~QOS_EVENT_MASK)
+ if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
+ (event->attr.config > QOS_MBM_LOCAL_BW_EVENT_ID))
return -EINVAL;

/* unsupported modes and filters */
--
1.9.1

2016-03-01 23:48:59

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 132 ++++++++++++++++++++++++++++-
1 file changed, 130 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index fa5ec85..2870fc7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -18,9 +18,15 @@
* value
*/
#define MBM_CNTR_MAX 0xffffff
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME 1000
+
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -48,6 +54,7 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
* @interval_start Time this interval began
@@ -1122,6 +1129,84 @@ static void __intel_mbm_event_count(void *info)
atomic64_add(val, &rr->value);
}

+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct perf_event *iter, *iter1;
+ int ret = HRTIMER_RESTART;
+ struct list_head *head;
+ unsigned long flags;
+ u32 grp_rmid;
+
+ /*
+ * Need to cache_lock as the timer Event Select MSR reads
+ * can race with the mbm/cqm count() and mbm_init() reads.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (list_empty(&cache_groups)) {
+ ret = HRTIMER_NORESTART;
+ goto out;
+ }
+
+ list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ grp_rmid = iter->hw.cqm_rmid;
+ if (!__rmid_valid(grp_rmid))
+ continue;
+ if (is_mbm_event(iter->attr.config))
+ update_sample(grp_rmid, iter->attr.config, 0);
+
+ head = &iter->hw.cqm_group_entry;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+ if (!iter1->hw.is_group_event)
+ break;
+ if (is_mbm_event(iter1->attr.config))
+ update_sample(iter1->hw.cqm_rmid,
+ iter1->attr.config, 0);
+ }
+ }
+
+ hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+ hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+ hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+ struct hrtimer *hr;
+ int i;
+
+ for (i = 0; i < mbm_socket_max; i++) {
+ hr = &mbm_timers[i];
+ hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hr->function = mbm_hrtimer_handle;
+ }
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1251,8 +1336,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ unsigned long flags;

mutex_lock(&cache_mutex);
+ /*
+ * Hold the cache_lock as mbm timer handlers could be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);

/*
* If there's another event in this group...
@@ -1284,6 +1375,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ /*
+ * Stop the mbm overflow timers when the last event is destroyed.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_stop_timers();
+
mutex_unlock(&cache_mutex);
}

@@ -1291,6 +1390,7 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
+ unsigned long flags;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1316,9 +1416,21 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /*
+ * Start the mbm overflow timers when the first event is created.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_start_timers();
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);

+ /*
+ * Hold the cache_lock as mbm timer handlers be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1337,6 +1449,7 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);

if (rotate)
@@ -1604,15 +1717,30 @@ static int intel_mbm_init(void)
{
int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_socket_max = topology_max_packages();
+ array_size = sizeof(struct sample) * maxid * mbm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;

mbm_total = kmalloc(array_size, GFP_KERNEL);
if (!mbm_total) {
- kfree(mbm_local);
ret = -ENOMEM;
+ goto out;
+ }
+
+ array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ mbm_timers = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_timers) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mbm_hrtimer_init();
+
+out:
+ if (ret) {
+ kfree(mbm_local);
+ kfree(mbm_total);
}

return ret;
--
1.9.1

2016-03-01 23:49:21

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak

Fixes the hotcpu notifier leak and a memory leak during cqm(cache
quality of service monitoring) initialization.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e6be335..5666171 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {

static int __init intel_cqm_init(void)
{
- char *str, scale[20];
+ char *str = NULL, scale[20];
int i, cpu, ret;

if (!x86_match_cpu(intel_cqm_match))
@@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

- __perf_cpu_notifier(intel_cqm_cpu_notifier);
-
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret)
+ if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
- else
+ goto out;
+ } else {
pr_info("Intel CQM monitoring enabled\n");
+ }

+ /*
+ * Register the hot cpu notifier once we are sure cqm
+ * is enabled to avoid notifier leak.
+ */
+ __perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
+ if (ret)
+ kfree(str);

return ret;
}
--
1.9.1

2016-03-02 08:01:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak

On Tue, 1 Mar 2016, Vikas Shivappa wrote:

> Fixes the hotcpu notifier leak and a memory leak during cqm(cache
> quality of service monitoring) initialization.
>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Vikas Shivappa <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> index e6be335..5666171 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> @@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>
> static int __init intel_cqm_init(void)
> {
> - char *str, scale[20];
> + char *str = NULL, scale[20];
> int i, cpu, ret;
>
> if (!x86_match_cpu(intel_cqm_match))
> @@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
> cqm_pick_event_reader(i);
> }
>
> - __perf_cpu_notifier(intel_cqm_cpu_notifier);
> -
> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> - if (ret)
> + if (ret) {
> pr_err("Intel CQM perf registration failed: %d\n", ret);
> - else
> + goto out;
> + } else {
> pr_info("Intel CQM monitoring enabled\n");
> + }
>
> + /*
> + * Register the hot cpu notifier once we are sure cqm
> + * is enabled to avoid notifier leak.
> + */
> + __perf_cpu_notifier(intel_cqm_cpu_notifier);
> out:
> cpu_notifier_register_done();
> + if (ret)
> + kfree(str);

This still leaks cqm_rmid_ptrs ....

2016-03-02 08:06:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

On Tue, 1 Mar 2016, Vikas Shivappa wrote:
> @@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
> __perf_cpu_notifier(intel_cqm_cpu_notifier);
> out:
> cpu_notifier_register_done();
> - if (ret)
> + if (ret) {
> + mbm_enabled = false;
> + cqm_enabled = false;
> kfree(str);


Leaks mbm_local and mbm_total ....

> + }
>
> return ret;
> }
> --
> 1.9.1
>
>

2016-03-02 17:58:06

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Tue, 1 Mar 2016, Vikas Shivappa wrote:
>
>> Fixes the hotcpu notifier leak and a memory leak during cqm(cache
>> quality of service monitoring) initialization.
>>
>> Reviewed-by: Tony Luck <[email protected]>
>> Signed-off-by: Vikas Shivappa <[email protected]>
>> ---
>> arch/x86/kernel/cpu/perf_event_intel_cqm.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> index e6be335..5666171 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> @@ -1322,7 +1322,7 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>>
>> static int __init intel_cqm_init(void)
>> {
>> - char *str, scale[20];
>> + char *str = NULL, scale[20];
>> int i, cpu, ret;
>>
>> if (!x86_match_cpu(intel_cqm_match))
>> @@ -1382,16 +1382,23 @@ static int __init intel_cqm_init(void)
>> cqm_pick_event_reader(i);
>> }
>>
>> - __perf_cpu_notifier(intel_cqm_cpu_notifier);
>> -
>> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>> - if (ret)
>> + if (ret) {
>> pr_err("Intel CQM perf registration failed: %d\n", ret);
>> - else
>> + goto out;
>> + } else {
>> pr_info("Intel CQM monitoring enabled\n");
>> + }
>>
>> + /*
>> + * Register the hot cpu notifier once we are sure cqm
>> + * is enabled to avoid notifier leak.
>> + */
>> + __perf_cpu_notifier(intel_cqm_cpu_notifier);
>> out:
>> cpu_notifier_register_done();
>> + if (ret)
>> + kfree(str);
>
> This still leaks cqm_rmid_ptrs ....
>

Thats right. Thought there must be something more than the str leak i fixed,
but missed this. Will fix

Thanks,
Vikas

2016-03-02 17:59:15

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Tue, 1 Mar 2016, Vikas Shivappa wrote:
>> @@ -1397,8 +1543,11 @@ static int __init intel_cqm_init(void)
>> __perf_cpu_notifier(intel_cqm_cpu_notifier);
>> out:
>> cpu_notifier_register_done();
>> - if (ret)
>> + if (ret) {
>> + mbm_enabled = false;
>> + cqm_enabled = false;
>> kfree(str);
>
>
> Leaks mbm_local and mbm_total ....

Will fix. Thanks for pointing out. I missed the ones which are done at the
next level of calls from the init. Will do a check on all the globals as well.

Vikas

>
>> + }
>>
>> return ret;
>> }
>> --
>> 1.9.1
>>
>>
>

2016-03-02 21:30:52

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init



On Wed, 2 Mar 2016, Vikas Shivappa wrote:

>
>
> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
>
>> Leaks mbm_local and mbm_total ....
>
> Will fix. Thanks for pointing out. I missed the ones which are done at the
> next level of calls from the init. Will do a check on all the globals as
> well.
>
> Vikas

Looks like the mbm_timers(declared in patch 06) is leaked in cqm_init , will fix
that as well.

>
>>
>>> + }
>>>
>>> return ret;
>>> }
>>> --
>>> 1.9.1
>>>
>>>
>>
>

2016-03-02 23:53:27

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak

Fixes the hotcpu notifier leak and other global variable memory leaks
during cqm(cache quality of service monitoring) initialization.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---

Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.

arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e6be335..37a93fa 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)

return 0;
fail:
- while (r--)
+ while (r--) {
kfree(cqm_rmid_ptrs[r]);
+ cqm_rmid_ptrs[r] = NULL;
+ }

kfree(cqm_rmid_ptrs);
+ cqm_rmid_ptrs = NULL;
return -ENOMEM;
}

@@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};

+static void cqm_cleanup(void)
+{
+ int r = cqm_max_rmid;
+
+ while (r--)
+ kfree(cqm_rmid_ptrs[r]);
+
+ kfree(cqm_rmid_ptrs);
+}
+
static int __init intel_cqm_init(void)
{
- char *str, scale[20];
+ char *str = NULL, scale[20];
int i, cpu, ret;

if (!x86_match_cpu(intel_cqm_match))
@@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

- __perf_cpu_notifier(intel_cqm_cpu_notifier);
-
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret)
+ if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
- else
+ goto out;
+ } else {
pr_info("Intel CQM monitoring enabled\n");
+ }

+ /*
+ * Register the hot cpu notifier once we are sure cqm
+ * is enabled to avoid notifier leak.
+ */
+ __perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
+ if (ret) {
+ kfree(str);
+ cqm_cleanup();
+ }

return ret;
}
--
1.9.1

2016-03-02 23:55:36

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

The MBM init patch enumerates the Intel (Memory b/w monitoring)MBM and
initializes the perf events and datastructures for monitoring the memory
b/w. Its based on original patch series by Tony Luck and Kanaka Juvva.

Memory bandwidth monitoring(MBM) provides OS/VMM a way to monitor
bandwidth from one level of cache to another. The current patches
support L3 external bandwidth monitoring. It supports both 'local
bandwidth' and 'total bandwidth' monitoring for the socket. Local
bandwidth measures the amount of data sent through the memory controller
on the socket and total b/w measures the total system bandwidth.

Extending the cache quality of service monitoring(CQM) we add four more
events to the perf infrastructure:
intel_cqm_llc/local_bytes - bytes sent through local socket memory
controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
intel_cqm_llc/local_bw - Current local b/w
intel_cqm_llc/total_bw - current total b/w

The tasks are associated with a Resouce Monitoring ID(RMID) just like in
cqm and OS uses a MSR write to indicate the RMID of the task during
scheduling.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---

Fixed mbm_local and mbm_total leaks as per Thomas feedback.

arch/x86/include/asm/cpufeature.h | 2 +
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/perf_event_intel_cqm.c | 157 ++++++++++++++++++++++++++++-
3 files changed, 159 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 7ad8c94..9b4233e 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -241,6 +241,8 @@

/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
+#define X86_FEATURE_CQM_MBM_TOTAL (12*32+ 1) /* LLC Total MBM monitoring */
+#define X86_FEATURE_CQM_MBM_LOCAL (12*32+ 2) /* LLC Local MBM monitoring */

/* AMD-defined CPU features, CPUID level 0x80000008 (ebx), word 13 */
#define X86_FEATURE_CLZERO (13*32+0) /* CLZERO instruction */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa05680..13af76e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -635,7 +635,9 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
c->x86_capability[CPUID_F_1_EDX] = edx;

- if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ if ((cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) ||
+ ((cpu_has(c, X86_FEATURE_CQM_MBM_TOTAL)) ||
+ (cpu_has(c, X86_FEATURE_CQM_MBM_LOCAL)))) {
c->x86_cache_max_rmid = ecx;
c->x86_cache_occ_scale = ebx;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 37a93fa..5da415d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -15,6 +15,7 @@

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+static bool cqm_enabled, mbm_enabled;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -42,6 +43,30 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+/**
+ * struct sample - mbm event's (local or total) data
+ * @interval_start Time this interval began
+ * @interval_bytes #bytes in this interval
+ * @total_bytes #bytes since we began monitoring
+ * @prev_msr previous value of MSR
+ * @bandwidth bytes/sec in previous completed interval
+ */
+struct sample {
+ ktime_t interval_start;
+ u64 interval_bytes;
+ u64 total_bytes;
+ u64 prev_msr;
+ u64 bandwidth;
+};
+
+/*
+ * samples profiled for total memory bandwidth type events
+ */
+static struct sample *mbm_total;
+/*
+ * samples profiled for local memory bandwidth type events
+ */
+static struct sample *mbm_local;

/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -1155,6 +1180,28 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");

+EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
+EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
+EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
+EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
+EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
+EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
+EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
+
+EVENT_ATTR_STR(total_bw, intel_cqm_total_bw, "event=0x04");
+EVENT_ATTR_STR(total_bw.per-pkg, intel_cqm_total_bw_pkg, "1");
+EVENT_ATTR_STR(total_bw.unit, intel_cqm_total_bw_unit, "MB/sec");
+EVENT_ATTR_STR(total_bw.scale, intel_cqm_total_bw_scale, "1e-6");
+EVENT_ATTR_STR(total_bw.snapshot, intel_cqm_total_bw_snapshot, "1");
+
+EVENT_ATTR_STR(local_bw, intel_cqm_local_bw, "event=0x05");
+EVENT_ATTR_STR(local_bw.per-pkg, intel_cqm_local_bw_pkg, "1");
+EVENT_ATTR_STR(local_bw.unit, intel_cqm_local_bw_unit, "MB/sec");
+EVENT_ATTR_STR(local_bw.scale, intel_cqm_local_bw_scale, "1e-6");
+EVENT_ATTR_STR(local_bw.snapshot, intel_cqm_local_bw_snapshot, "1");
+
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
EVENT_PTR(intel_cqm_llc_pkg),
@@ -1164,9 +1211,58 @@ static struct attribute *intel_cqm_events_attr[] = {
NULL,
};

+static struct attribute *intel_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bw),
+ EVENT_PTR(intel_cqm_local_bw),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bw_pkg),
+ EVENT_PTR(intel_cqm_local_bw_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bw_unit),
+ EVENT_PTR(intel_cqm_local_bw_unit),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_total_bw_scale),
+ EVENT_PTR(intel_cqm_local_bw_scale),
+ EVENT_PTR(intel_cqm_total_bw_snapshot),
+ EVENT_PTR(intel_cqm_local_bw_snapshot),
+ NULL,
+};
+
+static struct attribute *intel_cmt_mbm_events_attr[] = {
+ EVENT_PTR(intel_cqm_llc),
+ EVENT_PTR(intel_cqm_total_bytes),
+ EVENT_PTR(intel_cqm_local_bytes),
+ EVENT_PTR(intel_cqm_total_bw),
+ EVENT_PTR(intel_cqm_local_bw),
+ EVENT_PTR(intel_cqm_llc_pkg),
+ EVENT_PTR(intel_cqm_total_bytes_pkg),
+ EVENT_PTR(intel_cqm_local_bytes_pkg),
+ EVENT_PTR(intel_cqm_total_bw_pkg),
+ EVENT_PTR(intel_cqm_local_bw_pkg),
+ EVENT_PTR(intel_cqm_llc_unit),
+ EVENT_PTR(intel_cqm_total_bytes_unit),
+ EVENT_PTR(intel_cqm_local_bytes_unit),
+ EVENT_PTR(intel_cqm_total_bw_unit),
+ EVENT_PTR(intel_cqm_local_bw_unit),
+ EVENT_PTR(intel_cqm_llc_scale),
+ EVENT_PTR(intel_cqm_total_bytes_scale),
+ EVENT_PTR(intel_cqm_local_bytes_scale),
+ EVENT_PTR(intel_cqm_total_bw_scale),
+ EVENT_PTR(intel_cqm_local_bw_scale),
+ EVENT_PTR(intel_cqm_llc_snapshot),
+ EVENT_PTR(intel_cqm_total_bw_snapshot),
+ EVENT_PTR(intel_cqm_local_bw_snapshot),
+ NULL,
+};
+
static struct attribute_group intel_cqm_events_group = {
.name = "events",
- .attrs = intel_cqm_events_attr,
+ .attrs = NULL,
};

PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
kfree(cqm_rmid_ptrs[r]);

kfree(cqm_rmid_ptrs);
+ kfree(mbm_local);
+ kfree(mbm_total);
+ mbm_enabled = false;
+ cqm_enabled = false;
+}
+
+static const struct x86_cpu_id intel_mbm_local_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
+ {}
+};
+
+static const struct x86_cpu_id intel_mbm_total_match[] = {
+ { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
+ {}
+};
+
+static int intel_mbm_init(void)
+{
+ int ret = 0, array_size, maxid = cqm_max_rmid + 1;
+
+ array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_local = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_local)
+ return -ENOMEM;
+
+ mbm_total = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_total) {
+ kfree(mbm_local);
+ mbm_local = NULL;
+ ret = -ENOMEM;
+ }
+
+ return ret;
}

static int __init intel_cqm_init(void)
@@ -1338,7 +1467,14 @@ static int __init intel_cqm_init(void)
char *str = NULL, scale[20];
int i, cpu, ret;

- if (!x86_match_cpu(intel_cqm_match))
+ if (x86_match_cpu(intel_cqm_match))
+ cqm_enabled = true;
+
+ if (x86_match_cpu(intel_mbm_local_match) &&
+ x86_match_cpu(intel_mbm_total_match))
+ mbm_enabled = true;
+
+ if (!cqm_enabled && !mbm_enabled)
return -ENODEV;

cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1395,12 +1531,27 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}

+ if (mbm_enabled)
+ ret = intel_mbm_init();
+ if (ret && !cqm_enabled)
+ goto out;
+
+ if (cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
+ else if (!cqm_enabled && mbm_enabled)
+ intel_cqm_events_group.attrs = intel_mbm_events_attr;
+ else if (cqm_enabled && !mbm_enabled)
+ intel_cqm_events_group.attrs = intel_cqm_events_attr;
+
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
goto out;
} else {
- pr_info("Intel CQM monitoring enabled\n");
+ if (cqm_enabled)
+ pr_info("Intel CQM monitoring enabled\n");
+ if (mbm_enabled)
+ pr_info("Intel MBM enabled\n");
}

/*
--
1.9.1

2016-03-02 23:58:17

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 6/6] x86/mbm: Add support for MBM counter overflow handling

This patch adds a per package timer which periodically updates the
Memory bandwidth counters for the events that are currently active.
Current patch has a periodic timer every 1s since the SDM guarantees
that the counter will not overflow in 1s but this time can be definitely
improved by calibrating on the system. The overflow is really a function
of the max memory b/w that the socket can support, max counter value and
scaling factor.

Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---

Fixed mbm_timers leak in the intel_cqm_init function.

arch/x86/kernel/cpu/perf_event_intel_cqm.c | 134 ++++++++++++++++++++++++++++-
1 file changed, 132 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index e4b520f..0b3a9ac 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -18,9 +18,15 @@
* value
*/
#define MBM_CNTR_MAX 0xffffff
+/*
+ * Guaranteed time in ms as per SDM where MBM counters will not overflow.
+ */
+#define MBM_CTR_OVERFLOW_TIME 1000
+
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
+unsigned int mbm_socket_max;

/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -48,6 +54,7 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
* @interval_start Time this interval began
@@ -1125,6 +1132,84 @@ static void __intel_mbm_event_count(void *info)
atomic64_add(val, &rr->value);
}

+static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct perf_event *iter, *iter1;
+ int ret = HRTIMER_RESTART;
+ struct list_head *head;
+ unsigned long flags;
+ u32 grp_rmid;
+
+ /*
+ * Need to cache_lock as the timer Event Select MSR reads
+ * can race with the mbm/cqm count() and mbm_init() reads.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (list_empty(&cache_groups)) {
+ ret = HRTIMER_NORESTART;
+ goto out;
+ }
+
+ list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ grp_rmid = iter->hw.cqm_rmid;
+ if (!__rmid_valid(grp_rmid))
+ continue;
+ if (is_mbm_event(iter->attr.config))
+ update_sample(grp_rmid, iter->attr.config, 0);
+
+ head = &iter->hw.cqm_group_entry;
+ if (list_empty(head))
+ continue;
+ list_for_each_entry(iter1, head, hw.cqm_group_entry) {
+ if (!iter1->hw.is_group_event)
+ break;
+ if (is_mbm_event(iter1->attr.config))
+ update_sample(iter1->hw.cqm_rmid,
+ iter1->attr.config, 0);
+ }
+ }
+
+ hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
+out:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return ret;
+}
+
+static void __mbm_start_timer(void *info)
+{
+ hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void __mbm_stop_timer(void *info)
+{
+ hrtimer_cancel(&mbm_timers[pkg_id]);
+}
+
+static void mbm_start_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
+}
+
+static void mbm_stop_timers(void)
+{
+ on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
+}
+
+static void mbm_hrtimer_init(void)
+{
+ struct hrtimer *hr;
+ int i;
+
+ for (i = 0; i < mbm_socket_max; i++) {
+ hr = &mbm_timers[i];
+ hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hr->function = mbm_hrtimer_handle;
+ }
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
@@ -1254,8 +1339,14 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ unsigned long flags;

mutex_lock(&cache_mutex);
+ /*
+ * Hold the cache_lock as mbm timer handlers could be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);

/*
* If there's another event in this group...
@@ -1287,6 +1378,14 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ /*
+ * Stop the mbm overflow timers when the last event is destroyed.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_stop_timers();
+
mutex_unlock(&cache_mutex);
}

@@ -1294,6 +1393,7 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
+ unsigned long flags;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1319,9 +1419,21 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /*
+ * Start the mbm overflow timers when the first event is created.
+ */
+ if (mbm_enabled && list_empty(&cache_groups))
+ mbm_start_timers();
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);

+ /*
+ * Hold the cache_lock as mbm timer handlers be
+ * scanning the list of events.
+ */
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1340,6 +1452,7 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}

+ raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);

if (rotate)
@@ -1603,6 +1716,7 @@ static void cqm_cleanup(void)
kfree(cqm_rmid_ptrs);
kfree(mbm_local);
kfree(mbm_total);
+ kfree(mbm_timers);
mbm_enabled = false;
cqm_enabled = false;
}
@@ -1621,16 +1735,32 @@ static int intel_mbm_init(void)
{
int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- array_size = sizeof(struct sample) * maxid * topology_max_packages();
+ mbm_socket_max = topology_max_packages();
+ array_size = sizeof(struct sample) * maxid * mbm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;

mbm_total = kmalloc(array_size, GFP_KERNEL);
if (!mbm_total) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ mbm_timers = kmalloc(array_size, GFP_KERNEL);
+ if (!mbm_timers) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mbm_hrtimer_init();
+
+out:
+ if (ret) {
kfree(mbm_local);
mbm_local = NULL;
- ret = -ENOMEM;
+ kfree(mbm_total);
+ mbm_total = NULL;
}

return ret;
--
1.9.1

2016-03-03 07:36:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> + if (cqm_enabled && mbm_enabled)
> + intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> + else if (!cqm_enabled && mbm_enabled)
> + intel_cqm_events_group.attrs = intel_mbm_events_attr;
> + else if (cqm_enabled && !mbm_enabled)
> + intel_cqm_events_group.attrs = intel_cqm_events_attr;
> +
> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> if (ret) {
> pr_err("Intel CQM perf registration failed: %d\n", ret);
> goto out;

So what cleans up mbm_local and mbm_total in that case?

Thanks,

tglx

2016-03-03 18:26:43

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init



On Wed, 2 Mar 2016, Thomas Gleixner wrote:

> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>> + if (cqm_enabled && mbm_enabled)
>> + intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
>> + else if (!cqm_enabled && mbm_enabled)
>> + intel_cqm_events_group.attrs = intel_mbm_events_attr;
>> + else if (cqm_enabled && !mbm_enabled)
>> + intel_cqm_events_group.attrs = intel_cqm_events_attr;
>> +
>> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>> if (ret) {
>> pr_err("Intel CQM perf registration failed: %d\n", ret);
>> goto out;
>
> So what cleans up mbm_local and mbm_total in that case?

I put all the cleanup code in the cqm_cleanup .. - please see copy below

@@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
kfree(cqm_rmid_ptrs[r]);

kfree(cqm_rmid_ptrs);
+ kfree(mbm_local);
+ kfree(mbm_total);
+ mbm_enabled = false;
+ cqm_enabled = false;
+}

Thanks,
Vikas


>
> Thanks,
>
> tglx
>

2016-03-03 18:38:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

On Thu, 3 Mar 2016, Vikas Shivappa wrote:

>
>
> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
>
> > On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> > > + if (cqm_enabled && mbm_enabled)
> > > + intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> > > + else if (!cqm_enabled && mbm_enabled)
> > > + intel_cqm_events_group.attrs = intel_mbm_events_attr;
> > > + else if (cqm_enabled && !mbm_enabled)
> > > + intel_cqm_events_group.attrs = intel_cqm_events_attr;
> > > +
> > > ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> > > if (ret) {
> > > pr_err("Intel CQM perf registration failed: %d\n", ret);
> > > goto out;
> >
> > So what cleans up mbm_local and mbm_total in that case?
>
> I put all the cleanup code in the cqm_cleanup .. - please see copy below

Ok, missed that.

> @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
> kfree(cqm_rmid_ptrs[r]);
>
> kfree(cqm_rmid_ptrs);
> + kfree(mbm_local);
> + kfree(mbm_total);
> + mbm_enabled = false;
> + cqm_enabled = false;
> +}
>
> Thanks,
> Vikas
>
>
> >
> > Thanks,
> >
> > tglx
> >
>

2016-03-07 23:04:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

On Tue, Mar 01, 2016 at 03:48:26PM -0800, Vikas Shivappa wrote:

> Lot of the scheduling code was taken out from Tony's patch and a 3-4
> lines of change were added in the intel_cqm_event_read. Since the timer
> is no more added on every context switch this change was made.

It this here to confuse people or is there some actual information in
it?

> +/*
> + * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
> + * value
> + */
> +#define MBM_CNTR_MAX 0xffffff

#define MBM_CNTR_WIDTH 24
#define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)


> #define QOS_L3_OCCUP_EVENT_ID (1 << 0)
> +/*
> + * MBM Event IDs as defined in SDM section 17.15.5
> + * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
> + */
> +enum mbm_evt_type {
> + QOS_MBM_TOTAL_EVENT_ID = 0x02,
> + QOS_MBM_LOCAL_EVENT_ID,
> + QOS_MBM_TOTAL_BW_EVENT_ID,
> + QOS_MBM_LOCAL_BW_EVENT_ID,
> +};

QOS_L3_*_EVENT_ID is a define, these are an enum. Rather inconsistent.

> struct rmid_read {
> u32 rmid;

Hole, you could've filled with the enum (which ends up being an int I
think).

> atomic64_t value;
> + enum mbm_evt_type evt_type;
> };

> +static bool is_mbm_event(int e)

You had an enum type, you might as well use it.

> +{
> + return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
> +}
>

> +static struct sample *update_sample(unsigned int rmid,
> + enum mbm_evt_type evt_type, int first)
> +{
> + ktime_t cur_time;
> + struct sample *mbm_current;
> + u32 vrmid = rmid_2_index(rmid);
> + u64 val, bytes, diff_time;
> + u32 eventid;
> +
> + if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
> + mbm_current = &mbm_local[vrmid];
> + eventid = QOS_MBM_LOCAL_EVENT_ID;
> + } else {
> + mbm_current = &mbm_total[vrmid];
> + eventid = QOS_MBM_TOTAL_EVENT_ID;
> + }
> +
> + cur_time = ktime_get();
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + rdmsrl(MSR_IA32_QM_CTR, val);
> + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> + return mbm_current;

> + val &= MBM_CNTR_MAX;

> + if (val < mbm_current->prev_msr)
> + bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
> + else
> + bytes = val - mbm_current->prev_msr;

Would not something like:

shift = 64 - MBM_CNTR_WIDTH;

bytes = (val << shift) - (prev << shift);
bytes >>= shift;

be less obtuse? (and consistent with how every other perf update
function does it).

What guarantee is there we didn't wrap multiple times? Doesn't that
deserve a comment?

> + bytes *= cqm_l3_scale;
> +
> + mbm_current->total_bytes += bytes;
> + mbm_current->interval_bytes += bytes;
> + mbm_current->prev_msr = val;
> + diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);

Here we do a / 1e6

> +
> + /*
> + * The b/w measured is really the most recent/current b/w.
> + * We wait till enough time has passed to avoid
> + * arthmetic rounding problems.Having it at >=100ms,
> + * such errors would be <=1%.
> + */
> + if (diff_time > 100) {

This could well be > 100e6 instead, avoiding the above division most of
the time.

> + bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
> + do_div(bytes, diff_time);
> + mbm_current->bandwidth = bytes;
> + mbm_current->interval_bytes = 0;
> + mbm_current->interval_start = cur_time;
> + }
> +
> + return mbm_current;
> +}

How does the above time tracking deal with the event not actually having
been scheduled the whole time?


> +static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
> +{
> + struct rmid_read rr = {
> + .value = ATOMIC64_INIT(0),
> + };
> +
> + rr.rmid = rmid;
> + rr.evt_type = evt_type;

That's just sad.. put those two in the struct init as well.

> + /* on each socket, init sample */
> + on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
> +}

2016-03-07 23:04:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group

On Tue, Mar 01, 2016 at 03:48:23PM -0800, Vikas Shivappa wrote:
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -121,6 +121,7 @@ struct hw_perf_event {
> struct { /* intel_cqm */
> int cqm_state;
> u32 cqm_rmid;
> + bool is_group_event;
> struct list_head cqm_events_entry;
> struct list_head cqm_groups_entry;
> struct list_head cqm_group_entry;

Please, no 'bool' in structures.

2016-03-07 23:27:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

>> + bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>> + do_div(bytes, diff_time);
>> + mbm_current->bandwidth = bytes;
>> + mbm_current->interval_bytes = 0;
>> + mbm_current->interval_start = cur_time;
>> + }
>>> +
>> + return mbm_current;
>> +}
>
> How does the above time tracking deal with the event not actually having
> been scheduled the whole time?

That's been the topic of a few philosophical debates ... what exactly are
we trying to say when we report that a process has a "memory bandwidth"
of, say, 1523 MBytes/s? We need to know both the amount of data moved
and to pick an interval to measure and divide by. Does it make a difference
whether the process voluntarily gave up the cpu for some part of the interval
(by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?

The above code gives the average bandwidth across the last interval
(with a minimum interval size of 100ms to avoid craziness with rounding
errors on exceptionally tiny intervals). Some folks apparently want to get
a "rate" directly from perf. I think many folks will find the "bytes" counters
more helpful (where they control the sample interval with '-I" flag to perf
utility).

-Tony

2016-03-08 08:49:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management

On Mon, Mar 07, 2016 at 11:27:26PM +0000, Luck, Tony wrote:
> >> + bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
> >> + do_div(bytes, diff_time);
> >> + mbm_current->bandwidth = bytes;
> >> + mbm_current->interval_bytes = 0;
> >> + mbm_current->interval_start = cur_time;
> >> + }
> >>> +
> >> + return mbm_current;
> >> +}
> >
> > How does the above time tracking deal with the event not actually having
> > been scheduled the whole time?
>
> That's been the topic of a few philosophical debates ... what exactly are
> we trying to say when we report that a process has a "memory bandwidth"
> of, say, 1523 MBytes/s? We need to know both the amount of data moved
> and to pick an interval to measure and divide by. Does it make a difference
> whether the process voluntarily gave up the cpu for some part of the interval
> (by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?
>
> The above code gives the average bandwidth across the last interval
> (with a minimum interval size of 100ms to avoid craziness with rounding
> errors on exceptionally tiny intervals). Some folks apparently want to get
> a "rate" directly from perf. I think many folks will find the "bytes" counters
> more helpful (where they control the sample interval with '-I" flag to perf
> utility).

So why didn't any of that make it into the Changelog? This is very much
different from any other perf driver, at the very least this debate
should have been mentioned and the choice defended.

Also, why are we doing the time tracking and divisions at all? Can't we
simply report the number of bytes transferred and let userspace sort out
the rest?

Userspace is provided the time the event was enabled, the time the event
was running and it can fairly trivially obtain walltime if it so desires
and then it can compute whatever definition of bandwidth it wants to
use.


2016-03-08 09:24:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak

On Wed, 2 Mar 2016, Vikas Shivappa wrote:

Please fix the subject line prefix: "x86/perf/intel/cqm:"

> Fixes the hotcpu notifier leak and other global variable memory leaks
> during cqm(cache quality of service monitoring) initialization.
>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Vikas Shivappa <[email protected]>
> ---
>
> Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.
>
> arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
> 1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> index e6be335..37a93fa 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
> @@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)
>
> return 0;
> fail:
> - while (r--)
> + while (r--) {
> kfree(cqm_rmid_ptrs[r]);
> + cqm_rmid_ptrs[r] = NULL;
> + }
>
> kfree(cqm_rmid_ptrs);
> + cqm_rmid_ptrs = NULL;

This partial rollback is crap. Why can't you utilize cqm_cleanup() ?
Performance certainly is not an issue here.

> return -ENOMEM;
> }
>
> @@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
> {}
> };
>
> +static void cqm_cleanup(void)
> +{
> + int r = cqm_max_rmid;
> +
> + while (r--)
> + kfree(cqm_rmid_ptrs[r]);
> +
> + kfree(cqm_rmid_ptrs);
> +}
> +
> static int __init intel_cqm_init(void)
> {
> - char *str, scale[20];
> + char *str = NULL, scale[20];
> int i, cpu, ret;
>
> if (!x86_match_cpu(intel_cqm_match))
> @@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
> cqm_pick_event_reader(i);
> }
>
> - __perf_cpu_notifier(intel_cqm_cpu_notifier);
> -
> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> - if (ret)
> + if (ret) {
> pr_err("Intel CQM perf registration failed: %d\n", ret);
> - else
> + goto out;
> + } else {
> pr_info("Intel CQM monitoring enabled\n");
> + }
>
> + /*
> + * Register the hot cpu notifier once we are sure cqm
> + * is enabled to avoid notifier leak.
> + */
> + __perf_cpu_notifier(intel_cqm_cpu_notifier);
> out:
> cpu_notifier_register_done();
> + if (ret) {
> + kfree(str);
> + cqm_cleanup();

This is still broken. If intel_cqm_setup_rmid_cache() fails, you clear
cqm_rmid_ptrs and then jump to out. ret is -ENOMEM, so you will call
cqm_cleanup which will dereference cqm_rmid_ptrs .....

Find below a delta patch, which fixes that proper and safe.

Thanks,

tglx

8<----------------

arch/x86/events/intel/cqm.c | 40 +++++++++++++++++++---------------------
1 file changed, 19 insertions(+), 21 deletions(-)

--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
}

+static void cqm_cleanup(void)
+{
+ int i;
+
+ if (!cqm_rmid_ptrs)
+ return;
+
+ for (i = 0; i < cqm_max_rmid; i++)
+ kfree(cqm_rmid_ptrs[i]);
+
+ kfree(cqm_rmid_ptrs);
+ cqm_rmid_ptrs = NULL;
+}
+
static int intel_cqm_setup_rmid_cache(void)
{
struct cqm_rmid_entry *entry;
@@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(vo
int r = 0;

nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
+ cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
nr_rmids, GFP_KERNEL);
if (!cqm_rmid_ptrs)
return -ENOMEM;
@@ -247,16 +261,10 @@ static int intel_cqm_setup_rmid_cache(vo
mutex_lock(&cache_mutex);
intel_cqm_rotation_rmid = __get_rmid();
mutex_unlock(&cache_mutex);
-
return 0;
-fail:
- while (r--) {
- kfree(cqm_rmid_ptrs[r]);
- cqm_rmid_ptrs[r] = NULL;
- }

- kfree(cqm_rmid_ptrs);
- cqm_rmid_ptrs = NULL;
+fail:
+ cqm_cleanup();
return -ENOMEM;
}

@@ -1313,16 +1321,6 @@ static const struct x86_cpu_id intel_cqm
{}
};

-static void cqm_cleanup(void)
-{
- int r = cqm_max_rmid;
-
- while (r--)
- kfree(cqm_rmid_ptrs[r]);
-
- kfree(cqm_rmid_ptrs);
-}
-
static int __init intel_cqm_init(void)
{
char *str = NULL, scale[20];
@@ -1389,10 +1387,10 @@ static int __init intel_cqm_init(void)
if (ret) {
pr_err("Intel CQM perf registration failed: %d\n", ret);
goto out;
- } else {
- pr_info("Intel CQM monitoring enabled\n");
}

+ pr_info("Intel CQM monitoring enabled\n");
+
/*
* Register the hot cpu notifier once we are sure cqm
* is enabled to avoid notifier leak.

2016-03-08 09:26:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init

On Thu, 3 Mar 2016, Thomas Gleixner wrote:
> On Thu, 3 Mar 2016, Vikas Shivappa wrote:
>
> >
> >
> > On Wed, 2 Mar 2016, Thomas Gleixner wrote:
> >
> > > On Wed, 2 Mar 2016, Vikas Shivappa wrote:
> > > > + if (cqm_enabled && mbm_enabled)
> > > > + intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
> > > > + else if (!cqm_enabled && mbm_enabled)
> > > > + intel_cqm_events_group.attrs = intel_mbm_events_attr;
> > > > + else if (cqm_enabled && !mbm_enabled)
> > > > + intel_cqm_events_group.attrs = intel_cqm_events_attr;
> > > > +
> > > > ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> > > > if (ret) {
> > > > pr_err("Intel CQM perf registration failed: %d\n", ret);
> > > > goto out;
> > >
> > > So what cleans up mbm_local and mbm_total in that case?
> >
> > I put all the cleanup code in the cqm_cleanup .. - please see copy below
>
> Ok, missed that.
>
> > @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
> > kfree(cqm_rmid_ptrs[r]);
> >
> > kfree(cqm_rmid_ptrs);
> > + kfree(mbm_local);
> > + kfree(mbm_total);
> > + mbm_enabled = false;
> > + cqm_enabled = false;
> > +}

After looking at it closely, you really need a separate mbm_cleanup()
function, so you can utilize it from mbm_init() and in the exit path.

Thanks,

tglx

2016-03-08 19:35:51

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86,perf/cqm: Fix cqm memory leak and notifier leak



On Tue, 8 Mar 2016, Thomas Gleixner wrote:

> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>
> Please fix the subject line prefix: "x86/perf/intel/cqm:"

Will fix..

>
>> Fixes the hotcpu notifier leak and other global variable memory leaks
>> during cqm(cache quality of service monitoring) initialization.
>>
>> Reviewed-by: Tony Luck <[email protected]>
>> Signed-off-by: Vikas Shivappa <[email protected]>
>> ---
>>
>> Fixed the memory leak for cqm_rmid_ptrs as per Thomas feedback.
>>
>> arch/x86/kernel/cpu/perf_event_intel_cqm.c | 34 ++++++++++++++++++++++++------
>> 1 file changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> index e6be335..37a93fa 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
>> @@ -250,10 +250,13 @@ static int intel_cqm_setup_rmid_cache(void)
>>
>> return 0;
>> fail:
>> - while (r--)
>> + while (r--) {
>> kfree(cqm_rmid_ptrs[r]);
>> + cqm_rmid_ptrs[r] = NULL;
>> + }
>>
>> kfree(cqm_rmid_ptrs);
>> + cqm_rmid_ptrs = NULL;
>
> This partial rollback is crap. Why can't you utilize cqm_cleanup() ?
> Performance certainly is not an issue here.
>
>> return -ENOMEM;
>> }
>>
>> @@ -1320,9 +1323,19 @@ static const struct x86_cpu_id intel_cqm_match[] = {
>> {}
>> };
>>
>> +static void cqm_cleanup(void)
>> +{
>> + int r = cqm_max_rmid;
>> +
>> + while (r--)
>> + kfree(cqm_rmid_ptrs[r]);
>> +
>> + kfree(cqm_rmid_ptrs);
>> +}
>> +
>> static int __init intel_cqm_init(void)
>> {
>> - char *str, scale[20];
>> + char *str = NULL, scale[20];
>> int i, cpu, ret;
>>
>> if (!x86_match_cpu(intel_cqm_match))
>> @@ -1382,16 +1395,25 @@ static int __init intel_cqm_init(void)
>> cqm_pick_event_reader(i);
>> }
>>
>> - __perf_cpu_notifier(intel_cqm_cpu_notifier);
>> -
>> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>> - if (ret)
>> + if (ret) {
>> pr_err("Intel CQM perf registration failed: %d\n", ret);
>> - else
>> + goto out;
>> + } else {
>> pr_info("Intel CQM monitoring enabled\n");
>> + }
>>
>> + /*
>> + * Register the hot cpu notifier once we are sure cqm
>> + * is enabled to avoid notifier leak.
>> + */
>> + __perf_cpu_notifier(intel_cqm_cpu_notifier);
>> out:
>> cpu_notifier_register_done();
>> + if (ret) {
>> + kfree(str);
>> + cqm_cleanup();
>
> This is still broken. If intel_cqm_setup_rmid_cache() fails, you clear
> cqm_rmid_ptrs and then jump to out. ret is -ENOMEM, so you will call
> cqm_cleanup which will dereference cqm_rmid_ptrs .....
>
> Find below a delta patch, which fixes that proper and safe.

Thanks for sending it. will resend the patch with your fix.

-Vikas

>
> Thanks,
>
> tglx
>
> 8<----------------
>
> arch/x86/events/intel/cqm.c | 40 +++++++++++++++++++---------------------
> 1 file changed, 19 insertions(+), 21 deletions(-)
>
> --- a/arch/x86/events/intel/cqm.c
> +++ b/arch/x86/events/intel/cqm.c
> @@ -211,6 +211,20 @@ static void __put_rmid(u32 rmid)
> list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
> }
>
> +static void cqm_cleanup(void)
> +{
> + int i;
> +
> + if (!cqm_rmid_ptrs)
> + return;
> +
> + for (i = 0; i < cqm_max_rmid; i++)
> + kfree(cqm_rmid_ptrs[i]);
> +
> + kfree(cqm_rmid_ptrs);
> + cqm_rmid_ptrs = NULL;
> +}
> +
> static int intel_cqm_setup_rmid_cache(void)
> {
> struct cqm_rmid_entry *entry;
> @@ -218,7 +232,7 @@ static int intel_cqm_setup_rmid_cache(vo
> int r = 0;
>
> nr_rmids = cqm_max_rmid + 1;
> - cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
> + cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
> nr_rmids, GFP_KERNEL);
> if (!cqm_rmid_ptrs)
> return -ENOMEM;
> @@ -247,16 +261,10 @@ static int intel_cqm_setup_rmid_cache(vo
> mutex_lock(&cache_mutex);
> intel_cqm_rotation_rmid = __get_rmid();
> mutex_unlock(&cache_mutex);
> -
> return 0;
> -fail:
> - while (r--) {
> - kfree(cqm_rmid_ptrs[r]);
> - cqm_rmid_ptrs[r] = NULL;
> - }
>
> - kfree(cqm_rmid_ptrs);
> - cqm_rmid_ptrs = NULL;
> +fail:
> + cqm_cleanup();
> return -ENOMEM;
> }
>
> @@ -1313,16 +1321,6 @@ static const struct x86_cpu_id intel_cqm
> {}
> };
>
> -static void cqm_cleanup(void)
> -{
> - int r = cqm_max_rmid;
> -
> - while (r--)
> - kfree(cqm_rmid_ptrs[r]);
> -
> - kfree(cqm_rmid_ptrs);
> -}
> -
> static int __init intel_cqm_init(void)
> {
> char *str = NULL, scale[20];
> @@ -1389,10 +1387,10 @@ static int __init intel_cqm_init(void)
> if (ret) {
> pr_err("Intel CQM perf registration failed: %d\n", ret);
> goto out;
> - } else {
> - pr_info("Intel CQM monitoring enabled\n");
> }
>
> + pr_info("Intel CQM monitoring enabled\n");
> +
> /*
> * Register the hot cpu notifier once we are sure cqm
> * is enabled to avoid notifier leak.
>

2016-03-08 19:36:29

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/mbm: Intel Memory B/W Monitoring enumeration and init



On Tue, 8 Mar 2016, Thomas Gleixner wrote:

> On Thu, 3 Mar 2016, Thomas Gleixner wrote:
>> On Thu, 3 Mar 2016, Vikas Shivappa wrote:
>>
>>>
>>>
>>> On Wed, 2 Mar 2016, Thomas Gleixner wrote:
>>>
>>>> On Wed, 2 Mar 2016, Vikas Shivappa wrote:
>>>>> + if (cqm_enabled && mbm_enabled)
>>>>> + intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
>>>>> + else if (!cqm_enabled && mbm_enabled)
>>>>> + intel_cqm_events_group.attrs = intel_mbm_events_attr;
>>>>> + else if (cqm_enabled && !mbm_enabled)
>>>>> + intel_cqm_events_group.attrs = intel_cqm_events_attr;
>>>>> +
>>>>> ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
>>>>> if (ret) {
>>>>> pr_err("Intel CQM perf registration failed: %d\n", ret);
>>>>> goto out;
>>>>
>>>> So what cleans up mbm_local and mbm_total in that case?
>>>
>>> I put all the cleanup code in the cqm_cleanup .. - please see copy below
>>
>> Ok, missed that.
>>
>>> @@ -1331,6 +1427,39 @@ static void cqm_cleanup(void)
>>> kfree(cqm_rmid_ptrs[r]);
>>>
>>> kfree(cqm_rmid_ptrs);
>>> + kfree(mbm_local);
>>> + kfree(mbm_total);
>>> + mbm_enabled = false;
>>> + cqm_enabled = false;
>>> +}
>
> After looking at it closely, you really need a separate mbm_cleanup()
> function, so you can utilize it from mbm_init() and in the exit path.

Will fix and seperate the mbm and cqm cleanups..

Thanks,
Vikas

>
> Thanks,
>
> tglx
>

2016-03-10 00:17:27

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86,perf/cqm: Fix cqm handling of grouping events into a cache_group



On Mon, 7 Mar 2016, Peter Zijlstra wrote:

> On Tue, Mar 01, 2016 at 03:48:23PM -0800, Vikas Shivappa wrote:
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -121,6 +121,7 @@ struct hw_perf_event {
>> struct { /* intel_cqm */
>> int cqm_state;
>> u32 cqm_rmid;
>> + bool is_group_event;
>> struct list_head cqm_events_entry;
>> struct list_head cqm_groups_entry;
>> struct list_head cqm_group_entry;
>
> Please, no 'bool' in structures.

Will fix..

thanks,
vikas

>

2016-03-10 22:46:41

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management



On Tue, 8 Mar 2016, Peter Zijlstra wrote:

> On Mon, Mar 07, 2016 at 11:27:26PM +0000, Luck, Tony wrote:
>>>> + bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>>>> + do_div(bytes, diff_time);
>>>> + mbm_current->bandwidth = bytes;
>>>> + mbm_current->interval_bytes = 0;
>>>> + mbm_current->interval_start = cur_time;
>>>> + }
>>>>> +
>>>> + return mbm_current;
>>>> +}
>>>
>>> How does the above time tracking deal with the event not actually having
>>> been scheduled the whole time?
>>
>> That's been the topic of a few philosophical debates ... what exactly are
>> we trying to say when we report that a process has a "memory bandwidth"
>> of, say, 1523 MBytes/s? We need to know both the amount of data moved
>> and to pick an interval to measure and divide by. Does it make a difference
>> whether the process voluntarily gave up the cpu for some part of the interval
>> (by blocking on I/O)? Or did the scheduler time-slice it out to run other jobs?
>>
>> The above code gives the average bandwidth across the last interval
>> (with a minimum interval size of 100ms to avoid craziness with rounding
>> errors on exceptionally tiny intervals). Some folks apparently want to get
>> a "rate" directly from perf. I think many folks will find the "bytes" counters
>> more helpful (where they control the sample interval with '-I" flag to perf
>> utility).
>
> So why didn't any of that make it into the Changelog? This is very much
> different from any other perf driver, at the very least this debate
> should have been mentioned and the choice defended.
>
> Also, why are we doing the time tracking and divisions at all? Can't we
> simply report the number of bytes transferred and let userspace sort out
> the rest?
>
> Userspace is provided the time the event was enabled, the time the event
> was running and it can fairly trivially obtain walltime if it so desires
> and then it can compute whatever definition of bandwidth it wants to
> use.


We had discussions on removing the bw event. Discussed this with Tony and will
update the patch by removing the bw events.. So this code will be removed.

thanks,
Vikas

>
>
>

2016-03-10 22:50:06

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86/mbm: Memory bandwidth monitoring event management



On Mon, 7 Mar 2016, Peter Zijlstra wrote:

> On Tue, Mar 01, 2016 at 03:48:26PM -0800, Vikas Shivappa wrote:
>
>> Lot of the scheduling code was taken out from Tony's patch and a 3-4
>> lines of change were added in the intel_cqm_event_read. Since the timer
>> is no more added on every context switch this change was made.
>
> It this here to confuse people or is there some actual information in
> it?

Will remove the comment.

>
>> +/*
>> + * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
>> + * value
>> + */
>> +#define MBM_CNTR_MAX 0xffffff
>
> #define MBM_CNTR_WIDTH 24
> #define MBM_CNTR_MAX ((1U << MBM_CNTR_WIDTH) - 1)
>
>
Will fix

>> #define QOS_L3_OCCUP_EVENT_ID (1 << 0)
>> +/*
>> + * MBM Event IDs as defined in SDM section 17.15.5
>> + * Event IDs are used to program EVTSEL MSRs before reading mbm event counters
>> + */
>> +enum mbm_evt_type {
>> + QOS_MBM_TOTAL_EVENT_ID = 0x02,
>> + QOS_MBM_LOCAL_EVENT_ID,
>> + QOS_MBM_TOTAL_BW_EVENT_ID,
>> + QOS_MBM_LOCAL_BW_EVENT_ID,
>> +};
>
> QOS_L3_*_EVENT_ID is a define, these are an enum. Rather inconsistent.
>

Will be changing all of them to #define . and we are also removing the bw
events..

>> struct rmid_read {
>> u32 rmid;
>
> Hole, you could've filled with the enum (which ends up being an int I
> think).
>
>> atomic64_t value;
>> + enum mbm_evt_type evt_type;
>> };
>
>> +static bool is_mbm_event(int e)
>
> You had an enum type, you might as well use it.

the enum will be gone..

>
>> +{
>> + return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_BW_EVENT_ID);
>> +}
>>
>
>> +static struct sample *update_sample(unsigned int rmid,
>> + enum mbm_evt_type evt_type, int first)
>> +{
>> + ktime_t cur_time;
>> + struct sample *mbm_current;
>> + u32 vrmid = rmid_2_index(rmid);
>> + u64 val, bytes, diff_time;
>> + u32 eventid;
>> +
>> + if (evt_type & QOS_MBM_LOCAL_EVENT_MASK) {
>> + mbm_current = &mbm_local[vrmid];
>> + eventid = QOS_MBM_LOCAL_EVENT_ID;
>> + } else {
>> + mbm_current = &mbm_total[vrmid];
>> + eventid = QOS_MBM_TOTAL_EVENT_ID;
>> + }
>> +
>> + cur_time = ktime_get();
>> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
>> + rdmsrl(MSR_IA32_QM_CTR, val);
>> + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>> + return mbm_current;
>
>> + val &= MBM_CNTR_MAX;
>
>> + if (val < mbm_current->prev_msr)
>> + bytes = MBM_CNTR_MAX - mbm_current->prev_msr + val + 1;
>> + else
>> + bytes = val - mbm_current->prev_msr;
>
> Would not something like:
>
> shift = 64 - MBM_CNTR_WIDTH;
>
> bytes = (val << shift) - (prev << shift);
> bytes >>= shift;
>
> be less obtuse? (and consistent with how every other perf update
> function does it).

Will fix.

>
> What guarantee is there we didn't wrap multiple times? Doesn't that
> deserve a comment?


this is taken care of in the next patch 0006. I have put a comment there that
h/w guarentees that overflow wont happen with in 1s at the definition of the
timers, but can add an other comment here in the patch 0006

>
>> + bytes *= cqm_l3_scale;
>> +
>> + mbm_current->total_bytes += bytes;
>> + mbm_current->interval_bytes += bytes;
>> + mbm_current->prev_msr = val;
>> + diff_time = ktime_ms_delta(cur_time, mbm_current->interval_start);
>
> Here we do a / 1e6
>
>> +
>> + /*
>> + * The b/w measured is really the most recent/current b/w.
>> + * We wait till enough time has passed to avoid
>> + * arthmetic rounding problems.Having it at >=100ms,
>> + * such errors would be <=1%.
>> + */
>> + if (diff_time > 100) {
>
> This could well be > 100e6 instead, avoiding the above division most of
> the time.
>
>> + bytes = mbm_current->interval_bytes * MSEC_PER_SEC;
>> + do_div(bytes, diff_time);
>> + mbm_current->bandwidth = bytes;
>> + mbm_current->interval_bytes = 0;
>> + mbm_current->interval_start = cur_time;
>> + }
>> +
>> + return mbm_current;
>> +}
>
> How does the above time tracking deal with the event not actually having
> been scheduled the whole time?

Will be removing the bw events - so should address all three comments above.

>
>
>> +static void init_mbm_sample(u32 rmid, enum mbm_evt_type evt_type)
>> +{
>> + struct rmid_read rr = {
>> + .value = ATOMIC64_INIT(0),
>> + };
>> +
>> + rr.rmid = rmid;
>> + rr.evt_type = evt_type;
>
> That's just sad.. put those two in the struct init as well.

Will fix

thanks,
vikas
>
>> + /* on each socket, init sample */
>> + on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>> +}
>