2014-01-03 20:35:25

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

This patchset adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group. Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

This series is also laying the framework for additional Platform
QoS features in future Intel Xeon processors.

The CPU features themselves are relatively straight-forward, but
the presentation of the data is less straight-forward. Since this
tracks cache usage and occupancy per process (by swapping Resource
Monitor IDs, or RMIDs, when processes are rescheduled), perf would
not be a good fit for this data, which does not report on a
per-process level. Therefore, a new cgroup subsystem, cacheqos, has
been added. This operates very similarly to the cpu and cpuacct
cgroup subsystems, where tasks can be grouped into sub-leaves of the
root-level cgroup.

Peter P Waskiewicz Jr (4):
x86: Add support for Cache QoS Monitoring (CQM) detection
x86: Add Cache QoS Monitoring support to x86 perf uncore
cgroup: Add new cacheqos cgroup subsys to support Cache QoS Monitoring
Documentation: Add documentation for cacheqos cgroup


2014-01-03 20:35:36

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: [PATCH 1/4] x86: Add support for Cache QoS Monitoring (CQM) detection

This patch adds support for the new Cache QoS Monitoring (CQM)
feature found in future Intel Xeon processors. It includes the
new values to track CQM resources to the cpuinfo_x86 structure,
plus the CPUID detection routines for CQM.

CQM allows a process, or set of processes, to be tracked by the CPU
to determine the cache usage of that task group. Using this data
from the CPU, software can be written to extract this data and
report cache usage and occupancy for a particular process, or
group of processes.

More information about Cache QoS Monitoring can be found in the
Intel (R) x86 Architecture Software Developer Manual, section 17.14.

Signed-off-by: Peter P Waskiewicz Jr <[email protected]>
---
arch/x86/configs/x86_64_defconfig | 1 +
arch/x86/include/asm/cpufeature.h | 9 ++++++++-
arch/x86/include/asm/processor.h | 3 +++
arch/x86/kernel/cpu/common.c | 39 +++++++++++++++++++++++++++++++++++++++
4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index c1119d4..8e98ed4 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -14,6 +14,7 @@ CONFIG_LOG_BUF_SHIFT=18
CONFIG_CGROUPS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y
+CONFIG_CGROUP_CACHEQOS=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_SCHED=y
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 89270b4..5dd59a2 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -8,7 +8,7 @@
#include <asm/required-features.h>
#endif

-#define NCAPINTS 10 /* N 32-bit words worth of info */
+#define NCAPINTS 12 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */

/*
@@ -216,10 +216,17 @@
#define X86_FEATURE_ERMS (9*32+ 9) /* Enhanced REP MOVSB/STOSB */
#define X86_FEATURE_INVPCID (9*32+10) /* Invalidate Processor Context ID */
#define X86_FEATURE_RTM (9*32+11) /* Restricted Transactional Memory */
+#define X86_FEATURE_CQM (9*32+12) /* Cache QoS Monitoring */
#define X86_FEATURE_RDSEED (9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX (9*32+19) /* The ADCX and ADOX instructions */
#define X86_FEATURE_SMAP (9*32+20) /* Supervisor Mode Access Prevention */

+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:0 (edx), word 10 */
+#define X86_FEATURE_CQM_LLC (10*32+ 1) /* LLC QoS if 1 */
+
+/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 11 */
+#define X86_FEATURE_CQM_OCCUP_LLC (11*32+ 0) /* LLC occupancy monitoring if 1 */
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4..3892281 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -110,6 +110,9 @@ struct cpuinfo_x86 {
/* in KB - valid for CPUS which support this call: */
int x86_cache_size;
int x86_cache_alignment; /* In bytes */
+ /* Cache QoS architectural values: */
+ int x86_cache_max_rmid; /* max index */
+ int x86_cache_occ_scale; /* scale to bytes */
int x86_power;
unsigned long loops_per_jiffy;
/* cpuid returned max cores value: */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6abc172..f18bc43 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -626,6 +626,30 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[9] = ebx;
}

+ /* Additional Intel-defined flags: level 0x0000000F */
+ if (c->cpuid_level >= 0x0000000F) {
+ u32 eax, ebx, ecx, edx;
+
+ /* QoS sub-leaf, EAX=0Fh, ECX=0 */
+ cpuid_count(0x0000000F, 0, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[10] = edx;
+ if (cpu_has(c, X86_FEATURE_CQM_LLC)) {
+ /* will be overridden if occupancy monitoring exists */
+ c->x86_cache_max_rmid = ebx;
+
+ /* QoS sub-leaf, EAX=0Fh, ECX=1 */
+ cpuid_count(0x0000000F, 1, &eax, &ebx, &ecx, &edx);
+ c->x86_capability[11] = edx;
+ if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ c->x86_cache_max_rmid = ecx;
+ c->x86_cache_occ_scale = ebx;
+ }
+ } else {
+ c->x86_cache_max_rmid = -1;
+ c->x86_cache_occ_scale = -1;
+ }
+ }
+
/* AMD-defined flags: level 0x80000001 */
xlvl = cpuid_eax(0x80000000);
c->extended_cpuid_level = xlvl;
@@ -814,6 +838,20 @@ static void generic_identify(struct cpuinfo_x86 *c)
detect_nopl(c);
}

+static void x86_init_cache_qos(struct cpuinfo_x86 *c)
+{
+ /*
+ * The heavy lifting of max_rmid and cache_occ_scale are handled
+ * in get_cpu_cap(). Here we just set the max_rmid for the boot_cpu
+ * in case CQM bits really aren't there in this CPU.
+ */
+ if (c != &boot_cpu_data) {
+ boot_cpu_data.x86_cache_max_rmid =
+ min(boot_cpu_data.x86_cache_max_rmid,
+ c->x86_cache_max_rmid);
+ }
+}
+
/*
* This does the hard work of actually picking apart the CPU stuff...
*/
@@ -903,6 +941,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)

init_hypervisor(c);
x86_init_rdrand(c);
+ x86_init_cache_qos(c);

/*
* Clear/Set all flags overriden by options, need do it
--
1.8.3.1

2014-01-03 20:35:50

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: [PATCH 4/4] Documentation: Add documentation for cacheqos cgroup

This patch adds the documentation for the new cacheqos cgroup
subsystem. It provides the overview of how the new subsystem
works, how Cache QoS Monitoring works in the x86 architecture,
and how everything is tied together between the hardware and the
cgroup software stack.

Signed-off-by: Peter P Waskiewicz Jr <[email protected]>
---
Documentation/cgroups/00-INDEX | 2 +
Documentation/cgroups/cacheqos.txt | 166 +++++++++++++++++++++++++++++++++++++
2 files changed, 168 insertions(+)
create mode 100644 Documentation/cgroups/cacheqos.txt

diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index bc461b6..055655d 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -2,6 +2,8 @@
- this file
blkio-controller.txt
- Description for Block IO Controller, implementation and usage details.
+cacheqos.txt
+ - Description for Cache QoS Monitoring; implementation and usage details
cgroups.txt
- Control Groups definition, implementation details, examples and API.
cpuacct.txt
diff --git a/Documentation/cgroups/cacheqos.txt b/Documentation/cgroups/cacheqos.txt
new file mode 100644
index 0000000..b7b85ce
--- /dev/null
+++ b/Documentation/cgroups/cacheqos.txt
@@ -0,0 +1,166 @@
+Cache QoS Monitoring Controller
+-------------------------------
+
+1. Overview
+===========
+
+The Cache QoS Monitoring controller is used to group tasks using cgroups and
+monitor the CPU cache usage and occupancy of the grouped tasks. This
+monitoring does require hardware support for this information, especially
+since cache optimization and usage models will vary between CPU architectures.
+
+The Cache QoS Monitoring controller supports multi-hierarchy groups. A
+monitoring group accumulates the cache usage of all of its child groups and
+the tasks directly present in its group.
+
+Monitoring groups can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -ocacheqos none /sys/fs/cgroup/cacheqos
+
+With the above step, the initial or the parent monitoring group becomes
+visible at /sys/fs/cgroup/cacheqos. At bootup, this group includes all the
+tasks in the system. /sys/fs/cgroup/cacheqos/tasks lists the tasks in this
+cgroup. Each file in the cgroup is described in greater detail below.
+
+
+2. Basic usage
+==============
+
+New monitoring groups can be created under the parent group
+/sys/fs/cgroup/cacheqos.
+
+# cd /sys/fs/cgroup/cacheqos
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it. At this point, the group is ready to be monitored.
+However, since this process requires hardware support to identify tasks
+properly, the mechanisms in the hardware are most likely a finite resource.
+So new monitoring groups are not activated by default to monitor their
+respective task groups.
+
+To enable a task group for hardware monitoring:
+
+# cd /sys/fs/cgroup/cacheqos
+# mkdir g1
+# echo $$ > g1/tasks
+# echo 1 > g1/cacheqos.monitor_cache
+
+This will enable monitoring for the tasks in the g1 monitoring group. Note
+that the root monitoring group is always enabled and cannot be turned off.
+
+
+3. Overview of files
+====================
+
+- cacheqos.monitor_cache:
+ Controls whether or not the monitoring group is enabled or not. This
+ is a R/W field, and expects 0 for disable, 1 for enable.
+
+ If no available hardware resources are left for monitoring, writing a
+ 1 to this file will result in -EAGAIN being returned (Resource
+ temporarily unavailable).
+
+- cacheqos.occupancy:
+ This is a read-only field. It returns the total cache occupancy in
+ bytes of the task group for all CPUs it has run on.
+
+- cacheqos.occupancy_percent:
+ This is a read-only field. It returns the total cache occupancy used
+ as a percentage for all CPUs it has run on. The percentage is based
+ on the size of the cache, which can obviously vary from CPU to CPU.
+
+- cacheqos.occupancy_persocket:
+ This is a read-only field. It returns the total cache occupancy used
+ by the task group, broken down per CPU socket (usually per NUMA node).
+
+- cacheqos.occupancy_percent_persocket:
+ This is a read-only field. It returns the total cache occupancy used
+ by the task group, broken down per CPU socket (usually per NUMA node).
+ Each socket's occupancy is presented as a percentage of the total
+ cache.
+
+4. Adding new architectures
+===========================
+
+Currently Cache QoS Monitoring support only exists in modern Intel Xeon
+processors. Due to this, the Kconfig option for Cache QoS Monitoring depends
+on X86_64 or X86. If another architecture supports cache monitoring, then
+a few functions need to be implemented by the architecture, and that
+architecture needs to be added to some #if clauses for support. These are:
+
+- init/Kconfig
+ Add the new architecture to the dependancy list
+
+- kernel/sched/cacheqos.c
+ Add the new architecture to the #if line to compile out
+ cacheqos_late_init():
+
+ #if !defined(CONFIG_X86_64) || !defined(CONFIG_X86)
+ static int __init cacheqos_late_init(void) ^^^^^^^
+
+The following functions need to be implemented by the architecture:
+
+- void cacheqos_map_schedule_out(void);
+ This function is called by the scheduler when swapping out a task from
+ a CPU. This would be where the CPU architecture code to stop monitoring
+ for a particular task would be executed.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- void cacheqos_map_schedule_in(struct cacheqos *);
+ This function is called by the scheduler when swapping a task into a
+ CPU core. This would be where the CPU architecture code to start
+ monitoring a particular task would be executed.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- void cacheqos_read(void *);
+ This function is called by the cacheqos cgroup subsystem when
+ collating the cache usage data. This would be where the CPU
+ architecture code to pull information for a particular monitoring
+ unit would exist.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+- int __init cacheqos_late_init(void); (late_initcall)
+ This function needs to be implemented as late_initcall for the
+ specific architecture. The reason for a later invocation is the
+ CPU features can be determined, which happens after the cgroup subsystem
+ is started in the kernel boot sequence. Since the configuration of
+ the cacheqos cgroup depends on how much of particular monitoring
+ resources are available, the cgroup's root_cacheqos_group's subsys_info
+ field cannot be initialized until the CPU features are discovered.
+
+ This function's responsibility is to allocate the
+ root_cacheqos_group.subsys_info field and initialize these fields:
+ - cache_max_rmid: Maximum resource monitoring ID on this CPU
+ - cache_occ_scale: This is used to scale the occupancy data
+ being collected, meant to help compress the
+ values being stored in the CPU. This may
+ exist or not in a particular architecture.
+ - cache_size: Size of the cache being monitored, used for the
+ percentage reporting.
+
+ Refer to arch/x86/kernel/cpu/perf_event_intel_uncore.c for an example.
+
+
+5. Intel-specific implementation
+================================
+
+Intel Xeon processors implement Cache QoS Monitoring using Resource Monitoring
+Identifiers, or RMIDs. When a task is scheduled on a CPU core, the RMID that
+is associated with that task (or group that task belongs to) is written to the
+IA32_PRQ_ASSOC MSR for that CPU. That instructs that CPU to accumulate cache
+occupancy data while that task runs. When that task is scheduled out, the
+IA32_PQR_ASSOC MSR is written with a 0, clearing the monitoring mechanism.
+
+To retrieve the monitoring data, the RMID for the task group being read is
+used to build a configuration map for the IA32_QM_EVTSEL MSR. Once the map is
+written to that MSR, the result is written to the IA32_QM_CTR MSR. That data
+is then stored, but multiplied by the cache_occ_scale, which is read from the
+CPUID sub-leaf during CPU initialization.
+
+For details on the implementation, please refer to the Intel Software
+Development Manual, Volume 3, Chapter 17.14: Cache Quality of Service Monitoring
--
1.8.3.1

2014-01-03 20:35:48

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: [PATCH 3/4] cgroup: Add new cacheqos cgroup subsys to support Cache QoS Monitoring

This patch adds a new cgroup subsystem, named cacheqos. This cgroup
controller is intended to manage task groups to track cache occupancy
and usage of a CPU.

The cacheqos subsystem operates very similarly to the cpuacct
subsystem. Tasks can be grouped into different child subgroups,
and have separate cache occupancy accounting for each of the
subgroups. See Documentation/cgroups/cacheqos-subsystem.txt for
more details.

The patch also adds the Kconfig option for enabling/disabling the
CGROUP_CACHEQOS subsystem. As this CPU feature is currently found
only in Intel Xeon processors, the cgroup subsystem depends on X86.

Signed-off-by: Peter P Waskiewicz Jr <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 112 ++++++++
include/linux/cgroup_subsys.h | 4 +
include/linux/perf_event.h | 14 +
init/Kconfig | 10 +
kernel/sched/Makefile | 1 +
kernel/sched/cacheqos.c | 397 ++++++++++++++++++++++++++
kernel/sched/cacheqos.h | 59 ++++
7 files changed, 597 insertions(+)
create mode 100644 kernel/sched/cacheqos.c
create mode 100644 kernel/sched/cacheqos.h

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 29c2487..4d48e26 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -1633,6 +1633,118 @@ static struct intel_uncore_type *snb_msr_uncores[] = {
};
/* end of Sandy Bridge uncore support */

+#ifdef CONFIG_CGROUP_CACHEQOS
+
+/* needed for the cacheqos cgroup structs */
+#include "../../../kernel/sched/cacheqos.h"
+
+extern struct cacheqos root_cacheqos_group;
+static DEFINE_MUTEX(cqm_mutex);
+
+static int __init cacheqos_late_init(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ struct rmid_list_element *elem;
+ int i;
+
+ mutex_lock(&cqm_mutex);
+
+ if (cpu_has(c, X86_FEATURE_CQM_OCCUP_LLC)) {
+ root_cacheqos_group.subsys_info =
+ kzalloc(sizeof(struct cacheqos_subsys_info), GFP_KERNEL);
+ if (!root_cacheqos_group.subsys_info) {
+ mutex_unlock(&cqm_mutex);
+ return -ENOMEM;
+ }
+
+ root_cacheqos_group.subsys_info->cache_max_rmid =
+ c->x86_cache_max_rmid;
+ root_cacheqos_group.subsys_info->cache_occ_scale =
+ c->x86_cache_occ_scale;
+ root_cacheqos_group.subsys_info->cache_size = c->x86_cache_size;
+ } else {
+ root_cacheqos_group.monitor_cache = false;
+ root_cacheqos_group.css.ss->disabled = 1;
+ mutex_unlock(&cqm_mutex);
+ return -ENODEV;
+ }
+
+ /* Populate the unused rmid list with all rmids. */
+ INIT_LIST_HEAD(&root_cacheqos_group.subsys_info->rmid_unused_fifo);
+ INIT_LIST_HEAD(&root_cacheqos_group.subsys_info->rmid_inuse_list);
+ elem = kzalloc(sizeof(*elem), GFP_KERNEL);
+ if (!elem)
+ return -ENOMEM;
+
+ elem->rmid = 0;
+ list_add_tail(&elem->list,
+ &root_cacheqos_group.subsys_info->rmid_inuse_list);
+ for (i = 1; i < root_cacheqos_group.subsys_info->cache_max_rmid; i++) {
+ elem = kzalloc(sizeof(*elem), GFP_KERNEL);
+ if (!elem)
+ return -ENOMEM;
+
+ elem->rmid = i;
+ INIT_LIST_HEAD(&elem->list);
+ list_add_tail(&elem->list,
+ &root_cacheqos_group.subsys_info->rmid_unused_fifo);
+ }
+
+ /* go live on the root group */
+ root_cacheqos_group.monitor_cache = true;
+
+ mutex_unlock(&cqm_mutex);
+ return 0;
+}
+late_initcall(cacheqos_late_init);
+
+void cacheqos_map_schedule_out(void)
+{
+ /*
+ * cacheqos_map_schedule_in() will set the MSR correctly, but
+ * clearing the MSR here will prevent occupancy counts against this
+ * task during the context switch. In other words, this gives a
+ * "better" representation of what's happening in the cache.
+ */
+ wrmsrl(IA32_PQR_ASSOC, 0);
+}
+
+void cacheqos_map_schedule_in(struct cacheqos *cq)
+{
+ u64 map;
+
+ map = cq->rmid & IA32_RMID_PQR_MASK;
+ wrmsrl(IA32_PQR_ASSOC, map);
+}
+
+void cacheqos_read(void *arg)
+{
+ struct cacheqos *cq = arg;
+ u64 config;
+ u64 result = 0;
+ int cpu, node;
+
+ cpu = smp_processor_id(),
+ node = cpu_to_node(cpu);
+ config = cq->rmid;
+ config = ((config & IA32_RMID_PQR_MASK) <<
+ IA32_QM_EVTSEL_RMID_POSITION) |
+ IA32_QM_EVTSEL_EVTID_READ_OCC;
+
+ wrmsrl(IA32_QM_EVTSEL, config);
+ rdmsrl(IA32_QM_CTR, result);
+
+ /* place results in sys_wide_info area for recovery */
+ if (result & IA32_QM_CTR_ERR)
+ result = -1;
+ else
+ result &= ~IA32_QM_CTR_ERR;
+
+ cq->subsys_info->node_results[node] =
+ result * cq->subsys_info->cache_occ_scale;
+}
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
/* Nehalem uncore support */
static void nhm_uncore_msr_disable_box(struct intel_uncore_box *box)
{
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index b613ffd..14b97e4 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -50,6 +50,10 @@ SUBSYS(net_prio)
#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_HUGETLB)
SUBSYS(hugetlb)
#endif
+
+#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_CACHEQOS)
+SUBSYS(cacheqos)
+#endif
/*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..59eabf3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -54,6 +54,11 @@ struct perf_guest_info_callbacks {
#include <linux/perf_regs.h>
#include <asm/local.h>

+#ifdef CONFIG_CGROUP_CACHEQOS
+inline void cacheqos_sched_out(struct task_struct *task);
+inline void cacheqos_sched_in(struct task_struct *task);
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
struct perf_callchain_entry {
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
@@ -676,6 +681,10 @@ static inline void perf_event_task_sched_in(struct task_struct *prev,
{
if (static_key_false(&perf_sched_events.key))
__perf_event_task_sched_in(prev, task);
+
+#ifdef CONFIG_CGROUP_CACHEQOS
+ cacheqos_sched_in(task);
+#endif /* CONFIG_CGROUP_CACHEQOS */
}

static inline void perf_event_task_sched_out(struct task_struct *prev,
@@ -685,6 +694,11 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,

if (static_key_false(&perf_sched_events.key))
__perf_event_task_sched_out(prev, next);
+
+#ifdef CONFIG_CGROUP_CACHEQOS
+ /* use outgoing task to see if cacheqos is active or not */
+ cacheqos_sched_out(prev);
+#endif /* CONFIG_CGROUP_CACHEQOS */
}

extern void perf_event_mmap(struct vm_area_struct *vma);
diff --git a/init/Kconfig b/init/Kconfig
index 4e5d96a..9619cdc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -905,6 +905,16 @@ config PROC_PID_CPUSET
depends on CPUSETS
default y

+config CGROUP_CACHEQOS
+ bool "Simple Cache QoS Monitoring cgroup subsystem"
+ depends on X86 || X86_64
+ help
+ Provides a simple Resource Controller for monitoring the
+ total cache occupancy by the tasks in a cgroup. This requires
+ hardware support to track cache usage.
+
+ Say N if unsure.
+
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
help
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7b62140..30aa883 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -18,3 +18,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CGROUP_CACHEQOS) += cacheqos.o
diff --git a/kernel/sched/cacheqos.c b/kernel/sched/cacheqos.c
new file mode 100644
index 0000000..1ce799e
--- /dev/null
+++ b/kernel/sched/cacheqos.c
@@ -0,0 +1,397 @@
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/kernel_stat.h>
+#include <linux/err.h>
+
+#include "cacheqos.h"
+#include "sched.h"
+
+struct cacheqos root_cacheqos_group;
+static DEFINE_MUTEX(cacheqos_mutex);
+
+#if !defined(CONFIG_X86_64) || !defined(CONFIG_X86)
+static int __init cacheqos_late_init(void)
+{
+ /* No Cache QoS support on this architecture, disable the subsystem */
+ root_cacheqos_group.monitor_cache = false;
+ root_cacheqos_group.css.ss->disabled = 1;
+ return -ENODEV;
+}
+late_initcall(cacheqos_late_init);
+#endif
+
+inline void cacheqos_sched_out(struct task_struct *task)
+{
+ struct cacheqos *cq = task_cacheqos(task);
+ /*
+ * Assumption is that this thread is running on the logical processor
+ * from which the task is being scheduled out.
+ *
+ * As the task is scheduled out mapping goes back to default map.
+ */
+ if (cq->monitor_cache)
+ cacheqos_map_schedule_out();
+}
+
+inline void cacheqos_sched_in(struct task_struct *task)
+{
+ struct cacheqos *cq = task_cacheqos(task);
+ /*
+ * Assumption is that this thread is running on the logical processor
+ * of which this task is being scheduled onto.
+ *
+ * As the task is scheduled in, the cgroup's rmid is loaded
+ */
+ if (cq->monitor_cache)
+ cacheqos_map_schedule_in(cq);
+}
+
+static void cacheqos_adjust_children_rmid(struct cacheqos *cq)
+{
+ struct cgroup_subsys_state *css, *pos;
+ struct cacheqos *p_cq, *pos_cq;
+
+ css = &cq->css;
+ rcu_read_lock();
+
+ css_for_each_descendant_pre(pos, css) {
+ pos_cq = css_cacheqos(pos);
+ if (!pos_cq->monitor_cache) {
+ /* monitoring is disabled, so use the parent's RMID */
+ p_cq = parent_cacheqos(pos_cq);
+ spin_lock_irq(&pos_cq->lock);
+ pos_cq->rmid = p_cq->rmid;
+ spin_unlock_irq(&pos_cq->lock);
+ }
+ }
+ rcu_read_unlock();
+}
+
+static int cacheqos_move_rmid_to_unused_list(struct cacheqos *cq)
+{
+ struct rmid_list_element *elem;
+
+ /*
+ * Assumes only called when cq->rmid is valid (ie, it is on the
+ * inuse list) and cacheqos_mutex is held.
+ */
+ lockdep_assert_held(&cacheqos_mutex);
+ list_for_each_entry(elem, &cq->subsys_info->rmid_inuse_list, list) {
+ if (cq->rmid == elem->rmid) {
+ /* Move rmid from inuse to unused list */
+ list_del_init(&elem->list);
+ list_add_tail(&elem->list,
+ &cq->subsys_info->rmid_unused_fifo);
+ goto quick_exit;
+ }
+ }
+ return -ELIBBAD;
+
+quick_exit:
+ return 0;
+}
+
+static int cacheqos_deallocate_rmid(struct cacheqos *cq)
+{
+ struct cacheqos *cq_parent = parent_cacheqos(cq);
+ int err;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_move_rmid_to_unused_list(cq);
+ if (err)
+ return err;
+ /* assign parent's rmid to cgroup */
+ cq->monitor_cache = false;
+ cq->rmid = cq_parent->rmid;
+
+ /* Check for children using this cgroup's rmid, iterate */
+ cacheqos_adjust_children_rmid(cq);
+
+ mutex_unlock(&cacheqos_mutex);
+ return 0;
+}
+
+static int cacheqos_allocate_rmid(struct cacheqos *cq)
+{
+ struct rmid_list_element *elem;
+ struct list_head *item;
+
+ mutex_lock(&cacheqos_mutex);
+
+ if (list_empty(&cq->subsys_info->rmid_unused_fifo)) {
+ mutex_unlock(&cacheqos_mutex);
+ return -EAGAIN;
+ }
+
+ /* Move rmid from unused to inuse list */
+ item = cq->subsys_info->rmid_unused_fifo.next;
+ list_del_init(item);
+ list_add_tail(item, &cq->subsys_info->rmid_inuse_list);
+
+ /* assign rmid to cgroup */
+ elem = list_entry(item, struct rmid_list_element, list);
+ cq->rmid = elem->rmid;
+ cq->monitor_cache = true;
+
+ /* Check for children using this cgroup's rmid, iterate */
+ cacheqos_adjust_children_rmid(cq);
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+/* create a new cacheqos cgroup */
+static struct cgroup_subsys_state *
+cacheqos_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct cacheqos *parent = css_cacheqos(parent_css);
+ struct cacheqos *cq;
+
+ if (!parent) {
+ /* cacheqos_late_init() will enable monitoring on the root */
+ root_cacheqos_group.rmid = 0;
+ return &root_cacheqos_group.css;
+ }
+
+ cq = kzalloc(sizeof(struct cacheqos), GFP_KERNEL);
+ if (!cq)
+ goto out;
+
+ cq->cgrp = parent_css->cgroup;
+ cq->monitor_cache = false; /* disabled i.e., use parent's RMID */
+ cq->rmid = parent->rmid; /* Start by using parent's RMID*/
+ cq->subsys_info = root_cacheqos_group.subsys_info;
+ return &cq->css;
+
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+/* destroy an existing cacheqos task group */
+static void cacheqos_css_free(struct cgroup_subsys_state *css)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+
+ if (cq->monitor_cache) {
+ mutex_lock(&cacheqos_mutex);
+ cacheqos_move_rmid_to_unused_list(cq);
+ mutex_unlock(&cacheqos_mutex);
+ }
+ kfree(cq);
+}
+
+/* return task group's monitoring state */
+static u64 cacheqos_monitor_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+
+ return cq->monitor_cache;
+}
+
+/* set the task group's monitoring state */
+static int cacheqos_monitor_write(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 enable)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ int err = 0;
+
+ if (enable != 0 && enable != 1) {
+ err = -EINVAL;
+ goto monitor_out;
+ }
+
+ if (enable && cq->monitor_cache)
+ goto monitor_out;
+
+ if (cq->monitor_cache)
+ err = cacheqos_deallocate_rmid(cq);
+ else
+ err = cacheqos_allocate_rmid(cq);
+
+monitor_out:
+ return err;
+}
+
+static int cacheqos_get_occupancy_data(struct cacheqos *cq)
+{
+ unsigned int cpu;
+ unsigned int node;
+ const struct cpumask *node_cpus;
+ int err = 0;
+
+ /* Assumes cacheqos_mutex is held */
+ lockdep_assert_held(&cacheqos_mutex);
+ for_each_node_with_cpus(node) {
+ node_cpus = cpumask_of_node(node);
+ cpu = any_online_cpu(*node_cpus);
+ err = smp_call_function_single(cpu, cacheqos_read, cq, 1);
+
+ if (err) {
+ break;
+ } else if (cq->subsys_info->node_results[node] == -1) {
+ err = -EPROTO;
+ break;
+ }
+ }
+ return err;
+}
+
+/* return total system LLC occupancy in bytes of a task group */
+static int cacheqos_occupancy_read(struct cgroup_subsys_state *css,
+ struct cftype *cft, struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy = 0;
+ int err, node;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node)
+ total_occupancy += cq->subsys_info->node_results[node];
+
+ mutex_unlock(&cacheqos_mutex);
+
+ seq_printf(m, "%llu\n", total_occupancy);
+ return 0;
+}
+
+/* return display each LLC's occupancy in bytes of a task group */
+static int
+cacheqos_occupancy_persocket_seq_read(struct cgroup_subsys_state *css,
+ struct cftype *cft, struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ int err, node;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node) {
+ seq_printf(m, "%llu\n",
+ cq->subsys_info->node_results[node]);
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+/* return total system LLC occupancy as a %of system LLC for the task group */
+static int cacheqos_occupancy_percent_read(struct cgroup_subsys_state *css,
+ struct cftype *cft,
+ struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy = 0;
+ int err, node;
+ int node_cnt = 0;
+ int parts_of_100, parts_of_10000;
+ int cache_size;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ for_each_node_with_cpus(node) {
+ ++node_cnt;
+ total_occupancy += cq->subsys_info->node_results[node];
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ cache_size = cq->subsys_info->cache_size * node_cnt;
+ parts_of_100 = (total_occupancy * 100) / (cache_size * 1024);
+ parts_of_10000 = (total_occupancy * 10000) / (cache_size * 1024) -
+ parts_of_100 * 100;
+ seq_printf(m, "%d.%02d\n", parts_of_100, parts_of_10000);
+
+ return 0;
+}
+
+/* return display each LLC's % occupancy of the socket's LLC for task group */
+static int
+cacheqos_occupancy_percent_persocket_seq_read(struct cgroup_subsys_state *css,
+ struct cftype *cft,
+ struct seq_file *m)
+{
+ struct cacheqos *cq = css_cacheqos(css);
+ u64 total_occupancy;
+ int err, node;
+ int cache_size;
+ int parts_of_100, parts_of_10000;
+
+ mutex_lock(&cacheqos_mutex);
+ err = cacheqos_get_occupancy_data(cq);
+ if (err) {
+ mutex_unlock(&cacheqos_mutex);
+ return err;
+ }
+
+ cache_size = cq->subsys_info->cache_size;
+ for_each_node_with_cpus(node) {
+ total_occupancy = cq->subsys_info->node_results[node];
+ parts_of_100 = (total_occupancy * 100) / (cache_size * 1024);
+ parts_of_10000 = (total_occupancy * 10000) /
+ (cache_size * 1024) - parts_of_100 * 100;
+
+ seq_printf(m, "%d.%02d\n", parts_of_100, parts_of_10000);
+ }
+
+ mutex_unlock(&cacheqos_mutex);
+
+ return 0;
+}
+
+static struct cftype cacheqos_files[] = {
+ {
+ .name = "monitor_cache",
+ .read_u64 = cacheqos_monitor_read,
+ .write_u64 = cacheqos_monitor_write,
+ .mode = 0666,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "occupancy_persocket",
+ .read_seq_string = cacheqos_occupancy_persocket_seq_read,
+ },
+ {
+ .name = "occupancy",
+ .read_seq_string = cacheqos_occupancy_read,
+ },
+ {
+ .name = "occupancy_percent_persocket",
+ .read_seq_string = cacheqos_occupancy_percent_persocket_seq_read,
+ },
+ {
+ .name = "occupancy_percent",
+ .read_seq_string = cacheqos_occupancy_percent_read,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys cacheqos_subsys = {
+ .name = "cacheqos",
+ .css_alloc = cacheqos_css_alloc,
+ .css_free = cacheqos_css_free,
+ .subsys_id = cacheqos_subsys_id,
+ .base_cftypes = cacheqos_files,
+};
diff --git a/kernel/sched/cacheqos.h b/kernel/sched/cacheqos.h
new file mode 100644
index 0000000..b20f25e
--- /dev/null
+++ b/kernel/sched/cacheqos.h
@@ -0,0 +1,59 @@
+#ifndef _CACHEQOS_H_
+#define _CACHEQOS_H_
+#ifdef CONFIG_CGROUP_CACHEQOS
+
+#include <linux/cgroup.h>
+
+struct rmid_list_element {
+ int rmid;
+ struct list_head list;
+};
+
+struct cacheqos_subsys_info {
+ struct list_head rmid_unused_fifo;
+ struct list_head rmid_inuse_list;
+ int cache_max_rmid;
+ int cache_occ_scale;
+ int cache_size;
+ u64 node_results[MAX_NUMNODES];
+};
+
+struct cacheqos {
+ struct cgroup_subsys_state css;
+ struct cacheqos_subsys_info *subsys_info;
+ struct cgroup *cgrp;
+ bool monitor_cache; /* false - use parent RMID / true - new RMID */
+
+ /*
+ * Used for walking the task groups to update RMID's of the various
+ * sub-groups. If monitor_cache is false, the sub-groups will inherit
+ * the parent's RMID. If monitor_cache is true, then the group has its
+ * own RMID.
+ */
+ spinlock_t lock;
+ u32 rmid;
+};
+
+extern void cacheqos_map_schedule_out(void);
+extern void cacheqos_map_schedule_in(struct cacheqos *);
+extern void cacheqos_read(void *);
+
+/* return cacheqos group corresponding to this container */
+static inline struct cacheqos *css_cacheqos(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct cacheqos, css) : NULL;
+}
+
+/* return cacheqos group to which this task belongs */
+static inline struct cacheqos *task_cacheqos(struct task_struct *task)
+{
+ return css_cacheqos(task_css(task, cacheqos_subsys_id));
+}
+
+static inline struct cacheqos *parent_cacheqos(struct cacheqos *cacheqos)
+{
+ return css_cacheqos(css_parent(&cacheqos->css));
+}
+
+#endif /* CONFIG_CGROUP_CACHEQOS */
+#endif /* _CACHEQOS_H_ */
--
1.8.3.1

2014-01-03 20:35:33

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: [PATCH 2/4] x86: Add Cache QoS Monitoring support to x86 perf uncore

This patch adds the MSRs and masks for CQM to the x86 uncore.

The actual schedling functions using the MSRs will be included
in the next patch when the new cgroup subsystem is added, as there
are dependencies on structs from the cgroup.

Signed-off-by: Peter P Waskiewicz Jr <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_uncore.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index a80ab71..f788145 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -412,6 +412,19 @@

#define NHMEX_W_PMON_GLOBAL_FIXED_EN (1ULL << 31)

+#ifdef CONFIG_CGROUP_CACHEQOS
+/* Intel Cache QoS Monitoring uncore support */
+#define IA32_QM_EVTSEL 0xc8d
+#define IA32_QM_CTR 0xc8e
+#define IA32_PQR_ASSOC 0xc8f
+
+#define IA32_QM_EVTSEL_EVTID_READ_OCC 0x01
+#define IA32_QM_CTR_ERR (0x03llu << 62)
+#define IA32_RMID_PQR_MASK 0x3ff
+#define IA32_QM_EVTSEL_RMID_POSITION 32
+
+#endif /* CONFIG_CGROUP_CACHEQOS */
+
struct intel_uncore_ops;
struct intel_uncore_pmu;
struct intel_uncore_box;
--
1.8.3.1

2014-01-04 16:10:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Hello,

On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> The CPU features themselves are relatively straight-forward, but
> the presentation of the data is less straight-forward. Since this
> tracks cache usage and occupancy per process (by swapping Resource
> Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> not be a good fit for this data, which does not report on a
> per-process level. Therefore, a new cgroup subsystem, cacheqos, has
> been added. This operates very similarly to the cpu and cpuacct
> cgroup subsystems, where tasks can be grouped into sub-leaves of the
> root-level cgroup.

I don't really understand why this is implemented as part of cgroup.
There doesn't seem to be anything which requires cgroup. Wouldn't
just doing it per-process make more sense? Even grouping would be
better done along the traditional process hierarchy, no? And
per-cgroup accounting can be trivially achieved from userland by just
accumulating the stats according to the process's cgroup membership.
What am I missing here?

Thanks.

--
tejun

2014-01-04 22:43:07

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Sat, 2014-01-04 at 11:10 -0500, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> > The CPU features themselves are relatively straight-forward, but
> > the presentation of the data is less straight-forward. Since this
> > tracks cache usage and occupancy per process (by swapping Resource
> > Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> > not be a good fit for this data, which does not report on a
> > per-process level. Therefore, a new cgroup subsystem, cacheqos, has
> > been added. This operates very similarly to the cpu and cpuacct
> > cgroup subsystems, where tasks can be grouped into sub-leaves of the
> > root-level cgroup.
>
> I don't really understand why this is implemented as part of cgroup.
> There doesn't seem to be anything which requires cgroup. Wouldn't
> just doing it per-process make more sense? Even grouping would be
> better done along the traditional process hierarchy, no? And
> per-cgroup accounting can be trivially achieved from userland by just
> accumulating the stats according to the process's cgroup membership.
> What am I missing here?

Thanks for the quick response! I knew the approach would generate
questions, so let me explain.

The feature I'm enabling in the Xeon processors is fairly simple. It
has a set of Resource Monitoring ID's (RMIDs), and those are used by the
CPU cores to track the cache usage while any process associated with the
RMID is running. The more complicated part is how to present the
interface of creating RMID groups and assigning processes to them for
both tracking, and for stat collection.

We discussed (internally) a few different approaches to implement this.
The first natural thought was this is similar to other PMU features, but
this deals with processes and groups of processes, not overall CPU core
or uncore state. Given the way processes in a cgroup can be grouped
together and treated as single entities, this felt like a natural fit
with the RMID concept.

Simply put, when we want to allocate an RMID for monitoring httpd
traffic, we can create a new child in the subsystem hierarchy, and
assign the httpd processes to it. Then the RMID can be assigned to the
subsystem, and each process inherits that RMID. So instead of dealing
with assigning an RMID to each and every process, we can leverage the
existing cgroup mechanisms for grouping processes and their children to
a group, and they inherit the RMID.

Please let me know if this is a better explanation, and gives a better
picture of why we decided to approach the implementation this way. Also
note that this feature, Cache QoS Monitoring, is the first in a series
of Platform QoS Monitoring features that will be coming. So this isn't
a one-off feature, so however this first piece gets accepted, we want to
make sure it's easy to expand and not impact userspace tools repeatedly
(if possible).

Cheers,
-PJ Waskiewicz

--------------
Intel Open Source Technology Center
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-04 22:51:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

Hello,

On Sat, Jan 04, 2014 at 10:43:00PM +0000, Waskiewicz Jr, Peter P wrote:
> Simply put, when we want to allocate an RMID for monitoring httpd
> traffic, we can create a new child in the subsystem hierarchy, and
> assign the httpd processes to it. Then the RMID can be assigned to the
> subsystem, and each process inherits that RMID. So instead of dealing
> with assigning an RMID to each and every process, we can leverage the
> existing cgroup mechanisms for grouping processes and their children to
> a group, and they inherit the RMID.

Here's one thing that I don't get, possibly because I'm not
understanding the processor feature too well. Why does the processor
have to be aware of the grouping? ie. why can't it be done
per-process and then aggregated? Is there something inherent about
the monitored events which requires such peculiarity? Or is it that
accessing the stats data is noticieably expensive to do per context
switch?

> Please let me know if this is a better explanation, and gives a better
> picture of why we decided to approach the implementation this way. Also
> note that this feature, Cache QoS Monitoring, is the first in a series
> of Platform QoS Monitoring features that will be coming. So this isn't
> a one-off feature, so however this first piece gets accepted, we want to
> make sure it's easy to expand and not impact userspace tools repeatedly
> (if possible).

In general, I'm quite strongly opposed against using cgroup as
arbitrary grouping mechanism for anything other than resource control,
especially given that we're moving away from multiple hierarchies.

Thanks.

--
tejun

2014-01-05 05:23:12

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Sat, 2014-01-04 at 17:50 -0500, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Sat, Jan 04, 2014 at 10:43:00PM +0000, Waskiewicz Jr, Peter P wrote:
> > Simply put, when we want to allocate an RMID for monitoring httpd
> > traffic, we can create a new child in the subsystem hierarchy, and
> > assign the httpd processes to it. Then the RMID can be assigned to the
> > subsystem, and each process inherits that RMID. So instead of dealing
> > with assigning an RMID to each and every process, we can leverage the
> > existing cgroup mechanisms for grouping processes and their children to
> > a group, and they inherit the RMID.
>
> Here's one thing that I don't get, possibly because I'm not
> understanding the processor feature too well. Why does the processor
> have to be aware of the grouping? ie. why can't it be done
> per-process and then aggregated? Is there something inherent about
> the monitored events which requires such peculiarity? Or is it that
> accessing the stats data is noticieably expensive to do per context
> switch?

The processor doesn't need to understand the grouping at all, but it
also isn't tracking things per-process that are rolled up later.
They're tracked via the RMID resource in the hardware, which could
correspond to a single process, or 500 processes. It really comes down
to the ease of management of grouping tasks in groups for two consumers,
1) the end user, and 2) the process scheduler.

I think I still may not be explaining how the CPU side works well
enough, in order to better understand what I'm trying to do with the
cgroup. Let me try to be a bit more clear, and if I'm still sounding
vague or not making sense, please tell me what isn't clear and I'll try
to be more specific. The new Documentation addition in patch 4 also has
a good overview, but let's try this:

A CPU may have 32 RMID's in hardware. This is for the platform, not per
core. I may want to have a single process assigned to an RMID for
tracking, say qemu to monitor cache usage of a specific VM. But I also
may want to monitor cache usage of all MySQL database processes with
another RMID, or even split specific processes of that database between
different RMID's. It all comes down to how the end-user wants to
monitor their specific workloads, and how those workloads are impacting
cache usage and occupancy.

With this implementation I've sent, all tasks are in RMID 0 by default.
Then one can create a subdirectory, just like the cpuacct cgroup, and
then add tasks to that subdirectory's task list. Once that
subdirectory's task list is enabled (through the cacheqos.monitor_cache
handle), then a free RMID is assigned from the CPU, and when the
scheduler switches to any of the tasks in that cgroup under that RMID,
the RMID begins monitoring the usage.

The CPU side is easy and clean. When something in the software wants to
monitor when a particular task is scheduled and started, write whatever
RMID that task is assigned to (through some mechanism) to the proper MSR
in the CPU. When that task is swapped out, clear the MSR to stop
monitoring of that RMID. When that RMID's statistics are requested by
the software (through some mechanism), then the CPU's MSRs are written
with the RMID in question, and the value is read of what has been
collected so far. In my case, I decided to use a cgroup for this
"mechanism" since so much of the grouping and task/group association
already exists and doesn't need to be rebuilt or re-invented.

> > Please let me know if this is a better explanation, and gives a better
> > picture of why we decided to approach the implementation this way. Also
> > note that this feature, Cache QoS Monitoring, is the first in a series
> > of Platform QoS Monitoring features that will be coming. So this isn't
> > a one-off feature, so however this first piece gets accepted, we want to
> > make sure it's easy to expand and not impact userspace tools repeatedly
> > (if possible).
>
> In general, I'm quite strongly opposed against using cgroup as
> arbitrary grouping mechanism for anything other than resource control,
> especially given that we're moving away from multiple hierarchies.

Just to clarify then, would the mechanism in the cpuacct cgroup to
create a group off the root subsystem be considered multi-hierarchical?
If not, then the intent for this new cacheqos subsystem is to be
identical in that regard to cpuacct in the behavior.

This is a resource controller, it just happens to be tied to a hardware
resource instead of an OS resource.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 11:08:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> The CPU features themselves are relatively straight-forward, but
> the presentation of the data is less straight-forward. Since this
> tracks cache usage and occupancy per process (by swapping Resource
> Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> not be a good fit for this data, which does not report on a
> per-process level. Therefore, a new cgroup subsystem, cacheqos, has
> been added. This operates very similarly to the cpu and cpuacct
> cgroup subsystems, where tasks can be grouped into sub-leaves of the
> root-level cgroup.

This doesn't make any sense.. From a quick SDM read you can do pretty
much whatever with those RMIDs. If you allocate a RMID per task (thread
in userspace) you can actually measure things on a task basis.

>From then on you can use perf-cgroup to group whatever tasks you want.

So please be more explicit in why you think this doesn't fit into perf.

2014-01-06 11:16:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> The processor doesn't need to understand the grouping at all, but it
> also isn't tracking things per-process that are rolled up later.
> They're tracked via the RMID resource in the hardware, which could
> correspond to a single process, or 500 processes. It really comes down
> to the ease of management of grouping tasks in groups for two consumers,
> 1) the end user, and 2) the process scheduler.
>
> I think I still may not be explaining how the CPU side works well
> enough, in order to better understand what I'm trying to do with the
> cgroup. Let me try to be a bit more clear, and if I'm still sounding
> vague or not making sense, please tell me what isn't clear and I'll try
> to be more specific. The new Documentation addition in patch 4 also has
> a good overview, but let's try this:
>
> A CPU may have 32 RMID's in hardware. This is for the platform, not per
> core. I may want to have a single process assigned to an RMID for
> tracking, say qemu to monitor cache usage of a specific VM. But I also
> may want to monitor cache usage of all MySQL database processes with
> another RMID, or even split specific processes of that database between
> different RMID's. It all comes down to how the end-user wants to
> monitor their specific workloads, and how those workloads are impacting
> cache usage and occupancy.
>
> With this implementation I've sent, all tasks are in RMID 0 by default.
> Then one can create a subdirectory, just like the cpuacct cgroup, and
> then add tasks to that subdirectory's task list. Once that
> subdirectory's task list is enabled (through the cacheqos.monitor_cache
> handle), then a free RMID is assigned from the CPU, and when the
> scheduler switches to any of the tasks in that cgroup under that RMID,
> the RMID begins monitoring the usage.
>
> The CPU side is easy and clean. When something in the software wants to
> monitor when a particular task is scheduled and started, write whatever
> RMID that task is assigned to (through some mechanism) to the proper MSR
> in the CPU. When that task is swapped out, clear the MSR to stop
> monitoring of that RMID. When that RMID's statistics are requested by
> the software (through some mechanism), then the CPU's MSRs are written
> with the RMID in question, and the value is read of what has been
> collected so far. In my case, I decided to use a cgroup for this
> "mechanism" since so much of the grouping and task/group association
> already exists and doesn't need to be rebuilt or re-invented.

This still doesn't explain why you can't use perf-cgroup for this.

> > In general, I'm quite strongly opposed against using cgroup as
> > arbitrary grouping mechanism for anything other than resource control,
> > especially given that we're moving away from multiple hierarchies.
>
> Just to clarify then, would the mechanism in the cpuacct cgroup to
> create a group off the root subsystem be considered multi-hierarchical?
> If not, then the intent for this new cacheqos subsystem is to be
> identical in that regard to cpuacct in the behavior.
>
> This is a resource controller, it just happens to be tied to a hardware
> resource instead of an OS resource.

No, cpuacct and perf-cgroup aren't actually controllers at all. They're
resource monitors at best. Same with your Cache QoS Monitor, it doesn't
control anything.

2014-01-06 16:34:27

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:
> On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > The CPU side is easy and clean. When something in the software wants to
> > monitor when a particular task is scheduled and started, write whatever
> > RMID that task is assigned to (through some mechanism) to the proper MSR
> > in the CPU. When that task is swapped out, clear the MSR to stop
> > monitoring of that RMID. When that RMID's statistics are requested by
> > the software (through some mechanism), then the CPU's MSRs are written
> > with the RMID in question, and the value is read of what has been
> > collected so far. In my case, I decided to use a cgroup for this
> > "mechanism" since so much of the grouping and task/group association
> > already exists and doesn't need to be rebuilt or re-invented.
>
> This still doesn't explain why you can't use perf-cgroup for this.

I'm not completely familiar with perf-cgroup, so I looked for some
documentation for it to better understand it. Are you referring to perf
-G to monitor an existing cgroup/all cgroups? Or something else? If
it's the former, I'm not following you how this would fit.

> > > In general, I'm quite strongly opposed against using cgroup as
> > > arbitrary grouping mechanism for anything other than resource control,
> > > especially given that we're moving away from multiple hierarchies.
> >
> > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > create a group off the root subsystem be considered multi-hierarchical?
> > If not, then the intent for this new cacheqos subsystem is to be
> > identical in that regard to cpuacct in the behavior.
> >
> > This is a resource controller, it just happens to be tied to a hardware
> > resource instead of an OS resource.
>
> No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> control anything.

I may be using controller in a different way than you are. Yes, the
Cache QoS Monitor is monitoring cache data. But it is also controlling
the allocation and deallocation of RMIDs to tasks/task groups as
monitoring is enabled and disabled for those groups. That's why I
called it a controller. If that's not accurate, I apologize.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 16:42:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 04:34:04PM +0000, Waskiewicz Jr, Peter P wrote:
> On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:
> > On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > > The CPU side is easy and clean. When something in the software wants to
> > > monitor when a particular task is scheduled and started, write whatever
> > > RMID that task is assigned to (through some mechanism) to the proper MSR
> > > in the CPU. When that task is swapped out, clear the MSR to stop
> > > monitoring of that RMID. When that RMID's statistics are requested by
> > > the software (through some mechanism), then the CPU's MSRs are written
> > > with the RMID in question, and the value is read of what has been
> > > collected so far. In my case, I decided to use a cgroup for this
> > > "mechanism" since so much of the grouping and task/group association
> > > already exists and doesn't need to be rebuilt or re-invented.
> >
> > This still doesn't explain why you can't use perf-cgroup for this.
>
> I'm not completely familiar with perf-cgroup, so I looked for some
> documentation for it to better understand it. Are you referring to perf
> -G to monitor an existing cgroup/all cgroups? Or something else? If
> it's the former, I'm not following you how this would fit.

All the bits under CONFIG_CGROUP_PERF, I've no idea how userspace looks.

> > > > In general, I'm quite strongly opposed against using cgroup as
> > > > arbitrary grouping mechanism for anything other than resource control,
> > > > especially given that we're moving away from multiple hierarchies.
> > >
> > > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > > create a group off the root subsystem be considered multi-hierarchical?
> > > If not, then the intent for this new cacheqos subsystem is to be
> > > identical in that regard to cpuacct in the behavior.
> > >
> > > This is a resource controller, it just happens to be tied to a hardware
> > > resource instead of an OS resource.
> >
> > No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> > resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> > control anything.
>
> I may be using controller in a different way than you are. Yes, the
> Cache QoS Monitor is monitoring cache data. But it is also controlling
> the allocation and deallocation of RMIDs to tasks/task groups as
> monitoring is enabled and disabled for those groups. That's why I
> called it a controller. If that's not accurate, I apologize.

Yeah that's not accurate, nor desired I think, because you get into
horrible problems with hierarchies, do child groups belong to your RMID
or not?

As is I don't really see a good use for RMIDs and I would simply not use
them.

2014-01-06 16:42:33

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 12:08 +0100, Peter Zijlstra wrote:
> On Fri, Jan 03, 2014 at 12:34:41PM -0800, Peter P Waskiewicz Jr wrote:
> > The CPU features themselves are relatively straight-forward, but
> > the presentation of the data is less straight-forward. Since this
> > tracks cache usage and occupancy per process (by swapping Resource
> > Monitor IDs, or RMIDs, when processes are rescheduled), perf would
> > not be a good fit for this data, which does not report on a
> > per-process level. Therefore, a new cgroup subsystem, cacheqos, has
> > been added. This operates very similarly to the cpu and cpuacct
> > cgroup subsystems, where tasks can be grouped into sub-leaves of the
> > root-level cgroup.
>
> This doesn't make any sense.. From a quick SDM read you can do pretty
> much whatever with those RMIDs. If you allocate a RMID per task (thread
> in userspace) you can actually measure things on a task basis.

Exactly. An RMID can be assigned to a single task or a group of tasks.
Because the RMID is a hardware resource and is limited, the
implementation of using it is what we're really discussing here. Our
approach is to either monitor per-task, or per group of tasks.

> From then on you can use perf-cgroup to group whatever tasks you want.
>
> So please be more explicit in why you think this doesn't fit into perf.

I said this in my other reply to the other thread, but I'll ask again
because I'm not following. I'm looking for information on perf-cgroup,
and I all see is a way to monitor CPU events for tasks in a cgroup (perf
-G option).

The other part I'm not seeing is how to control the RMIDs being
allocated across to different groups. There may be 100 task groups to
monitor, but only 32 RMIDs. So the RMIDs need to be handed out to
active tasks and then enabled, data extracted, then disabled. That was
the intent of the cacheqos.monitor_cache knob.

The bottom line is I'm asking for a bit more information from you about
perf-cgroup, since it sounds like you see a fit for CQM here, and I'm
not seeing what you're looking at yet. Any information is much
appreciated.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 16:48:01

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 17:41 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 04:34:04PM +0000, Waskiewicz Jr, Peter P wrote:
> > On Mon, 2014-01-06 at 12:16 +0100, Peter Zijlstra wrote:
> > > On Sun, Jan 05, 2014 at 05:23:07AM +0000, Waskiewicz Jr, Peter P wrote:
> > > > The CPU side is easy and clean. When something in the software wants to
> > > > monitor when a particular task is scheduled and started, write whatever
> > > > RMID that task is assigned to (through some mechanism) to the proper MSR
> > > > in the CPU. When that task is swapped out, clear the MSR to stop
> > > > monitoring of that RMID. When that RMID's statistics are requested by
> > > > the software (through some mechanism), then the CPU's MSRs are written
> > > > with the RMID in question, and the value is read of what has been
> > > > collected so far. In my case, I decided to use a cgroup for this
> > > > "mechanism" since so much of the grouping and task/group association
> > > > already exists and doesn't need to be rebuilt or re-invented.
> > >
> > > This still doesn't explain why you can't use perf-cgroup for this.
> >
> > I'm not completely familiar with perf-cgroup, so I looked for some
> > documentation for it to better understand it. Are you referring to perf
> > -G to monitor an existing cgroup/all cgroups? Or something else? If
> > it's the former, I'm not following you how this would fit.
>
> All the bits under CONFIG_CGROUP_PERF, I've no idea how userspace looks.

Ah ok. Yes, the userspace side of perf really doesn't fit controlling
the CQM bits at all from what I see.

> > > > > In general, I'm quite strongly opposed against using cgroup as
> > > > > arbitrary grouping mechanism for anything other than resource control,
> > > > > especially given that we're moving away from multiple hierarchies.
> > > >
> > > > Just to clarify then, would the mechanism in the cpuacct cgroup to
> > > > create a group off the root subsystem be considered multi-hierarchical?
> > > > If not, then the intent for this new cacheqos subsystem is to be
> > > > identical in that regard to cpuacct in the behavior.
> > > >
> > > > This is a resource controller, it just happens to be tied to a hardware
> > > > resource instead of an OS resource.
> > >
> > > No, cpuacct and perf-cgroup aren't actually controllers at all. They're
> > > resource monitors at best. Same with your Cache QoS Monitor, it doesn't
> > > control anything.
> >
> > I may be using controller in a different way than you are. Yes, the
> > Cache QoS Monitor is monitoring cache data. But it is also controlling
> > the allocation and deallocation of RMIDs to tasks/task groups as
> > monitoring is enabled and disabled for those groups. That's why I
> > called it a controller. If that's not accurate, I apologize.
>
> Yeah that's not accurate, nor desired I think, because you get into
> horrible problems with hierarchies, do child groups belong to your RMID
> or not?

I'd rather not support a child group of a child group. Only groups off
the root, and each group would be assigned an RMID when it's activated
for monitoring.

> As is I don't really see a good use for RMIDs and I would simply not use
> them.

If you want to use CQM in the hardware, then the RMID is how you get the
cache usage data from the CPU. If you don't want to use CQM, then you
can ignore RMIDs.

One of the best use cases for using RMIDs is in virtualization. A VM
may be a heavy cache user, or a light cache user. Tracing different VMs
on different RMIDs can allow an admin to identify which VM may be
causing high levels of eviction, and either migrate it to another host,
or move other tasks/VMs to other hosts. Without CQM, it's much harder
to find which process is eating the cache up.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 17:54:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > Yeah that's not accurate, nor desired I think, because you get into
> > horrible problems with hierarchies, do child groups belong to your RMID
> > or not?
>
> I'd rather not support a child group of a child group. Only groups off
> the root, and each group would be assigned an RMID when it's activated
> for monitoring.

Yeah, that's a complete non started for cgroups. Cgroups need to be
completely hierarchical.

So even the root group should represent all tasks; which if you fragment
RMIDs on child cgroups doesn't work anymore.

2014-01-06 18:06:24

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 18:53 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > > Yeah that's not accurate, nor desired I think, because you get into
> > > horrible problems with hierarchies, do child groups belong to your RMID
> > > or not?
> >
> > I'd rather not support a child group of a child group. Only groups off
> > the root, and each group would be assigned an RMID when it's activated
> > for monitoring.
>
> Yeah, that's a complete non started for cgroups. Cgroups need to be
> completely hierarchical.
>
> So even the root group should represent all tasks; which if you fragment
> RMIDs on child cgroups doesn't work anymore.

The root group does represent all tasks in the current patchset on RMID
0. Then any child assigned to another group will be assigned to a
different RMID. It looks like this:

root (rmid 0)
/ \
(rmid 4) g1 g2 (rmid 16)

We could keep going down from there, but I don't see it buying anything
extra.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 18:06:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > As is I don't really see a good use for RMIDs and I would simply not use
> > them.
>
> If you want to use CQM in the hardware, then the RMID is how you get the
> cache usage data from the CPU. If you don't want to use CQM, then you
> can ignore RMIDs.

I think you can make do with a single RMID (per cpu). When you program
the counter (be it for a task, cpu or cgroup context) you set the 1 RMID
and EVSEL and read the CTR.

What I'm not entirely clear on is if the EVSEL and CTR MSR are per
logical CPU or per L3 (package); /me prays they're per logical CPU.

> One of the best use cases for using RMIDs is in virtualization.

*groan*.. /me plugs wax in ears and goes la-la-la-la

> A VM
> may be a heavy cache user, or a light cache user. Tracing different VMs
> on different RMIDs can allow an admin to identify which VM may be
> causing high levels of eviction, and either migrate it to another host,
> or move other tasks/VMs to other hosts. Without CQM, it's much harder
> to find which process is eating the cache up.

Not necessarily VMs, there's plenty large processes that exhibit similar
problems.. why must people always do VMs :-(

That said, even with a single RMID you can get that information by
simply running it against all competing processes one at a time. Since
there's limited RMID space you need to rotate at some point anyway.

The cgroup interface you propose wouldn't allow for rotation; other than
manual by creating different cgroups one after another.

2014-01-06 20:10:52

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 19:06 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 04:47:57PM +0000, Waskiewicz Jr, Peter P wrote:
> > > As is I don't really see a good use for RMIDs and I would simply not use
> > > them.
> >
> > If you want to use CQM in the hardware, then the RMID is how you get the
> > cache usage data from the CPU. If you don't want to use CQM, then you
> > can ignore RMIDs.
>
> I think you can make do with a single RMID (per cpu). When you program
> the counter (be it for a task, cpu or cgroup context) you set the 1 RMID
> and EVSEL and read the CTR.
>
> What I'm not entirely clear on is if the EVSEL and CTR MSR are per
> logical CPU or per L3 (package); /me prays they're per logical CPU.

There is one per logical CPU. However, in the current generation, they
report on the usage of the same L3 cache. But the CPU takes care of the
resolution of which MSR write and read comes from the logical CPU, so
software doesn't need to lock access to it from different CPUs.

> > One of the best use cases for using RMIDs is in virtualization.
>
> *groan*.. /me plugs wax in ears and goes la-la-la-la
>
> > A VM
> > may be a heavy cache user, or a light cache user. Tracing different VMs
> > on different RMIDs can allow an admin to identify which VM may be
> > causing high levels of eviction, and either migrate it to another host,
> > or move other tasks/VMs to other hosts. Without CQM, it's much harder
> > to find which process is eating the cache up.
>
> Not necessarily VMs, there's plenty large processes that exhibit similar
> problems.. why must people always do VMs :-(

Completely agreed. It's just the loudest people right now asking for
this capability are using VMs for the most part.

> That said, even with a single RMID you can get that information by
> simply running it against all competing processes one at a time. Since
> there's limited RMID space you need to rotate at some point anyway.
>
> The cgroup interface you propose wouldn't allow for rotation; other than
> manual by creating different cgroups one after another.

I see your points, and I also think that the cgroup approach now isn't
the best way to make this completely flexible. What about this:

Add a new read/write entry to the /proc/<pid> attributes that is the
RMID to assign that process to. Then expose all the available RMIDs
in /sys/devices/system/cpu, say in a new directory platformqos (or
whatever), which then have all the statistics inside those, plus a knob
to enable monitoring or not. Then all the kernel exposes is a way to
assign a PID to an RMID, and a way to turn on monitoring or turn it off,
and get the data. I can then put a simple userspace tool together to
make the management suck less.

Thoughts?

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 21:26:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 08:10:45PM +0000, Waskiewicz Jr, Peter P wrote:
> There is one per logical CPU. However, in the current generation, they
> report on the usage of the same L3 cache. But the CPU takes care of the
> resolution of which MSR write and read comes from the logical CPU, so
> software doesn't need to lock access to it from different CPUs.

What are the rules of RMIDs, I can't seem to find that in the SDM and I
think you're tagging cachelines with them. Which would mean that in
order to (re) use them you need a complete cache (L3) wipe.

Without a wipe you keep having stale entries of the former user and no
clear indication on when your numbers are any good.

Also, is there any sane way of shooting down the entire L3?

2014-01-06 21:48:35

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 22:26 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 08:10:45PM +0000, Waskiewicz Jr, Peter P wrote:
> > There is one per logical CPU. However, in the current generation, they
> > report on the usage of the same L3 cache. But the CPU takes care of the
> > resolution of which MSR write and read comes from the logical CPU, so
> > software doesn't need to lock access to it from different CPUs.
>
> What are the rules of RMIDs, I can't seem to find that in the SDM and I
> think you're tagging cachelines with them. Which would mean that in
> order to (re) use them you need a complete cache (L3) wipe.

The cacheline is tagged internally with the RMID as part of the waymask
for the thread in the core.

> Without a wipe you keep having stale entries of the former user and no
> clear indication on when your numbers are any good.

That can happen, yes. If you have leftover cache data from a process
that died that hasn't been evicted yet and it's assigned to the RMID
you're using, you will see its included cache occupancy to the overall
numbers.

> Also, is there any sane way of shooting down the entire L3?

That is a question I'd punt to hpa, but I'll ask him. Looking around
though, a WBINVD would certainly nuke things, but would hurt
performance. We could get creative with INVPCID as a process dies. Let
me ask him though and see if there's a good way to tidy up.

-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-06 22:13:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 09:48:29PM +0000, Waskiewicz Jr, Peter P wrote:
> On Mon, 2014-01-06 at 22:26 +0100, Peter Zijlstra wrote:
> > On Mon, Jan 06, 2014 at 08:10:45PM +0000, Waskiewicz Jr, Peter P wrote:
> > > There is one per logical CPU. However, in the current generation, they
> > > report on the usage of the same L3 cache. But the CPU takes care of the
> > > resolution of which MSR write and read comes from the logical CPU, so
> > > software doesn't need to lock access to it from different CPUs.
> >
> > What are the rules of RMIDs, I can't seem to find that in the SDM and I
> > think you're tagging cachelines with them. Which would mean that in
> > order to (re) use them you need a complete cache (L3) wipe.
>
> The cacheline is tagged internally with the RMID as part of the waymask
> for the thread in the core.
>
> > Without a wipe you keep having stale entries of the former user and no
> > clear indication on when your numbers are any good.
>
> That can happen, yes. If you have leftover cache data from a process
> that died that hasn't been evicted yet and it's assigned to the RMID
> you're using, you will see its included cache occupancy to the overall
> numbers.
>
> > Also, is there any sane way of shooting down the entire L3?
>
> That is a question I'd punt to hpa, but I'll ask him. Looking around
> though, a WBINVD would certainly nuke things, but would hurt
> performance. We could get creative with INVPCID as a process dies. Let
> me ask him though and see if there's a good way to tidy up.

You seem to be assuming a RMID is for the entire task lifetime.

Since its a very limited resource that seems like a weird assumption to
me; there's plenty scenarios in which you'd want to re-use RMIDs that
belong to a still running context.

At which point you need to force wipe.. otherwise its impossible to tell
when the number reported makes any kind of sense.

2014-01-06 22:45:43

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-06 at 23:12 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 09:48:29PM +0000, Waskiewicz Jr, Peter P wrote:
> > The cacheline is tagged internally with the RMID as part of the waymask
> > for the thread in the core.
> >
> > > Without a wipe you keep having stale entries of the former user and no
> > > clear indication on when your numbers are any good.
> >
> > That can happen, yes. If you have leftover cache data from a process
> > that died that hasn't been evicted yet and it's assigned to the RMID
> > you're using, you will see its included cache occupancy to the overall
> > numbers.
> >
> > > Also, is there any sane way of shooting down the entire L3?
> >
> > That is a question I'd punt to hpa, but I'll ask him. Looking around
> > though, a WBINVD would certainly nuke things, but would hurt
> > performance. We could get creative with INVPCID as a process dies. Let
> > me ask him though and see if there's a good way to tidy up.
>
> You seem to be assuming a RMID is for the entire task lifetime.

No, the RMID can be changed if the user wants to reassign the process to
a different group/RMID. If I'm coming across otherwise, then my
apologies.

> Since its a very limited resource that seems like a weird assumption to
> me; there's plenty scenarios in which you'd want to re-use RMIDs that
> belong to a still running context.

I think I see what you're really asking, let me rephrase to see if I'm
now understanding you: What happens to a running process' cache
assigned to one RMID when it's reassigned to a different RMID? Does the
RMID get updated internally or does it appear as still belonging to the
old RMID?

If that's your question, then the CPU will update the cache entry with
the correct RMID. It knows to do this because the when the process is
scheduled, the OS will write the IA32_PQR_ASSOC MSR with the RMID map to
start monitoring. That's when the RMID will be updated in the cache.
However, any cacheline for a process that is moved to a different RMID,
but doesn't have an opportunity to be scheduled, will still show up on
the old RMID until it gets to run with the new RMID.

Let me know if that better explains what I think you're asking.

-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-07 08:35:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, Jan 06, 2014 at 10:45:24PM +0000, Waskiewicz Jr, Peter P wrote:
> > Since its a very limited resource that seems like a weird assumption to
> > me; there's plenty scenarios in which you'd want to re-use RMIDs that
> > belong to a still running context.
>
> I think I see what you're really asking, let me rephrase to see if I'm
> now understanding you: What happens to a running process' cache
> assigned to one RMID when it's reassigned to a different RMID? Does the
> RMID get updated internally or does it appear as still belonging to the
> old RMID?
>
> If that's your question, then the CPU will update the cache entry with
> the correct RMID. It knows to do this because the when the process is
> scheduled, the OS will write the IA32_PQR_ASSOC MSR with the RMID map to
> start monitoring. That's when the RMID will be updated in the cache.
> However, any cacheline for a process that is moved to a different RMID,
> but doesn't have an opportunity to be scheduled, will still show up on
> the old RMID until it gets to run with the new RMID.
>
> Let me know if that better explains what I think you're asking.

Still confused here. So what you're saying is that cachelines get tagged
with {CR3,RMID} and when they observe the same CR3 with a different RMID
the hardware will iterate the entire cache and update all tuples?

That seems both very expensive and undesirable. It would mean you could
never use the RMID to creates slices of a process since you're stuck to
the CR3.

It also makes me wonder why we have the RMID at all; because if you're
already tagging every line with the CR3, why not build the cache monitor
on that. Just query the occupancy for all CR3s in your group and add.


The other possible interpretation is that it updates on-demand whenever
it touches a cacheline. But in that case, how do you deal with the
non-exclusive states? Does the last RMID to touch a non-exclusive
cacheline simply claim the entire line?

But that doesn't avoid the problem; because as soon as you change the
PQR_ASSOC RMID you still need to go run for a while to touch 'all' your
lines.

This duration is indeterminate; which again brings us back to needing to
first wipe the entire cache.

2014-01-07 15:16:04

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, 2014-01-07 at 09:34 +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 10:45:24PM +0000, Waskiewicz Jr, Peter P wrote:
> > > Since its a very limited resource that seems like a weird assumption to
> > > me; there's plenty scenarios in which you'd want to re-use RMIDs that
> > > belong to a still running context.
> >
> > I think I see what you're really asking, let me rephrase to see if I'm
> > now understanding you: What happens to a running process' cache
> > assigned to one RMID when it's reassigned to a different RMID? Does the
> > RMID get updated internally or does it appear as still belonging to the
> > old RMID?
> >
> > If that's your question, then the CPU will update the cache entry with
> > the correct RMID. It knows to do this because the when the process is
> > scheduled, the OS will write the IA32_PQR_ASSOC MSR with the RMID map to
> > start monitoring. That's when the RMID will be updated in the cache.
> > However, any cacheline for a process that is moved to a different RMID,
> > but doesn't have an opportunity to be scheduled, will still show up on
> > the old RMID until it gets to run with the new RMID.
> >
> > Let me know if that better explains what I think you're asking.
>
> Still confused here. So what you're saying is that cachelines get tagged
> with {CR3,RMID} and when they observe the same CR3 with a different RMID
> the hardware will iterate the entire cache and update all tuples?
>
> That seems both very expensive and undesirable. It would mean you could
> never use the RMID to creates slices of a process since you're stuck to
> the CR3.
>
> It also makes me wonder why we have the RMID at all; because if you're
> already tagging every line with the CR3, why not build the cache monitor
> on that. Just query the occupancy for all CR3s in your group and add.

The reason is the RMID needs to be retained on the cache entry when it
is promoted to another layer of cache, and (possibly) returns to the LLC
later. And the mechanism to return the occupancy is how you hope it is,
query the occupancy for all CR3s and add. If you didn't have the RMID
tagged on the cache entry, then you couldn't do that.

> The other possible interpretation is that it updates on-demand whenever
> it touches a cacheline. But in that case, how do you deal with the
> non-exclusive states? Does the last RMID to touch a non-exclusive
> cacheline simply claim the entire line?

I don't believe it claims the whole line; I had that exact discussion
awhile ago with the CPU architect, and this didn't appear broken before.
I will ask him again though since that discussion was over a year ago.

> But that doesn't avoid the problem; because as soon as you change the
> PQR_ASSOC RMID you still need to go run for a while to touch 'all' your
> lines.
>
> This duration is indeterminate; which again brings us back to needing to
> first wipe the entire cache.

I asked hpa if there is a clean way to do that outside of a WBINVD, and
the answer is no.

I've sent the two outstanding questions off to the CPU architect, I'll
let you know what he says once I hear.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-07 21:12:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, Jan 07, 2014 at 03:15:52PM +0000, Waskiewicz Jr, Peter P wrote:
> > Still confused here. So what you're saying is that cachelines get tagged
> > with {CR3,RMID} and when they observe the same CR3 with a different RMID
> > the hardware will iterate the entire cache and update all tuples?
> >
> > That seems both very expensive and undesirable. It would mean you could
> > never use the RMID to creates slices of a process since you're stuck to
> > the CR3.
> >
> > It also makes me wonder why we have the RMID at all; because if you're
> > already tagging every line with the CR3, why not build the cache monitor
> > on that. Just query the occupancy for all CR3s in your group and add.
>
> The reason is the RMID needs to be retained on the cache entry when it
> is promoted to another layer of cache, and (possibly) returns to the LLC
> later. And the mechanism to return the occupancy is how you hope it is,
> query the occupancy for all CR3s and add. If you didn't have the RMID
> tagged on the cache entry, then you couldn't do that.

Maybe its me (its late) but I can't follow.

So if every cacheline is tagged with both CR3 and RMID (on all levels --
I get that it needs to propagate etc..) then you can, upon observing a
new CR3,RMID pair, iterate the entire cache for the matching CR3 and
update its RMID.

This, while expensive, would fairly quickly propagate changes.

Now I'm not at all sure cachelines are CR3 tagged.

The above has downsides in that you cannot use RMIDs to slice into
processes, where a pure RMID (without CR3 relation, even if cachelines
are CR3 tagged) can slice processes -- note that process is an
address-space/CR3 collection of threads.

A pure RMID tagging solution would not allow the immediate update and
would require on demand updates on new cacheline usage.

This makes switching RMIDs effects slower to propagate.

> > The other possible interpretation is that it updates on-demand whenever
> > it touches a cacheline. But in that case, how do you deal with the
> > non-exclusive states? Does the last RMID to touch a non-exclusive
> > cacheline simply claim the entire line?
>
> I don't believe it claims the whole line; I had that exact discussion
> awhile ago with the CPU architect, and this didn't appear broken before.
> I will ask him again though since that discussion was over a year ago.
>
> > But that doesn't avoid the problem; because as soon as you change the
> > PQR_ASSOC RMID you still need to go run for a while to touch 'all' your
> > lines.
> >
> > This duration is indeterminate; which again brings us back to needing to
> > first wipe the entire cache.
>
> I asked hpa if there is a clean way to do that outside of a WBINVD, and
> the answer is no.
>
> I've sent the two outstanding questions off to the CPU architect, I'll
> let you know what he says once I hear.

Much appreciated; so I'd like a complete description of how this thing
works, with in particular when exactly lines are tagged.

So my current mental model would tag a line with the current (ASSOC)
RMID on:
- load from DRAM -> L*, even for non-exclusive
- any to exclusive transition

The result of such rules is that when the effective RMID of a task
changes it takes an indeterminate amount of time before the residency
stats reflect reality again.

Furthermore; the IA32_QM_CTR is a misnomer as its a VALUE not a COUNTER.
Not to mention the entire SDM 17.14.2 section is a mess; it purports to
describe how to detect the thing using CPUID but then also maybe
describes how to program it.

2014-01-10 18:55:19

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, 2014-01-07 at 22:12 +0100, Peter Zijlstra wrote:
> Maybe its me (its late) but I can't follow.
>
> So if every cacheline is tagged with both CR3 and RMID (on all levels --
> I get that it needs to propagate etc..) then you can, upon observing a
> new CR3,RMID pair, iterate the entire cache for the matching CR3 and
> update its RMID.
>
> This, while expensive, would fairly quickly propagate changes.
>
> Now I'm not at all sure cachelines are CR3 tagged.
>
> The above has downsides in that you cannot use RMIDs to slice into
> processes, where a pure RMID (without CR3 relation, even if cachelines
> are CR3 tagged) can slice processes -- note that process is an
> address-space/CR3 collection of threads.
>
> A pure RMID tagging solution would not allow the immediate update and
> would require on demand updates on new cacheline usage.
>
> This makes switching RMIDs effects slower to propagate.

> > > The other possible interpretation is that it updates on-demand whenever
> > > it touches a cacheline. But in that case, how do you deal with the
> > > non-exclusive states? Does the last RMID to touch a non-exclusive
> > > cacheline simply claim the entire line?
> >
> > I don't believe it claims the whole line; I had that exact discussion
> > awhile ago with the CPU architect, and this didn't appear broken before.
> > I will ask him again though since that discussion was over a year ago.
> >
> > > But that doesn't avoid the problem; because as soon as you change the
> > > PQR_ASSOC RMID you still need to go run for a while to touch 'all' your
> > > lines.
> > >
> > > This duration is indeterminate; which again brings us back to needing to
> > > first wipe the entire cache.
> >
> > I asked hpa if there is a clean way to do that outside of a WBINVD, and
> > the answer is no.
> >
> > I've sent the two outstanding questions off to the CPU architect, I'll
> > let you know what he says once I hear.
>
> Much appreciated; so I'd like a complete description of how this thing
> works, with in particular when exactly lines are tagged.

I've spoken with the CPU architect, and he's set me straight. I was
getting some simulation data and reality mixed up, so apologies.

The cacheline is tagged with the RMID being tracked when it's brought
into the cache. That is the only time it's tagged, it does not get
updated (I was looking at data showing impacts if it was updated).

If there are frequent RMID updates for a particular process, then there
is the possibility that any remaining old data for that process can be
accounted for on a different RMID. This really is workload dependent,
and my architect provided their data showing that this occurrence is
pretty much in the noise.

Also, I did ask about the granularity of the RMID, and it is
per-cacheline. So if there is a non-exclusive cacheline, then the
occupancy data in the other part of the cacheline will count against the
RMID.

> So my current mental model would tag a line with the current (ASSOC)
> RMID on:
> - load from DRAM -> L*, even for non-exclusive
> - any to exclusive transition
>
> The result of such rules is that when the effective RMID of a task
> changes it takes an indeterminate amount of time before the residency
> stats reflect reality again.
>
> Furthermore; the IA32_QM_CTR is a misnomer as its a VALUE not a COUNTER.
> Not to mention the entire SDM 17.14.2 section is a mess; it purports to
> describe how to detect the thing using CPUID but then also maybe
> describes how to program it.

I've given this feedback to the section owner in the SDM. There is an
update due this month, and there will be some updates to this section
(along with some additions).

I should have my alternate implementation sent out shortly, just working
a few kinks out of it. This is the proc-based and sysfs-based interface
that will rely on a userspace program to handle the logic of grouping
and assigning stuff together.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-13 07:55:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Fri, Jan 10, 2014 at 06:55:11PM +0000, Waskiewicz Jr, Peter P wrote:
> I've spoken with the CPU architect, and he's set me straight. I was
> getting some simulation data and reality mixed up, so apologies.
>
> The cacheline is tagged with the RMID being tracked when it's brought
> into the cache. That is the only time it's tagged, it does not get
> updated (I was looking at data showing impacts if it was updated).
>
> If there are frequent RMID updates for a particular process, then there
> is the possibility that any remaining old data for that process can be
> accounted for on a different RMID. This really is workload dependent,
> and my architect provided their data showing that this occurrence is
> pretty much in the noise.

What change frequency and what sided workloads did they test?

I can make it significant; take a multi-threaded workload that mostly
fits in cache, then assign all theads but one RMDI 0, then fairly
quickly rotate RMID 1 between the threads.

The problem is, since there's a limited number of RMIDs we have to
rotate at some point, but since changing RMIDs is nondeterministic we
can't.

> Also, I did ask about the granularity of the RMID, and it is
> per-cacheline. So if there is a non-exclusive cacheline, then the
> occupancy data in the other part of the cacheline will count against the
> RMID.

One more question:

u64 i;
u64 rmid_val[];

for (i = 0; i < rmid_max; i++) {
wrmsr(IA32_QM_EVTSEL, 1 | (i << 32));
rdmsr(IA32_QM_CTR, rmid_val[i]);
}

Is this the right way of reading these values? I couldn't find anything
that says the event must 'run' to accumulate a value at all, so all it
seems it a direct value read with a multiplexer to the RMID.

> > So my current mental model would tag a line with the current (ASSOC)
> > RMID on:
> > - load from DRAM -> L*, even for non-exclusive
> > - any to exclusive transition
> >
> > The result of such rules is that when the effective RMID of a task
> > changes it takes an indeterminate amount of time before the residency
> > stats reflect reality again.
> >
> > Furthermore; the IA32_QM_CTR is a misnomer as its a VALUE not a COUNTER.
> > Not to mention the entire SDM 17.14.2 section is a mess; it purports to
> > describe how to detect the thing using CPUID but then also maybe
> > describes how to program it.
>
> I've given this feedback to the section owner in the SDM. There is an
> update due this month, and there will be some updates to this section
> (along with some additions).
>
> I should have my alternate implementation sent out shortly, just working
> a few kinks out of it. This is the proc-based and sysfs-based interface
> that will rely on a userspace program to handle the logic of grouping
> and assigning stuff together.

I've not figured out how to deal with this stuff yet; exposing RMIDs to
userspace is a guaranteed fail though. Any interface that disallows the
kernel to manage the RMIDs is broken.

2014-01-14 17:59:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On 01/12/2014 11:55 PM, Peter Zijlstra wrote:
>
> The problem is, since there's a limited number of RMIDs we have to
> rotate at some point, but since changing RMIDs is nondeterministic we
> can't.
>

This is fundamentally the crux here. RMIDs are quite expensive for the
hardware to implement, so they are limited - but recycling them is
*very* expensive because you literally have to touch every line in the
cache.

-hpa

2014-01-14 20:46:57

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-13 at 08:55 +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2014 at 06:55:11PM +0000, Waskiewicz Jr, Peter P wrote:
> > I've spoken with the CPU architect, and he's set me straight. I was
> > getting some simulation data and reality mixed up, so apologies.
> >
> > The cacheline is tagged with the RMID being tracked when it's brought
> > into the cache. That is the only time it's tagged, it does not get
> > updated (I was looking at data showing impacts if it was updated).
> >
> > If there are frequent RMID updates for a particular process, then there
> > is the possibility that any remaining old data for that process can be
> > accounted for on a different RMID. This really is workload dependent,
> > and my architect provided their data showing that this occurrence is
> > pretty much in the noise.
>
> What change frequency and what sided workloads did they test?

I will see what data I can share, as much of this is internal testing
with open access to hardware implementation details.

> I can make it significant; take a multi-threaded workload that mostly
> fits in cache, then assign all theads but one RMDI 0, then fairly
> quickly rotate RMID 1 between the threads.
>
> The problem is, since there's a limited number of RMIDs we have to
> rotate at some point, but since changing RMIDs is nondeterministic we
> can't.
>
> > Also, I did ask about the granularity of the RMID, and it is
> > per-cacheline. So if there is a non-exclusive cacheline, then the
> > occupancy data in the other part of the cacheline will count against the
> > RMID.
>
> One more question:
>
> u64 i;
> u64 rmid_val[];
>
> for (i = 0; i < rmid_max; i++) {
> wrmsr(IA32_QM_EVTSEL, 1 | (i << 32));
> rdmsr(IA32_QM_CTR, rmid_val[i]);
> }
>
> Is this the right way of reading these values? I couldn't find anything
> that says the event must 'run' to accumulate a value at all, so all it
> seems it a direct value read with a multiplexer to the RMID.

Yes, this is correct. In the SDM, the layout of the IA32_QM_CTR MSR has
bits 61:0 contain the data, then bits 62 and 63 are error bits. In
order to select the RMID to read, the IA32_QM_EVTSEL must be programmed
to get the data out; that's the only way to tell the CPU what RMID needs
to be inspected.

[...]

> I've not figured out how to deal with this stuff yet; exposing RMIDs to
> userspace is a guaranteed fail though. Any interface that disallows the
> kernel to manage the RMIDs is broken.

Hence the first implementation in this patch series using cgroups. The
backend would assign an RMID to the task group when monitoring was
enabled. The RMID itself had no exposure to userspace. It's quite a
nice association that works well. I still think it's a viable way to do
it, and am trying to convince myself otherwise (but keep coming back to
it).

The implementation I'm talking about is to assign an arbitrary group
number to the tasks, then have the kernel assign an RMID to that group
number when the group's monitoring is enabled. So basically the same
functionality that the current patchset uses, but minus the cgroup. I
don't like the approach because it will reinvent some of the cgroup's
functionality, but it does separate it from cgroups.

Cheers,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-01-27 17:34:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, Jan 14, 2014 at 09:58:26AM -0800, H. Peter Anvin wrote:
> On 01/12/2014 11:55 PM, Peter Zijlstra wrote:
> >
> > The problem is, since there's a limited number of RMIDs we have to
> > rotate at some point, but since changing RMIDs is nondeterministic we
> > can't.
> >
>
> This is fundamentally the crux here. RMIDs are quite expensive for the
> hardware to implement, so they are limited - but recycling them is
> *very* expensive because you literally have to touch every line in the
> cache.

Its not a problem that changing the task:RMID map is expensive, what is
a problem is that there's no deterministic fashion of doing it.

That said; I think I've got a sort-of workaround for that. See the
largish comment near cache_pmu_rotate().

I've also illustrated how to use perf-cgroup for this.

The below is a rough draft, most if not all XXXs should be
fixed/finished. But given I don't actually have hardware that supports
this stuff (afaik) I couldn't be arsed.

---
include/linux/perf_event.h | 33 +
kernel/events/core.c | 22 -
x86/kernel/cpu/perf_event_intel_cache.c | 687 ++++++++++++++++++++++++++++++++
3 files changed, 725 insertions(+), 17 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -126,6 +126,14 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
+ struct { /* cache_pmu */
+ struct task_struct *cache_target;
+ int cache_state;
+ int cache_rmid;
+ struct list_head cache_events_entry;
+ struct list_head cache_groups_entry;
+ struct list_head cache_group_entry;
+ };
#ifdef CONFIG_HAVE_HW_BREAKPOINT
struct { /* breakpoint */
/*
@@ -526,6 +534,31 @@ struct perf_output_handle {
int page;
};

+#ifdef CONFIG_CGROUP_PERF
+
+struct perf_cgroup_info;
+
+struct perf_cgroup {
+ struct cgroup_subsys_state css;
+ struct perf_cgroup_info __percpu *info;
+};
+
+/*
+ * Must ensure cgroup is pinned (css_get) before calling
+ * this function. In other words, we cannot call this function
+ * if there is no cgroup event for the current CPU context.
+ *
+ * XXX: its not safe to use this thing!!!
+ */
+static inline struct perf_cgroup *
+perf_cgroup_from_task(struct task_struct *task)
+{
+ return container_of(task_css(task, perf_subsys_id),
+ struct perf_cgroup, css);
+}
+
+#endif /* CONFIG_CGROUP_PERF */
+
#ifdef CONFIG_PERF_EVENTS

extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -329,23 +329,6 @@ struct perf_cgroup_info {
u64 timestamp;
};

-struct perf_cgroup {
- struct cgroup_subsys_state css;
- struct perf_cgroup_info __percpu *info;
-};
-
-/*
- * Must ensure cgroup is pinned (css_get) before calling
- * this function. In other words, we cannot call this function
- * if there is no cgroup event for the current CPU context.
- */
-static inline struct perf_cgroup *
-perf_cgroup_from_task(struct task_struct *task)
-{
- return container_of(task_css(task, perf_subsys_id),
- struct perf_cgroup, css);
-}
-
static inline bool
perf_cgroup_match(struct perf_event *event)
{
@@ -6711,6 +6694,11 @@ perf_event_alloc(struct perf_event_attr
if (task) {
event->attach_state = PERF_ATTACH_TASK;

+ /*
+ * XXX fix for cache_target, dynamic type won't have an easy test,
+ * maybe move target crap into generic event.
+ */
+
if (attr->type == PERF_TYPE_TRACEPOINT)
event->hw.tp_target = task;
#ifdef CONFIG_HAVE_HW_BREAKPOINT
--- /dev/null
+++ b/x86/kernel/cpu/perf_event_intel_cache.c
@@ -0,0 +1,687 @@
+#include <asm/processor.h>
+#include <linux/idr.h>
+#include <linux/raw_spinlock.h>
+#include <linux/perf_event.h>
+
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+#define MSR_IA32_QM_CTR 0x0c8e
+#define MSR_IA32_QM_EVTSEL 0x0c8d
+
+unsigned int max_rmid;
+
+unsigned int l3_scale; /* supposedly cacheline size */
+unsigned int l3_max_rmid;
+
+
+struct cache_pmu_state {
+ raw_spin_lock lock;
+ int rmid;
+ int cnt;
+};
+
+static DEFINE_PER_CPU(struct cache_pmu_state, state);
+
+/*
+ * Protects the global state, hold both for modification, hold either for
+ * stability.
+ *
+ * XXX we modify RMID with only cache_mutex held, racy!
+ */
+static DEFINE_MUTEX(cache_mutex);
+static DEFINE_RAW_SPINLOCK(cache_lock);
+
+static unsigned long *cache_rmid_bitmap;
+
+/*
+ * All events
+ */
+static LIST_HEAD(cache_events);
+
+/*
+ * Groups of events that have the same target(s), one RMID per group.
+ */
+static LIST_HEAD(cache_groups);
+
+/*
+ * The new RMID we must not use until cache_pmu_stable().
+ * See cache_pmu_rotate().
+ */
+static unsigned long *cache_limbo_bitmap;
+
+/*
+ * The spare RMID that make rotation possible; keep out of the
+ * cache_rmid_bitmap to avoid it getting used for new events.
+ */
+static int cache_rotation_rmid;
+
+/*
+ * The freed RMIDs, see cache_pmu_rotate().
+ */
+static int cache_freed_nr;
+static int *cache_freed_rmid;
+
+/*
+ * One online cpu per package, for cache_pmu_stable().
+ */
+static cpumask_t cache_cpus;
+
+/*
+ * Returns < 0 on fail.
+ */
+static int __get_rmid(void)
+{
+ return bitmap_find_free_region(cache_rmid_bitmap, max_rmid, 0);
+}
+
+static void __put_rmid(int rmid)
+{
+ bitmap_release_region(cache_rmid_bitmap, rmid, 0);
+}
+
+/*
+ * Needs a quesent state before __put, see cache_pmu_stabilize().
+ */
+static void __free_rmid(int rmid)
+{
+ cache_freed_rmid[cache_freed_nr++] = rmid;
+}
+
+#define RMID_VAL_ERROR (1ULL << 63)
+#define RMID_VAL_UNAVAIL (1ULL << 62)
+
+static u64 __rmid_read(unsigned long rmid)
+{
+ u64 val;
+
+ /*
+ * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
+ * it just says that to increase confusion.
+ */
+ wrmsr(MSR_IA32_QM_EVTSEL, 1 | (rmid << 32));
+ rdmsr(MSR_IA32_QM_CTR, val);
+
+ /*
+ * Aside from the ERROR and UNAVAIL bits, assume this thing returns
+ * the number of cachelines tagged with @rmid.
+ */
+ return val;
+}
+
+static void smp_test_stable(void *info)
+{
+ bool *used = info;
+ int i;
+
+ for (i = 0; i < cache_freed_nr; i++) {
+ if (__rmid_read(cache_freed_rmid[i]))
+ *used = false;
+ }
+}
+
+/*
+ * Test if the rotation_rmid is unused; see the comment near
+ * cache_pmu_rotate().
+ */
+static bool cache_pmu_is_stable(void)
+{
+ bool used = true;
+
+ smp_call_function_many(&cache_cpus, smp_test_stable, &used, true);
+
+ return used;
+}
+
+/*
+ * Quescent state; wait for all the 'freed' RMIDs to become unused. After this
+ * we can can reuse them and know that the current set of active RMIDs is
+ * stable.
+ */
+static void cache_pmu_stabilize(void)
+{
+ int i = 0;
+
+ if (!cache_freed_nr)
+ return;
+
+ /*
+ * Now wait until the old RMID drops back to 0 again, this means all
+ * cachelines have acquired a new tag and the new RMID is now stable.
+ */
+ while (!cache_pmu_is_stable()) {
+ /*
+ * XXX adaptive timeout? Ideally the hardware would get us an
+ * interrupt :/
+ */
+ schedule_timeout_uninterruptible(1);
+ }
+
+ bitmap_clear(cache_limbo_bitmap, 0, max_rmid);
+
+ if (cache_rotation_rmid <= 0) {
+ cache_rotation_rmid = cache_freed_rmid[0];
+ i++;
+ }
+
+ for (; i < cache_freed_nr; i++)
+ __put_rmid(cache_freed_rmid[i]);
+
+ cache_freed_nr = 0;
+}
+
+/*
+ * Exchange the RMID of a group of events.
+ */
+static unsigned long cache_group_xchg_rmid(struct perf_event *group, unsigned long rmid)
+{
+ struct perf_event *event;
+ unsigned long old_rmid = group->hw.cache_rmid;
+
+ group->hw.cache_rmid = rmid;
+ list_for_each_entry(event, &group->hw.cache_group_entry, hw.cache_group_entry)
+ event->hw.cache_rmid = rmid;
+
+ return old_rmid;
+}
+
+/*
+ * Determine if @a and @b measure the same set of tasks.
+ */
+static bool __match_event(struct perf_event *a, struct perf_event *b)
+{
+ if ((a->attach_state & PERF_ATTACH_TASK) !=
+ (b->attach_state & PERF_ATTACH_TASK))
+ return false;
+
+ if (a->attach_state & PERF_ATTACH_TASK) {
+ if (a->hw.cache_target != b->hw.cache_target)
+ return false;
+
+ return true;
+ }
+
+ /* not task */
+
+#ifdef CONFIG_CGROUP_PERF
+ if ((a->cgrp == b->cgrp) && a->cgrp)
+ return true;
+#endif
+
+ return true; /* if not task or cgroup, we're machine wide */
+}
+
+static struct perf_cgroup *event_to_cgroup(struct perf_event *event)
+{
+ if (event->cgrp)
+ return event->cgrp;
+
+ if (event->attach_state & PERF_ATTACH_TASK) /* XXX */
+ return perf_cgroup_from_task(event->hw.cache_target);
+
+ return NULL;
+}
+
+/*
+ * Determine if @na's tasks intersect with @b's tasks
+ */
+static bool __conflict_event(struct perf_event *a, struct perf_event *b)
+{
+#ifdef CONFIG_CGROUP_PERF
+ struct perf_cb *ac, *bc;
+
+ ac = event_to_cgroup(a);
+ bc = event_to_cgroup(b);
+
+ if (!ac || !bc) {
+ /*
+ * If either is NULL, its a system wide event and that
+ * always conflicts with a cgroup one.
+ *
+ * If both are system wide, __match_event() should've
+ * been true and we'll never get here, if we did fail.
+ */
+ return true;
+ }
+
+ /*
+ * If one is a parent of the other, we've got an intersection.
+ */
+ if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
+ cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
+ return true;
+#endif
+
+ /*
+ * If one of them is not a task, same story as above with cgroups.
+ */
+ if (!(a->attach_state & PERF_ATTACH_TASK) ||
+ !(b->attach_state & PERF_ATTACH_TASK))
+ return true;
+
+ /*
+ * Again, if they're the same __match_event() should've caught us, if not fail.
+ */
+ if (a->hw.cache_target == b->hw.cache_target)
+ return true;
+
+ /*
+ * Must be non-overlapping.
+ */
+ return false;
+}
+
+/*
+ * Attempt to rotate the groups and assign new RMIDs, ought to run from an
+ * delayed work or somesuch.
+ *
+ * Rotating RMIDs is complicated; firstly because the hardware doesn't give us
+ * any clues; secondly because of cgroups.
+ *
+ * There's problems with the hardware interface; when you change the task:RMID
+ * map cachelines retain their 'old' tags, giving a skewed picture. In order to
+ * work around this, we must always keep one free RMID.
+ *
+ * Rotation works by taking away an RMID from a group (the old RMID), and
+ * assigning the free RMID to another group (the new RMID). We must then wait
+ * for the old RMID to not be used (no cachelines tagged). This ensure that all
+ * cachelines are tagged with 'active' RMIDs. At this point we can start
+ * reading values for the new RMID and treat the old RMID as the free RMID for
+ * the next rotation.
+ *
+ * Secondly, since cgroups can nest, we must make sure to not program
+ * conflicting cgroups at the same time. A conflicting cgroup is one that has a
+ * parent<->child relation. After all, a task of the child cgroup will also be
+ * covered by the parent cgroup.
+ *
+ * Therefore, when selecting a new group, we must invalidate all conflicting
+ * groups. Rotations allows us to measure all (conflicting) groups
+ * sequentially.
+ *
+ * XXX there's a further problem in that because we do our own rotation and
+ * cheat with schedulability the event {enabled,running} times are incorrect.
+ */
+static bool cache_pmu_rotate(void)
+{
+ struct perf_event *rotor;
+ int rmid;
+
+ mutex_lock(&cache_mutex);
+
+ if (list_empty(&cache_groups))
+ goto unlock_mutex;
+
+ rotor = list_first_entry(&cache_groups, struct perf_event, hw.cache_groups_entry);
+
+ raw_spin_lock_irq(&cache_lock);
+ list_del(&rotor->hw.cache_groups_entry);
+ rmid = cache_group_xchg_rmid(rotor, -1);
+ WARN_ON_ONCE(rmid <= 0); /* first entry must always have an RMID */
+ __free_rmid(rmid);
+ raw_spin_unlock_irq(&cache_loc);
+
+ /*
+ * XXX O(n^2) schedulability
+ */
+
+ list_for_each_entry(group, &cache_groups, hw.cache_groups_entry) {
+ bool conflicts = false;
+ struct perf_event *iter;
+
+ list_for_each_entry(iter, &cache_groups, hw.cache_groups_entry) {
+ if (iter == group)
+ break;
+ if (__conflict_event(group, iter)) {
+ conflicts = true;
+ break;
+ }
+ }
+
+ if (conflicts && group->hw.cache_rmid > 0) {
+ rmid = cache_group_xchg_rmid(group, -1);
+ WARN_ON_ONCE(rmid <= 0);
+ __free_rmid(rmid);
+ continue;
+ }
+
+ if (!conflicts && group->hw.cache_rmid <= 0) {
+ rmid = __get_rmid();
+ if (rmid <= 0) {
+ rmid = cache_rotation_rmid;
+ cache_rotation_rmid = -1;
+ }
+ set_bit(rmid, cache_limbo_rmid);
+ if (rmid <= 0)
+ break; /* we're out of RMIDs, more next time */
+
+ rmid = cache_group_xchg_rmid(group, rmid);
+ WARM_ON_ONCE(rmid > 0);
+ continue;
+ }
+
+ /*
+ * either we conflict and do not have an RMID -> good,
+ * or we do not conflict and have an RMID -> also good.
+ */
+ }
+
+ raw_spin_lock_irq(&cache_lock);
+ list_add_tail(&rotor->hw.cache_groups_entry, &cache_groups);
+ raw_spin_unlock_irq(&cache_lock);
+
+ /*
+ * XXX force a PMU reprogram here such that the new RMIDs are in
+ * effect.
+ */
+
+ cache_pmu_stabilize();
+
+unlock_mutex:
+ mutex_unlock(&cache_mutex);
+
+ /*
+ * XXX reschedule work.
+ */
+}
+
+/*
+ * Find a group and setup RMID
+ */
+static struct perf_event *cache_pmu_setup_event(struct perf_event *event)
+{
+ struct perf_event *iter;
+ int rmid = 0; /* unset */
+
+ list_for_each_entry(iter, &cache_groups, hw.cache_groups_entry) {
+ if (__match_event(iter, event)) {
+ event->hw.cache_rmid = iter->hw.cache_rmid;
+ return iter;
+ }
+ if (__conflict_event(iter, event))
+ rmid = -1; /* conflicting rmid */
+ }
+
+ if (!rmid) {
+ /* XXX lacks stabilization */
+ event->hw.cache_rmid = __get_rmid();
+ }
+
+ return NULL;
+}
+
+static void cache_pmu_event_read(struct perf_event *event)
+{
+ unsigned long rmid = event->hw.cache_rmid;
+ u64 val = RMID_VAL_UNAVAIL;
+
+ if (!test_bit(rmid, cache_limbo_bitmap))
+ val = __rmid_read(rmid);
+
+ /*
+ * Ignore this reading on error states and do not update the value.
+ */
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+
+ val *= l3_scale; /* cachelines -> bytes */
+
+ local64_set(&event->count, val);
+}
+
+static void cache_pmu_event_start(struct perf_event *event, int mode)
+{
+ struct cache_pmu_state *state = &__get_cpu_var(&state);
+ unsigned long flags;
+
+ if (!(event->hw.cache_state & PERF_HES_STOPPED))
+ return;
+
+ event->hw.cache_state &= ~PERF_HES_STOPPED;
+
+ raw_spin_lock_irqsave(&state->lock, flags);
+ if (state->cnt++)
+ WARN_ON_ONCE(state->rmid != rmid);
+ else
+ WARN_ON_ONCE(state->rmid);
+ state->rmid = rmid;
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid);
+ raw_spin_unlock_irqrestore(&state->lock, flags);
+}
+
+static void cache_pmu_event_stop(struct perf_event *event, int mode)
+{
+ struct cache_pmu_state *state = &__get_cpu_var(&state);
+ unsigned long flags;
+
+ if (event->hw.cache_state & PERF_HES_STOPPED)
+ return;
+
+ event->hw.cache_state |= PERF_HES_STOPPED;
+
+ raw_spin_lock_irqsave(&state->lock, flags);
+ cache_pmu_event_read(event);
+ if (!--state->cnt) {
+ state->rmid = 0;
+ wrmsr(MSR_IA32_PQR_ASSOC, 0);
+ } else {
+ WARN_ON_ONCE(!state->rmid);
+ raw_spin_unlock_irqrestore(&state->lock, flags);
+}
+
+static int cache_pmu_event_add(struct perf_event *event, int mode)
+{
+ struct cache_pmu_state *state = &__get_cpu_var(&state);
+ unsigned long flags;
+ int rmid;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ event->hw.cache_state = PERF_HES_STOPPED;
+ rmid = event->hw.cache_rmid;
+ if (rmid <= 0)
+ goto unlock;
+
+ if (mode & PERF_EF_START)
+ cache_pmu_event_start(event, mode);
+
+unlock:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return 0;
+}
+
+static void cache_pmu_event_del(struct perf_event *event, int mode)
+{
+ struct cache_pmu_state *state = &__get_cpu_var(&state);
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+ cache_pmu_event_stop(event, mode);
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return 0;
+}
+
+static void cache_pmu_event_destroy(struct perf_event *event)
+{
+ struct perf_event *group_other = NULL;
+
+ mutex_lock(&cache_mutex);
+ raw_spin_lock_irq(&cache_lock);
+
+ list_del(&event->hw.cache_events_entry);
+
+ /*
+ * If there's another event in this group...
+ */
+ if (!list_empty(&event->hw.cache_group_entry)) {
+ group_other = list_first_entry(&event->hw.cache_group_entry,
+ struct perf_event,
+ hw.cache_group_entry);
+ list_del(&event->hw.cache_group_entry);
+ }
+ /*
+ * And we're the group leader..
+ */
+ if (!list_empty(&event->hw.cache_groups_entry)) {
+ /*
+ * If there was a group_other, make that leader, otherwise
+ * destroy the group and return the RMID.
+ */
+ if (group_other) {
+ list_replace(&event->hw.cache_groups_entry,
+ &group_other->hw.cache_groups_entry);
+ } else {
+ int rmid = event->hw.cache_rmid;
+ if (rmid > 0)
+ __put_rmid(rmid);
+ list_del(&event->hw.cache_groups_entry);
+ }
+ }
+
+ raw_spin_unlock_irq(&cache_lock);
+ mutex_unlock(&cache_mutex);
+}
+
+static struct pmu cache_pmu;
+
+/*
+ * Takes non-sampling task,cgroup or machine wide events.
+ *
+ * XXX there's a bit of a problem in that we cannot simply do the one event per
+ * node as one would want, since that one event would one get scheduled on the
+ * one cpu. But we want to 'schedule' the RMID on all CPUs.
+ *
+ * This means we want events for each CPU, however, that generates a lot of
+ * duplicate values out to userspace -- this is not to be helped unless we want
+ * to change the core code in some way.
+ */
+static int cache_pmu_event_init(struct perf_event *event)
+{
+ struct perf_event *group;
+
+ if (event->attr.type != cache_pmu.type)
+ return -ENOENT;
+
+ if (event->attr.config != 0)
+ return -EINVAL;
+
+ if (event->cpu == -1) /* must have per-cpu events; see above */
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ event->destroy = cache_pmu_event_destroy;
+
+ mutex_lock(&cache_mutex);
+
+ group = cache_pmu_setup_event(event); /* will also set rmid */
+
+ raw_spin_lock_irq(&cache_lock);
+ if (group) {
+ event->hw.cache_rmid = group->hw.cache_rmid;
+ list_add_tail(&event->hw.cache_group_entry,
+ &group->hw.cache_group_entry);
+ } else {
+ list_add_tail(&event->hw.cache_groups_entry,
+ &cache_groups);
+ }
+
+ list_add_tail(&event->hw.cache_events_entry, &cache_events);
+ raw_spin_unlock_irq(&cache_lock);
+
+ mutex_unlock(&cache_mutex);
+
+ return 0;
+}
+
+static struct pmu cache_pmu = {
+ .task_ctx_nr = perf_sw_context, /* we cheat: our add will never fail */
+ .event_init = cache_pmu_event_init,
+ .add = cache_pmu_event_add,
+ .del = cache_pmu_event_del,
+ .start = cache_pmu_event_start,
+ .stop = cache_pmu_event_stop,
+ .read = cache_pmu_event_read,
+};
+
+static int __init cache_pmu_init(void)
+{
+ unsigned int eax, ebx, ecd, edx;
+ int i;
+
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+ return 0;
+
+ if (boot_cpu_data.x86 != 6)
+ return 0;
+
+ cpuid_count(0x07, 0, &eax, &ebx, &ecx, &edx);
+
+ /* CPUID.(EAX=07H, ECX=0).EBX.QOS[bit12] */
+ if (!(ebx & (1 << 12)))
+ return 0;
+
+ cpuid_count(0x0f, 0, &eax, &ebx, &ecx, &edx);
+
+ max_rmid = ebx;
+
+ /*
+ * We should iterate bits in CPUID(EAX=0FH, ECX=0).EDX
+ * For now, only support L3 (bit 1).
+ */
+ if (!(edx & (1 << 1)))
+ return 0;
+
+ cpuid_count(0x0f, 1, &eax, &ebx, &ecx, &edx);
+
+ l3_scale = ebx;
+ l3_max_rmid = ecx;
+
+ if (l3_max_rmid != max_rmid)
+ return 0;
+
+ cache_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(max_rmid), GFP_KERNEL);
+ if (!cache_rmid_bitmap)
+ return -ENOMEM;
+
+ cache_limbo_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(max_rmid), GFP_KERNEL);
+ if (!cache_limbo_bitmap)
+ return -ENOMEM; /* XXX frees */
+
+ cache_freed_rmid = kmalloc(sizeof(int) * max_rmid, GFP_KERNEL);
+ if (!cache_freed_rmid)
+ return -ENOMEM; /* XXX free bitmaps */
+
+ bitmap_zero(cache_rmid_bitmap, max_rmid);
+ bitmap_set(cache_rmid_bitmap, 0, 1); /* RMID 0 is special */
+ cache_rotation_rmid = __get_rmid(); /* keep one free RMID for rotation */
+ if (WARN_ON_ONCE(cache_rotation_rmid < 0))
+ return cache_rotation_rmid;
+
+ /*
+ * XXX hotplug notifiers!
+ */
+ for_each_possible_cpu(i) {
+ struct cache_pmu_state *state = &per_cpu(state, cpu);
+
+ raw_spin_lock_init(&state->lock);
+ state->rmid = 0;
+ }
+
+ ret = perf_pmu_register(&cache_pmu, "cache_qos", -1);
+ if (WARN_ON(ret)) {
+ pr_info("Cache QoS detected, registration failed (%d), disabled\n", ret);
+ return -1;
+ }
+
+ return 0;
+}
+device_initcall(cache_pmu_init);

2014-02-18 17:30:05

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Mon, 2014-01-27 at 18:34 +0100, Peter Zijlstra wrote:

Hi Peter,

First of all, sorry for the delay in responding. I've been talking with
the CPU architects to make sure we're going down the right path here
before coming back to this. Responses below.

> On Tue, Jan 14, 2014 at 09:58:26AM -0800, H. Peter Anvin wrote:
> > On 01/12/2014 11:55 PM, Peter Zijlstra wrote:
> > >
> > > The problem is, since there's a limited number of RMIDs we have to
> > > rotate at some point, but since changing RMIDs is nondeterministic we
> > > can't.
> > >
> >
> > This is fundamentally the crux here. RMIDs are quite expensive for the
> > hardware to implement, so they are limited - but recycling them is
> > *very* expensive because you literally have to touch every line in the
> > cache.
>
> Its not a problem that changing the task:RMID map is expensive, what is
> a problem is that there's no deterministic fashion of doing it.

We are going to add to the SDM that changing RMID's often/frequently is
not the intended use case for this feature, and can cause bogus data.
The real intent is to land threads into an RMID, and run that until the
threads are effectively done.

That being said, reassigning a thread to a new RMID is certainly
supported, just "frequent" updates is not encouraged at all.

> That said; I think I've got a sort-of workaround for that. See the
> largish comment near cache_pmu_rotate().



> I've also illustrated how to use perf-cgroup for this.

I do see that, however the userspace interface for this isn't ideal for
how the feature is intended to be used. I'm still planning to have this
be managed per process in /proc/<pid>, I just had other priorities push
this back a bit on my stovetop.

Also, now that the new SDM is available, there is a new feature added to
the same family as CQM, called Memory Bandwidth Monitoring (MBM). The
original cgroup approach would have allowed another subsystem be added
next to cacheqos; the perf-cgroup here is not easily expandable.
The /proc/<pid> approach can add MBM pretty easily alongside CQM.

> The below is a rough draft, most if not all XXXs should be
> fixed/finished. But given I don't actually have hardware that supports
> this stuff (afaik) I couldn't be arsed.

The hardware is not publicly available yet, but I know that Red Hat and
others have some of these platforms for testing.

I really appreciate the patch. There was a good amount of thought put
into this, and gave a good set of different viewpoints. I'll keep the
comments all here in one place, it'll be easier to discuss than
disjointed in the code.

The rotation idea to reclaim RMID's no longer in use is interesting.
This differs from the original patch where the original patch would
reclaim the RMID when monitoring was disabled for that group of
processes.

I can see a merged sort of approach, where if monitoring for a group of
processes is disabled, we can place that RMID onto a reclaim list. The
next time an RMID is requested (monitoring is enabled for a
process/group of processes), the reclaim list is searched for an RMID
that has 0 occupancy (i.e. not in use), or worst-case, find and assign
one with the lowest occupancy. I did discuss this with hpa offline and
this seemed reasonable.

Thoughts?

Thanks,
-PJ

>
> ---
> include/linux/perf_event.h | 33 +
> kernel/events/core.c | 22 -
> x86/kernel/cpu/perf_event_intel_cache.c | 687 ++++++++++++++++++++++++++++++++
> 3 files changed, 725 insertions(+), 17 deletions(-)
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -126,6 +126,14 @@ struct hw_perf_event {
> /* for tp_event->class */
> struct list_head tp_list;
> };
> + struct { /* cache_pmu */
> + struct task_struct *cache_target;
> + int cache_state;
> + int cache_rmid;
> + struct list_head cache_events_entry;
> + struct list_head cache_groups_entry;
> + struct list_head cache_group_entry;
> + };
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> struct { /* breakpoint */
> /*
> @@ -526,6 +534,31 @@ struct perf_output_handle {
> int page;
> };
>
> +#ifdef CONFIG_CGROUP_PERF
> +
> +struct perf_cgroup_info;
> +
> +struct perf_cgroup {
> + struct cgroup_subsys_state css;
> + struct perf_cgroup_info __percpu *info;
> +};
> +
> +/*
> + * Must ensure cgroup is pinned (css_get) before calling
> + * this function. In other words, we cannot call this function
> + * if there is no cgroup event for the current CPU context.
> + *
> + * XXX: its not safe to use this thing!!!
> + */
> +static inline struct perf_cgroup *
> +perf_cgroup_from_task(struct task_struct *task)
> +{
> + return container_of(task_css(task, perf_subsys_id),
> + struct perf_cgroup, css);
> +}
> +
> +#endif /* CONFIG_CGROUP_PERF */
> +
> #ifdef CONFIG_PERF_EVENTS
>
> extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -329,23 +329,6 @@ struct perf_cgroup_info {
> u64 timestamp;
> };
>
> -struct perf_cgroup {
> - struct cgroup_subsys_state css;
> - struct perf_cgroup_info __percpu *info;
> -};
> -
> -/*
> - * Must ensure cgroup is pinned (css_get) before calling
> - * this function. In other words, we cannot call this function
> - * if there is no cgroup event for the current CPU context.
> - */
> -static inline struct perf_cgroup *
> -perf_cgroup_from_task(struct task_struct *task)
> -{
> - return container_of(task_css(task, perf_subsys_id),
> - struct perf_cgroup, css);
> -}
> -
> static inline bool
> perf_cgroup_match(struct perf_event *event)
> {
> @@ -6711,6 +6694,11 @@ perf_event_alloc(struct perf_event_attr
> if (task) {
> event->attach_state = PERF_ATTACH_TASK;
>
> + /*
> + * XXX fix for cache_target, dynamic type won't have an easy test,
> + * maybe move target crap into generic event.
> + */
> +
> if (attr->type == PERF_TYPE_TRACEPOINT)
> event->hw.tp_target = task;
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> --- /dev/null
> +++ b/x86/kernel/cpu/perf_event_intel_cache.c
> @@ -0,0 +1,687 @@
> +#include <asm/processor.h>
> +#include <linux/idr.h>
> +#include <linux/raw_spinlock.h>
> +#include <linux/perf_event.h>
> +
> +
> +#define MSR_IA32_PQR_ASSOC 0x0c8f
> +#define MSR_IA32_QM_CTR 0x0c8e
> +#define MSR_IA32_QM_EVTSEL 0x0c8d
> +
> +unsigned int max_rmid;
> +
> +unsigned int l3_scale; /* supposedly cacheline size */
> +unsigned int l3_max_rmid;
> +
> +
> +struct cache_pmu_state {
> + raw_spin_lock lock;
> + int rmid;
> + int cnt;
> +};
> +
> +static DEFINE_PER_CPU(struct cache_pmu_state, state);
> +
> +/*
> + * Protects the global state, hold both for modification, hold either for
> + * stability.
> + *
> + * XXX we modify RMID with only cache_mutex held, racy!
> + */
> +static DEFINE_MUTEX(cache_mutex);
> +static DEFINE_RAW_SPINLOCK(cache_lock);
> +
> +static unsigned long *cache_rmid_bitmap;
> +
> +/*
> + * All events
> + */
> +static LIST_HEAD(cache_events);
> +
> +/*
> + * Groups of events that have the same target(s), one RMID per group.
> + */
> +static LIST_HEAD(cache_groups);
> +
> +/*
> + * The new RMID we must not use until cache_pmu_stable().
> + * See cache_pmu_rotate().
> + */
> +static unsigned long *cache_limbo_bitmap;
> +
> +/*
> + * The spare RMID that make rotation possible; keep out of the
> + * cache_rmid_bitmap to avoid it getting used for new events.
> + */
> +static int cache_rotation_rmid;
> +
> +/*
> + * The freed RMIDs, see cache_pmu_rotate().
> + */
> +static int cache_freed_nr;
> +static int *cache_freed_rmid;
> +
> +/*
> + * One online cpu per package, for cache_pmu_stable().
> + */
> +static cpumask_t cache_cpus;
> +
> +/*
> + * Returns < 0 on fail.
> + */
> +static int __get_rmid(void)
> +{
> + return bitmap_find_free_region(cache_rmid_bitmap, max_rmid, 0);
> +}
> +
> +static void __put_rmid(int rmid)
> +{
> + bitmap_release_region(cache_rmid_bitmap, rmid, 0);
> +}
> +
> +/*
> + * Needs a quesent state before __put, see cache_pmu_stabilize().
> + */
> +static void __free_rmid(int rmid)
> +{
> + cache_freed_rmid[cache_freed_nr++] = rmid;
> +}
> +
> +#define RMID_VAL_ERROR (1ULL << 63)
> +#define RMID_VAL_UNAVAIL (1ULL << 62)
> +
> +static u64 __rmid_read(unsigned long rmid)
> +{
> + u64 val;
> +
> + /*
> + * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
> + * it just says that to increase confusion.
> + */
> + wrmsr(MSR_IA32_QM_EVTSEL, 1 | (rmid << 32));
> + rdmsr(MSR_IA32_QM_CTR, val);
> +
> + /*
> + * Aside from the ERROR and UNAVAIL bits, assume this thing returns
> + * the number of cachelines tagged with @rmid.
> + */
> + return val;
> +}
> +
> +static void smp_test_stable(void *info)
> +{
> + bool *used = info;
> + int i;
> +
> + for (i = 0; i < cache_freed_nr; i++) {
> + if (__rmid_read(cache_freed_rmid[i]))
> + *used = false;
> + }
> +}
> +
> +/*
> + * Test if the rotation_rmid is unused; see the comment near
> + * cache_pmu_rotate().
> + */
> +static bool cache_pmu_is_stable(void)
> +{
> + bool used = true;
> +
> + smp_call_function_many(&cache_cpus, smp_test_stable, &used, true);
> +
> + return used;
> +}
> +
> +/*
> + * Quescent state; wait for all the 'freed' RMIDs to become unused. After this
> + * we can can reuse them and know that the current set of active RMIDs is
> + * stable.
> + */
> +static void cache_pmu_stabilize(void)
> +{
> + int i = 0;
> +
> + if (!cache_freed_nr)
> + return;
> +
> + /*
> + * Now wait until the old RMID drops back to 0 again, this means all
> + * cachelines have acquired a new tag and the new RMID is now stable.
> + */
> + while (!cache_pmu_is_stable()) {
> + /*
> + * XXX adaptive timeout? Ideally the hardware would get us an
> + * interrupt :/
> + */
> + schedule_timeout_uninterruptible(1);
> + }
> +
> + bitmap_clear(cache_limbo_bitmap, 0, max_rmid);
> +
> + if (cache_rotation_rmid <= 0) {
> + cache_rotation_rmid = cache_freed_rmid[0];
> + i++;
> + }
> +
> + for (; i < cache_freed_nr; i++)
> + __put_rmid(cache_freed_rmid[i]);
> +
> + cache_freed_nr = 0;
> +}
> +
> +/*
> + * Exchange the RMID of a group of events.
> + */
> +static unsigned long cache_group_xchg_rmid(struct perf_event *group, unsigned long rmid)
> +{
> + struct perf_event *event;
> + unsigned long old_rmid = group->hw.cache_rmid;
> +
> + group->hw.cache_rmid = rmid;
> + list_for_each_entry(event, &group->hw.cache_group_entry, hw.cache_group_entry)
> + event->hw.cache_rmid = rmid;
> +
> + return old_rmid;
> +}
> +
> +/*
> + * Determine if @a and @b measure the same set of tasks.
> + */
> +static bool __match_event(struct perf_event *a, struct perf_event *b)
> +{
> + if ((a->attach_state & PERF_ATTACH_TASK) !=
> + (b->attach_state & PERF_ATTACH_TASK))
> + return false;
> +
> + if (a->attach_state & PERF_ATTACH_TASK) {
> + if (a->hw.cache_target != b->hw.cache_target)
> + return false;
> +
> + return true;
> + }
> +
> + /* not task */
> +
> +#ifdef CONFIG_CGROUP_PERF
> + if ((a->cgrp == b->cgrp) && a->cgrp)
> + return true;
> +#endif
> +
> + return true; /* if not task or cgroup, we're machine wide */
> +}
> +
> +static struct perf_cgroup *event_to_cgroup(struct perf_event *event)
> +{
> + if (event->cgrp)
> + return event->cgrp;
> +
> + if (event->attach_state & PERF_ATTACH_TASK) /* XXX */
> + return perf_cgroup_from_task(event->hw.cache_target);
> +
> + return NULL;
> +}
> +
> +/*
> + * Determine if @na's tasks intersect with @b's tasks
> + */
> +static bool __conflict_event(struct perf_event *a, struct perf_event *b)
> +{
> +#ifdef CONFIG_CGROUP_PERF
> + struct perf_cb *ac, *bc;
> +
> + ac = event_to_cgroup(a);
> + bc = event_to_cgroup(b);
> +
> + if (!ac || !bc) {
> + /*
> + * If either is NULL, its a system wide event and that
> + * always conflicts with a cgroup one.
> + *
> + * If both are system wide, __match_event() should've
> + * been true and we'll never get here, if we did fail.
> + */
> + return true;
> + }
> +
> + /*
> + * If one is a parent of the other, we've got an intersection.
> + */
> + if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
> + cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
> + return true;
> +#endif
> +
> + /*
> + * If one of them is not a task, same story as above with cgroups.
> + */
> + if (!(a->attach_state & PERF_ATTACH_TASK) ||
> + !(b->attach_state & PERF_ATTACH_TASK))
> + return true;
> +
> + /*
> + * Again, if they're the same __match_event() should've caught us, if not fail.
> + */
> + if (a->hw.cache_target == b->hw.cache_target)
> + return true;
> +
> + /*
> + * Must be non-overlapping.
> + */
> + return false;
> +}
> +
> +/*
> + * Attempt to rotate the groups and assign new RMIDs, ought to run from an
> + * delayed work or somesuch.
> + *
> + * Rotating RMIDs is complicated; firstly because the hardware doesn't give us
> + * any clues; secondly because of cgroups.
> + *
> + * There's problems with the hardware interface; when you change the task:RMID
> + * map cachelines retain their 'old' tags, giving a skewed picture. In order to
> + * work around this, we must always keep one free RMID.
> + *
> + * Rotation works by taking away an RMID from a group (the old RMID), and
> + * assigning the free RMID to another group (the new RMID). We must then wait
> + * for the old RMID to not be used (no cachelines tagged). This ensure that all
> + * cachelines are tagged with 'active' RMIDs. At this point we can start
> + * reading values for the new RMID and treat the old RMID as the free RMID for
> + * the next rotation.
> + *
> + * Secondly, since cgroups can nest, we must make sure to not program
> + * conflicting cgroups at the same time. A conflicting cgroup is one that has a
> + * parent<->child relation. After all, a task of the child cgroup will also be
> + * covered by the parent cgroup.
> + *
> + * Therefore, when selecting a new group, we must invalidate all conflicting
> + * groups. Rotations allows us to measure all (conflicting) groups
> + * sequentially.
> + *
> + * XXX there's a further problem in that because we do our own rotation and
> + * cheat with schedulability the event {enabled,running} times are incorrect.
> + */
> +static bool cache_pmu_rotate(void)
> +{
> + struct perf_event *rotor;
> + int rmid;
> +
> + mutex_lock(&cache_mutex);
> +
> + if (list_empty(&cache_groups))
> + goto unlock_mutex;
> +
> + rotor = list_first_entry(&cache_groups, struct perf_event, hw.cache_groups_entry);
> +
> + raw_spin_lock_irq(&cache_lock);
> + list_del(&rotor->hw.cache_groups_entry);
> + rmid = cache_group_xchg_rmid(rotor, -1);
> + WARN_ON_ONCE(rmid <= 0); /* first entry must always have an RMID */
> + __free_rmid(rmid);
> + raw_spin_unlock_irq(&cache_loc);
> +
> + /*
> + * XXX O(n^2) schedulability
> + */
> +
> + list_for_each_entry(group, &cache_groups, hw.cache_groups_entry) {
> + bool conflicts = false;
> + struct perf_event *iter;
> +
> + list_for_each_entry(iter, &cache_groups, hw.cache_groups_entry) {
> + if (iter == group)
> + break;
> + if (__conflict_event(group, iter)) {
> + conflicts = true;
> + break;
> + }
> + }
> +
> + if (conflicts && group->hw.cache_rmid > 0) {
> + rmid = cache_group_xchg_rmid(group, -1);
> + WARN_ON_ONCE(rmid <= 0);
> + __free_rmid(rmid);
> + continue;
> + }
> +
> + if (!conflicts && group->hw.cache_rmid <= 0) {
> + rmid = __get_rmid();
> + if (rmid <= 0) {
> + rmid = cache_rotation_rmid;
> + cache_rotation_rmid = -1;
> + }
> + set_bit(rmid, cache_limbo_rmid);
> + if (rmid <= 0)
> + break; /* we're out of RMIDs, more next time */
> +
> + rmid = cache_group_xchg_rmid(group, rmid);
> + WARM_ON_ONCE(rmid > 0);
> + continue;
> + }
> +
> + /*
> + * either we conflict and do not have an RMID -> good,
> + * or we do not conflict and have an RMID -> also good.
> + */
> + }
> +
> + raw_spin_lock_irq(&cache_lock);
> + list_add_tail(&rotor->hw.cache_groups_entry, &cache_groups);
> + raw_spin_unlock_irq(&cache_lock);
> +
> + /*
> + * XXX force a PMU reprogram here such that the new RMIDs are in
> + * effect.
> + */
> +
> + cache_pmu_stabilize();
> +
> +unlock_mutex:
> + mutex_unlock(&cache_mutex);
> +
> + /*
> + * XXX reschedule work.
> + */
> +}
> +
> +/*
> + * Find a group and setup RMID
> + */
> +static struct perf_event *cache_pmu_setup_event(struct perf_event *event)
> +{
> + struct perf_event *iter;
> + int rmid = 0; /* unset */
> +
> + list_for_each_entry(iter, &cache_groups, hw.cache_groups_entry) {
> + if (__match_event(iter, event)) {
> + event->hw.cache_rmid = iter->hw.cache_rmid;
> + return iter;
> + }
> + if (__conflict_event(iter, event))
> + rmid = -1; /* conflicting rmid */
> + }
> +
> + if (!rmid) {
> + /* XXX lacks stabilization */
> + event->hw.cache_rmid = __get_rmid();
> + }
> +
> + return NULL;
> +}
> +
> +static void cache_pmu_event_read(struct perf_event *event)
> +{
> + unsigned long rmid = event->hw.cache_rmid;
> + u64 val = RMID_VAL_UNAVAIL;
> +
> + if (!test_bit(rmid, cache_limbo_bitmap))
> + val = __rmid_read(rmid);
> +
> + /*
> + * Ignore this reading on error states and do not update the value.
> + */
> + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> + return;
> +
> + val *= l3_scale; /* cachelines -> bytes */
> +
> + local64_set(&event->count, val);
> +}
> +
> +static void cache_pmu_event_start(struct perf_event *event, int mode)
> +{
> + struct cache_pmu_state *state = &__get_cpu_var(&state);
> + unsigned long flags;
> +
> + if (!(event->hw.cache_state & PERF_HES_STOPPED))
> + return;
> +
> + event->hw.cache_state &= ~PERF_HES_STOPPED;
> +
> + raw_spin_lock_irqsave(&state->lock, flags);
> + if (state->cnt++)
> + WARN_ON_ONCE(state->rmid != rmid);
> + else
> + WARN_ON_ONCE(state->rmid);
> + state->rmid = rmid;
> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid);
> + raw_spin_unlock_irqrestore(&state->lock, flags);
> +}
> +
> +static void cache_pmu_event_stop(struct perf_event *event, int mode)
> +{
> + struct cache_pmu_state *state = &__get_cpu_var(&state);
> + unsigned long flags;
> +
> + if (event->hw.cache_state & PERF_HES_STOPPED)
> + return;
> +
> + event->hw.cache_state |= PERF_HES_STOPPED;
> +
> + raw_spin_lock_irqsave(&state->lock, flags);
> + cache_pmu_event_read(event);
> + if (!--state->cnt) {
> + state->rmid = 0;
> + wrmsr(MSR_IA32_PQR_ASSOC, 0);
> + } else {
> + WARN_ON_ONCE(!state->rmid);
> + raw_spin_unlock_irqrestore(&state->lock, flags);
> +}
> +
> +static int cache_pmu_event_add(struct perf_event *event, int mode)
> +{
> + struct cache_pmu_state *state = &__get_cpu_var(&state);
> + unsigned long flags;
> + int rmid;
> +
> + raw_spin_lock_irqsave(&cache_lock, flags);
> +
> + event->hw.cache_state = PERF_HES_STOPPED;
> + rmid = event->hw.cache_rmid;
> + if (rmid <= 0)
> + goto unlock;
> +
> + if (mode & PERF_EF_START)
> + cache_pmu_event_start(event, mode);
> +
> +unlock:
> + raw_spin_unlock_irqrestore(&cache_lock, flags);
> +
> + return 0;
> +}
> +
> +static void cache_pmu_event_del(struct perf_event *event, int mode)
> +{
> + struct cache_pmu_state *state = &__get_cpu_var(&state);
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&cache_lock, flags);
> + cache_pmu_event_stop(event, mode);
> + raw_spin_unlock_irqrestore(&cache_lock, flags);
> +
> + return 0;
> +}
> +
> +static void cache_pmu_event_destroy(struct perf_event *event)
> +{
> + struct perf_event *group_other = NULL;
> +
> + mutex_lock(&cache_mutex);
> + raw_spin_lock_irq(&cache_lock);
> +
> + list_del(&event->hw.cache_events_entry);
> +
> + /*
> + * If there's another event in this group...
> + */
> + if (!list_empty(&event->hw.cache_group_entry)) {
> + group_other = list_first_entry(&event->hw.cache_group_entry,
> + struct perf_event,
> + hw.cache_group_entry);
> + list_del(&event->hw.cache_group_entry);
> + }
> + /*
> + * And we're the group leader..
> + */
> + if (!list_empty(&event->hw.cache_groups_entry)) {
> + /*
> + * If there was a group_other, make that leader, otherwise
> + * destroy the group and return the RMID.
> + */
> + if (group_other) {
> + list_replace(&event->hw.cache_groups_entry,
> + &group_other->hw.cache_groups_entry);
> + } else {
> + int rmid = event->hw.cache_rmid;
> + if (rmid > 0)
> + __put_rmid(rmid);
> + list_del(&event->hw.cache_groups_entry);
> + }
> + }
> +
> + raw_spin_unlock_irq(&cache_lock);
> + mutex_unlock(&cache_mutex);
> +}
> +
> +static struct pmu cache_pmu;
> +
> +/*
> + * Takes non-sampling task,cgroup or machine wide events.
> + *
> + * XXX there's a bit of a problem in that we cannot simply do the one event per
> + * node as one would want, since that one event would one get scheduled on the
> + * one cpu. But we want to 'schedule' the RMID on all CPUs.
> + *
> + * This means we want events for each CPU, however, that generates a lot of
> + * duplicate values out to userspace -- this is not to be helped unless we want
> + * to change the core code in some way.
> + */
> +static int cache_pmu_event_init(struct perf_event *event)
> +{
> + struct perf_event *group;
> +
> + if (event->attr.type != cache_pmu.type)
> + return -ENOENT;
> +
> + if (event->attr.config != 0)
> + return -EINVAL;
> +
> + if (event->cpu == -1) /* must have per-cpu events; see above */
> + return -EINVAL;
> +
> + /* unsupported modes and filters */
> + if (event->attr.exclude_user ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle ||
> + event->attr.exclude_host ||
> + event->attr.exclude_guest ||
> + event->attr.sample_period) /* no sampling */
> + return -EINVAL;
> +
> + event->destroy = cache_pmu_event_destroy;
> +
> + mutex_lock(&cache_mutex);
> +
> + group = cache_pmu_setup_event(event); /* will also set rmid */
> +
> + raw_spin_lock_irq(&cache_lock);
> + if (group) {
> + event->hw.cache_rmid = group->hw.cache_rmid;
> + list_add_tail(&event->hw.cache_group_entry,
> + &group->hw.cache_group_entry);
> + } else {
> + list_add_tail(&event->hw.cache_groups_entry,
> + &cache_groups);
> + }
> +
> + list_add_tail(&event->hw.cache_events_entry, &cache_events);
> + raw_spin_unlock_irq(&cache_lock);
> +
> + mutex_unlock(&cache_mutex);
> +
> + return 0;
> +}
> +
> +static struct pmu cache_pmu = {
> + .task_ctx_nr = perf_sw_context, /* we cheat: our add will never fail */
> + .event_init = cache_pmu_event_init,
> + .add = cache_pmu_event_add,
> + .del = cache_pmu_event_del,
> + .start = cache_pmu_event_start,
> + .stop = cache_pmu_event_stop,
> + .read = cache_pmu_event_read,
> +};
> +
> +static int __init cache_pmu_init(void)
> +{
> + unsigned int eax, ebx, ecd, edx;
> + int i;
> +
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> + return 0;
> +
> + if (boot_cpu_data.x86 != 6)
> + return 0;
> +
> + cpuid_count(0x07, 0, &eax, &ebx, &ecx, &edx);
> +
> + /* CPUID.(EAX=07H, ECX=0).EBX.QOS[bit12] */
> + if (!(ebx & (1 << 12)))
> + return 0;
> +
> + cpuid_count(0x0f, 0, &eax, &ebx, &ecx, &edx);
> +
> + max_rmid = ebx;
> +
> + /*
> + * We should iterate bits in CPUID(EAX=0FH, ECX=0).EDX
> + * For now, only support L3 (bit 1).
> + */
> + if (!(edx & (1 << 1)))
> + return 0;
> +
> + cpuid_count(0x0f, 1, &eax, &ebx, &ecx, &edx);
> +
> + l3_scale = ebx;
> + l3_max_rmid = ecx;
> +
> + if (l3_max_rmid != max_rmid)
> + return 0;
> +
> + cache_rmid_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(max_rmid), GFP_KERNEL);
> + if (!cache_rmid_bitmap)
> + return -ENOMEM;
> +
> + cache_limbo_bitmap = kmalloc(sizeof(long) * BITS_TO_LONGS(max_rmid), GFP_KERNEL);
> + if (!cache_limbo_bitmap)
> + return -ENOMEM; /* XXX frees */
> +
> + cache_freed_rmid = kmalloc(sizeof(int) * max_rmid, GFP_KERNEL);
> + if (!cache_freed_rmid)
> + return -ENOMEM; /* XXX free bitmaps */
> +
> + bitmap_zero(cache_rmid_bitmap, max_rmid);
> + bitmap_set(cache_rmid_bitmap, 0, 1); /* RMID 0 is special */
> + cache_rotation_rmid = __get_rmid(); /* keep one free RMID for rotation */
> + if (WARN_ON_ONCE(cache_rotation_rmid < 0))
> + return cache_rotation_rmid;
> +
> + /*
> + * XXX hotplug notifiers!
> + */
> + for_each_possible_cpu(i) {
> + struct cache_pmu_state *state = &per_cpu(state, cpu);
> +
> + raw_spin_lock_init(&state->lock);
> + state->rmid = 0;
> + }
> +
> + ret = perf_pmu_register(&cache_pmu, "cache_qos", -1);
> + if (WARN_ON(ret)) {
> + pr_info("Cache QoS detected, registration failed (%d), disabled\n", ret);
> + return -1;
> + }
> +
> + return 0;
> +}
> +device_initcall(cache_pmu_init);

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.


Attachments:
smime.p7s (3.38 kB)

2014-02-18 19:35:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, Feb 18, 2014 at 05:29:42PM +0000, Waskiewicz Jr, Peter P wrote:
> > Its not a problem that changing the task:RMID map is expensive, what is
> > a problem is that there's no deterministic fashion of doing it.
>
> We are going to add to the SDM that changing RMID's often/frequently is
> not the intended use case for this feature, and can cause bogus data.
> The real intent is to land threads into an RMID, and run that until the
> threads are effectively done.
>
> That being said, reassigning a thread to a new RMID is certainly
> supported, just "frequent" updates is not encouraged at all.

You don't even need really high frequency, just unsynchronized wrt
reading the counter. Suppose A flips the RMIDs about and just when its
done programming B reads them.

At that point you've got 0 guarantee the data makes any kind of sense.

> I do see that, however the userspace interface for this isn't ideal for
> how the feature is intended to be used. I'm still planning to have this
> be managed per process in /proc/<pid>, I just had other priorities push
> this back a bit on my stovetop.

So I really don't like anything /proc/$pid/ nor do I really see a point in
doing that. What are you going to do in the /proc/$pid/ thing anyway?
Exposing raw RMIDs is an absolute no-no, and anything else is going to
end up being yet-another-grouping thing and thus not much different from
cgroups.

> Also, now that the new SDM is available

Can you guys please set up a mailing list already so we know when
there's new versions out? Ideally mailing out the actual PDF too so I
get the automagic download and archive for all versions.

> , there is a new feature added to
> the same family as CQM, called Memory Bandwidth Monitoring (MBM). The
> original cgroup approach would have allowed another subsystem be added
> next to cacheqos; the perf-cgroup here is not easily expandable.
> The /proc/<pid> approach can add MBM pretty easily alongside CQM.

I'll have to go read up what you've done now, but if its also RMID based
I don't see why the proposed scheme won't work.

> > The below is a rough draft, most if not all XXXs should be
> > fixed/finished. But given I don't actually have hardware that supports
> > this stuff (afaik) I couldn't be arsed.
>
> The hardware is not publicly available yet, but I know that Red Hat and
> others have some of these platforms for testing.

Yeah, not in my house therefore it doesn't exist :-)

> I really appreciate the patch. There was a good amount of thought put
> into this, and gave a good set of different viewpoints. I'll keep the
> comments all here in one place, it'll be easier to discuss than
> disjointed in the code.
>
> The rotation idea to reclaim RMID's no longer in use is interesting.
> This differs from the original patch where the original patch would
> reclaim the RMID when monitoring was disabled for that group of
> processes.
>
> I can see a merged sort of approach, where if monitoring for a group of
> processes is disabled, we can place that RMID onto a reclaim list. The
> next time an RMID is requested (monitoring is enabled for a
> process/group of processes), the reclaim list is searched for an RMID
> that has 0 occupancy (i.e. not in use), or worst-case, find and assign
> one with the lowest occupancy. I did discuss this with hpa offline and
> this seemed reasonable.
>
> Thoughts?

So you have to wait for one 'freed' RMID to become empty before
'allowing' reads of the other RMIDs, otherwise the visible value can be
complete rubbish. Even for low frequency rotation, see the above
scenario about asynchronous operations.

This means you have to always have at least one free RMID.

2014-02-18 19:54:39

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, 2014-02-18 at 20:35 +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2014 at 05:29:42PM +0000, Waskiewicz Jr, Peter P wrote:
> > > Its not a problem that changing the task:RMID map is expensive, what is
> > > a problem is that there's no deterministic fashion of doing it.
> >
> > We are going to add to the SDM that changing RMID's often/frequently is
> > not the intended use case for this feature, and can cause bogus data.
> > The real intent is to land threads into an RMID, and run that until the
> > threads are effectively done.
> >
> > That being said, reassigning a thread to a new RMID is certainly
> > supported, just "frequent" updates is not encouraged at all.
>
> You don't even need really high frequency, just unsynchronized wrt
> reading the counter. Suppose A flips the RMIDs about and just when its
> done programming B reads them.
>
> At that point you've got 0 guarantee the data makes any kind of sense.

Agreed, there is no guarantee with how the hardware is designed. We
don't have an instruction that can nuke RMID-tagged cachelines from the
cache, and the CPU guys (along with hpa) have been very explicit that
wbinv is not an option.

> > I do see that, however the userspace interface for this isn't ideal for
> > how the feature is intended to be used. I'm still planning to have this
> > be managed per process in /proc/<pid>, I just had other priorities push
> > this back a bit on my stovetop.
>
> So I really don't like anything /proc/$pid/ nor do I really see a point in
> doing that. What are you going to do in the /proc/$pid/ thing anyway?
> Exposing raw RMIDs is an absolute no-no, and anything else is going to
> end up being yet-another-grouping thing and thus not much different from
> cgroups.

Exactly. The cgroup grouping mechanisms fit really well with this
feature. I was exploring another way to do it given the pushback on
using cgroups initially. The RMID's won't be exposed, rather a group
identifier (in cgroups it's the new subdirectory in the subsystem), and
RMIDs are assigned by the kernel, completely hidden to userspace.

>
> > Also, now that the new SDM is available
>
> Can you guys please set up a mailing list already so we know when
> there's new versions out? Ideally mailing out the actual PDF too so I
> get the automagic download and archive for all versions.

I assume this has been requested before. As I'm typing this, I just
received the notification internally that the new SDM is now published.
I'll forward your request along and see what I hear back.

> > , there is a new feature added to
> > the same family as CQM, called Memory Bandwidth Monitoring (MBM). The
> > original cgroup approach would have allowed another subsystem be added
> > next to cacheqos; the perf-cgroup here is not easily expandable.
> > The /proc/<pid> approach can add MBM pretty easily alongside CQM.
>
> I'll have to go read up what you've done now, but if its also RMID based
> I don't see why the proposed scheme won't work.

Yes please do look at the cgroup patches. For the RMID allocation, we
could use your proposal to manage allocation/reclamation, and the
management interface to userspace will match the use cases I'm trying to
enable.

> > > The below is a rough draft, most if not all XXXs should be
> > > fixed/finished. But given I don't actually have hardware that supports
> > > this stuff (afaik) I couldn't be arsed.
> >
> > The hardware is not publicly available yet, but I know that Red Hat and
> > others have some of these platforms for testing.
>
> Yeah, not in my house therefore it doesn't exist :-)
>
> > I really appreciate the patch. There was a good amount of thought put
> > into this, and gave a good set of different viewpoints. I'll keep the
> > comments all here in one place, it'll be easier to discuss than
> > disjointed in the code.
> >
> > The rotation idea to reclaim RMID's no longer in use is interesting.
> > This differs from the original patch where the original patch would
> > reclaim the RMID when monitoring was disabled for that group of
> > processes.
> >
> > I can see a merged sort of approach, where if monitoring for a group of
> > processes is disabled, we can place that RMID onto a reclaim list. The
> > next time an RMID is requested (monitoring is enabled for a
> > process/group of processes), the reclaim list is searched for an RMID
> > that has 0 occupancy (i.e. not in use), or worst-case, find and assign
> > one with the lowest occupancy. I did discuss this with hpa offline and
> > this seemed reasonable.
> >
> > Thoughts?
>
> So you have to wait for one 'freed' RMID to become empty before
> 'allowing' reads of the other RMIDs, otherwise the visible value can be
> complete rubbish. Even for low frequency rotation, see the above
> scenario about asynchronous operations.
>
> This means you have to always have at least one free RMID.

Understood now, I was missing the asynchronous point you were trying to
make. I thought you wanted the free RMID to use that to always assign
so you know it's "empty," not to get around the twiddling that can
occur.

Let me know what you think about the cacheqos cgroup implementation I
sent, and if things don't look horrible, I can respin with your RMID
management scheme.

Thanks,
-PJ

--
PJ Waskiewicz Open Source Technology Center
[email protected] Intel Corp.


Attachments:
smime.p7s (3.38 kB)

2014-02-20 16:58:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] x86: Add Cache QoS Monitoring (CQM) support

On Tue, Feb 18, 2014 at 07:54:34PM +0000, Waskiewicz Jr, Peter P wrote:
> On Tue, 2014-02-18 at 20:35 +0100, Peter Zijlstra wrote:
> > On Tue, Feb 18, 2014 at 05:29:42PM +0000, Waskiewicz Jr, Peter P wrote:
> > > > Its not a problem that changing the task:RMID map is expensive, what is
> > > > a problem is that there's no deterministic fashion of doing it.
> > >
> > > We are going to add to the SDM that changing RMID's often/frequently is
> > > not the intended use case for this feature, and can cause bogus data.
> > > The real intent is to land threads into an RMID, and run that until the
> > > threads are effectively done.
> > >
> > > That being said, reassigning a thread to a new RMID is certainly
> > > supported, just "frequent" updates is not encouraged at all.
> >
> > You don't even need really high frequency, just unsynchronized wrt
> > reading the counter. Suppose A flips the RMIDs about and just when its
> > done programming B reads them.
> >
> > At that point you've got 0 guarantee the data makes any kind of sense.
>
> Agreed, there is no guarantee with how the hardware is designed. We
> don't have an instruction that can nuke RMID-tagged cachelines from the
> cache, and the CPU guys (along with hpa) have been very explicit that
> wbinv is not an option.

Right; but if you wait for the 'unused' RMID to drop to 0 occupancy you
have a fair chance all lines have an active RMID tag. There are a few
corner cases where this is not so, but given the hardware this is the
best I could come up with.

Under constant L3 pressure it basically means that your new RMID
assignment has reached steady state (in as far as the workload has one
to begin with).

wbinv is actually worse in that it wipes everything, it will guarantee
any occupancy read will not over-report, but almost guarantees
under-reporting if you're 'quick'.

The only really sucky part is that we have to poll for this situation to
occur.

> > > I do see that, however the userspace interface for this isn't ideal for
> > > how the feature is intended to be used. I'm still planning to have this
> > > be managed per process in /proc/<pid>, I just had other priorities push
> > > this back a bit on my stovetop.
> >
> > So I really don't like anything /proc/$pid/ nor do I really see a point in
> > doing that. What are you going to do in the /proc/$pid/ thing anyway?
> > Exposing raw RMIDs is an absolute no-no, and anything else is going to
> > end up being yet-another-grouping thing and thus not much different from
> > cgroups.
>
> Exactly. The cgroup grouping mechanisms fit really well with this
> feature. I was exploring another way to do it given the pushback on
> using cgroups initially. The RMID's won't be exposed, rather a group
> identifier (in cgroups it's the new subdirectory in the subsystem), and
> RMIDs are assigned by the kernel, completely hidden to userspace.

So I don't see the need for a custom controller; what's wrong with the
perf-cgroup approach I proposed?

The thing is, a custom controller will have to jump through most of the
same hoops anyway.

> > > Also, now that the new SDM is available
> >
> > Can you guys please set up a mailing list already so we know when
> > there's new versions out? Ideally mailing out the actual PDF too so I
> > get the automagic download and archive for all versions.
>
> I assume this has been requested before. As I'm typing this, I just
> received the notification internally that the new SDM is now published.
> I'll forward your request along and see what I hear back.

Yeah, just about every time an Intel person tells me I've been staring
at the wrong version -- usually several emails down a confused
discussion.

The even better option would be the TeX source of the document so we can
diff(1) for changes (and yes; I suspect you're not using TeX like you
should be :-).

Currently we manually keep histerical versions and hope to spot the
differences by hand, but its very painful.

> > > , there is a new feature added to
> > > the same family as CQM, called Memory Bandwidth Monitoring (MBM). The
> > > original cgroup approach would have allowed another subsystem be added
> > > next to cacheqos; the perf-cgroup here is not easily expandable.
> > > The /proc/<pid> approach can add MBM pretty easily alongside CQM.
> >
> > I'll have to go read up what you've done now, but if its also RMID based
> > I don't see why the proposed scheme won't work.

OK; so in the Feb 2014 edition of the Intel SDM for x86_64...

Vol 3c, table 35-23, lists the QM_EVTSEL, QM_CTR and PQR_ASSOC as per
thread, which I read to mean per logical cpu.

(and here I ask what's a PQR)

Vol 3b. 17.14.7 has the following text:

"Thread access to the IA32_QM_EVTSEL and IA32_QM_CTR MSR pair should be
serialized to avoid situations where one thread changes the RMID/EvtID
just before another thread reads monitoring data from IA32_QM_CTR."

The PQR_ASSOC is also stated to be per logical CPU in 17.14.3; but that
same section fails to be explicit for the QM_* thingies.

So which is it; are the QM_* MSRs shared across threads or is it per
thread?

Vol 3b. 17.14.5.2 MBM is rather sparse, but what I can gather from the
text in 17.14.5 the MBM events work more like normal PMU events in that
once you program the QM_EVTSEL it starts counting.

However, there doesn't appear to be an EN bit, nor is CTR writable. So
it appears we must simply set EVTSEL, quickly read CTR as start value,
and at some time later (while also keeping track of time) read it again
and compute the lines/time for bandwidth?

I suppose that since we have multiple cores (or threads, depending on
how the MSRs are implemented) per L3 we can model the thing as having
that many counters.

A bit crappy because we'll have to IPI ourselves into oblivion to
control all those counters, a better deal would've been that many MSRs
package wide -- like the other uncore PMUs have.