This series introduces the next iteration of kernel support for the
Cache QoS Monitoring (CQM) technology available in Intel Xeon processors.
One of the main limitations of the previous version is the inability
to simultaneously monitor:
1) cpu event and any other event in that cpu.
2) cgroup events for cgroups in same descendancy line.
3) cgroup events and any thread event of a cgroup in the same
descendancy line.
Another limitation is that monitoring for a cgroup was enabled/disabled by
the existence of a perf event for that cgroup. Since the event
llc_occupancy measures changes in occupancy rather than total occupancy,
in order to read meaningful llc_occupancy values, an event should be
enabled for a long enough period of time. The overhead in context switches
caused by the perf events is undesired in some sensitive scenarios.
This series of patches addresses the shortcomings mentioned above and,
add some other improvements. The main changes are:
- No more potential conflicts between different events. New
version builds a hierarchy of RMIDs that captures the dependency
between monitored cgroups. llc_occupancy for cgroup is the sum of
llc_occupancies for that cgroup RMID and all other RMIDs in the
cgroups subtree (both monitored cgroups and threads).
- A cgroup integration that allows to monitor the a cgroup without
creating a perf event, decreasing the context switch overhead.
Monitoring is controlled by a boolean cgroup subsystem attribute
in each perf cgroup, this is:
echo 1 > cgroup_path/perf_event.cqm_cont_monitoring
starts CQM monitoring whether or not there is a perf_event
attached to the cgroup. Setting the attribute to 0 makes
monitoring dependent on the existence of a perf_event.
A perf_event is always required in order to read llc_occupancy.
This cgroup integration uses Intel's PQR code and is intended to
be used by upcoming versions of Intel's CAT.
- A more stable rotation algorithm: New algorithm uses SLOs that
guarantee:
- A minimum of enabled time for monitored cgroups and
threads.
- A maximum time disabled before error is introduced by
reusing dirty RMIDs.
- A minimum rate at which RMIDs recycling must progress.
- Reduced impact of stealing/rotation of RMIDs: The new algorithm
accounts the residual occupancy held by limbo RMIDs towards the
former owner of the limbo RMID, decreasing the error introduced
by RMID rotation.
It also allows a limbo RMID to be reused by its former owner when
appropriate, decreasing the potential error of reusing dirty RMIDs
and allowing to make progress even if most limbo RMIDs do not
drop occupancy fast enough.
- Elimination of pmu::count: perf generic's perf_event_count()
perform a quick add of atomic types. The introduction of
pmu::count in the previous CQM series to read occupancy for thread
events changed the behavior of perf_event_count() by performing a
potentially slow IPI and write/read to MSR. It also made pmu::read
to have different behaviors depending on whether the event was a
cpu/cgroup event or a thread. This patches serie removes the custom
pmu::count from CQM and provides a consistent behavior for all
calls of perf_event_read .
- Added error return for pmu::read: Reads to CQM events may fail
due to stealing of RMIDs, even after successfully adding an event
to a PMU. This patch series expands pmu::read with an int return
value and propagates the error to callers that can fail
(ie. perf_read).
The ability to fail of pmu::read is consistent with the recent
changes that allow perf_event_read to fail for transactional
reading of event groups.
- Introduces the field pmu_event_flags that contain flags set by
the PMU to signal variations on the default behavior to perf's
generic code. In this series, three flags are introduced:
- PERF_CGROUP_NO_RECURSION : Signals generic code to add
events of the cgroup ancestors of a cgroup.
- PERF_INACTIVE_CPU_READ_PKG: Signals generic coda that
this CPU event can be read in any CPU in its event::cpu's
package, even if the event is not active.
- PERF_INACTIVE_EV_READ_ANY_CPU: Signals generic code that
this event can be read in any CPU in any package in the
system even if the event is not active.
Using the above flags takes advantage of the CQM's hw ability to
read llc_occupancy even when the associated perf event is not
running in a CPU.
This patch series also updates the perf tool to fix error handling and to
better handle the idiosyncrasies of snapshot and per-pkg events.
David Carrillo-Cisneros (31):
perf/x86/intel/cqm: temporarily remove MBM from CQM and cleanup
perf/x86/intel/cqm: remove check for conflicting events
perf/x86/intel/cqm: remove all code for rotation of RMIDs
perf/x86/intel/cqm: make read of RMIDs per package (Temporal)
perf/core: remove unused pmu->count
x86/intel,cqm: add CONFIG_INTEL_RDT configuration flag and refactor
PQR
perf/x86/intel/cqm: separate CQM PMU's attributes from x86 PMU
perf/x86/intel/cqm: prepare for next patches
perf/x86/intel/cqm: add per-package RMIDs, data and locks
perf/x86/intel/cqm: basic RMID hierarchy with per package rmids
perf/x86/intel/cqm: (I)state and limbo prmids
perf/x86/intel/cqm: add per-package RMID rotation
perf/x86/intel/cqm: add polled update of RMID's llc_occupancy
perf/x86/intel/cqm: add preallocation of anodes
perf/core: add hooks to expose architecture specific features in
perf_cgroup
perf/x86/intel/cqm: add cgroup support
perf/core: adding pmu::event_terminate
perf/x86/intel/cqm: use pmu::event_terminate
perf/core: introduce PMU event flag PERF_CGROUP_NO_RECURSION
x86/intel/cqm: use PERF_CGROUP_NO_RECURSION in CQM
perf/x86/intel/cqm: handle inherit event and inherit_stat flag
perf/x86/intel/cqm: introduce read_subtree
perf/core: introduce PERF_INACTIVE_*_READ_* flags
perf/x86/intel/cqm: use PERF_INACTIVE_*_READ_* flags in CQM
sched: introduce the finish_arch_pre_lock_switch() scheduler hook
perf/x86/intel/cqm: integrate CQM cgroups with scheduler
perf/core: add perf_event cgroup hooks for subsystem attributes
perf/x86/intel/cqm: add CQM attributes to perf_event cgroup
perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
pmu::read
perf,perf/x86: add hook perf_event_arch_exec
perf/stat: revamp error handling for snapshot and per_pkg events
Stephane Eranian (1):
perf/stat: fix bug in handling events in error state
arch/alpha/kernel/perf_event.c | 3 +-
arch/arc/kernel/perf_event.c | 3 +-
arch/arm64/include/asm/hw_breakpoint.h | 2 +-
arch/arm64/kernel/hw_breakpoint.c | 3 +-
arch/metag/kernel/perf/perf_event.c | 5 +-
arch/mips/kernel/perf_event_mipsxx.c | 3 +-
arch/powerpc/include/asm/hw_breakpoint.h | 2 +-
arch/powerpc/kernel/hw_breakpoint.c | 3 +-
arch/powerpc/perf/core-book3s.c | 11 +-
arch/powerpc/perf/core-fsl-emb.c | 5 +-
arch/powerpc/perf/hv-24x7.c | 5 +-
arch/powerpc/perf/hv-gpci.c | 3 +-
arch/s390/kernel/perf_cpum_cf.c | 5 +-
arch/s390/kernel/perf_cpum_sf.c | 3 +-
arch/sh/include/asm/hw_breakpoint.h | 2 +-
arch/sh/kernel/hw_breakpoint.c | 3 +-
arch/sparc/kernel/perf_event.c | 2 +-
arch/tile/kernel/perf_event.c | 3 +-
arch/x86/Kconfig | 6 +
arch/x86/events/amd/ibs.c | 2 +-
arch/x86/events/amd/iommu.c | 5 +-
arch/x86/events/amd/uncore.c | 3 +-
arch/x86/events/core.c | 3 +-
arch/x86/events/intel/Makefile | 3 +-
arch/x86/events/intel/bts.c | 3 +-
arch/x86/events/intel/cqm.c | 3847 +++++++++++++++++++++---------
arch/x86/events/intel/cqm.h | 519 ++++
arch/x86/events/intel/cstate.c | 3 +-
arch/x86/events/intel/pt.c | 3 +-
arch/x86/events/intel/rapl.c | 3 +-
arch/x86/events/intel/uncore.c | 3 +-
arch/x86/events/intel/uncore.h | 2 +-
arch/x86/events/msr.c | 3 +-
arch/x86/include/asm/hw_breakpoint.h | 2 +-
arch/x86/include/asm/perf_event.h | 41 +
arch/x86/include/asm/pqr_common.h | 74 +
arch/x86/include/asm/processor.h | 4 +
arch/x86/kernel/cpu/Makefile | 4 +
arch/x86/kernel/cpu/pqr_common.c | 43 +
arch/x86/kernel/hw_breakpoint.c | 3 +-
arch/x86/kvm/pmu.h | 10 +-
drivers/bus/arm-cci.c | 3 +-
drivers/bus/arm-ccn.c | 3 +-
drivers/perf/arm_pmu.c | 3 +-
include/linux/perf_event.h | 91 +-
kernel/events/core.c | 170 +-
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 3 +
kernel/trace/bpf_trace.c | 5 +-
tools/perf/builtin-stat.c | 43 +-
tools/perf/util/counts.h | 19 +
tools/perf/util/evsel.c | 44 +-
tools/perf/util/evsel.h | 8 +-
tools/perf/util/stat.c | 35 +-
54 files changed, 3746 insertions(+), 1337 deletions(-)
create mode 100644 arch/x86/events/intel/cqm.h
create mode 100644 arch/x86/include/asm/pqr_common.h
create mode 100644 arch/x86/kernel/cpu/pqr_common.c
--
2.8.0.rc3.226.g39d4020
Removing MBM code from arch/x86/events/intel/cqm.c. MBM will be added
using the new RMID infrastucture introduced in this patch series.
Also, remove updates to CQM that are superseded by this series.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 486 ++++----------------------------------------
include/linux/perf_event.h | 1 -
2 files changed, 44 insertions(+), 443 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 7b5fd81..1b064c4 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -13,16 +13,8 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
-#define MBM_CNTR_WIDTH 24
-/*
- * Guaranteed time in ms as per SDM where MBM counters will not overflow.
- */
-#define MBM_CTR_OVERFLOW_TIME 1000
-
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
/**
* struct intel_pqr_state - State cache for the PQR MSR
@@ -50,37 +42,8 @@ struct intel_pqr_state {
* interrupts disabled, which is sufficient for the protection.
*/
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-static struct hrtimer *mbm_timers;
-/**
- * struct sample - mbm event's (local or total) data
- * @total_bytes #bytes since we began monitoring
- * @prev_msr previous value of MSR
- */
-struct sample {
- u64 total_bytes;
- u64 prev_msr;
-};
/*
- * samples profiled for total memory bandwidth type events
- */
-static struct sample *mbm_total;
-/*
- * samples profiled for local memory bandwidth type events
- */
-static struct sample *mbm_local;
-
-#define pkg_id topology_physical_package_id(smp_processor_id())
-/*
- * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
- * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
- * rmids per socket, an example is given below
- * RMID1 of Socket0: vrmid = 1
- * RMID1 of Socket1: vrmid = 1 * (cqm_max_rmid + 1) + 1
- * RMID1 of Socket2: vrmid = 2 * (cqm_max_rmid + 1) + 1
- */
-#define rmid_2_index(rmid) ((pkg_id * (cqm_max_rmid + 1)) + rmid)
-/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
*
@@ -102,13 +65,9 @@ static cpumask_t cqm_cpumask;
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)
-/*
- * Event IDs are used to program IA32_QM_EVTSEL before reading event
- * counter from IA32_QM_CTR
- */
-#define QOS_L3_OCCUP_EVENT_ID 0x01
-#define QOS_MBM_TOTAL_EVENT_ID 0x02
-#define QOS_MBM_LOCAL_EVENT_ID 0x03
+#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
+
+#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -252,21 +211,6 @@ static void __put_rmid(u32 rmid)
list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
}
-static void cqm_cleanup(void)
-{
- int i;
-
- if (!cqm_rmid_ptrs)
- return;
-
- for (i = 0; i < cqm_max_rmid; i++)
- kfree(cqm_rmid_ptrs[i]);
-
- kfree(cqm_rmid_ptrs);
- cqm_rmid_ptrs = NULL;
- cqm_enabled = false;
-}
-
static int intel_cqm_setup_rmid_cache(void)
{
struct cqm_rmid_entry *entry;
@@ -274,7 +218,7 @@ static int intel_cqm_setup_rmid_cache(void)
int r = 0;
nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
+ cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
nr_rmids, GFP_KERNEL);
if (!cqm_rmid_ptrs)
return -ENOMEM;
@@ -305,9 +249,11 @@ static int intel_cqm_setup_rmid_cache(void)
mutex_unlock(&cache_mutex);
return 0;
-
fail:
- cqm_cleanup();
+ while (r--)
+ kfree(cqm_rmid_ptrs[r]);
+
+ kfree(cqm_rmid_ptrs);
return -ENOMEM;
}
@@ -335,13 +281,9 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
/*
* Events that target same task are placed into the same cache group.
- * Mark it as a multi event group, so that we update ->count
- * for every event rather than just the group leader later.
*/
- if (a->hw.target == b->hw.target) {
- b->hw.is_group_event = true;
+ if (a->hw.target == b->hw.target)
return true;
- }
/*
* Are we an inherited event?
@@ -450,26 +392,10 @@ static bool __conflict_event(struct perf_event *a, struct perf_event *b)
struct rmid_read {
u32 rmid;
- u32 evt_type;
atomic64_t value;
};
static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
-static void __intel_mbm_event_count(void *info);
-
-static bool is_mbm_event(int e)
-{
- return (e >= QOS_MBM_TOTAL_EVENT_ID && e <= QOS_MBM_LOCAL_EVENT_ID);
-}
-
-static void cqm_mask_call(struct rmid_read *rr)
-{
- if (is_mbm_event(rr->evt_type))
- on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_count, rr, 1);
- else
- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, rr, 1);
-}
/*
* Exchange the RMID of a group of events.
@@ -487,12 +413,12 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
*/
if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
struct rmid_read rr = {
- .rmid = old_rmid,
- .evt_type = group->attr.config,
.value = ATOMIC64_INIT(0),
+ .rmid = old_rmid,
};
- cqm_mask_call(&rr);
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
+ &rr, 1);
local64_set(&group->count, atomic64_read(&rr.value));
}
@@ -504,22 +430,6 @@ static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
raw_spin_unlock_irq(&cache_lock);
- /*
- * If the allocation is for mbm, init the mbm stats.
- * Need to check if each event in the group is mbm event
- * because there could be multiple type of events in the same group.
- */
- if (__rmid_valid(rmid)) {
- event = group;
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
-
- list_for_each_entry(event, head, hw.cqm_group_entry) {
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
- }
- }
-
return old_rmid;
}
@@ -927,72 +837,6 @@ static void intel_cqm_rmid_rotate(struct work_struct *work)
schedule_delayed_work(&intel_cqm_rmid_work, delay);
}
-static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
-{
- struct sample *mbm_current;
- u32 vrmid = rmid_2_index(rmid);
- u64 val, bytes, shift;
- u32 eventid;
-
- if (evt_type == QOS_MBM_LOCAL_EVENT_ID) {
- mbm_current = &mbm_local[vrmid];
- eventid = QOS_MBM_LOCAL_EVENT_ID;
- } else {
- mbm_current = &mbm_total[vrmid];
- eventid = QOS_MBM_TOTAL_EVENT_ID;
- }
-
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
- rdmsrl(MSR_IA32_QM_CTR, val);
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return mbm_current->total_bytes;
-
- if (first) {
- mbm_current->prev_msr = val;
- mbm_current->total_bytes = 0;
- return mbm_current->total_bytes;
- }
-
- /*
- * The h/w guarantees that counters will not overflow
- * so long as we poll them at least once per second.
- */
- shift = 64 - MBM_CNTR_WIDTH;
- bytes = (val << shift) - (mbm_current->prev_msr << shift);
- bytes >>= shift;
-
- bytes *= cqm_l3_scale;
-
- mbm_current->total_bytes += bytes;
- mbm_current->prev_msr = val;
-
- return mbm_current->total_bytes;
-}
-
-static u64 rmid_read_mbm(unsigned int rmid, u32 evt_type)
-{
- return update_sample(rmid, evt_type, 0);
-}
-
-static void __intel_mbm_event_init(void *info)
-{
- struct rmid_read *rr = info;
-
- update_sample(rr->rmid, rr->evt_type, 1);
-}
-
-static void init_mbm_sample(u32 rmid, u32 evt_type)
-{
- struct rmid_read rr = {
- .rmid = rmid,
- .evt_type = evt_type,
- .value = ATOMIC64_INIT(0),
- };
-
- /* on each socket, init sample */
- on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
-}
-
/*
* Find a group and setup RMID.
*
@@ -1005,7 +849,6 @@ static void intel_cqm_setup_event(struct perf_event *event,
bool conflict = false;
u32 rmid;
- event->hw.is_group_event = false;
list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
rmid = iter->hw.cqm_rmid;
@@ -1013,8 +856,6 @@ static void intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
- init_mbm_sample(rmid, event->attr.config);
return;
}
@@ -1031,9 +872,6 @@ static void intel_cqm_setup_event(struct perf_event *event,
else
rmid = __get_rmid();
- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
- init_mbm_sample(rmid, event->attr.config);
-
event->hw.cqm_rmid = rmid;
}
@@ -1055,10 +893,7 @@ static void intel_cqm_event_read(struct perf_event *event)
if (!__rmid_valid(rmid))
goto out;
- if (is_mbm_event(event->attr.config))
- val = rmid_read_mbm(rmid, event->attr.config);
- else
- val = __rmid_read(rmid);
+ val = __rmid_read(rmid);
/*
* Ignore this reading on error states and do not update the value.
@@ -1089,100 +924,10 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_groups_entry);
}
-static void __intel_mbm_event_count(void *info)
-{
- struct rmid_read *rr = info;
- u64 val;
-
- val = rmid_read_mbm(rr->rmid, rr->evt_type);
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
- atomic64_add(val, &rr->value);
-}
-
-static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
-{
- struct perf_event *iter, *iter1;
- int ret = HRTIMER_RESTART;
- struct list_head *head;
- unsigned long flags;
- u32 grp_rmid;
-
- /*
- * Need to cache_lock as the timer Event Select MSR reads
- * can race with the mbm/cqm count() and mbm_init() reads.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
-
- if (list_empty(&cache_groups)) {
- ret = HRTIMER_NORESTART;
- goto out;
- }
-
- list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
- grp_rmid = iter->hw.cqm_rmid;
- if (!__rmid_valid(grp_rmid))
- continue;
- if (is_mbm_event(iter->attr.config))
- update_sample(grp_rmid, iter->attr.config, 0);
-
- head = &iter->hw.cqm_group_entry;
- if (list_empty(head))
- continue;
- list_for_each_entry(iter1, head, hw.cqm_group_entry) {
- if (!iter1->hw.is_group_event)
- break;
- if (is_mbm_event(iter1->attr.config))
- update_sample(iter1->hw.cqm_rmid,
- iter1->attr.config, 0);
- }
- }
-
- hrtimer_forward_now(hrtimer, ms_to_ktime(MBM_CTR_OVERFLOW_TIME));
-out:
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
- return ret;
-}
-
-static void __mbm_start_timer(void *info)
-{
- hrtimer_start(&mbm_timers[pkg_id], ms_to_ktime(MBM_CTR_OVERFLOW_TIME),
- HRTIMER_MODE_REL_PINNED);
-}
-
-static void __mbm_stop_timer(void *info)
-{
- hrtimer_cancel(&mbm_timers[pkg_id]);
-}
-
-static void mbm_start_timers(void)
-{
- on_each_cpu_mask(&cqm_cpumask, __mbm_start_timer, NULL, 1);
-}
-
-static void mbm_stop_timers(void)
-{
- on_each_cpu_mask(&cqm_cpumask, __mbm_stop_timer, NULL, 1);
-}
-
-static void mbm_hrtimer_init(void)
-{
- struct hrtimer *hr;
- int i;
-
- for (i = 0; i < mbm_socket_max; i++) {
- hr = &mbm_timers[i];
- hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- hr->function = mbm_hrtimer_handle;
- }
-}
-
static u64 intel_cqm_event_count(struct perf_event *event)
{
unsigned long flags;
struct rmid_read rr = {
- .evt_type = event->attr.config,
.value = ATOMIC64_INIT(0),
};
@@ -1195,9 +940,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);
/*
- * Only the group leader gets to report values except in case of
- * multiple events in the same group, we still need to read the
- * other events.This stops us
+ * Only the group leader gets to report values. This stops us
* reporting duplicate values to userspace, and gives us a clear
* rule for which task gets to report the values.
*
@@ -1205,7 +948,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
* specific packages - we forfeit that ability when we create
* task events.
*/
- if (!cqm_group_leader(event) && !event->hw.is_group_event)
+ if (!cqm_group_leader(event))
return 0;
/*
@@ -1232,7 +975,7 @@ static u64 intel_cqm_event_count(struct perf_event *event)
if (!__rmid_valid(rr.rmid))
goto out;
- cqm_mask_call(&rr);
+ on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
raw_spin_lock_irqsave(&cache_lock, flags);
if (event->hw.cqm_rmid == rr.rmid)
@@ -1303,14 +1046,8 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
- unsigned long flags;
mutex_lock(&cache_mutex);
- /*
- * Hold the cache_lock as mbm timer handlers could be
- * scanning the list of events.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
/*
* If there's another event in this group...
@@ -1342,14 +1079,6 @@ static void intel_cqm_event_destroy(struct perf_event *event)
}
}
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
- /*
- * Stop the mbm overflow timers when the last event is destroyed.
- */
- if (mbm_enabled && list_empty(&cache_groups))
- mbm_stop_timers();
-
mutex_unlock(&cache_mutex);
}
@@ -1357,13 +1086,11 @@ static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
bool rotate = false;
- unsigned long flags;
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
- if ((event->attr.config < QOS_L3_OCCUP_EVENT_ID) ||
- (event->attr.config > QOS_MBM_LOCAL_EVENT_ID))
+ if (event->attr.config & ~QOS_EVENT_MASK)
return -EINVAL;
/* unsupported modes and filters */
@@ -1383,21 +1110,9 @@ static int intel_cqm_event_init(struct perf_event *event)
mutex_lock(&cache_mutex);
- /*
- * Start the mbm overflow timers when the first event is created.
- */
- if (mbm_enabled && list_empty(&cache_groups))
- mbm_start_timers();
-
/* Will also set rmid */
intel_cqm_setup_event(event, &group);
- /*
- * Hold the cache_lock as mbm timer handlers be
- * scanning the list of events.
- */
- raw_spin_lock_irqsave(&cache_lock, flags);
-
if (group) {
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
@@ -1416,7 +1131,6 @@ static int intel_cqm_event_init(struct perf_event *event)
rotate = true;
}
- raw_spin_unlock_irqrestore(&cache_lock, flags);
mutex_unlock(&cache_mutex);
if (rotate)
@@ -1431,16 +1145,6 @@ EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
-EVENT_ATTR_STR(total_bytes, intel_cqm_total_bytes, "event=0x02");
-EVENT_ATTR_STR(total_bytes.per-pkg, intel_cqm_total_bytes_pkg, "1");
-EVENT_ATTR_STR(total_bytes.unit, intel_cqm_total_bytes_unit, "MB");
-EVENT_ATTR_STR(total_bytes.scale, intel_cqm_total_bytes_scale, "1e-6");
-
-EVENT_ATTR_STR(local_bytes, intel_cqm_local_bytes, "event=0x03");
-EVENT_ATTR_STR(local_bytes.per-pkg, intel_cqm_local_bytes_pkg, "1");
-EVENT_ATTR_STR(local_bytes.unit, intel_cqm_local_bytes_unit, "MB");
-EVENT_ATTR_STR(local_bytes.scale, intel_cqm_local_bytes_scale, "1e-6");
-
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
EVENT_PTR(intel_cqm_llc_pkg),
@@ -1450,38 +1154,9 @@ static struct attribute *intel_cqm_events_attr[] = {
NULL,
};
-static struct attribute *intel_mbm_events_attr[] = {
- EVENT_PTR(intel_cqm_total_bytes),
- EVENT_PTR(intel_cqm_local_bytes),
- EVENT_PTR(intel_cqm_total_bytes_pkg),
- EVENT_PTR(intel_cqm_local_bytes_pkg),
- EVENT_PTR(intel_cqm_total_bytes_unit),
- EVENT_PTR(intel_cqm_local_bytes_unit),
- EVENT_PTR(intel_cqm_total_bytes_scale),
- EVENT_PTR(intel_cqm_local_bytes_scale),
- NULL,
-};
-
-static struct attribute *intel_cmt_mbm_events_attr[] = {
- EVENT_PTR(intel_cqm_llc),
- EVENT_PTR(intel_cqm_total_bytes),
- EVENT_PTR(intel_cqm_local_bytes),
- EVENT_PTR(intel_cqm_llc_pkg),
- EVENT_PTR(intel_cqm_total_bytes_pkg),
- EVENT_PTR(intel_cqm_local_bytes_pkg),
- EVENT_PTR(intel_cqm_llc_unit),
- EVENT_PTR(intel_cqm_total_bytes_unit),
- EVENT_PTR(intel_cqm_local_bytes_unit),
- EVENT_PTR(intel_cqm_llc_scale),
- EVENT_PTR(intel_cqm_total_bytes_scale),
- EVENT_PTR(intel_cqm_local_bytes_scale),
- EVENT_PTR(intel_cqm_llc_snapshot),
- NULL,
-};
-
static struct attribute_group intel_cqm_events_group = {
.name = "events",
- .attrs = NULL,
+ .attrs = intel_cqm_events_attr,
};
PMU_FORMAT_ATTR(event, "config:0-7");
@@ -1569,12 +1244,15 @@ static struct pmu intel_cqm_pmu = {
static inline void cqm_pick_event_reader(int cpu)
{
- int reader;
+ int phys_id = topology_physical_package_id(cpu);
+ int i;
+
+ for_each_cpu(i, &cqm_cpumask) {
+ if (phys_id == topology_physical_package_id(i))
+ return; /* already got reader for this socket */
+ }
- /* First online cpu in package becomes the reader */
- reader = cpumask_any_and(&cqm_cpumask, topology_core_cpumask(cpu));
- if (reader >= nr_cpu_ids)
- cpumask_set_cpu(cpu, &cqm_cpumask);
+ cpumask_set_cpu(cpu, &cqm_cpumask);
}
static void intel_cqm_cpu_starting(unsigned int cpu)
@@ -1592,17 +1270,24 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
static void intel_cqm_cpu_exit(unsigned int cpu)
{
- int target;
+ int phys_id = topology_physical_package_id(cpu);
+ int i;
- /* Is @cpu the current cqm reader for this package ? */
+ /*
+ * Is @cpu a designated cqm reader?
+ */
if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
return;
- /* Find another online reader in this package */
- target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+ for_each_online_cpu(i) {
+ if (i == cpu)
+ continue;
- if (target < nr_cpu_ids)
- cpumask_set_cpu(target, &cqm_cpumask);
+ if (phys_id == topology_physical_package_id(i)) {
+ cpumask_set_cpu(i, &cqm_cpumask);
+ break;
+ }
+ }
}
static int intel_cqm_cpu_notifier(struct notifier_block *nb,
@@ -1628,70 +1313,12 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};
-static void mbm_cleanup(void)
-{
- if (!mbm_enabled)
- return;
-
- kfree(mbm_local);
- kfree(mbm_total);
- mbm_enabled = false;
-}
-
-static const struct x86_cpu_id intel_mbm_local_match[] = {
- { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_LOCAL },
- {}
-};
-
-static const struct x86_cpu_id intel_mbm_total_match[] = {
- { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_MBM_TOTAL },
- {}
-};
-
-static int intel_mbm_init(void)
-{
- int ret = 0, array_size, maxid = cqm_max_rmid + 1;
-
- mbm_socket_max = topology_max_packages();
- array_size = sizeof(struct sample) * maxid * mbm_socket_max;
- mbm_local = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_local)
- return -ENOMEM;
-
- mbm_total = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_total) {
- ret = -ENOMEM;
- goto out;
- }
-
- array_size = sizeof(struct hrtimer) * mbm_socket_max;
- mbm_timers = kmalloc(array_size, GFP_KERNEL);
- if (!mbm_timers) {
- ret = -ENOMEM;
- goto out;
- }
- mbm_hrtimer_init();
-
-out:
- if (ret)
- mbm_cleanup();
-
- return ret;
-}
-
static int __init intel_cqm_init(void)
{
- char *str = NULL, scale[20];
+ char *str, scale[20];
int i, cpu, ret;
- if (x86_match_cpu(intel_cqm_match))
- cqm_enabled = true;
-
- if (x86_match_cpu(intel_mbm_local_match) &&
- x86_match_cpu(intel_mbm_total_match))
- mbm_enabled = true;
-
- if (!cqm_enabled && !mbm_enabled)
+ if (!x86_match_cpu(intel_cqm_match))
return -ENODEV;
cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
@@ -1748,41 +1375,16 @@ static int __init intel_cqm_init(void)
cqm_pick_event_reader(i);
}
- if (mbm_enabled)
- ret = intel_mbm_init();
- if (ret && !cqm_enabled)
- goto out;
-
- if (cqm_enabled && mbm_enabled)
- intel_cqm_events_group.attrs = intel_cmt_mbm_events_attr;
- else if (!cqm_enabled && mbm_enabled)
- intel_cqm_events_group.attrs = intel_mbm_events_attr;
- else if (cqm_enabled && !mbm_enabled)
- intel_cqm_events_group.attrs = intel_cqm_events_attr;
+ __perf_cpu_notifier(intel_cqm_cpu_notifier);
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
- if (ret) {
+ if (ret)
pr_err("Intel CQM perf registration failed: %d\n", ret);
- goto out;
- }
-
- if (cqm_enabled)
+ else
pr_info("Intel CQM monitoring enabled\n");
- if (mbm_enabled)
- pr_info("Intel MBM enabled\n");
- /*
- * Register the hot cpu notifier once we are sure cqm
- * is enabled to avoid notifier leak.
- */
- __perf_cpu_notifier(intel_cqm_cpu_notifier);
out:
cpu_notifier_register_done();
- if (ret) {
- kfree(str);
- cqm_cleanup();
- mbm_cleanup();
- }
return ret;
}
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9f64044..00bb6b5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -121,7 +121,6 @@ struct hw_perf_event {
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
- int is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;
--
2.8.0.rc3.226.g39d4020
The new version of Intel's CQM uses a RMID hierarchy to avoid conflicts
between cpu, cgroup and task events, making unnecessary to check and
resolve conflicts between events of different types (ie. cgroup vs task).
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 148 --------------------------------------------
1 file changed, 148 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 1b064c4..a3fde49 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -304,92 +304,6 @@ static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
}
#endif
-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- * PROHIBITS
- * system-wide -> cgroup and task
- * cgroup -> system-wide
- * -> task in cgroup
- * task -> system-wide
- * -> task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
- /*
- * We can have any number of cgroups but only one system-wide
- * event at a time.
- */
- if (a->cgrp && b->cgrp) {
- struct perf_cgroup *ac = a->cgrp;
- struct perf_cgroup *bc = b->cgrp;
-
- /*
- * This condition should have been caught in
- * __match_event() and we should be sharing an RMID.
- */
- WARN_ON_ONCE(ac == bc);
-
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-
- if (a->cgrp || b->cgrp) {
- struct perf_cgroup *ac, *bc;
-
- /*
- * cgroup and system-wide events are mutually exclusive
- */
- if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
- (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
- return true;
-
- /*
- * Ensure neither event is part of the other's cgroup
- */
- ac = event_to_cgroup(a);
- bc = event_to_cgroup(b);
- if (ac == bc)
- return true;
-
- /*
- * Must have cgroup and non-intersecting task events.
- */
- if (!ac || !bc)
- return false;
-
- /*
- * We have cgroup and task events, and the task belongs
- * to a cgroup. Check for for overlap.
- */
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-#endif
- /*
- * If one of them is not a task, same story as above with cgroups.
- */
- if (!(a->attach_state & PERF_ATTACH_TASK) ||
- !(b->attach_state & PERF_ATTACH_TASK))
- return true;
-
- /*
- * Must be non-overlapping.
- */
- return false;
-}
-
struct rmid_read {
u32 rmid;
atomic64_t value;
@@ -465,10 +379,6 @@ static void intel_cqm_stable(void *arg)
}
}
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
static bool intel_cqm_sched_in_event(u32 rmid)
{
struct perf_event *leader, *event;
@@ -484,9 +394,6 @@ static bool intel_cqm_sched_in_event(u32 rmid)
if (__rmid_valid(event->hw.cqm_rmid))
continue;
- if (__conflict_event(event, leader))
- continue;
-
intel_cqm_xchg_rmid(event, rmid);
return true;
}
@@ -592,10 +499,6 @@ static bool intel_cqm_rmid_stabilize(unsigned int *available)
continue;
}
- /*
- * If we have groups waiting for RMIDs, hand
- * them one now provided they don't conflict.
- */
if (intel_cqm_sched_in_event(entry->rmid))
continue;
@@ -638,46 +541,8 @@ static void __intel_cqm_pick_and_rotate(struct perf_event *next)
}
/*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
- struct perf_event *group, *g;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
- if (group == event)
- continue;
-
- rmid = group->hw.cqm_rmid;
-
- /*
- * Skip events that don't have a valid RMID.
- */
- if (!__rmid_valid(rmid))
- continue;
-
- /*
- * No conflict? No problem! Leave the event alone.
- */
- if (!__conflict_event(group, event))
- continue;
-
- intel_cqm_xchg_rmid(group, INVALID_RMID);
- __put_rmid(rmid);
- }
-}
-
-/*
* Attempt to rotate the groups and assign new RMIDs.
*
- * We rotate for two reasons,
- * 1. To handle the scheduling of conflicting events
- * 2. To recycle RMIDs
- *
* Rotating RMIDs is complicated because the hardware doesn't give us
* any clues.
*
@@ -732,10 +597,6 @@ again:
goto stabilize;
/*
- * We have more event groups without RMIDs than available RMIDs,
- * or we have event groups that conflict with the ones currently
- * scheduled.
- *
* We force deallocate the rmid of the group at the head of
* cache_groups. The first event group without an RMID then gets
* assigned intel_cqm_rotation_rmid. This ensures we always make
@@ -754,8 +615,6 @@ again:
intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
intel_cqm_rotation_rmid = __get_rmid();
- intel_cqm_sched_out_conflicting_events(start);
-
if (__intel_cqm_threshold)
__intel_cqm_threshold--;
}
@@ -858,13 +717,6 @@ static void intel_cqm_setup_event(struct perf_event *event,
*group = iter;
return;
}
-
- /*
- * We only care about conflicts for events that are
- * actually scheduled in (and hence have a valid RMID).
- */
- if (__conflict_event(iter, event) && __rmid_valid(rmid))
- conflict = true;
}
if (conflict)
--
2.8.0.rc3.226.g39d4020
In preparation for future patches that will introduce a per-package
rotation of RMIDs.
The new rotation logic follows the same ideas as the present rotation
logic being removed but takes advantage of the per-package RMID design
and a more detailed bookkeeping to guarantee to meet user SLOs.
It also avoid IPIs, and does not keep an unused rotation RMID in some
cases (as present version does).
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 371 --------------------------------------------
1 file changed, 371 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index a3fde49..3c1e247 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -312,42 +312,6 @@ struct rmid_read {
static void __intel_cqm_event_count(void *info);
/*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
- struct perf_event *event;
- struct list_head *head = &group->hw.cqm_group_entry;
- u32 old_rmid = group->hw.cqm_rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- /*
- * If our RMID is being deallocated, perform a read now.
- */
- if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
- struct rmid_read rr = {
- .value = ATOMIC64_INIT(0),
- .rmid = old_rmid,
- };
-
- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
- &rr, 1);
- local64_set(&group->count, atomic64_read(&rr.value));
- }
-
- raw_spin_lock_irq(&cache_lock);
-
- group->hw.cqm_rmid = rmid;
- list_for_each_entry(event, head, hw.cqm_group_entry)
- event->hw.cqm_rmid = rmid;
-
- raw_spin_unlock_irq(&cache_lock);
-
- return old_rmid;
-}
-
-/*
* If we fail to assign a new RMID for intel_cqm_rotation_rmid because
* cachelines are still tagged with RMIDs in limbo, we progressively
* increment the threshold until we find an RMID in limbo with <=
@@ -364,44 +328,6 @@ static unsigned int __intel_cqm_threshold;
static unsigned int __intel_cqm_max_threshold;
/*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
- struct cqm_rmid_entry *entry;
-
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- if (entry->state != RMID_AVAILABLE)
- break;
-
- if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
- entry->state = RMID_DIRTY;
- }
-}
-
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
- struct perf_event *leader, *event;
-
- lockdep_assert_held(&cache_mutex);
-
- leader = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
- event = leader;
-
- list_for_each_entry_continue(event, &cache_groups,
- hw.cqm_groups_entry) {
- if (__rmid_valid(event->hw.cqm_rmid))
- continue;
-
- intel_cqm_xchg_rmid(event, rmid);
- return true;
- }
-
- return false;
-}
-
-/*
* Initially use this constant for both the limbo queue time and the
* rotation timer interval, pmu::hrtimer_interval_ms.
*
@@ -411,291 +337,8 @@ static bool intel_cqm_sched_in_event(u32 rmid)
*/
#define RMID_DEFAULT_QUEUE_TIME 250 /* ms */
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
- struct cqm_rmid_entry *entry, *tmp;
-
- lockdep_assert_held(&cache_mutex);
-
- *available = 0;
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- unsigned long min_queue_time;
- unsigned long now = jiffies;
-
- /*
- * We hold RMIDs placed into limbo for a minimum queue
- * time. Before the minimum queue time has elapsed we do
- * not recycle RMIDs.
- *
- * The reasoning is that until a sufficient time has
- * passed since we stopped using an RMID, any RMID
- * placed onto the limbo list will likely still have
- * data tagged in the cache, which means we'll probably
- * fail to recycle it anyway.
- *
- * We can save ourselves an expensive IPI by skipping
- * any RMIDs that have not been queued for the minimum
- * time.
- */
- min_queue_time = entry->queue_time +
- msecs_to_jiffies(__rmid_queue_time_ms);
-
- if (time_after(min_queue_time, now))
- break;
-
- entry->state = RMID_AVAILABLE;
- (*available)++;
- }
-
- /*
- * Fast return if none of the RMIDs on the limbo list have been
- * sitting on the queue for the minimum queue time.
- */
- if (!*available)
- return false;
-
- /*
- * Test whether an RMID is free for each package.
- */
- on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
-
- list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
- /*
- * Exhausted all RMIDs that have waited min queue time.
- */
- if (entry->state == RMID_YOUNG)
- break;
-
- if (entry->state == RMID_DIRTY)
- continue;
-
- list_del(&entry->list); /* remove from limbo */
-
- /*
- * The rotation RMID gets priority if it's
- * currently invalid. In which case, skip adding
- * the RMID to the the free lru.
- */
- if (!__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_rotation_rmid = entry->rmid;
- continue;
- }
-
- if (intel_cqm_sched_in_event(entry->rmid))
- continue;
-
- /*
- * Otherwise place it onto the free list.
- */
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
- }
-
-
- return __rmid_valid(intel_cqm_rotation_rmid);
-}
-
-/*
- * Pick a victim group and move it to the tail of the group list.
- * @next: The first group without an RMID
- */
-static void __intel_cqm_pick_and_rotate(struct perf_event *next)
-{
- struct perf_event *rotor;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- rotor = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
-
- /*
- * The group at the front of the list should always have a valid
- * RMID. If it doesn't then no groups have RMIDs assigned and we
- * don't need to rotate the list.
- */
- if (next == rotor)
- return;
-
- rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
- __put_rmid(rmid);
-
- list_rotate_left(&cache_groups);
-}
-
-/*
- * Attempt to rotate the groups and assign new RMIDs.
- *
- * Rotating RMIDs is complicated because the hardware doesn't give us
- * any clues.
- *
- * There's problems with the hardware interface; when you change the
- * task:RMID map cachelines retain their 'old' tags, giving a skewed
- * picture. In order to work around this, we must always keep one free
- * RMID - intel_cqm_rotation_rmid.
- *
- * Rotation works by taking away an RMID from a group (the old RMID),
- * and assigning the free RMID to another group (the new RMID). We must
- * then wait for the old RMID to not be used (no cachelines tagged).
- * This ensure that all cachelines are tagged with 'active' RMIDs. At
- * this point we can start reading values for the new RMID and treat the
- * old RMID as the free RMID for the next rotation.
- *
- * Return %true or %false depending on whether we did any rotating.
- */
-static bool __intel_cqm_rmid_rotate(void)
-{
- struct perf_event *group, *start = NULL;
- unsigned int threshold_limit;
- unsigned int nr_needed = 0;
- unsigned int nr_available;
- bool rotated = false;
-
- mutex_lock(&cache_mutex);
-
-again:
- /*
- * Fast path through this function if there are no groups and no
- * RMIDs that need cleaning.
- */
- if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
- if (!__rmid_valid(group->hw.cqm_rmid)) {
- if (!start)
- start = group;
- nr_needed++;
- }
- }
-
- /*
- * We have some event groups, but they all have RMIDs assigned
- * and no RMIDs need cleaning.
- */
- if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- if (!nr_needed)
- goto stabilize;
-
- /*
- * We force deallocate the rmid of the group at the head of
- * cache_groups. The first event group without an RMID then gets
- * assigned intel_cqm_rotation_rmid. This ensures we always make
- * forward progress.
- *
- * Rotate the cache_groups list so the previous head is now the
- * tail.
- */
- __intel_cqm_pick_and_rotate(start);
-
- /*
- * If the rotation is going to succeed, reduce the threshold so
- * that we don't needlessly reuse dirty RMIDs.
- */
- if (__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
- intel_cqm_rotation_rmid = __get_rmid();
-
- if (__intel_cqm_threshold)
- __intel_cqm_threshold--;
- }
-
- rotated = true;
-
-stabilize:
- /*
- * We now need to stablize the RMID we freed above (if any) to
- * ensure that the next time we rotate we have an RMID with zero
- * occupancy value.
- *
- * Alternatively, if we didn't need to perform any rotation,
- * we'll have a bunch of RMIDs in limbo that need stabilizing.
- */
- threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
-
- while (intel_cqm_rmid_stabilize(&nr_available) &&
- __intel_cqm_threshold < threshold_limit) {
- unsigned int steal_limit;
-
- /*
- * Don't spin if nobody is actively waiting for an RMID,
- * the rotation worker will be kicked as soon as an
- * event needs an RMID anyway.
- */
- if (!nr_needed)
- break;
-
- /* Allow max 25% of RMIDs to be in limbo. */
- steal_limit = (cqm_max_rmid + 1) / 4;
-
- /*
- * We failed to stabilize any RMIDs so our rotation
- * logic is now stuck. In order to make forward progress
- * we have a few options:
- *
- * 1. rotate ("steal") another RMID
- * 2. increase the threshold
- * 3. do nothing
- *
- * We do both of 1. and 2. until we hit the steal limit.
- *
- * The steal limit prevents all RMIDs ending up on the
- * limbo list. This can happen if every RMID has a
- * non-zero occupancy above threshold_limit, and the
- * occupancy values aren't dropping fast enough.
- *
- * Note that there is prioritisation at work here - we'd
- * rather increase the number of RMIDs on the limbo list
- * than increase the threshold, because increasing the
- * threshold skews the event data (because we reuse
- * dirty RMIDs) - threshold bumps are a last resort.
- */
- if (nr_available < steal_limit)
- goto again;
-
- __intel_cqm_threshold++;
- }
-
-out:
- mutex_unlock(&cache_mutex);
- return rotated;
-}
-
-static void intel_cqm_rmid_rotate(struct work_struct *work);
-
-static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
-
static struct pmu intel_cqm_pmu;
-static void intel_cqm_rmid_rotate(struct work_struct *work)
-{
- unsigned long delay;
-
- __intel_cqm_rmid_rotate();
-
- delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
- schedule_delayed_work(&intel_cqm_rmid_work, delay);
-}
-
/*
* Find a group and setup RMID.
*
@@ -937,7 +580,6 @@ static void intel_cqm_event_destroy(struct perf_event *event)
static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
- bool rotate = false;
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -971,23 +613,10 @@ static int intel_cqm_event_init(struct perf_event *event)
} else {
list_add_tail(&event->hw.cqm_groups_entry,
&cache_groups);
-
- /*
- * All RMIDs are either in use or have recently been
- * used. Kick the rotation worker to clean/free some.
- *
- * We only do this for the group leader, rather than for
- * every event in a group to save on needless work.
- */
- if (!__rmid_valid(event->hw.cqm_rmid))
- rotate = true;
}
mutex_unlock(&cache_mutex);
- if (rotate)
- schedule_delayed_work(&intel_cqm_rmid_work, 0);
-
return 0;
}
--
2.8.0.rc3.226.g39d4020
First part of new CQM logic. This patch introduces the struct pkg_data
that contains all per-package CQM data required by the new RMID hierarchy.
The raw RMID value is encapsulated in a Package RMID (prmid) structure
that provides atomic updates and caches recent reads. This caching
throttles the frequency at which (slow) hardware reads are performed and
ameliorates the impact of the worst case scenarios while traversing the
hierarchy of RMIDs (hierarchy and operations are introduced in future
patches within this series).
There is a set of prmids per physical package (socket) in the system. Each
package may have different number of prmids (different hw max_rmid_index).
Each package maintains its own pool of prmids (only a free pool as of this
patch, more pools to add in future patches in this series). Also, each
package has its own mutex and lock to protect the RMID pools and rotation
logic. This per-package separation reduces the contention for each lock
and mutex compared with the previous version (with system-wide mutex
and lock).
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 426 +++++++++++++++++++++-----------------
arch/x86/events/intel/cqm.h | 154 ++++++++++++++
arch/x86/include/asm/pqr_common.h | 2 +
3 files changed, 392 insertions(+), 190 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index f678014..541e515 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -12,7 +12,6 @@
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
-static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
#define RMID_VAL_ERROR (1ULL << 63)
@@ -30,39 +29,13 @@ static struct perf_pmu_events_attr event_attr_##v = { \
}
/*
- * Updates caller cpu's cache.
- */
-static inline void __update_pqr_rmid(u32 rmid)
-{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- if (state->rmid == rmid)
- return;
- state->rmid = rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
-}
-
-/*
* Groups of events that have the same target(s), one RMID per group.
* Protected by cqm_mutex.
*/
static LIST_HEAD(cache_groups);
static DEFINE_MUTEX(cqm_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
-/*
- * Mask of CPUs for reading CQM values. We only need one per-socket.
- */
-static cpumask_t cqm_cpumask;
-
-
-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
+struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
/*
* Is @rmid valid for programming the hardware?
@@ -82,162 +55,220 @@ static inline bool __rmid_valid(u32 rmid)
static u64 __rmid_read(u32 rmid)
{
+ /* XXX: Placeholder, will be removed in next patch. */
+ return 0;
+}
+
+/*
+ * Update if enough time has passed since last read.
+ *
+ * Must be called in a cpu in the package where prmid belongs.
+ * This function is safe to be called concurrently since it is guaranteed
+ * that entry->last_read_value is updated to a occupancy value obtained
+ * after the time set in entry->last_read_time .
+ * Return 1 if value was updated, 0 if not, negative number if error.
+ */
+static inline int __cqm_prmid_update(struct prmid *prmid,
+ unsigned long jiffies_min_delta)
+{
+ unsigned long now = jiffies;
+ unsigned long last_read_time;
u64 val;
/*
- * Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
- * it just says that to increase confusion.
+ * Shortcut the calculation of elapsed time for the
+ * case jiffies_min_delta == 0
*/
- wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
+ if (jiffies_min_delta > 0) {
+ last_read_time = atomic64_read(&prmid->last_read_time);
+ if (time_after(last_read_time + jiffies_min_delta, now))
+ return 0;
+ }
+
+ wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, prmid->rmid);
rdmsrl(MSR_IA32_QM_CTR, val);
/*
- * Aside from the ERROR and UNAVAIL bits, assume this thing returns
- * the number of cachelines tagged with @rmid.
+ * Ignore this reading on error states and do not update the value.
*/
- return val;
-}
+ WARN_ON_ONCE(val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL));
+ if (val & RMID_VAL_ERROR)
+ return -EINVAL;
+ if (val & RMID_VAL_UNAVAIL)
+ return -ENODATA;
-enum rmid_recycle_state {
- RMID_YOUNG = 0,
- RMID_AVAILABLE,
- RMID_DIRTY,
-};
+ atomic64_set(&prmid->last_read_value, val);
+ /*
+ * Protect last_read_time from being updated before last_read_value is.
+ * So reader always receive an updated value even if sometimes values
+ * are updated twice.
+ */
+ smp_wmb();
-struct cqm_rmid_entry {
- u32 rmid;
- enum rmid_recycle_state state;
- struct list_head list;
- unsigned long queue_time;
-};
+ atomic64_set(&prmid->last_read_time, now);
-/*
- * cqm_rmid_free_lru - A least recently used list of RMIDs.
- *
- * Oldest entry at the head, newest (most recently used) entry at the
- * tail. This list is never traversed, it's only used to keep track of
- * the lru order. That is, we only pick entries of the head or insert
- * them on the tail.
- *
- * All entries on the list are 'free', and their RMIDs are not currently
- * in use. To mark an RMID as in use, remove its entry from the lru
- * list.
- *
- *
- * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
- *
- * This list is contains RMIDs that no one is currently using but that
- * may have a non-zero occupancy value associated with them. The
- * rotation worker moves RMIDs from the limbo list to the free list once
- * the occupancy value drops below __intel_cqm_threshold.
- *
- * Both lists are protected by cqm_mutex.
- */
-static LIST_HEAD(cqm_rmid_free_lru);
-static LIST_HEAD(cqm_rmid_limbo_lru);
+ return 1;
+}
+
+static inline int cqm_prmid_update(struct prmid *prmid)
+{
+ return __cqm_prmid_update(prmid, __rmid_min_update_time);
+}
/*
- * We use a simple array of pointers so that we can lookup a struct
- * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
- * and __put_rmid() from having to worry about dealing with struct
- * cqm_rmid_entry - they just deal with rmids, i.e. integers.
- *
- * Once this array is initialized it is read-only. No locks are required
- * to access it.
- *
- * All entries for all RMIDs can be looked up in the this array at all
- * times.
+ * Updates caller cpu's cache.
*/
-static struct cqm_rmid_entry **cqm_rmid_ptrs;
-
-static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
+static inline void __update_pqr_prmid(struct prmid *prmid)
{
- struct cqm_rmid_entry *entry;
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- entry = cqm_rmid_ptrs[rmid];
- WARN_ON(entry->rmid != rmid);
+ if (state->rmid == prmid->rmid)
+ return;
+ state->rmid = prmid->rmid;
+ wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
+}
- return entry;
+static inline bool __valid_pkg_id(u16 pkg_id)
+{
+ return pkg_id < PQR_MAX_NR_PKGS;
}
/*
* Returns < 0 on fail.
*
- * We expect to be called with cqm_mutex held.
+ * We expect to be called with cache_mutex held.
*/
static u32 __get_rmid(void)
{
- struct cqm_rmid_entry *entry;
+ /* XXX: Placeholder, will be removed in next patch. */
+ return 0;
+}
+
+static void __put_rmid(u32 rmid)
+{
+ /* XXX: Placeholder, will be removed in next patch. */
+}
+
+/* Init cqm pkg_data for @cpu 's package. */
+static int pkg_data_init_cpu(int cpu)
+{
+ struct pkg_data *pkg_data;
+ struct cpuinfo_x86 *c = &cpu_data(cpu);
+ u16 pkg_id = topology_physical_package_id(cpu);
+
+ if (cqm_pkgs_data[pkg_id])
+ return 0;
- lockdep_assert_held(&cqm_mutex);
- if (list_empty(&cqm_rmid_free_lru))
- return INVALID_RMID;
+ pkg_data = kmalloc_node(sizeof(struct pkg_data),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!pkg_data)
+ return -ENOMEM;
+
+ pkg_data->max_rmid = c->x86_cache_max_rmid;
- entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
- list_del(&entry->list);
+ /* Does hardware has more rmids than this driver can handle? */
+ if (WARN_ON(pkg_data->max_rmid >= INVALID_RMID))
+ pkg_data->max_rmid = INVALID_RMID - 1;
- return entry->rmid;
+ if (c->x86_cache_occ_scale != cqm_l3_scale) {
+ pr_err("Multiple LLC scale values, disabling\n");
+ kfree(pkg_data);
+ return -EINVAL;
+ }
+
+ pkg_data->prmids_by_rmid = kmalloc_node(
+ sizeof(struct prmid *) * (1 + pkg_data->max_rmid),
+ GFP_KERNEL, cpu_to_node(cpu));
+
+ if (!pkg_data) {
+ kfree(pkg_data);
+ return -ENOMEM;
+ }
+
+ INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
+
+ mutex_init(&pkg_data->pkg_data_mutex);
+ raw_spin_lock_init(&pkg_data->pkg_data_lock);
+
+ /* XXX: Chose randomly*/
+ pkg_data->rotation_cpu = cpu;
+
+ cqm_pkgs_data[pkg_id] = pkg_data;
+ return 0;
}
-static void __put_rmid(u32 rmid)
+static inline bool __valid_rmid(u16 pkg_id, u32 rmid)
{
- struct cqm_rmid_entry *entry;
+ return rmid <= cqm_pkgs_data[pkg_id]->max_rmid;
+}
- lockdep_assert_held(&cqm_mutex);
+static inline bool __valid_prmid(u16 pkg_id, struct prmid *prmid)
+{
+ struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+ bool valid = __valid_rmid(pkg_id, prmid->rmid);
- WARN_ON(!__rmid_valid(rmid));
- entry = __rmid_entry(rmid);
+ WARN_ON_ONCE(valid && pkg_data->prmids_by_rmid[
+ prmid->rmid]->rmid != prmid->rmid);
+ return valid;
+}
- entry->queue_time = jiffies;
- entry->state = RMID_YOUNG;
+static inline struct prmid *
+__prmid_from_rmid(u16 pkg_id, u32 rmid)
+{
+ struct prmid *prmid;
- list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
+ if (!__valid_rmid(pkg_id, rmid))
+ return NULL;
+ prmid = cqm_pkgs_data[pkg_id]->prmids_by_rmid[rmid];
+ WARN_ON_ONCE(!__valid_prmid(pkg_id, prmid));
+ return prmid;
}
-static int intel_cqm_setup_rmid_cache(void)
+static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
{
- struct cqm_rmid_entry *entry;
- unsigned int nr_rmids;
- int r = 0;
-
- nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
- nr_rmids, GFP_KERNEL);
- if (!cqm_rmid_ptrs)
- return -ENOMEM;
+ int r;
+ unsigned long flags;
+ struct prmid *prmid;
+ struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+
+ if (!__valid_pkg_id(pkg_id))
+ return -EINVAL;
- for (; r <= cqm_max_rmid; r++) {
- struct cqm_rmid_entry *entry;
+ for (r = 0; r <= pkg_data->max_rmid; r++) {
- entry = kmalloc(sizeof(*entry), GFP_KERNEL);
- if (!entry)
+ prmid = kmalloc_node(sizeof(struct prmid), GFP_KERNEL,
+ cpu_to_node(pkg_data->rotation_cpu));
+ if (!prmid)
goto fail;
- INIT_LIST_HEAD(&entry->list);
- entry->rmid = r;
- cqm_rmid_ptrs[r] = entry;
+ atomic64_set(&prmid->last_read_value, 0L);
+ atomic64_set(&prmid->last_read_time, 0L);
+ INIT_LIST_HEAD(&prmid->pool_entry);
+ prmid->rmid = r;
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
- }
+ /* Lock needed if called during CPU hotplug. */
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+ pkg_data->prmids_by_rmid[r] = prmid;
- /*
- * RMID 0 is special and is always allocated. It's used for all
- * tasks that are not monitored.
- */
- entry = __rmid_entry(0);
- list_del(&entry->list);
- mutex_lock(&cqm_mutex);
- intel_cqm_rotation_rmid = __get_rmid();
- mutex_unlock(&cqm_mutex);
+ /* RMID 0 is special and makes the root of rmid hierarchy. */
+ if (r != 0)
+ list_add_tail(&prmid->pool_entry,
+ &pkg_data->free_prmids_pool);
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ }
return 0;
fail:
- while (r--)
- kfree(cqm_rmid_ptrs[r]);
-
- kfree(cqm_rmid_ptrs);
+ while (!list_empty(&pkg_data->free_prmids_pool)) {
+ prmid = list_first_entry(&pkg_data->free_prmids_pool,
+ struct prmid, pool_entry);
+ list_del(&prmid->pool_entry);
+ kfree(pkg_data->prmids_by_rmid[prmid->rmid]);
+ kfree(prmid);
+ }
return -ENOMEM;
}
@@ -322,8 +353,9 @@ static void intel_cqm_event_read(struct perf_event *event)
unsigned long flags;
u32 rmid;
u64 val;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
- raw_spin_lock_irqsave(&cache_lock, flags);
+ raw_spin_lock_irqsave(&cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
rmid = event->hw.cqm_rmid;
if (!__rmid_valid(rmid))
@@ -339,7 +371,8 @@ static void intel_cqm_event_read(struct perf_event *event)
local64_set(&event->count, val);
out:
- raw_spin_unlock_irqrestore(&cache_lock, flags);
+ raw_spin_unlock_irqrestore(
+ &cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
}
static inline bool cqm_group_leader(struct perf_event *event)
@@ -349,29 +382,32 @@ static inline bool cqm_group_leader(struct perf_event *event)
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
if (!(event->hw.state & PERF_HES_STOPPED))
return;
event->hw.state &= ~PERF_HES_STOPPED;
- __update_pqr_rmid(event->hw.cqm_rmid);
+ __update_pqr_prmid(__prmid_from_rmid(pkg_id, event->hw.cqm_rmid));
}
static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
if (event->hw.state & PERF_HES_STOPPED)
return;
event->hw.state |= PERF_HES_STOPPED;
intel_cqm_event_read(event);
- __update_pqr_rmid(0);
+ __update_pqr_prmid(__prmid_from_rmid(pkg_id, 0));
}
static int intel_cqm_event_add(struct perf_event *event, int mode)
{
unsigned long flags;
u32 rmid;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
- raw_spin_lock_irqsave(&cache_lock, flags);
+ raw_spin_lock_irqsave(&cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
event->hw.state = PERF_HES_STOPPED;
rmid = event->hw.cqm_rmid;
@@ -379,7 +415,8 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
if (__rmid_valid(rmid) && (mode & PERF_EF_START))
intel_cqm_event_start(event, mode);
- raw_spin_unlock_irqrestore(&cache_lock, flags);
+ raw_spin_unlock_irqrestore(
+ &cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
return 0;
}
@@ -503,9 +540,10 @@ max_recycle_threshold_show(
{
ssize_t rv;
- mutex_lock(&cqm_mutex);
- rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold);
- mutex_unlock(&cqm_mutex);
+ monr_hrchy_acquire_mutexes();
+ rv = snprintf(page, PAGE_SIZE - 1, "%u\n",
+ __intel_cqm_max_threshold);
+ monr_hrchy_release_mutexes();
return rv;
}
@@ -522,9 +560,12 @@ max_recycle_threshold_store(struct device *dev,
if (ret)
return ret;
- mutex_lock(&cqm_mutex);
+ /* Mutex waits for rotation logic in all packages to complete. */
+ monr_hrchy_acquire_mutexes();
+
__intel_cqm_max_threshold = bytes;
- mutex_unlock(&cqm_mutex);
+
+ monr_hrchy_release_mutexes();
return count;
}
@@ -561,49 +602,42 @@ static struct pmu intel_cqm_pmu = {
static inline void cqm_pick_event_reader(int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
- int i;
-
- for_each_cpu(i, &cqm_cpumask) {
- if (phys_id == topology_physical_package_id(i))
- return; /* already got reader for this socket */
- }
-
- cpumask_set_cpu(cpu, &cqm_cpumask);
+ u16 pkg_id = topology_physical_package_id(cpu);
+ /* XXX: lock, check if rotation cpu is online, maybe */
+ /*
+ * Pick a reader if there isn't one already.
+ */
+ if (cqm_pkgs_data[pkg_id]->rotation_cpu != -1)
+ cqm_pkgs_data[pkg_id]->rotation_cpu = cpu;
}
static void intel_cqm_cpu_starting(unsigned int cpu)
{
struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
struct cpuinfo_x86 *c = &cpu_data(cpu);
+ u16 pkg_id = topology_physical_package_id(cpu);
state->rmid = 0;
state->closid = 0;
- WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
+ /* XXX: lock */
+ /* XXX: Make sure this case is handled when hotplug happens. */
+ WARN_ON(c->x86_cache_max_rmid != cqm_pkgs_data[pkg_id]->max_rmid);
WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
}
static void intel_cqm_cpu_exit(unsigned int cpu)
{
- int phys_id = topology_physical_package_id(cpu);
- int i;
-
/*
* Is @cpu a designated cqm reader?
*/
- if (!cpumask_test_and_clear_cpu(cpu, &cqm_cpumask))
- return;
-
- for_each_online_cpu(i) {
- if (i == cpu)
- continue;
+ u16 pkg_id = topology_physical_package_id(cpu);
- if (phys_id == topology_physical_package_id(i)) {
- cpumask_set_cpu(i, &cqm_cpumask);
- break;
- }
- }
+ if (cqm_pkgs_data[pkg_id]->rotation_cpu != cpu)
+ return;
+ /* XXX: do remove unused packages */
+ cqm_pkgs_data[pkg_id]->rotation_cpu = cpumask_any_but(
+ topology_core_cpumask(cpu), cpu);
}
static int intel_cqm_cpu_notifier(struct notifier_block *nb,
@@ -616,6 +650,7 @@ static int intel_cqm_cpu_notifier(struct notifier_block *nb,
intel_cqm_cpu_exit(cpu);
break;
case CPU_STARTING:
+ pkg_data_init_cpu(cpu);
intel_cqm_cpu_starting(cpu);
cqm_pick_event_reader(cpu);
break;
@@ -632,12 +667,17 @@ static const struct x86_cpu_id intel_cqm_match[] = {
static int __init intel_cqm_init(void)
{
char *str, scale[20];
- int i, cpu, ret;
+ int i, cpu, ret = 0, min_max_rmid = 0;
if (!x86_match_cpu(intel_cqm_match))
return -ENODEV;
cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
+ if (WARN_ON(cqm_l3_scale == 0))
+ cqm_l3_scale = 1;
+
+ for (i = 0; i < PQR_MAX_NR_PKGS; i++)
+ cqm_pkgs_data[i] = NULL;
/*
* It's possible that not all resources support the same number
@@ -650,17 +690,20 @@ static int __init intel_cqm_init(void)
*/
cpu_notifier_register_begin();
+ /* XXX: assert all cpus in pkg have same nr rmids (they should). */
for_each_online_cpu(cpu) {
- struct cpuinfo_x86 *c = &cpu_data(cpu);
-
- if (c->x86_cache_max_rmid < cqm_max_rmid)
- cqm_max_rmid = c->x86_cache_max_rmid;
+ ret = pkg_data_init_cpu(cpu);
+ if (ret)
+ goto error;
+ }
- if (c->x86_cache_occ_scale != cqm_l3_scale) {
- pr_err("Multiple LLC scale values, disabling\n");
- ret = -EINVAL;
- goto out;
- }
+ /* Select the minimum of the maximum rmids to use as limit for
+ * threshold. XXX: per-package threshold.
+ */
+ cqm_pkg_id_for_each_online(i) {
+ if (min_max_rmid < cqm_pkgs_data[i]->max_rmid)
+ min_max_rmid = cqm_pkgs_data[i]->max_rmid;
+ intel_cqm_setup_pkg_prmid_pools(i);
}
/*
@@ -671,21 +714,17 @@ static int __init intel_cqm_init(void)
* For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
*/
__intel_cqm_max_threshold =
- boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);
+ boot_cpu_data.x86_cache_size * 1024 / (min_max_rmid + 1);
snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
str = kstrdup(scale, GFP_KERNEL);
if (!str) {
ret = -ENOMEM;
- goto out;
+ goto error;
}
event_attr_intel_cqm_llc_scale.event_str = str;
- ret = intel_cqm_setup_rmid_cache();
- if (ret)
- goto out;
-
for_each_online_cpu(i) {
intel_cqm_cpu_starting(i);
cqm_pick_event_reader(i);
@@ -695,13 +734,20 @@ static int __init intel_cqm_init(void)
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret)
- pr_err("Intel CQM perf registration failed: %d\n", ret);
- else
- pr_info("Intel CQM monitoring enabled\n");
+ goto error;
-out:
+ cpu_notifier_register_done();
+
+ pr_info("Intel CQM monitoring enabled with at least %u rmids per package.\n",
+ min_max_rmid + 1);
+
+ return ret;
+
+error:
+ pr_err("Intel CQM perf registration failed: %d\n", ret);
cpu_notifier_register_done();
return ret;
}
+
device_initcall(intel_cqm_init);
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index e25d0a1..a25d49b 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -19,14 +19,168 @@
#include <asm/pqr_common.h>
/*
+ * struct prmid: Package RMID. Per-package wrapper for a rmid.
+ * @last_read_value: Least read value.
+ * @last_read_time: Time last read, used when throtling read rate.
+ * @pool_entry: Attaches to a prmid pool in cqm_pkg_data.
+ * @rmid: The rmid value to be programed in hardware.
+ *
+ * Its accesors ensure that CQM events for this rmid are read atomically and
+ * allow to throtle the frequency of reads to up to one each
+ * __rmid_min_update_time ms.
+ */
+struct prmid {
+ atomic64_t last_read_value;
+ atomic64_t last_read_time;
+ struct list_head pool_entry;
+ u32 rmid;
+};
+
+/*
* Minimum time elapsed between reads of occupancy value for an RMID when
* transversing the monr hierarchy.
*/
#define RMID_DEFAULT_MIN_UPDATE_TIME 20 /* ms */
+static unsigned int __rmid_min_update_time = RMID_DEFAULT_MIN_UPDATE_TIME;
+
+static inline int cqm_prmid_update(struct prmid *prmid);
# define INVALID_RMID (-1)
/*
+ * struct pkg_data: Per-package CQM data.
+ * @max_rmid: Max rmid valid for cpus in this package.
+ * @prmids_by_rmid: Utility mapping between rmid values and prmids.
+ * XXX: Make it an array of prmids.
+ * @free_prmid_pool: Free prmids.
+ * @pkg_data_mutex: Hold for stability when modifying pmonrs
+ * hierarchy.
+ * @pkg_data_lock: Hold to protect variables that may be accessed
+ * during process scheduling. The locks for all
+ * packages must be held when modifying the monr
+ * hierarchy.
+ * @rotation_cpu: CPU to run @rotation_work on, it must be in the
+ * package associated to this instance of pkg_data.
+ */
+struct pkg_data {
+ u32 max_rmid;
+ /* Quick map from rmids to prmids. */
+ struct prmid **prmids_by_rmid;
+
+ /*
+ * Pools of prmids used in rotation logic.
+ */
+ struct list_head free_prmids_pool;
+
+ struct mutex pkg_data_mutex;
+ raw_spinlock_t pkg_data_lock;
+
+ int rotation_cpu;
+};
+
+extern struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
+
+static inline u16 __cqm_pkgs_data_next_online(u16 pkg_id)
+{
+ while (!cqm_pkgs_data[++pkg_id] && pkg_id < PQR_MAX_NR_PKGS)
+ ;
+ return pkg_id;
+}
+
+static inline u16 __cqm_pkgs_data_first_online(void)
+{
+ if (cqm_pkgs_data[0])
+ return 0;
+ return __cqm_pkgs_data_next_online(0);
+}
+
+/* Iterate for each online pkgs data */
+#define cqm_pkg_id_for_each_online(pkg_id__) \
+ for (pkg_id__ = __cqm_pkgs_data_first_online(); \
+ pkg_id__ < PQR_MAX_NR_PKGS; \
+ pkg_id__ = __cqm_pkgs_data_next_online(pkg_id__))
+
+#define __pkg_data(pmonr, member) cqm_pkgs_data[pmonr->pkg_id]->member
+
+/*
+ * Utility function and macros to manage per-package locks.
+ * Use macros to keep flags in caller's stace.
+ * Hold lock in all the packages, required to alter the monr hierarchy
+ */
+static inline void monr_hrchy_acquire_mutexes(void)
+{
+ int i;
+
+ cqm_pkg_id_for_each_online(i)
+ mutex_lock_nested(&cqm_pkgs_data[i]->pkg_data_mutex, i);
+}
+
+# define monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i) \
+ do { \
+ raw_local_irq_save(flags); \
+ cqm_pkg_id_for_each_online(i) {\
+ raw_spin_lock_nested( \
+ &cqm_pkgs_data[i]->pkg_data_lock, i); \
+ } \
+ } while (0)
+
+#define monr_hrchy_acquire_locks(flags, i) \
+ do {\
+ monr_hrchy_acquire_mutexes(); \
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i); \
+ } while (0)
+
+static inline void monr_hrchy_release_mutexes(void)
+{
+ int i;
+
+ cqm_pkg_id_for_each_online(i)
+ mutex_unlock(&cqm_pkgs_data[i]->pkg_data_mutex);
+}
+
+# define monr_hrchy_release_raw_spin_locks_irq_restore(flags, i) \
+ do { \
+ cqm_pkg_id_for_each_online(i) {\
+ raw_spin_unlock(&cqm_pkgs_data[i]->pkg_data_lock); \
+ } \
+ raw_local_irq_restore(flags); \
+ } while (0)
+
+#define monr_hrchy_release_locks(flags, i) \
+ do {\
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i); \
+ monr_hrchy_release_mutexes(); \
+ } while (0)
+
+static inline void monr_hrchy_assert_held_mutexes(void)
+{
+ int i;
+
+ cqm_pkg_id_for_each_online(i)
+ lockdep_assert_held(&cqm_pkgs_data[i]->pkg_data_mutex);
+}
+
+static inline void monr_hrchy_assert_held_raw_spin_locks(void)
+{
+ int i;
+
+ cqm_pkg_id_for_each_online(i)
+ lockdep_assert_held(&cqm_pkgs_data[i]->pkg_data_lock);
+}
+#ifdef CONFIG_LOCKDEP
+static inline int monr_hrchy_count_held_raw_spin_locks(void)
+{
+ int i, nr_held = 0;
+
+ cqm_pkg_id_for_each_online(i) {
+ if (lockdep_is_held(&cqm_pkgs_data[i]->pkg_data_lock))
+ nr_held++;
+ }
+ return nr_held;
+}
+#endif
+
+/*
* Time between execution of rotation logic. The frequency of execution does
* not affect the rate at which RMIDs are recycled, except by the delay by the
* delay updating the prmid's and their pools.
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
index 0c2001b..f770637 100644
--- a/arch/x86/include/asm/pqr_common.h
+++ b/arch/x86/include/asm/pqr_common.h
@@ -27,5 +27,7 @@ struct intel_pqr_state {
DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+#define PQR_MAX_NR_PKGS 8
+
#endif
#endif
--
2.8.0.rc3.226.g39d4020
Add Intel's PQR as its own build target, remove its build dependency
on CQM, and add CONFIG_INTEL_RDT as a configuration flag to build PQR
and all of its related drivers (currently CQM, future: MBM, CAT, CDP).
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/Kconfig | 6 ++++++
arch/x86/events/intel/Makefile | 3 ++-
arch/x86/events/intel/cqm.c | 27 +--------------------------
arch/x86/include/asm/pqr_common.h | 31 +++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/Makefile | 4 ++++
arch/x86/kernel/cpu/pqr_common.c | 9 +++++++++
include/linux/perf_event.h | 2 ++
7 files changed, 55 insertions(+), 27 deletions(-)
create mode 100644 arch/x86/include/asm/pqr_common.h
create mode 100644 arch/x86/kernel/cpu/pqr_common.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a494fa3..7b81e6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -160,6 +160,12 @@ config X86
select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS
select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS
+config INTEL_RDT
+ def_bool y
+ depends on PERF_EVENTS && CPU_SUP_INTEL
+ ---help---
+ Enable Resource Director Technology for Intel Xeon Microprocessors.
+
config INSTRUCTION_DECODER
def_bool y
depends on KPROBES || PERF_EVENTS || UPROBES
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3660b2c..7e610bf 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,4 @@
-obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
obj-$(CONFIG_CPU_SUP_INTEL) += ds.o knc.o
obj-$(CONFIG_CPU_SUP_INTEL) += lbr.o p4.o p6.o pt.o
obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += intel-rapl.o
@@ -7,3 +7,4 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += intel-uncore.o
intel-uncore-objs := uncore.o uncore_nhmex.o uncore_snb.o uncore_snbep.o
obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
intel-cstate-objs := cstate.o
+obj-$(CONFIG_INTEL_RDT) += cqm.o
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index afd60dd..8457dd0 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -7,40 +7,15 @@
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
+#include <asm/pqr_common.h>
#include "../perf_event.h"
-#define MSR_IA32_PQR_ASSOC 0x0c8f
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid: The cached Resource Monitoring ID
- * @closid: The cached Class Of Service ID
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
- u32 rmid;
- u32 closid;
-};
-
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-
/*
* Updates caller cpu's cache.
*/
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
new file mode 100644
index 0000000..0c2001b
--- /dev/null
+++ b/arch/x86/include/asm/pqr_common.h
@@ -0,0 +1,31 @@
+#ifndef _X86_PQR_COMMON_H_
+#define _X86_PQR_COMMON_H_
+
+#if defined(CONFIG_INTEL_RDT)
+
+#include <linux/types.h>
+#include <asm/percpu.h>
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid: The cached Resource Monitoring ID
+ * @closid: The cached Class Of Service ID
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+ u32 rmid;
+ u32 closid;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..87e6279 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,6 +34,10 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o
+ifdef CONFIG_CPU_SUP_INTEL
+obj-$(CONFIG_INTEL_RDT) += pqr_common.o
+endif
+
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
obj-$(CONFIG_MICROCODE) += microcode/
diff --git a/arch/x86/kernel/cpu/pqr_common.c b/arch/x86/kernel/cpu/pqr_common.c
new file mode 100644
index 0000000..9eff5d9
--- /dev/null
+++ b/arch/x86/kernel/cpu/pqr_common.c
@@ -0,0 +1,9 @@
+#include <asm/pqr_common.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Both functions which modify the state
+ * (intel_cqm_event_start and intel_cqm_event_stop) are called with
+ * interrupts disabled, which is sufficient for the protection.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8bb1532..3a847bf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -118,6 +118,7 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
+#ifdef CONFIG_INTEL_RDT
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
@@ -125,6 +126,7 @@ struct hw_perf_event {
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;
};
+#endif
struct { /* itrace */
int itrace_started;
};
--
2.8.0.rc3.226.g39d4020
Pre-allocate enough anodes to be able to at least hold one set of RMIDs
per package before running out of anodes.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 904f2d3..98a919f 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -194,6 +194,7 @@ static int anode_pool__alloc_one(u16 pkg_id);
/* Init cqm pkg_data for @cpu 's package. */
static int pkg_data_init_cpu(int cpu)
{
+ int i, nr_anodes;
struct pkg_data *pkg_data;
struct cpuinfo_x86 *c = &cpu_data(cpu);
u16 pkg_id = topology_physical_package_id(cpu);
@@ -257,6 +258,15 @@ static int pkg_data_init_cpu(int cpu)
pkg_data->timed_update_cpu = cpu;
cqm_pkgs_data[pkg_id] = pkg_data;
+
+ /* Pre-allocate pool with one anode more than minimum needed to contain
+ * all the RMIDs in the package.
+ */
+ nr_anodes = (pkg_data->max_rmid + NR_RMIDS_PER_NODE) /
+ NR_RMIDS_PER_NODE + 1;
+
+ for (i = 0; i < nr_anodes; i++)
+ anode_pool__alloc_one(pkg_id);
return 0;
}
--
2.8.0.rc3.226.g39d4020
Read RMIDs llc_occupancy for cgroups by adding the occupancy of all
pmonrs with a read_rmid along its subtree in the pmonr hierarchy for
the event's package.
The RMID to read for a monr is the same as its RMID to schedule in hw if
the monr is in (A)state. If in (IL)state, the RMID to read is that of its
limbo_prmid. This reduces the error introduced by (IL)states since the
llc_occupancy of limbo_prmid is a lower bound of its real llc_occupancy.
monrs in (U)state can be safely ignored since they do not have any
occupancy.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 218 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 211 insertions(+), 7 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 6e85021..c14f1c7 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2305,18 +2305,222 @@ intel_cqm_setup_event(struct perf_event *event, struct perf_event **group)
return monr_hrchy_attach_event(event);
}
+static struct monr *
+monr_next_child(struct monr *pos, struct monr *parent)
+{
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(!monr_hrchy_count_held_raw_spin_locks());
+#endif
+ if (!pos)
+ return list_first_entry_or_null(
+ &parent->children, struct monr, parent_entry);
+ if (list_is_last(&pos->parent_entry, &parent->children))
+ return NULL;
+ return list_next_entry(pos, parent_entry);
+}
+
+static struct monr *
+monr_next_descendant_pre(struct monr *pos, struct monr *root)
+{
+ struct monr *next;
+
+#ifdef CONFIG_LOCKDEP
+ WARN_ON(!monr_hrchy_count_held_raw_spin_locks());
+#endif
+ if (!pos)
+ return root;
+ next = monr_next_child(NULL, pos);
+ if (next)
+ return next;
+ while (pos != root) {
+ next = monr_next_child(pos, pos->parent);
+ if (next)
+ return next;
+ pos = pos->parent;
+ }
+ return NULL;
+}
+
+/* Read pmonr's summary, safe to call without pkg's prmids lock.
+ * The possible scenarios are:
+ * - summary's occupancy cannot be read, return -1.
+ * - summary has no RMID but could be read as zero occupancy, return 0 and set
+ * rmid = INVALID_RMID.
+ * - summary has valid read RMID, set rmid to it.
+ */
+static inline int
+pmonr__get_read_rmid(struct pmonr *pmonr, u32 *rmid, bool fail_on_inherited)
+{
+ union prmid_summary summary;
+
+ *rmid = INVALID_RMID;
+
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ /* A pmonr in (I)state that doesn't fail can report it's limbo_prmid
+ * or NULL.
+ */
+ if (prmid_summary__is_istate(summary) && fail_on_inherited)
+ return -1;
+ /* A pmonr with inactive monitoring can be safely ignored. */
+ if (!prmid_summary__is_mon_active(summary))
+ return 0;
+
+ /* A pmonr that hasnt run in a pkg is safe to ignore since it
+ * cannot have occupancy there.
+ */
+ if (prmid_summary__is_ustate(summary))
+ return 0;
+ /* At this point the pmonr is either in (A)state or (I)state
+ * with fail_on_inherited=false . In the latter case,
+ * read_rmid is INVALID_RMID and is a successful read_rmid.
+ */
+ *rmid = summary.read_rmid;
+ return 0;
+}
+
+/* Read occupancy for all pmonrs in the subtree rooted at monr
+ * for the current package.
+ * Best effort two-stages read. First, obtain all RMIDs in subtree
+ * with locks held. The rmids are added to stack. If stack is full
+ * proceed to update and read in place. After finish storing the RMIDs,
+ * update and read occupancy for rmids in stack.
+ */
+static int pmonr__read_subtree(struct monr *monr, u16 pkg_id,
+ u64 *total, bool fail_on_inh_descendant)
+{
+ struct monr *pos = NULL;
+ struct astack astack;
+ int ret;
+ unsigned long flags;
+ u64 count;
+ struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+
+ *total = 0;
+ /* Must run in a CPU in the package to read. */
+ if (WARN_ON_ONCE(pkg_id !=
+ topology_physical_package_id(smp_processor_id())))
+ return -1;
+
+ astack__init(&astack, NR_RMIDS_PER_NODE - 1, pkg_id);
+
+ /* Lock to protect againsts changes in pmonr hierarchy. */
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+
+ while ((pos = monr_next_descendant_pre(pos, monr))) {
+ struct prmid *prmid;
+ u32 rmid;
+ /* the pmonr of the monr to read cannot be inherited,
+ * descendants may, depending on flag.
+ */
+ bool fail_on_inh = pos == monr || fail_on_inh_descendant;
+
+ ret = pmonr__get_read_rmid(pos->pmonrs[pkg_id],
+ &rmid, fail_on_inh);
+ if (ret)
+ goto exit_error;
+
+ if (rmid == INVALID_RMID)
+ continue;
+
+ ret = astack__push(&astack);
+ if (!ret) {
+ __astack__top(&astack, rmids) = rmid;
+ continue;
+ }
+ /* If no space in stack, update and read here (slower). */
+ prmid = __prmid_from_rmid(pkg_id, rmid);
+ if (WARN_ON_ONCE(!prmid))
+ goto exit_error;
+
+ ret = cqm_prmid_update(prmid);
+ if (ret < 0)
+ goto exit_error;
+
+ *total += atomic64_read(&prmid->last_read_value);
+ }
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+
+ ret = astack__rmids_sum_apply(&astack, pkg_id,
+ &__rmid_fn__cqm_prmid_update, &count);
+ if (ret < 0)
+ return ret;
+
+ *total += count;
+ astack__release(&astack);
+
+ return 0;
+
+exit_error:
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ astack__release(&astack);
+ return ret;
+}
+
+/* Read current package immediately and remote pkg (if any) from cache. */
+static void __read_task_event(struct perf_event *event)
+{
+ int i, ret;
+ u64 count = 0;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ struct monr *monr = monr_from_event(event);
+
+ /* Read either local or polled occupancy from all packages. */
+ cqm_pkg_id_for_each_online(i) {
+ struct prmid *prmid;
+ u32 rmid;
+ struct pmonr *pmonr = monr->pmonrs[i];
+
+ ret = pmonr__get_read_rmid(pmonr, &rmid, true);
+ if (ret)
+ return;
+ if (rmid == INVALID_RMID)
+ continue;
+ prmid = __prmid_from_rmid(i, rmid);
+ if (WARN_ON_ONCE(!prmid))
+ return;
+
+ /* update and read local for this cpu's package. */
+ if (i == pkg_id)
+ cqm_prmid_update(prmid);
+ count += atomic64_read(&prmid->last_read_value);
+ }
+ local64_set(&event->count, count);
+}
+
/* Read current package immediately and remote pkg (if any) from cache. */
static void intel_cqm_event_read(struct perf_event *event)
{
- union prmid_summary summary;
- struct prmid *prmid;
+ struct monr *monr;
+ u64 count;
u16 pkg_id = topology_physical_package_id(smp_processor_id());
- struct pmonr *pmonr = monr_from_event(event)->pmonrs[pkg_id];
- summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
- prmid = __prmid_from_rmid(pkg_id, summary.read_rmid);
- cqm_prmid_update(prmid);
- local64_set(&event->count, atomic64_read(&prmid->last_read_value));
+ monr = monr_from_event(event);
+
+ WARN_ON_ONCE(event->cpu != -1 &&
+ topology_physical_package_id(event->cpu) != pkg_id);
+
+ /* Only perf_event leader can return a value, everybody else share
+ * the same RMID.
+ */
+ if (event->parent) {
+ local64_set(&event->count, 0);
+ return;
+ }
+
+ if (event->attach_state & PERF_ATTACH_TASK) {
+ __read_task_event(event);
+ return;
+ }
+
+ /* It's either a cgroup or a cpu event. */
+ if (WARN_ON_ONCE(event->cpu < 0))
+ return;
+
+ /* XXX: expose fail_on_inh_descendant as a configuration parameter? */
+ pmonr__read_subtree(monr, pkg_id, &count, false);
+
+ local64_set(&event->count, count);
+ return;
}
static inline bool cqm_group_leader(struct perf_event *event)
--
2.8.0.rc3.226.g39d4020
Allow monitored cgroups to update the PQR MSR during task switch even
without an associated perf_event.
The package RMID for the current monr associated with a monitored
cgroup is written to hw during task switch (after perf_events is run)
if perf_event did not write a RMID for an event.
perf_event and any other caller of pqr_cache_update_rmid can update the
CPU's RMID using one of two modes:
- PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
e.g. the RMID of the root pmonr when no event is scheduled.
- PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
unset on pmu::del. This mode prevents from using a non-event
cgroup RMID.
This patch also introduces caching of writes to PQR MSR within the per-pcu
pqr state variable. This interface to update RMIDs and CLOSIDs will be
also utilized in upcoming versions of Intel's MBM and CAT drivers.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 65 +++++++++++++++++++++++++++++----------
arch/x86/events/intel/cqm.h | 2 --
arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
4 files changed, 135 insertions(+), 31 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index daf9fdf..4ece0a4 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid *prmid)
return __cqm_prmid_update(prmid, __rmid_min_update_time);
}
-/*
- * Updates caller cpu's cache.
- */
-static inline void __update_pqr_prmid(struct prmid *prmid)
-{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
- if (state->rmid == prmid->rmid)
- return;
- state->rmid = prmid->rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
-}
-
static inline bool __valid_pkg_id(u16 pkg_id)
{
return pkg_id < PQR_MAX_NR_PKGS;
@@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct perf_event *event)
static inline void __intel_cqm_event_start(
struct perf_event *event, union prmid_summary summary)
{
- u16 pkg_id = topology_physical_package_id(smp_processor_id());
if (!(event->hw.state & PERF_HES_STOPPED))
return;
-
event->hw.state &= ~PERF_HES_STOPPED;
- __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
+
+ pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
}
static void intel_cqm_event_start(struct perf_event *event, int mode)
@@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
/* Occupancy of CQM events is obtained at read. No need to read
* when event is stopped since read on inactive cpus succeed.
*/
- __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
+ pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
}
static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
state->rmid = 0;
state->closid = 0;
+ state->next_rmid = 0;
+ state->next_closid = 0;
/* XXX: lock */
/* XXX: Make sure this case is handled when hotplug happens. */
@@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
pr_info("Intel CQM monitoring enabled with at least %u rmids per package.\n",
min_max_rmid + 1);
+ /* Make sure pqr_common_enable_key is enabled after
+ * cqm_initialized_key.
+ */
+ barrier();
+
+ static_branch_enable(&pqr_common_enable_key);
return ret;
error_init_mutex:
@@ -3163,4 +3157,41 @@ error:
return ret;
}
+/* Schedule task without a CQM perf_event. */
+inline void __intel_cqm_no_event_sched_in(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+ struct monr *monr;
+ struct pmonr *pmonr;
+ union prmid_summary summary;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+
+ /* Assume CQM enabled is likely given that PQR is enabled. */
+ if (!static_branch_likely(&cqm_initialized_key))
+ return;
+
+ /* Safe to call from_task since we are in scheduler lock. */
+ monr = monr_from_perf_cgroup(perf_cgroup_from_task(current, NULL));
+ pmonr = monr->pmonrs[pkg_id];
+
+ /* Utilize most up to date pmonr summary. */
+ monr_hrchy_get_next_prmid_summary(pmonr);
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+
+ if (!prmid_summary__is_mon_active(summary))
+ goto no_rmid;
+
+ if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
+ goto no_rmid;
+
+ pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
+ return;
+
+no_rmid:
+ summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
+ pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
+#endif
+}
+
device_initcall(intel_cqm_init);
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 0f3da94..e1f8bd0 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -82,8 +82,6 @@ union prmid_summary {
};
};
-# define INVALID_RMID (-1)
-
/* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or INVALID_RMID
* depending on whether monitoring is active or not.
*/
diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
index f770637..abbb235 100644
--- a/arch/x86/include/asm/pqr_common.h
+++ b/arch/x86/include/asm/pqr_common.h
@@ -3,31 +3,72 @@
#if defined(CONFIG_INTEL_RDT)
+#include <linux/jump_label.h>
#include <linux/types.h>
#include <asm/percpu.h>
+#include <asm/msr.h>
#define MSR_IA32_PQR_ASSOC 0x0c8f
+#define INVALID_RMID (-1)
+#define INVALID_CLOSID (-1)
+
+
+extern struct static_key_false pqr_common_enable_key;
+
+enum intel_pqr_rmid_mode {
+ /* RMID has no perf_event associated. */
+ PQR_RMID_MODE_NOEVENT = 0,
+ /* RMID has a perf_event associated. */
+ PQR_RMID_MODE_EVENT
+};
/**
* struct intel_pqr_state - State cache for the PQR MSR
- * @rmid: The cached Resource Monitoring ID
- * @closid: The cached Class Of Service ID
+ * @rmid: Last rmid written to hw.
+ * @next_rmid: Next rmid to write to hw.
+ * @next_rmid_mode: Next rmid's mode.
+ * @closid: The current Class Of Service ID
+ * @next_closid: The Class Of Service ID to use.
*
* The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
* contains both parts, so we need to cache them.
*
- * The cache also helps to avoid pointless updates if the value does
- * not change.
+ * The cache also helps to avoid pointless updates if the value does not
+ * change. It also keeps track of the type of RMID set (event vs no event)
+ * used to determine when a cgroup RMID is required.
*/
struct intel_pqr_state {
- u32 rmid;
- u32 closid;
+ u32 rmid;
+ u32 next_rmid;
+ enum intel_pqr_rmid_mode next_rmid_mode;
+ u32 closid;
+ u32 next_closid;
};
DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
#define PQR_MAX_NR_PKGS 8
+void __pqr_update(void);
+
+inline void __intel_cqm_no_event_sched_in(void);
+
+inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode mode);
+
+inline void pqr_cache_update_closid(u32 closid);
+
+static inline void pqr_update(void)
+{
+ if (static_branch_unlikely(&pqr_common_enable_key))
+ __pqr_update();
+}
+
+#else
+
+static inline void pqr_update(void)
+{
+}
+
#endif
#endif
diff --git a/arch/x86/kernel/cpu/pqr_common.c b/arch/x86/kernel/cpu/pqr_common.c
index 9eff5d9..d91c127 100644
--- a/arch/x86/kernel/cpu/pqr_common.c
+++ b/arch/x86/kernel/cpu/pqr_common.c
@@ -1,9 +1,43 @@
#include <asm/pqr_common.h>
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
+
+inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode mode)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+ state->next_rmid_mode = mode;
+ state->next_rmid = rmid;
+}
+
+inline void pqr_cache_update_closid(u32 closid)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+ state->next_closid = closid;
+}
+
+/* Update hw's RMID using cgroup's if perf_event did not.
+ * Sync pqr cache with MSR.
+ */
+inline void __pqr_update(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+ /* If perf_event has set a next_rmid that is used, do not try
+ * to obtain another one from current task.
+ */
+ if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
+ __intel_cqm_no_event_sched_in();
+
+ /* __intel_cqm_no_event_sched_in might have changed next_rmid. */
+ if (state->rmid == state->next_rmid &&
+ state->closid == state->next_closid)
+ return;
+
+ state->rmid = state->next_rmid;
+ state->closid = state->next_closid;
+ wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
+}
--
2.8.0.rc3.226.g39d4020
From: Stephane Eranian <[email protected]>
When an event is in error state, read() returns 0
instead of sizeof() buffer. In certain modes, such
as interval printing, ignoring the 0 return value
may cause bogus count deltas to be computed and
thus invalid results printed.
this patch fixes this problem by modifying read_counters()
to mark the event as not scaled (scaled = -1) to force
the printout routine to show <NOT COUNTED>.
Signed-off-by: Stephane Eranian <[email protected]>
---
tools/perf/builtin-stat.c | 12 +++++++++---
tools/perf/util/evsel.c | 4 ++--
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1f19f2f..a4e5610 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -289,8 +289,12 @@ static int read_counter(struct perf_evsel *counter)
struct perf_counts_values *count;
count = perf_counts(counter->counts, cpu, thread);
- if (perf_evsel__read(counter, cpu, thread, count))
+ if (perf_evsel__read(counter, cpu, thread, count)) {
+ counter->counts->scaled = -1;
+ perf_counts(counter->counts, cpu, thread)->ena = 0;
+ perf_counts(counter->counts, cpu, thread)->run = 0;
return -1;
+ }
if (STAT_RECORD) {
if (perf_evsel__write_stat_event(counter, cpu, thread, count)) {
@@ -307,12 +311,14 @@ static int read_counter(struct perf_evsel *counter)
static void read_counters(bool close_counters)
{
struct perf_evsel *counter;
+ int ret;
evlist__for_each(evsel_list, counter) {
- if (read_counter(counter))
+ ret = read_counter(counter);
+ if (ret)
pr_debug("failed to read counter %s\n", counter->name);
- if (perf_stat_process_counter(&stat_config, counter))
+ if (ret == 0 && perf_stat_process_counter(&stat_config, counter))
pr_warning("failed to process counter %s\n", counter->name);
if (close_counters) {
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 545bb3f..52a0c35 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1150,7 +1150,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;
- if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
+ if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
return -errno;
return 0;
@@ -1168,7 +1168,7 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
return -ENOMEM;
- if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) < 0)
+ if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
return -errno;
perf_evsel__compute_deltas(evsel, cpu, thread, &count);
--
2.8.0.rc3.226.g39d4020
A package wide event can return a valid read even if it has not run in a
specific cpu, this does not fit well with the assumption that run == 0
is equivalent to a <not counted>.
To fix the problem, this patch defines special error values for val,
run and ena (~0ULL), and use them to signal read errors, allowing run == 0
to be a valid value for package events. A new value, NA, is output on
read error and when event has not been enabled (timed enabled == 0).
Finally, this patch revamps calculation of deltas and scaling for snapshot
events, removing the calculation of deltas for time running and enabled in
snapshot event, as should be.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
tools/perf/builtin-stat.c | 37 ++++++++++++++++++++++++++-----------
tools/perf/util/counts.h | 19 +++++++++++++++++++
tools/perf/util/evsel.c | 44 +++++++++++++++++++++++++++++++++-----------
tools/perf/util/evsel.h | 8 ++++++--
tools/perf/util/stat.c | 35 +++++++++++------------------------
5 files changed, 95 insertions(+), 48 deletions(-)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index a4e5610..f1c2166 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -63,6 +63,7 @@
#include "util/tool.h"
#include "asm/bug.h"
+#include <math.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <locale.h>
@@ -290,10 +291,8 @@ static int read_counter(struct perf_evsel *counter)
count = perf_counts(counter->counts, cpu, thread);
if (perf_evsel__read(counter, cpu, thread, count)) {
- counter->counts->scaled = -1;
- perf_counts(counter->counts, cpu, thread)->ena = 0;
- perf_counts(counter->counts, cpu, thread)->run = 0;
- return -1;
+ /* do not write stat for failed reads. */
+ continue;
}
if (STAT_RECORD) {
@@ -668,12 +667,16 @@ static int run_perf_stat(int argc, const char **argv)
static void print_running(u64 run, u64 ena)
{
+ bool is_na = run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || !ena;
+
if (csv_output) {
- fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
- csv_sep,
- run,
- csv_sep,
- ena ? 100.0 * run / ena : 100.0);
+ if (is_na)
+ fprintf(stat_config.output, "%sNA%sNA", csv_sep, csv_sep);
+ else
+ fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
+ csv_sep, run, csv_sep, 100.0 * run / ena);
+ } else if (is_na) {
+ fprintf(stat_config.output, " (NA)");
} else if (run != ena) {
fprintf(stat_config.output, " (%.2f%%)", 100.0 * run / ena);
}
@@ -1046,7 +1049,7 @@ static void printout(int id, int nr, struct perf_evsel *counter, double uval,
if (counter->cgrp)
os.nfields++;
}
- if (run == 0 || ena == 0 || counter->counts->scaled == -1) {
+ if (run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || counter->counts->scaled == -1) {
if (metric_only) {
pm(&os, NULL, "", "", 0);
return;
@@ -1152,12 +1155,17 @@ static void print_aggr(char *prefix)
id = aggr_map->map[s];
first = true;
evlist__for_each(evsel_list, counter) {
+ bool all_nan = true;
val = ena = run = 0;
nr = 0;
for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
s2 = aggr_get_id(perf_evsel__cpus(counter), cpu);
if (s2 != id)
continue;
+ /* skip NA reads. */
+ if (perf_counts_values__is_na(perf_counts(counter->counts, cpu, 0)))
+ continue;
+ all_nan = false;
val += perf_counts(counter->counts, cpu, 0)->val;
ena += perf_counts(counter->counts, cpu, 0)->ena;
run += perf_counts(counter->counts, cpu, 0)->run;
@@ -1171,6 +1179,10 @@ static void print_aggr(char *prefix)
fprintf(output, "%s", prefix);
uval = val * counter->scale;
+ if (all_nan) {
+ run = PERF_COUNTS_NA;
+ ena = PERF_COUNTS_NA;
+ }
printout(id, nr, counter, uval, prefix, run, ena, 1.0);
if (!metric_only)
fputc('\n', output);
@@ -1249,7 +1261,10 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
if (prefix)
fprintf(output, "%s", prefix);
- uval = val * counter->scale;
+ if (val != PERF_COUNTS_NA)
+ uval = val * counter->scale;
+ else
+ uval = NAN;
printout(cpu, 0, counter, uval, prefix, run, ena, 1.0);
fputc('\n', output);
diff --git a/tools/perf/util/counts.h b/tools/perf/util/counts.h
index 34d8baa..b65e97a 100644
--- a/tools/perf/util/counts.h
+++ b/tools/perf/util/counts.h
@@ -3,6 +3,9 @@
#include "xyarray.h"
+/* Not Available (NA) value. Any operation with a NA equals a NA. */
+#define PERF_COUNTS_NA ((u64)~0ULL)
+
struct perf_counts_values {
union {
struct {
@@ -14,6 +17,22 @@ struct perf_counts_values {
};
};
+static inline void
+perf_counts_values__make_na(struct perf_counts_values *values)
+{
+ values->val = PERF_COUNTS_NA;
+ values->ena = PERF_COUNTS_NA;
+ values->run = PERF_COUNTS_NA;
+}
+
+static inline bool
+perf_counts_values__is_na(struct perf_counts_values *values)
+{
+ return values->val == PERF_COUNTS_NA ||
+ values->ena == PERF_COUNTS_NA ||
+ values->run == PERF_COUNTS_NA;
+}
+
struct perf_counts {
s8 scaled;
struct perf_counts_values aggr;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 52a0c35..da63a87 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1109,6 +1109,9 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
if (!evsel->prev_raw_counts)
return;
+ if (perf_counts_values__is_na(count))
+ return;
+
if (cpu == -1) {
tmp = evsel->prev_raw_counts->aggr;
evsel->prev_raw_counts->aggr = *count;
@@ -1117,26 +1120,38 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
*perf_counts(evsel->prev_raw_counts, cpu, thread) = *count;
}
- count->val = count->val - tmp.val;
+ /* Snapshot events do not calculate deltas for count values. */
+ if (!evsel->snapshot)
+ count->val = count->val - tmp.val;
count->ena = count->ena - tmp.ena;
count->run = count->run - tmp.run;
}
void perf_counts_values__scale(struct perf_counts_values *count,
- bool scale, s8 *pscaled)
+ bool scale, bool per_pkg, bool snapshot, s8 *pscaled)
{
s8 scaled = 0;
+ if (perf_counts_values__is_na(count)) {
+ if (pscaled)
+ *pscaled = -1;
+ return;
+ }
+
if (scale) {
- if (count->run == 0) {
+ /* per-pkg events can have run == 0 and be valid. */
+ if (count->run == 0 && !per_pkg) {
scaled = -1;
count->val = 0;
} else if (count->run < count->ena) {
scaled = 1;
- count->val = (u64)((double) count->val * count->ena / count->run + 0.5);
+ /* Snapshot events do not scale counts values. */
+ if (!snapshot && count->run)
+ count->val = (u64)((double) count->val * count->ena /
+ count->run + 0.5);
}
- } else
- count->ena = count->run = 0;
+ }
+ count->run = count->ena;
if (pscaled)
*pscaled = scaled;
@@ -1150,8 +1165,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;
- if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
+ if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0) {
+ perf_counts_values__make_na(count);
return -errno;
+ }
return 0;
}
@@ -1159,6 +1176,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
int cpu, int thread, bool scale)
{
+ int ret = 0;
struct perf_counts_values count;
size_t nv = scale ? 3 : 1;
@@ -1168,13 +1186,17 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
return -ENOMEM;
- if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
- return -errno;
+ if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0) {
+ perf_counts_values__make_na(&count);
+ ret = -errno;
+ goto exit;
+ }
perf_evsel__compute_deltas(evsel, cpu, thread, &count);
- perf_counts_values__scale(&count, scale, NULL);
+ perf_counts_values__scale(&count, scale, evsel->per_pkg, evsel->snapshot, NULL);
+exit:
*perf_counts(evsel->counts, cpu, thread) = count;
- return 0;
+ return ret;
}
static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index b993218..e6a5854 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -74,6 +74,10 @@ struct perf_evsel_config_term {
* @is_pos: the position (counting backwards) of the event id (PERF_SAMPLE_ID or
* PERF_SAMPLE_IDENTIFIER) in a non-sample event i.e. if sample_id_all
* is used there is an id sample appended to non-sample events
+ * @snapshot: an event that whose raw value cannot be extrapolated based on
+ * the ratio of running/enabled time.
+ * @per_pkg: an event that runs package wide. All cores in same package will
+ * read the same value, even if running time == 0.
* @priv: And what is in its containing unnamed union are tool specific
*/
struct perf_evsel {
@@ -144,8 +148,8 @@ static inline int perf_evsel__nr_cpus(struct perf_evsel *evsel)
return perf_evsel__cpus(evsel)->nr;
}
-void perf_counts_values__scale(struct perf_counts_values *count,
- bool scale, s8 *pscaled);
+void perf_counts_values__scale(struct perf_counts_values *count, bool scale,
+ bool per_pkg, bool snapshot, s8 *pscaled);
void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
struct perf_counts_values *count);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 4d9b481..b0f0d41 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -197,7 +197,7 @@ static void zero_per_pkg(struct perf_evsel *counter)
}
static int check_per_pkg(struct perf_evsel *counter,
- struct perf_counts_values *vals, int cpu, bool *skip)
+ int cpu, bool *skip)
{
unsigned long *mask = counter->per_pkg_mask;
struct cpu_map *cpus = perf_evsel__cpus(counter);
@@ -219,17 +219,6 @@ static int check_per_pkg(struct perf_evsel *counter,
counter->per_pkg_mask = mask;
}
- /*
- * we do not consider an event that has not run as a good
- * instance to mark a package as used (skip=1). Otherwise
- * we may run into a situation where the first CPU in a package
- * is not running anything, yet the second is, and this function
- * would mark the package as used after the first CPU and would
- * not read the values from the second CPU.
- */
- if (!(vals->run && vals->ena))
- return 0;
-
s = cpu_map__get_socket(cpus, cpu, NULL);
if (s < 0)
return -1;
@@ -244,30 +233,27 @@ process_counter_values(struct perf_stat_config *config, struct perf_evsel *evsel
struct perf_counts_values *count)
{
struct perf_counts_values *aggr = &evsel->counts->aggr;
- static struct perf_counts_values zero;
bool skip = false;
- if (check_per_pkg(evsel, count, cpu, &skip)) {
+ if (check_per_pkg(evsel, cpu, &skip)) {
pr_err("failed to read per-pkg counter\n");
return -1;
}
- if (skip)
- count = &zero;
-
switch (config->aggr_mode) {
case AGGR_THREAD:
case AGGR_CORE:
case AGGR_SOCKET:
case AGGR_NONE:
- if (!evsel->snapshot)
- perf_evsel__compute_deltas(evsel, cpu, thread, count);
- perf_counts_values__scale(count, config->scale, NULL);
+ perf_evsel__compute_deltas(evsel, cpu, thread, count);
+ perf_counts_values__scale(count, config->scale,
+ evsel->per_pkg, evsel->snapshot, NULL);
if (config->aggr_mode == AGGR_NONE)
perf_stat__update_shadow_stats(evsel, count->values, cpu);
break;
case AGGR_GLOBAL:
- aggr->val += count->val;
+ if (!skip)
+ aggr->val += count->val;
if (config->scale) {
aggr->ena += count->ena;
aggr->run += count->run;
@@ -331,9 +317,10 @@ int perf_stat_process_counter(struct perf_stat_config *config,
if (config->aggr_mode != AGGR_GLOBAL)
return 0;
- if (!counter->snapshot)
- perf_evsel__compute_deltas(counter, -1, -1, aggr);
- perf_counts_values__scale(aggr, config->scale, &counter->counts->scaled);
+ perf_evsel__compute_deltas(counter, -1, -1, aggr);
+ perf_counts_values__scale(aggr, config->scale,
+ counter->per_pkg, counter->snapshot,
+ &counter->counts->scaled);
for (i = 0; i < 3; i++)
update_stats(&ps->res_stats[i], count[i]);
--
2.8.0.rc3.226.g39d4020
This hook allows architecture specific code to be called at the end of
the task switch and after perf_events' context switch but before the
scheduler lock is released.
The specific use case in this series is to avoid multiple writes to a slow
MSR until all functions which modify such register in task switch have
finished.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/include/asm/processor.h | 4 ++++
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 3 +++
3 files changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9264476..036d94a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@ struct vm86;
#include <asm/nops.h>
#include <asm/special_insns.h>
#include <asm/fpu/types.h>
+#include <asm/pqr_common.h>
#include <linux/personality.h>
#include <linux/cache.h>
@@ -841,4 +842,7 @@ bool xen_set_default_idle(void);
void stop_this_cpu(void *dummy);
void df_debug(struct pt_regs *regs, long error_code);
+
+#define finish_arch_pre_lock_switch pqr_update
+
#endif /* _ASM_X86_PROCESSOR_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..bcd5473 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2620,6 +2620,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
+ finish_arch_pre_lock_switch();
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d2..cb48b5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1077,6 +1077,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
#ifndef prepare_arch_switch
# define prepare_arch_switch(next) do { } while (0)
#endif
+#ifndef finish_arch_pre_lock_switch
+# define finish_arch_pre_lock_switch() do { } while (0)
+#endif
#ifndef finish_arch_post_lock_switch
# define finish_arch_post_lock_switch() do { } while (0)
#endif
--
2.8.0.rc3.226.g39d4020
New PMUs, such as CQM's, do not guarantee that a read will succeed even
if pmu::add was successful.
In the generic code, this patch adds an int error return and completes the
error checking path up to perf_read().
In CQM's PMU, it adds proper error handling of hw read failure errors.
In other PMUs, pmu::read() simply returns 0.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/alpha/kernel/perf_event.c | 3 +-
arch/arc/kernel/perf_event.c | 3 +-
arch/arm64/include/asm/hw_breakpoint.h | 2 +-
arch/arm64/kernel/hw_breakpoint.c | 3 +-
arch/metag/kernel/perf/perf_event.c | 5 ++-
arch/mips/kernel/perf_event_mipsxx.c | 3 +-
arch/powerpc/include/asm/hw_breakpoint.h | 2 +-
arch/powerpc/kernel/hw_breakpoint.c | 3 +-
arch/powerpc/perf/core-book3s.c | 11 +++---
arch/powerpc/perf/core-fsl-emb.c | 5 ++-
arch/powerpc/perf/hv-24x7.c | 5 ++-
arch/powerpc/perf/hv-gpci.c | 3 +-
arch/s390/kernel/perf_cpum_cf.c | 5 ++-
arch/s390/kernel/perf_cpum_sf.c | 3 +-
arch/sh/include/asm/hw_breakpoint.h | 2 +-
arch/sh/kernel/hw_breakpoint.c | 3 +-
arch/sparc/kernel/perf_event.c | 2 +-
arch/tile/kernel/perf_event.c | 3 +-
arch/x86/events/amd/ibs.c | 2 +-
arch/x86/events/amd/iommu.c | 5 ++-
arch/x86/events/amd/uncore.c | 3 +-
arch/x86/events/core.c | 3 +-
arch/x86/events/intel/bts.c | 3 +-
arch/x86/events/intel/cqm.c | 30 ++++++++------
arch/x86/events/intel/cstate.c | 3 +-
arch/x86/events/intel/pt.c | 3 +-
arch/x86/events/intel/rapl.c | 3 +-
arch/x86/events/intel/uncore.c | 3 +-
arch/x86/events/intel/uncore.h | 2 +-
arch/x86/events/msr.c | 3 +-
arch/x86/include/asm/hw_breakpoint.h | 2 +-
arch/x86/kernel/hw_breakpoint.c | 3 +-
arch/x86/kvm/pmu.h | 10 +++--
drivers/bus/arm-cci.c | 3 +-
drivers/bus/arm-ccn.c | 3 +-
drivers/perf/arm_pmu.c | 3 +-
include/linux/perf_event.h | 6 +--
kernel/events/core.c | 68 +++++++++++++++++++++-----------
38 files changed, 141 insertions(+), 86 deletions(-)
diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 5c218aa..3bf8a60 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -520,11 +520,12 @@ static void alpha_pmu_del(struct perf_event *event, int flags)
}
-static void alpha_pmu_read(struct perf_event *event)
+static int alpha_pmu_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
alpha_perf_event_update(event, hwc, hwc->idx, 0);
+ return 0;
}
diff --git a/arch/arc/kernel/perf_event.c b/arch/arc/kernel/perf_event.c
index 8b134cf..6e4f819 100644
--- a/arch/arc/kernel/perf_event.c
+++ b/arch/arc/kernel/perf_event.c
@@ -116,9 +116,10 @@ static void arc_perf_event_update(struct perf_event *event,
local64_sub(delta, &hwc->period_left);
}
-static void arc_pmu_read(struct perf_event *event)
+static int arc_pmu_read(struct perf_event *event)
{
arc_perf_event_update(event, &event->hw, event->hw.idx);
+ return 0;
}
static int arc_pmu_cache_event(u64 config)
diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
index 115ea2a..869ce97 100644
--- a/arch/arm64/include/asm/hw_breakpoint.h
+++ b/arch/arm64/include/asm/hw_breakpoint.h
@@ -126,7 +126,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
extern int arch_install_hw_breakpoint(struct perf_event *bp);
extern void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-extern void hw_breakpoint_pmu_read(struct perf_event *bp);
+extern int hw_breakpoint_pmu_read(struct perf_event *bp);
extern int hw_breakpoint_slots(int type);
#ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index 4ef5373..ac1a6ca 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -942,8 +942,9 @@ static int __init arch_hw_breakpoint_init(void)
}
arch_initcall(arch_hw_breakpoint_init);
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
{
+ return 0;
}
/*
diff --git a/arch/metag/kernel/perf/perf_event.c b/arch/metag/kernel/perf/perf_event.c
index 2478ec6..9721c1a 100644
--- a/arch/metag/kernel/perf/perf_event.c
+++ b/arch/metag/kernel/perf/perf_event.c
@@ -360,15 +360,16 @@ static void metag_pmu_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
}
-static void metag_pmu_read(struct perf_event *event)
+static int metag_pmu_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
/* Don't read disabled counters! */
if (hwc->idx < 0)
- return;
+ return 0;
metag_pmu_event_update(event, hwc, hwc->idx);
+ return 0;
}
static struct pmu pmu = {
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index 9bc1191..bdc0915 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -509,7 +509,7 @@ static void mipspmu_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
}
-static void mipspmu_read(struct perf_event *event)
+static int mipspmu_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -518,6 +518,7 @@ static void mipspmu_read(struct perf_event *event)
return;
mipspmu_event_update(event, hwc, hwc->idx);
+ return 0;
}
static void mipspmu_enable(struct pmu *pmu)
diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h
index ac6432d..5218696 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -66,7 +66,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
unsigned long val, void *data);
int arch_install_hw_breakpoint(struct perf_event *bp);
void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
extern void flush_ptrace_hw_breakpoint(struct task_struct *tsk);
extern struct pmu perf_ops_bp;
diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
index aec9a1b..d462d8a 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -361,7 +361,8 @@ void flush_ptrace_hw_breakpoint(struct task_struct *tsk)
t->ptrace_bps[0] = NULL;
}
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
{
/* TODO */
+ return 0;
}
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 97a1d40..0baf04e 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1002,20 +1002,20 @@ static u64 check_and_compute_delta(u64 prev, u64 val)
return delta;
}
-static void power_pmu_read(struct perf_event *event)
+static int power_pmu_read(struct perf_event *event)
{
s64 val, delta, prev;
if (event->hw.state & PERF_HES_STOPPED)
- return;
+ return 0;
if (!event->hw.idx)
- return;
+ return 0;
if (is_ebb_event(event)) {
val = read_pmc(event->hw.idx);
local64_set(&event->hw.prev_count, val);
- return;
+ return 0;
}
/*
@@ -1029,7 +1029,7 @@ static void power_pmu_read(struct perf_event *event)
val = read_pmc(event->hw.idx);
delta = check_and_compute_delta(prev, val);
if (!delta)
- return;
+ return 0;
} while (local64_cmpxchg(&event->hw.prev_count, prev, val) != prev);
local64_add(delta, &event->count);
@@ -1049,6 +1049,7 @@ static void power_pmu_read(struct perf_event *event)
if (val < 1)
val = 1;
} while (local64_cmpxchg(&event->hw.period_left, prev, val) != prev);
+ return 0;
}
/*
diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c
index 5d747b4..46d982e 100644
--- a/arch/powerpc/perf/core-fsl-emb.c
+++ b/arch/powerpc/perf/core-fsl-emb.c
@@ -176,12 +176,12 @@ static void write_pmlcb(int idx, unsigned long val)
isync();
}
-static void fsl_emb_pmu_read(struct perf_event *event)
+static int fsl_emb_pmu_read(struct perf_event *event)
{
s64 val, delta, prev;
if (event->hw.state & PERF_HES_STOPPED)
- return;
+ return 0;
/*
* Performance monitor interrupts come even when interrupts
@@ -198,6 +198,7 @@ static void fsl_emb_pmu_read(struct perf_event *event)
delta = (val - prev) & 0xfffffffful;
local64_add(delta, &event->count);
local64_sub(delta, &event->hw.period_left);
+ return 0;
}
/*
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 2da41b7..eddc853 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1268,7 +1268,7 @@ static void update_event_count(struct perf_event *event, u64 now)
local64_add(now - prev, &event->count);
}
-static void h_24x7_event_read(struct perf_event *event)
+static int h_24x7_event_read(struct perf_event *event)
{
u64 now;
struct hv_24x7_request_buffer *request_buffer;
@@ -1289,7 +1289,7 @@ static void h_24x7_event_read(struct perf_event *event)
int ret;
if (__this_cpu_read(hv_24x7_txn_err))
- return;
+ return 0;
request_buffer = (void *)get_cpu_var(hv_24x7_reqb);
@@ -1323,6 +1323,7 @@ static void h_24x7_event_read(struct perf_event *event)
now = h_24x7_get_value(event);
update_event_count(event, now);
}
+ return 0;
}
static void h_24x7_event_start(struct perf_event *event, int flags)
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 7aa3723..d467ee6 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -191,12 +191,13 @@ static u64 h_gpci_get_value(struct perf_event *event)
return count;
}
-static void h_gpci_event_update(struct perf_event *event)
+static int h_gpci_event_update(struct perf_event *event)
{
s64 prev;
u64 now = h_gpci_get_value(event);
prev = local64_xchg(&event->hw.prev_count, now);
local64_add(now - prev, &event->count);
+ return 0;
}
static void h_gpci_event_start(struct perf_event *event, int flags)
diff --git a/arch/s390/kernel/perf_cpum_cf.c b/arch/s390/kernel/perf_cpum_cf.c
index 62f066b..719ec56 100644
--- a/arch/s390/kernel/perf_cpum_cf.c
+++ b/arch/s390/kernel/perf_cpum_cf.c
@@ -471,12 +471,13 @@ out:
return err;
}
-static void cpumf_pmu_read(struct perf_event *event)
+static int cpumf_pmu_read(struct perf_event *event)
{
if (event->hw.state & PERF_HES_STOPPED)
- return;
+ return 0;
hw_perf_event_update(event);
+ return 0;
}
static void cpumf_pmu_start(struct perf_event *event, int flags)
diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c
index eaab9a7..605055c 100644
--- a/arch/s390/kernel/perf_cpum_sf.c
+++ b/arch/s390/kernel/perf_cpum_sf.c
@@ -1298,9 +1298,10 @@ static void hw_perf_event_update(struct perf_event *event, int flush_all)
sampl_overflow, event_overflow);
}
-static void cpumsf_pmu_read(struct perf_event *event)
+static int cpumsf_pmu_read(struct perf_event *event)
{
/* Nothing to do ... updates are interrupt-driven */
+ return 0;
}
/* Activate sampling control.
diff --git a/arch/sh/include/asm/hw_breakpoint.h b/arch/sh/include/asm/hw_breakpoint.h
index ec9ad59..d3ad1bf 100644
--- a/arch/sh/include/asm/hw_breakpoint.h
+++ b/arch/sh/include/asm/hw_breakpoint.h
@@ -60,7 +60,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
int arch_install_hw_breakpoint(struct perf_event *bp);
void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
extern void arch_fill_perf_breakpoint(struct perf_event *bp);
extern int register_sh_ubc(struct sh_ubc *);
diff --git a/arch/sh/kernel/hw_breakpoint.c b/arch/sh/kernel/hw_breakpoint.c
index 2197fc5..3a2e719 100644
--- a/arch/sh/kernel/hw_breakpoint.c
+++ b/arch/sh/kernel/hw_breakpoint.c
@@ -401,9 +401,10 @@ int __kprobes hw_breakpoint_exceptions_notify(struct notifier_block *unused,
return hw_breakpoint_handler(data);
}
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
{
/* TODO */
+ return 0;
}
int register_sh_ubc(struct sh_ubc *ubc)
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 6596f66..ec454cd 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1131,7 +1131,7 @@ static void sparc_pmu_del(struct perf_event *event, int _flags)
local_irq_restore(flags);
}
-static void sparc_pmu_read(struct perf_event *event)
+static int sparc_pmu_read(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int idx = active_event_index(cpuc, event);
diff --git a/arch/tile/kernel/perf_event.c b/arch/tile/kernel/perf_event.c
index 8767060..2b27890 100644
--- a/arch/tile/kernel/perf_event.c
+++ b/arch/tile/kernel/perf_event.c
@@ -734,9 +734,10 @@ static void tile_pmu_del(struct perf_event *event, int flags)
/*
* Propagate event elapsed time into the event.
*/
-static inline void tile_pmu_read(struct perf_event *event)
+static inline int tile_pmu_read(struct perf_event *event)
{
tile_perf_event_update(event);
+ return 0;
}
/*
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index feb90f6..03032ed 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -510,7 +510,7 @@ static void perf_ibs_del(struct perf_event *event, int flags)
perf_event_update_userpage(event);
}
-static void perf_ibs_read(struct perf_event *event) { }
+static int perf_ibs_read(struct perf_event *event) { return 0; }
PMU_FORMAT_ATTR(rand_en, "config:57");
PMU_FORMAT_ATTR(cnt_ctl, "config:19");
diff --git a/arch/x86/events/amd/iommu.c b/arch/x86/events/amd/iommu.c
index 40625ca..57340ae 100644
--- a/arch/x86/events/amd/iommu.c
+++ b/arch/x86/events/amd/iommu.c
@@ -317,7 +317,7 @@ static void perf_iommu_start(struct perf_event *event, int flags)
}
-static void perf_iommu_read(struct perf_event *event)
+static int perf_iommu_read(struct perf_event *event)
{
u64 count = 0ULL;
u64 prev_raw_count = 0ULL;
@@ -335,13 +335,14 @@ static void perf_iommu_read(struct perf_event *event)
prev_raw_count = local64_read(&hwc->prev_count);
if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
count) != prev_raw_count)
- return;
+ return 0;
/* Handling 48-bit counter overflowing */
delta = (count << COUNTER_SHIFT) - (prev_raw_count << COUNTER_SHIFT);
delta >>= COUNTER_SHIFT;
local64_add(delta, &event->count);
+ return 0;
}
static void perf_iommu_stop(struct perf_event *event, int flags)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 98ac573..5e4f1e7 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -71,7 +71,7 @@ static struct amd_uncore *event_to_amd_uncore(struct perf_event *event)
return NULL;
}
-static void amd_uncore_read(struct perf_event *event)
+static int amd_uncore_read(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
u64 prev, new;
@@ -88,6 +88,7 @@ static void amd_uncore_read(struct perf_event *event)
delta = (new << COUNTER_SHIFT) - (prev << COUNTER_SHIFT);
delta >>= COUNTER_SHIFT;
local64_add(delta, &event->count);
+ return 0;
}
static void amd_uncore_start(struct perf_event *event, int flags)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 041e442..8323ecd 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1765,9 +1765,10 @@ static int __init init_hw_perf_events(void)
}
early_initcall(init_hw_perf_events);
-static inline void x86_pmu_read(struct perf_event *event)
+static inline int x86_pmu_read(struct perf_event *event)
{
x86_perf_event_update(event);
+ return 0;
}
/*
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 0a6e393..95e18c6 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -510,8 +510,9 @@ static int bts_event_init(struct perf_event *event)
return 0;
}
-static void bts_event_read(struct perf_event *event)
+static int bts_event_read(struct perf_event *event)
{
+ return 0;
}
static __init int bts_init(void)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 33691c1..8189e47 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2444,7 +2444,7 @@ exit_error:
}
/* Read current package immediately and remote pkg (if any) from cache. */
-static void __read_task_event(struct perf_event *event)
+static int __read_task_event(struct perf_event *event)
{
int i, ret;
u64 count = 0;
@@ -2459,26 +2459,31 @@ static void __read_task_event(struct perf_event *event)
ret = pmonr__get_read_rmid(pmonr, &rmid, true);
if (ret)
- return;
+ return ret;
if (rmid == INVALID_RMID)
continue;
prmid = __prmid_from_rmid(i, rmid);
if (WARN_ON_ONCE(!prmid))
- return;
+ return -1;
/* update and read local for this cpu's package. */
- if (i == pkg_id)
- cqm_prmid_update(prmid);
+ if (i == pkg_id) {
+ ret = cqm_prmid_update(prmid);
+ if (ret < 0)
+ return ret;
+ }
count += atomic64_read(&prmid->last_read_value);
}
local64_set(&event->count, count);
+ return 0;
}
/* Read current package immediately and remote pkg (if any) from cache. */
-static void intel_cqm_event_read(struct perf_event *event)
+static int intel_cqm_event_read(struct perf_event *event)
{
struct monr *monr;
u64 count;
+ int ret;
u16 pkg_id = topology_physical_package_id(smp_processor_id());
monr = monr_from_event(event);
@@ -2491,23 +2496,24 @@ static void intel_cqm_event_read(struct perf_event *event)
*/
if (event->parent) {
local64_set(&event->count, 0);
- return;
+ return 0;
}
if (event->attach_state & PERF_ATTACH_TASK) {
- __read_task_event(event);
- return;
+ return __read_task_event(event);
}
/* It's either a cgroup or a cpu event. */
if (WARN_ON_ONCE(event->cpu < 0))
- return;
+ return -1;
/* XXX: expose fail_on_inh_descendant as a configuration parameter? */
- pmonr__read_subtree(monr, pkg_id, &count, false);
+ ret = pmonr__read_subtree(monr, pkg_id, &count, false);
+ if (ret < 0)
+ return ret;
local64_set(&event->count, count);
- return;
+ return 0;
}
static inline bool cqm_group_leader(struct perf_event *event)
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 9ba4e41..390745c1 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -322,7 +322,7 @@ static inline u64 cstate_pmu_read_counter(struct perf_event *event)
return val;
}
-static void cstate_pmu_event_update(struct perf_event *event)
+static int cstate_pmu_event_update(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
u64 prev_raw_count, new_raw_count;
@@ -336,6 +336,7 @@ again:
goto again;
local64_add(new_raw_count - prev_raw_count, &event->count);
+ return 0;
}
static void cstate_pmu_event_start(struct perf_event *event, int mode)
diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
index 05ef87d..477eb4f 100644
--- a/arch/x86/events/intel/pt.c
+++ b/arch/x86/events/intel/pt.c
@@ -1120,8 +1120,9 @@ fail:
return ret;
}
-static void pt_event_read(struct perf_event *event)
+static int pt_event_read(struct perf_event *event)
{
+ return 0;
}
static void pt_event_destroy(struct perf_event *event)
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 99c4bab..b01abd1 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -408,9 +408,10 @@ static int rapl_pmu_event_init(struct perf_event *event)
return ret;
}
-static void rapl_pmu_event_read(struct perf_event *event)
+static int rapl_pmu_event_read(struct perf_event *event)
{
rapl_event_update(event);
+ return 0;
}
static ssize_t rapl_get_attr_cpumask(struct device *dev,
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 17734a6..c01bcc9 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -577,10 +577,11 @@ static void uncore_pmu_event_del(struct perf_event *event, int flags)
event->hw.last_tag = ~0ULL;
}
-void uncore_pmu_event_read(struct perf_event *event)
+int uncore_pmu_event_read(struct perf_event *event)
{
struct intel_uncore_box *box = uncore_event_to_box(event);
uncore_perf_event_update(box, event);
+ return 0;
}
/*
diff --git a/arch/x86/events/intel/uncore.h b/arch/x86/events/intel/uncore.h
index 79766b9..1c5db22 100644
--- a/arch/x86/events/intel/uncore.h
+++ b/arch/x86/events/intel/uncore.h
@@ -337,7 +337,7 @@ struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int cpu
u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event *event);
void uncore_pmu_start_hrtimer(struct intel_uncore_box *box);
void uncore_pmu_cancel_hrtimer(struct intel_uncore_box *box);
-void uncore_pmu_event_read(struct perf_event *event);
+int uncore_pmu_event_read(struct perf_event *event);
void uncore_perf_event_update(struct intel_uncore_box *box, struct perf_event *event);
struct event_constraint *
uncore_get_constraint(struct intel_uncore_box *box, struct perf_event *event);
diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index 7111400..62d33f6 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -165,7 +165,7 @@ static inline u64 msr_read_counter(struct perf_event *event)
return now;
}
-static void msr_event_update(struct perf_event *event)
+static int msr_event_update(struct perf_event *event)
{
u64 prev, now;
s64 delta;
@@ -183,6 +183,7 @@ again:
delta = sign_extend64(delta, 31);
local64_add(now - prev, &event->count);
+ return 0;
}
static void msr_event_start(struct perf_event *event, int flags)
diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index 6c98be8..a1c4ce00 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -59,7 +59,7 @@ extern int hw_breakpoint_exceptions_notify(struct notifier_block *unused,
int arch_install_hw_breakpoint(struct perf_event *bp);
void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-void hw_breakpoint_pmu_read(struct perf_event *bp);
+int hw_breakpoint_pmu_read(struct perf_event *bp);
void hw_breakpoint_pmu_unthrottle(struct perf_event *bp);
extern void
diff --git a/arch/x86/kernel/hw_breakpoint.c b/arch/x86/kernel/hw_breakpoint.c
index 2bcfb5f..60b8cab7 100644
--- a/arch/x86/kernel/hw_breakpoint.c
+++ b/arch/x86/kernel/hw_breakpoint.c
@@ -539,7 +539,8 @@ int hw_breakpoint_exceptions_notify(
return hw_breakpoint_handler(data);
}
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
{
/* TODO */
+ return 0;
}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f96e1f9..46fd299 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -39,12 +39,14 @@ static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
{
- u64 counter, enabled, running;
+ u64 counter, counter_tmp, enabled, running;
counter = pmc->counter;
- if (pmc->perf_event)
- counter += perf_event_read_value(pmc->perf_event,
- &enabled, &running);
+ if (pmc->perf_event) {
+ if (!perf_event_read_value(pmc->perf_event, &counter_tmp,
+ &enabled, &running))
+ counter += counter_tmp;
+ }
/* FIXME: Scaling needed? */
return counter & pmc_bitmask(pmc);
}
diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
index a49b283..9fa7b4e 100644
--- a/drivers/bus/arm-cci.c
+++ b/drivers/bus/arm-cci.c
@@ -1033,9 +1033,10 @@ static u64 pmu_event_update(struct perf_event *event)
return new_raw_count;
}
-static void pmu_read(struct perf_event *event)
+static int pmu_read(struct perf_event *event)
{
pmu_event_update(event);
+ return 0;
}
static void pmu_event_set_period(struct perf_event *event)
diff --git a/drivers/bus/arm-ccn.c b/drivers/bus/arm-ccn.c
index 7082c72..a2e4a9c 100644
--- a/drivers/bus/arm-ccn.c
+++ b/drivers/bus/arm-ccn.c
@@ -1123,9 +1123,10 @@ static void arm_ccn_pmu_event_del(struct perf_event *event, int flags)
arm_ccn_pmu_event_release(event);
}
-static void arm_ccn_pmu_event_read(struct perf_event *event)
+static int arm_ccn_pmu_event_read(struct perf_event *event)
{
arm_ccn_pmu_event_update(event);
+ return 0;
}
static irqreturn_t arm_ccn_pmu_overflow_handler(struct arm_ccn_dt *dt)
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 32346b5..7a78230 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -163,10 +163,11 @@ again:
return new_raw_count;
}
-static void
+static int
armpmu_read(struct perf_event *event)
{
armpmu_event_update(event);
+ return 0;
}
static void
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b0f6088..9c973bd 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -333,7 +333,7 @@ struct pmu {
* For sampling capable PMUs this will also update the software period
* hw_perf_event::period_left field.
*/
- void (*read) (struct perf_event *event);
+ int (*read) (struct perf_event *event);
/*
* Group events scheduling is treated as a transaction, add
@@ -786,8 +786,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
extern u64 perf_event_read_local(struct perf_event *event);
-extern u64 perf_event_read_value(struct perf_event *event,
- u64 *enabled, u64 *running);
+extern int perf_event_read_value(struct perf_event *event,
+ u64 *total, u64 *enabled, u64 *running);
struct perf_sample_data {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 804fdd1..cfffa50 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2539,7 +2539,7 @@ static void __perf_event_sync_stat(struct perf_event *event,
*/
switch (event->state) {
case PERF_EVENT_STATE_ACTIVE:
- event->pmu->read(event);
+ (void)event->pmu->read(event);
/* fall-through */
case PERF_EVENT_STATE_INACTIVE:
@@ -3291,6 +3291,7 @@ static void __perf_event_read(void *info)
return;
raw_spin_lock(&ctx->lock);
+
if (ctx->is_active) {
update_context_time(ctx);
update_cgrp_time_from_event(event);
@@ -3303,14 +3304,15 @@ static void __perf_event_read(void *info)
if (!data->group) {
- pmu->read(event);
- data->ret = 0;
+ data->ret = pmu->read(event);
goto unlock;
}
pmu->start_txn(pmu, PERF_PMU_TXN_READ);
- pmu->read(event);
+ data->ret = pmu->read(event);
+ if (data->ret)
+ goto unlock;
list_for_each_entry(sub, &event->sibling_list, group_entry) {
update_event_times(sub);
@@ -3320,7 +3322,9 @@ static void __perf_event_read(void *info)
* Use sibling's PMU rather than @event's since
* sibling could be on different (eg: software) PMU.
*/
- sub->pmu->read(sub);
+ data->ret = sub->pmu->read(sub);
+ if (data->ret)
+ goto unlock;
}
}
@@ -3341,6 +3345,7 @@ static inline u64 perf_event_count(struct perf_event *event)
* - either for the current task, or for this CPU
* - does not have inherit set, for inherited task events
* will not be local and we cannot read them atomically
+ * - pmu::read cannot fail
*/
u64 perf_event_read_local(struct perf_event *event)
{
@@ -3373,7 +3378,7 @@ u64 perf_event_read_local(struct perf_event *event)
* oncpu == -1).
*/
if (event->oncpu == smp_processor_id())
- event->pmu->read(event);
+ (void) event->pmu->read(event);
val = local64_read(&event->count);
local_irq_restore(flags);
@@ -3410,8 +3415,12 @@ static int perf_event_read(struct perf_event *event, bool group)
PERF_INACTIVE_EV_READ_ANY_CPU) ?
smp_processor_id() : event->cpu;
}
- smp_call_function_single(
- cpu_to_read, __perf_event_read, &data, 1);
+ ret = smp_call_function_single(cpu_to_read,
+ __perf_event_read, &data, 1);
+ if (ret) {
+ WARN_ON_ONCE(ret);
+ return ret;
+ }
ret = data.ret;
} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
struct perf_event_context *ctx = event->ctx;
@@ -3433,7 +3442,6 @@ static int perf_event_read(struct perf_event *event, bool group)
update_event_times(event);
raw_spin_unlock_irqrestore(&ctx->lock, flags);
}
-
return ret;
}
@@ -4035,18 +4043,22 @@ static int perf_release(struct inode *inode, struct file *file)
return 0;
}
-u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
+int perf_event_read_value(struct perf_event *event,
+ u64 *total, u64 *enabled, u64 *running)
{
struct perf_event *child;
- u64 total = 0;
+ int ret;
+ *total = 0;
*enabled = 0;
*running = 0;
mutex_lock(&event->child_mutex);
- (void)perf_event_read(event, false);
- total += perf_event_count(event);
+ ret = perf_event_read(event, false);
+ if (ret)
+ goto exit;
+ *total += perf_event_count(event);
*enabled += event->total_time_enabled +
atomic64_read(&event->child_total_time_enabled);
@@ -4054,14 +4066,17 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
atomic64_read(&event->child_total_time_running);
list_for_each_entry(child, &event->child_list, child_list) {
- (void)perf_event_read(child, false);
- total += perf_event_count(child);
+ ret = perf_event_read(child, false);
+ if (ret)
+ goto exit;
+ *total += perf_event_count(child);
*enabled += child->total_time_enabled;
*running += child->total_time_running;
}
+exit:
mutex_unlock(&event->child_mutex);
- return total;
+ return ret;
}
EXPORT_SYMBOL_GPL(perf_event_read_value);
@@ -4158,9 +4173,11 @@ static int perf_read_one(struct perf_event *event,
{
u64 enabled, running;
u64 values[4];
- int n = 0;
+ int n = 0, ret;
- values[n++] = perf_event_read_value(event, &enabled, &running);
+ ret = perf_event_read_value(event, &values[n++], &enabled, &running);
+ if (ret)
+ return ret;
if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
values[n++] = enabled;
if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
@@ -5427,7 +5444,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
values[n++] = running;
if (leader != event)
- leader->pmu->read(leader);
+ (void)leader->pmu->read(leader);
values[n++] = perf_event_count(leader);
if (read_format & PERF_FORMAT_ID)
@@ -5440,7 +5457,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
if ((sub != event) &&
(sub->state == PERF_EVENT_STATE_ACTIVE))
- sub->pmu->read(sub);
+ (void)sub->pmu->read(sub);
values[n++] = perf_event_count(sub);
if (read_format & PERF_FORMAT_ID)
@@ -6980,8 +6997,9 @@ fail:
preempt_enable_notrace();
}
-static void perf_swevent_read(struct perf_event *event)
+static int perf_swevent_read(struct perf_event *event)
{
+ return 0;
}
static int perf_swevent_add(struct perf_event *event, int flags)
@@ -7421,7 +7439,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
if (event->state != PERF_EVENT_STATE_ACTIVE)
return HRTIMER_NORESTART;
- event->pmu->read(event);
+ (void)event->pmu->read(event);
perf_sample_data_init(&data, 0, event->hw.last_period);
regs = get_irq_regs();
@@ -7536,9 +7554,10 @@ static void cpu_clock_event_del(struct perf_event *event, int flags)
cpu_clock_event_stop(event, flags);
}
-static void cpu_clock_event_read(struct perf_event *event)
+static int cpu_clock_event_read(struct perf_event *event)
{
cpu_clock_event_update(event);
+ return 0;
}
static int cpu_clock_event_init(struct perf_event *event)
@@ -7613,13 +7632,14 @@ static void task_clock_event_del(struct perf_event *event, int flags)
task_clock_event_stop(event, PERF_EF_UPDATE);
}
-static void task_clock_event_read(struct perf_event *event)
+static int task_clock_event_read(struct perf_event *event)
{
u64 now = perf_clock();
u64 delta = now - event->ctx->timestamp;
u64 time = event->ctx->time + delta;
task_clock_event_update(event, time);
+ return 0;
}
static int task_clock_event_init(struct perf_event *event)
--
2.8.0.rc3.226.g39d4020
perf_event context switches events to newly exec'ed tasks using
perf_event_exec. Add a hook for such path.
In x86, perf_event_arch_exec is used to synchronize the software
cache of the PQR_ASSOC msr, setting the right RMID for the new task.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/include/asm/perf_event.h | 2 ++
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 1 +
3 files changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 99fc206..c13f501 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -332,6 +332,8 @@ extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
.dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
.legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
+#define perf_event_arch_exec pqr_update
+
#else
#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c973bd..99b4393 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1289,4 +1289,9 @@ static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
#endif
+#ifndef perf_event_arch_exec
+#define perf_event_arch_exec() do { } while (0)
+#endif
+
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfffa50..5c675b4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3248,6 +3248,7 @@ void perf_event_exec(void)
for_each_task_context_nr(ctxn)
perf_event_enable_on_exec(ctxn);
rcu_read_unlock();
+ perf_event_arch_exec();
}
struct perf_read_data {
--
2.8.0.rc3.226.g39d4020
Allow architectures to define additional attributes for the perf cgroup.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 4 ++++
kernel/events/core.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 054d7f4..b0f6088 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1285,4 +1285,8 @@ static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
# define perf_cgroup_arch_css_free(css) do { } while (0)
#endif
+#ifndef PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#endif
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28d1b51..804fdd1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9903,5 +9903,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
.css_released = perf_cgroup_css_released,
.css_free = perf_cgroup_css_free,
.attach = perf_cgroup_attach,
+ /* Expand architecture specific attributes. */
+ PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
};
#endif /* CONFIG_CGROUP_PERF */
--
2.8.0.rc3.226.g39d4020
Expose the boolean attribute intel_cqm.cont_monitoring . When set, the
associated group will be monitored even if no perf cgroup event is
attached to it.
The occupancy of a cgroup must be read using a perf_event, regardless of
the value of intel_cqm.cont_monitoring .
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 81 +++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 6 +++
2 files changed, 87 insertions(+)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 4ece0a4..33691c1 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -3194,4 +3194,85 @@ no_rmid:
#endif
}
+#ifdef CONFIG_CGROUP_PERF
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static u64 cqm_cont_monitoring_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ int ret = -1;
+ struct perf_cgroup *perf_cgrp = css_to_perf_cgroup(css);
+ struct monr *monr;
+
+ mutex_lock(&cqm_init_mutex);
+ if (!static_branch_likely(&cqm_initialized_key))
+ goto out;
+
+ mutex_lock(&cqm_mutex);
+
+ ret = css_to_cqm_info(css)->cont_monitoring;
+ monr = monr_from_perf_cgroup(perf_cgrp);
+ WARN_ON(!monr->mon_event_group &&
+ (ret != perf_cgroup_is_monitored(perf_cgrp)));
+
+ mutex_unlock(&cqm_mutex);
+out:
+ mutex_unlock(&cqm_init_mutex);
+ return ret;
+}
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static int cqm_cont_monitoring_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 value)
+{
+ int ret = 0;
+ struct perf_cgroup *perf_cgrp = css_to_perf_cgroup(css);
+ struct monr *monr;
+
+ if (value > 1)
+ return -1;
+
+ mutex_lock(&cqm_init_mutex);
+ if (!static_branch_likely(&cqm_initialized_key)) {
+ ret = -1;
+ goto out;
+ }
+
+ /* Root cgroup cannot stop being monitored. */
+ if (css == get_root_perf_css())
+ goto out;
+
+ mutex_lock(&cqm_mutex);
+
+ monr = monr_from_perf_cgroup(perf_cgrp);
+
+ if (value && !perf_cgroup_is_monitored(perf_cgrp))
+ ret = __css_start_monitoring(css);
+ else if (!value &&
+ !monr->mon_event_group && perf_cgroup_is_monitored(perf_cgrp))
+ ret = __css_stop_monitoring(css);
+
+ WARN_ON(!monr->mon_event_group &&
+ (value != perf_cgroup_is_monitored(perf_cgrp)));
+
+ css_to_cqm_info(css)->cont_monitoring = value;
+
+ mutex_unlock(&cqm_mutex);
+out:
+ mutex_unlock(&cqm_init_mutex);
+ return ret;
+}
+
+struct cftype perf_event_cgrp_arch_subsys_cftypes[] = {
+ {
+ .name = "cqm_cont_monitoring",
+ .read_u64 = cqm_cont_monitoring_read_u64,
+ .write_u64 = cqm_cont_monitoring_write_u64,
+ },
+
+ {} /* terminate */
+};
+
+#endif
+
device_initcall(intel_cqm_init);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index c22d9e0..99fc206 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -326,6 +326,12 @@ inline void perf_cgroup_arch_css_released(struct cgroup_subsys_state *css);
perf_cgroup_arch_css_free
inline void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
+extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS \
+ .dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
+ .legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
+
#else
#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
--
2.8.0.rc3.226.g39d4020
Use newly added pmu_event flags to:
- Allow thread events to be read from any CPU even if not in
ACTIVE state. Since inter-package values are polled, a thread's
occupancy is always:
local occupancy (read from hw) + remote occupancy (polled values)
- Allow cpu/cgroup events to be read from any CPU in the package where
they run. This potentially saves IPIs when the read function runs in the
same package but in a distinct CPU than the event.
Since reading will always return a new value and inherit_stats is not
supported (due to all children events sharing the same RMID), there is no
need to read during sched_out of an event.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index c14f1c7..daf9fdf 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2702,6 +2702,16 @@ static int intel_cqm_event_init(struct perf_event *event)
*/
event->pmu_event_flags |= PERF_CGROUP_NO_RECURSION;
+ /* Events in CQM PMU are per-package and can be read even when
+ * the cpu is not running the event.
+ */
+ if (event->cpu < 0) {
+ WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK));
+ event->pmu_event_flags |= PERF_INACTIVE_EV_READ_ANY_CPU;
+ } else {
+ event->pmu_event_flags |= PERF_INACTIVE_CPU_READ_PKG;
+ }
+
mutex_lock(&cqm_mutex);
--
2.8.0.rc3.226.g39d4020
To avoid IPIs from IRQ disabled contexts, the occupancy for a RMID in a
remote package (a package other than the one the current cpu belongs) is
obtained from a cache that is periodically updated.
This removes the need for an IPI when reading occupancy for a task event,
that was the reason to add the problematic pmu::count and dummy
perf_event_read() in the previous CQM version.
The occupancy of all active prmids is updated every
__rmid_timed_update_period ms .
To avoid holding raw_spin_locks on the prmid hierarchy for too long, the
raw rmids to be read are copied to a temporal array list. The array list
is consumed to perform the wrmsrl and rdmsrl in each RMID required to
read its llc_occupancy.
This decoupling of traversing the RMID hierarchy and read occupancy is
specially useful due to high latency of the wrmsrl and rdmsl for the
llc_occupancy event (thousand of cycles in my test machine).
To avoid unnecessary memory allocations, the objects used to temporarily
store RMIDs are pooled in a per-package list and allocated on demand.
The infrastructure introduced in this patch will be used in future patches
in this series to perform reads on subtrees of a prmid hierarchy.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 251 +++++++++++++++++++++++++++++++++++++++++++-
arch/x86/events/intel/cqm.h | 36 +++++++
2 files changed, 286 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 31f0fd6..904f2d3 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -189,6 +189,8 @@ static inline bool __valid_pkg_id(u16 pkg_id)
return pkg_id < PQR_MAX_NR_PKGS;
}
+static int anode_pool__alloc_one(u16 pkg_id);
+
/* Init cqm pkg_data for @cpu 's package. */
static int pkg_data_init_cpu(int cpu)
{
@@ -241,11 +243,19 @@ static int pkg_data_init_cpu(int cpu)
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
+ INIT_LIST_HEAD(&pkg_data->anode_pool_head);
+ raw_spin_lock_init(&pkg_data->anode_pool_lock);
+
INIT_DELAYED_WORK(
&pkg_data->rotation_work, intel_cqm_rmid_rotation_work);
/* XXX: Chose randomly*/
pkg_data->rotation_cpu = cpu;
+ INIT_DELAYED_WORK(
+ &pkg_data->timed_update_work, intel_cqm_timed_update_work);
+ /* XXX: Chose randomly*/
+ pkg_data->timed_update_cpu = cpu;
+
cqm_pkgs_data[pkg_id] = pkg_data;
return 0;
}
@@ -744,6 +754,189 @@ static void monr_dealloc(struct monr *monr)
}
/*
+ * Logic for reading sets of rmids into per-package lists.
+ * This package lists can be used to update occupancies without
+ * holding locks in the hierarchies of pmonrs.
+ * @pool: free pool.
+ */
+struct astack {
+ struct list_head pool;
+ struct list_head items;
+ int top_idx;
+ int max_idx;
+ u16 pkg_id;
+};
+
+static void astack__init(struct astack *astack, int max_idx, u16 pkg_id)
+{
+ INIT_LIST_HEAD(&astack->items);
+ INIT_LIST_HEAD(&astack->pool);
+ astack->top_idx = -1;
+ astack->max_idx = max_idx;
+ astack->pkg_id = pkg_id;
+}
+
+/* Try to enlarge astack->pool with a anode from this pkgs pool. */
+static int astack__try_add_pool(struct astack *astack)
+{
+ unsigned long flags;
+ int ret = -1;
+ struct pkg_data *pkg_data = cqm_pkgs_data[astack->pkg_id];
+
+ raw_spin_lock_irqsave(&pkg_data->anode_pool_lock, flags);
+
+ if (!list_empty(&pkg_data->anode_pool_head)) {
+ list_move_tail(pkg_data->anode_pool_head.prev, &astack->pool);
+ ret = 0;
+ }
+
+ raw_spin_unlock_irqrestore(&pkg_data->anode_pool_lock, flags);
+ return ret;
+}
+
+static int astack__push(struct astack *astack)
+{
+ if (!list_empty(&astack->items) && astack->top_idx < astack->max_idx) {
+ astack->top_idx++;
+ return 0;
+ }
+
+ if (list_empty(&astack->pool) && astack__try_add_pool(astack))
+ return -1;
+ list_move_tail(astack->pool.prev, &astack->items);
+ astack->top_idx = 0;
+ return 0;
+}
+
+/* Must be non-empty */
+# define __astack__top(astack_, member_) \
+ list_last_entry(&(astack_)->items, \
+ struct anode, entry)->member_[(astack_)->top_idx]
+
+static void astack__clear(struct astack *astack)
+{
+ list_splice_tail_init(&astack->items, &astack->pool);
+ astack->top_idx = -1;
+}
+
+/* Put back into pkg_data's pool. */
+static void astack__release(struct astack *astack)
+{
+ unsigned long flags;
+ struct pkg_data *pkg_data = cqm_pkgs_data[astack->pkg_id];
+
+ astack__clear(astack);
+ raw_spin_lock_irqsave(&pkg_data->anode_pool_lock, flags);
+ list_splice_tail_init(&astack->pool, &pkg_data->anode_pool_head);
+ raw_spin_unlock_irqrestore(&pkg_data->anode_pool_lock, flags);
+}
+
+static int anode_pool__alloc_one(u16 pkg_id)
+{
+ unsigned long flags;
+ struct anode *anode;
+ struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+
+ anode = kmalloc_node(sizeof(struct anode), GFP_KERNEL,
+ cpu_to_node(pkg_data->rotation_cpu));
+ if (!anode)
+ return -ENOMEM;
+ raw_spin_lock_irqsave(&pkg_data->anode_pool_lock, flags);
+ list_add_tail(&anode->entry, &pkg_data->anode_pool_head);
+ raw_spin_unlock_irqrestore(&pkg_data->anode_pool_lock, flags);
+ return 0;
+}
+
+static int astack__end(struct astack *astack, struct anode *anode, int idx)
+{
+ return list_is_last(&anode->entry, &astack->items) &&
+ idx > astack->top_idx;
+}
+
+static int __rmid_fn__cqm_prmid_update(struct prmid *prmid, u64 *val)
+{
+ int ret = cqm_prmid_update(prmid);
+
+ if (ret >= 0)
+ *val = atomic64_read(&prmid->last_read_value);
+ return ret;
+}
+
+/* Apply function to all elements in all nodes.
+ * On error returns first error in read, zero otherwise.
+ */
+static int astack__rmids_sum_apply(
+ struct astack *astack,
+ u16 pkg_id, int (*fn)(struct prmid *, u64 *), u64 *total)
+{
+ struct prmid *prmid;
+ struct anode *anode;
+ u32 rmid;
+ int i, ret, first_error = 0;
+ u64 count;
+ *total = 0;
+
+ list_for_each_entry(anode, &astack->items, entry) {
+ for (i = 0; i <= astack->max_idx; i++) {
+ /* node in tail only has astack->top_idx elements. */
+ if (astack__end(astack, anode, i))
+ break;
+ rmid = anode->rmids[i];
+ prmid = cqm_pkgs_data[pkg_id]->prmids_by_rmid[rmid];
+ WARN_ON_ONCE(!prmid);
+ ret = fn(prmid, &count);
+ if (ret < 0) {
+ if (!first_error)
+ first_error = ret;
+ continue;
+ }
+ *total += count;
+ }
+ }
+ return first_error;
+}
+
+/* Does not need mutex since protected by locks when transversing
+ * astate_pmonrs_lru and updating atomic prmids.
+ */
+static int update_rmids_in_astate_pmonrs_lru(u16 pkg_id)
+{
+ struct astack astack;
+ struct pkg_data *pkg_data;
+ struct pmonr *pmonr;
+ int ret = 0;
+ unsigned long flags;
+ u64 count;
+
+ astack__init(&astack, NR_RMIDS_PER_NODE - 1, pkg_id);
+ pkg_data = cqm_pkgs_data[pkg_id];
+
+retry:
+ if (ret) {
+ anode_pool__alloc_one(pkg_id);
+ ret = 0;
+ }
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+ list_for_each_entry(pmonr,
+ &pkg_data->astate_pmonrs_lru, rotation_entry) {
+ ret = astack__push(&astack);
+ if (ret)
+ break;
+ __astack__top(&astack, rmids) = pmonr->prmid->rmid;
+ }
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ if (ret) {
+ astack__clear(&astack);
+ goto retry;
+ }
+ /* count is not used. */
+ ret = astack__rmids_sum_apply(&astack, pkg_id,
+ &__rmid_fn__cqm_prmid_update, &count);
+ astack__release(&astack);
+ return ret;
+}
+
+/*
* Wrappers for monr manipulation in events.
*
*/
@@ -1532,6 +1725,17 @@ exit:
mutex_unlock(&pkg_data->pkg_data_mutex);
}
+static void
+__intel_cqm_timed_update(u16 pkg_id)
+{
+ int ret;
+
+ mutex_lock_nested(&cqm_pkgs_data[pkg_id]->pkg_data_mutex, pkg_id);
+ ret = update_rmids_in_astate_pmonrs_lru(pkg_id);
+ mutex_unlock(&cqm_pkgs_data[pkg_id]->pkg_data_mutex);
+ WARN_ON_ONCE(ret);
+}
+
static struct pmu intel_cqm_pmu;
/* Rotation only needs to be run when there is any pmonr in (I)state. */
@@ -1554,6 +1758,22 @@ static bool intel_cqm_need_rotation(u16 pkg_id)
return need_rot;
}
+static bool intel_cqm_need_timed_update(u16 pkg_id)
+{
+
+ struct pkg_data *pkg_data;
+ bool need_update;
+
+ pkg_data = cqm_pkgs_data[pkg_id];
+
+ mutex_lock_nested(&pkg_data->pkg_data_mutex, pkg_id);
+ /* Update is needed if prmids if there is any active prmid. */
+ need_update = !list_empty(&pkg_data->active_prmids_pool);
+ mutex_unlock(&pkg_data->pkg_data_mutex);
+
+ return need_update;
+}
+
/*
* Schedule rotation in one package.
*/
@@ -1568,6 +1788,19 @@ static void __intel_cqm_schedule_rotation_for_pkg(u16 pkg_id)
pkg_data->rotation_cpu, &pkg_data->rotation_work, delay);
}
+static void __intel_cqm_schedule_timed_update_for_pkg(u16 pkg_id)
+{
+ struct pkg_data *pkg_data;
+ unsigned long delay;
+
+ delay = msecs_to_jiffies(__rmid_timed_update_period);
+ pkg_data = cqm_pkgs_data[pkg_id];
+ schedule_delayed_work_on(
+ pkg_data->timed_update_cpu,
+ &pkg_data->timed_update_work, delay);
+}
+
+
/*
* Schedule rotation and rmid's timed update in all packages.
* Reescheduling will stop when no longer needed.
@@ -1576,8 +1809,10 @@ static void intel_cqm_schedule_work_all_pkgs(void)
{
int pkg_id;
- cqm_pkg_id_for_each_online(pkg_id)
+ cqm_pkg_id_for_each_online(pkg_id) {
__intel_cqm_schedule_rotation_for_pkg(pkg_id);
+ __intel_cqm_schedule_timed_update_for_pkg(pkg_id);
+ }
}
static void intel_cqm_rmid_rotation_work(struct work_struct *work)
@@ -1598,6 +1833,20 @@ static void intel_cqm_rmid_rotation_work(struct work_struct *work)
__intel_cqm_schedule_rotation_for_pkg(pkg_id);
}
+static void intel_cqm_timed_update_work(struct work_struct *work)
+{
+ struct pkg_data *pkg_data = container_of(
+ to_delayed_work(work), struct pkg_data, timed_update_work);
+ u16 pkg_id = topology_physical_package_id(pkg_data->timed_update_cpu);
+
+ WARN_ON_ONCE(pkg_data != cqm_pkgs_data[pkg_id]);
+
+ __intel_cqm_timed_update(pkg_id);
+
+ if (intel_cqm_need_timed_update(pkg_id))
+ __intel_cqm_schedule_timed_update_for_pkg(pkg_id);
+}
+
/*
* Find a group and setup RMID.
*
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index b0e1698..25646a2 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -45,6 +45,10 @@ static unsigned int __rmid_min_update_time = RMID_DEFAULT_MIN_UPDATE_TIME;
static inline int cqm_prmid_update(struct prmid *prmid);
+#define RMID_DEFAULT_TIMED_UPDATE_PERIOD 100 /* ms */
+static unsigned int __rmid_timed_update_period =
+ RMID_DEFAULT_TIMED_UPDATE_PERIOD;
+
/*
* union prmid_summary: Machine-size summary of a pmonr's prmid state.
* @value: One word accesor.
@@ -211,6 +215,21 @@ struct pmonr {
atomic64_t prmid_summary_atomic;
};
+/* Store all RMIDs that can fit in a anode while keeping sizeof(struct anode)
+ * within one cache line (for performance).
+ */
+#define NR_TYPE_PER_NODE(__type) ((SMP_CACHE_BYTES - (int)sizeof(struct list_head)) / \
+ (int)sizeof(__type))
+
+#define NR_RMIDS_PER_NODE NR_TYPE_PER_NODE(u32)
+
+/* struct anode: Node of an array list used to temporarily store RMIDs. */
+struct anode {
+ /* Last valid RMID is RMID_INVALID */
+ u32 rmids[NR_RMIDS_PER_NODE];
+ struct list_head entry;
+};
+
/*
* struct pkg_data: Per-package CQM data.
* @max_rmid: Max rmid valid for cpus in this package.
@@ -239,6 +258,14 @@ struct pmonr {
* @rotation_cpu: CPU to run @rotation_work on, it must be in the
* package associated to this instance of pkg_data.
* @rotation_work: Task that performs rotation of prmids.
+ * @timed_update_work: Task that performs periodic updates of values
+ * for active rmids. These values are used when
+ * inter-package event read is not available due to
+ * irqs disabled contexts.
+ * @timed_update_cpu: CPU to run @timed_update_work on, it must be a
+ * cpu in this package.
+ * @anode_pool_head: Pool of unused anodes.
+ * @anode_pool_lock: Protect @anode_pool_head.
*/
struct pkg_data {
u32 max_rmid;
@@ -268,6 +295,13 @@ struct pkg_data {
struct delayed_work rotation_work;
int rotation_cpu;
+
+ struct delayed_work timed_update_work;
+ int timed_update_cpu;
+
+ /* Pool of unused rmid_list_nodes and its lock */
+ struct list_head anode_pool_head;
+ raw_spinlock_t anode_pool_lock;
};
/*
@@ -438,6 +472,8 @@ static inline int monr_hrchy_count_held_raw_spin_locks(void)
*/
static void intel_cqm_rmid_rotation_work(struct work_struct *work);
+static void intel_cqm_timed_update_work(struct work_struct *work);
+
/*
* Service Level Objectives (SLO) for the rotation logic.
*
--
2.8.0.rc3.226.g39d4020
Some offcore and uncore events, such as the new intel_cqm/llc_occupancy,
can be read even if the event is not active in its CPU (or in any CPU).
In those cases, a freshly read value is more recent, (and therefore
preferable) than the last value stored at event sched out.
There are two cases covered in this patch to allow Intel's CQM (and
potentially other per package events) to obtain updated values regardless
of the scheduling event in a particular CPU. Each case is covered by a
new event::pmu_event_flag:
1) PERF_INACTIVE_CPU_READ_PKG: An event attached to a CPU that can
be read in any CPU in its event:cpu's package, even if inactive.
2) PERF_INACTIVE_EV_READ_ANY_CPU: An event that can be read in any
CPU in any package in the system even if inactive.
A consequence of reading a new value from hw on each call to
perf_event_read() is that reading and saving the event value in sched out
can be avoided since the value will never be utilized. Therefore, a PMU
that sets any of the PERF_INACTIVE_*_READ_* flags can choose not to read
in context switch, at the cost of inherit_stats not working properly.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 15 ++++++++++++
kernel/events/core.c | 59 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 60 insertions(+), 14 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e4c58b0..054d7f4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,21 @@ struct perf_event {
/* Do not enable cgroup events in descendant cgroups. */
#define PERF_CGROUP_NO_RECURSION (1 << 0)
+/* CPU Event can read from event::cpu's package even if not in
+ * PERF_EVENT_STATE_ACTIVE, event::cpu must be a valid CPU.
+ */
+#define PERF_INACTIVE_CPU_READ_PKG (1 << 1)
+
+/* Event can read from any package even if not in PERF_EVENT_STATE_ACTIVE. */
+#define PERF_INACTIVE_EV_READ_ANY_CPU (1 << 2)
+
+static inline bool __perf_can_read_inactive(struct perf_event *event)
+{
+ return (event->pmu_event_flags & PERF_INACTIVE_EV_READ_ANY_CPU) ||
+ ((event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG) &&
+ (event->cpu != -1));
+}
+
/**
* struct perf_event_context - event context structure
*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 33961ec..28d1b51 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3266,15 +3266,28 @@ static void __perf_event_read(void *info)
struct perf_event_context *ctx = event->ctx;
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct pmu *pmu = event->pmu;
+ bool read_inactive = __perf_can_read_inactive(event);
+
+ WARN_ON_ONCE(event->cpu == -1 &&
+ (event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG));
+
+ /* If inactive, we should be reading in the adequate package. */
+ WARN_ON_ONCE(
+ event->state != PERF_EVENT_STATE_ACTIVE &&
+ (event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG) &&
+ (topology_physical_package_id(event->cpu) !=
+ topology_physical_package_id(smp_processor_id())));
/*
* If this is a task context, we need to check whether it is
- * the current task context of this cpu. If not it has been
+ * the current task context of this cpu or if the event
+ * can be read while inactive. If cannot read while inactive
+ * and not in current cpu, then the event has been
* scheduled out before the smp call arrived. In that case
* event->count would have been updated to a recent sample
* when the event was scheduled out.
*/
- if (ctx->task && cpuctx->task_ctx != ctx)
+ if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
return;
raw_spin_lock(&ctx->lock);
@@ -3284,9 +3297,11 @@ static void __perf_event_read(void *info)
}
update_event_times(event);
- if (event->state != PERF_EVENT_STATE_ACTIVE)
+
+ if (event->state != PERF_EVENT_STATE_ACTIVE && !read_inactive)
goto unlock;
+
if (!data->group) {
pmu->read(event);
data->ret = 0;
@@ -3299,7 +3314,8 @@ static void __perf_event_read(void *info)
list_for_each_entry(sub, &event->sibling_list, group_entry) {
update_event_times(sub);
- if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+ if (sub->state == PERF_EVENT_STATE_ACTIVE ||
+ __perf_can_read_inactive(sub)) {
/*
* Use sibling's PMU rather than @event's since
* sibling could be on different (eg: software) PMU.
@@ -3368,19 +3384,34 @@ u64 perf_event_read_local(struct perf_event *event)
static int perf_event_read(struct perf_event *event, bool group)
{
int ret = 0;
+ bool active = event->state == PERF_EVENT_STATE_ACTIVE;
/*
- * If event is enabled and currently active on a CPU, update the
- * value in the event structure:
+ * Read inactive event if PMU allows it. Otherwise, if event is
+ * enabled and currently active on a CPU, update the value in the
+ * event structure:
*/
- if (event->state == PERF_EVENT_STATE_ACTIVE) {
+
+ if (active || __perf_can_read_inactive(event)) {
struct perf_read_data data = {
.event = event,
.group = group,
.ret = 0,
};
- smp_call_function_single(event->oncpu,
- __perf_event_read, &data, 1);
+ int cpu_to_read = event->oncpu;
+
+ if (!active) {
+ cpu_to_read =
+ /* if __perf_can_read_inactive is true, it
+ * either is a CPU/cgroup event or can be
+ * read for any CPU.
+ */
+ (event->pmu_event_flags &
+ PERF_INACTIVE_EV_READ_ANY_CPU) ?
+ smp_processor_id() : event->cpu;
+ }
+ smp_call_function_single(
+ cpu_to_read, __perf_event_read, &data, 1);
ret = data.ret;
} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
struct perf_event_context *ctx = event->ctx;
@@ -8199,11 +8230,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
mutex_init(&event->mmap_mutex);
atomic_long_set(&event->refcount, 1);
- event->cpu = cpu;
- event->attr = *attr;
- event->group_leader = group_leader;
- event->pmu = NULL;
- event->oncpu = -1;
+ event->cpu = cpu;
+ event->attr = *attr;
+ event->group_leader = group_leader;
+ event->pmu = NULL;
+ event->oncpu = -1;
event->parent = parent_event;
--
2.8.0.rc3.226.g39d4020
Since inherited events are part of the same cqm cache group, they share the
RMID and therefore they cannot provide the granularity required by
inherit_stats. Changing this would require to create a subtree of monrs for
each parent event and its inherited events, a potential improvement for
future patches.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index d8d3191..6e85021 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2483,6 +2483,7 @@ static int intel_cqm_event_init(struct perf_event *event)
event->attr.exclude_idle ||
event->attr.exclude_host ||
event->attr.exclude_guest ||
+ event->attr.inherit_stat || /* cqm groups share rmid */
event->attr.sample_period) /* no sampling */
return -EINVAL;
--
2.8.0.rc3.226.g39d4020
The CQM hardware is not compatible with the way generic code handles
cgroup hierarchies (simultaneously adding the events of for all ancestors
of the current cgroup). This version of Intel's CQM driver handles
cgroup hierarchy internally.
Set PERF_CGROUP_NO_RECURSION for llc_occupancy events to
signal perf's generic code to not add events for ancestors of current
cgroup.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index dcf7f4a..d8d3191 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2489,6 +2489,14 @@ static int intel_cqm_event_init(struct perf_event *event)
INIT_LIST_HEAD(&event->hw.cqm_event_groups_entry);
INIT_LIST_HEAD(&event->hw.cqm_event_group_entry);
+ /*
+ * CQM driver handles cgroup recursion and since only noe
+ * RMID can be programmed at the time in each core, then
+ * it is incompatible with the way generic code handles
+ * cgroup hierarchies.
+ */
+ event->pmu_event_flags |= PERF_CGROUP_NO_RECURSION;
+
mutex_lock(&cqm_mutex);
--
2.8.0.rc3.226.g39d4020
Some events, such as Intel's CQM llc_occupancy, need small deviations
from the traditional behavior in the generic code in a way that depends
on the event itself (and known by the PMU) and not in a field of
perf_event_attrs.
An example is the recursive scope for cgroups: The generic code handles
cgroup hierarchy for a cgroup C by simultaneously adding to the PMU
the events of all cgroups that are ancestors of C. This approach is
incompatible with the CQM hw that only allows one RMID per virtual core
at a time. CQM's PMU work-arounds this limitation by internally
maintaining the hierarchical dependency between monitored cgroups and
only requires that the generic code adds current cgroup's event to
the PMU.
The introduction of the flag PERF_CGROUP_NO_RECURSION allows the PMU to
signal the generic code to avoid using recursive cgroup scope for
llc_occupancy events, preventing an undesired overwrite of RMIDs.
The PERF_CGROUP_NO_RECURSION, introduced in this patch, is the first flag
of this type, more will be added in this patch series.
To keep things tidy, this patch introduces the flag field pmu_event_flag,
intended to contain all flags that:
- Are not user-configurable event attributes (not suitable for
perf_event_attributes).
- Are known by the PMU during initialization of struct perf_event.
- Signal something to the generic code.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 10 ++++++++++
kernel/events/core.c | 3 +++
2 files changed, 13 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 81e29c6..e4c58b0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -594,9 +594,19 @@ struct perf_event {
#endif
struct list_head sb_list;
+
+ /* Flags to generic code set by PMU. */
+ int pmu_event_flags;
+
#endif /* CONFIG_PERF_EVENTS */
};
+/*
+ * Possible flags for mpu_event_flags.
+ */
+/* Do not enable cgroup events in descendant cgroups. */
+#define PERF_CGROUP_NO_RECURSION (1 << 0)
+
/**
* struct perf_event_context - event context structure
*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2a868a6..33961ec 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -545,6 +545,9 @@ perf_cgroup_match(struct perf_event *event)
if (!cpuctx->cgrp)
return false;
+ if (event->pmu_event_flags & PERF_CGROUP_NO_RECURSION)
+ return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
+
/*
* Cgroup scoping is recursive. An event enabled for a cgroup is
* also enabled for all its descendant cgroups. If @cpuctx's
--
2.8.0.rc3.226.g39d4020
This version of RMID rotation improves over original one by:
1. Being per-package. No need for IPIs to test for occupancy.
2. Since the monr hierarchy removed the potential conflicts between
events, the new RMID rotation logic does not need to check and
resolve conflicts.
3. No need to mantain an unused RMID as rotation_rmid, effectively
freeing one RMID per package.
4. Guarantee that monitored events and cgroups with a valid RMID keep
the RMID for an user configurable time: __cqm_min_mon_slice ms.
Previously, it was likely to receive a RMID in one execution of the
rotation logic just to have it removed in the next. That was
specially problematic in the presence of events conflict
(ie. cgroup events and thread events in a descendant cgroup).
5. Do not increase the dirty threshold unless strictly necessary to make
progress. Previous version simultaneously stole RMIDs and increased
the dirty threshold (the maximum number of cache lines with spurious
occupancy associated with a "clean" RMID). This version makes sure
that increasing the dirty threshold is the only way to make progress
in the RMID rotation (the case when too many RMID in limbo do not
drop occupancy despite having spent enough time in limbo) before
increasing the threshold.
This change reduces spurious occupancy as a source of error.
6. Do not steal RMIDs unnecesarily. Thanks to a more detailed
bookeeping, this patch guarantees that the number of RMIDs in limbo
do not exceed the number of RMIDs needed by pmonrs currently waiting
for an RMID.
7. Reutilize dirty limbo RMIDs when appropriate. In this new version, a
stolen RMID remains referenced by its former pmonr owner until it is
reutilized by another pmonr or it is moved from limbo into the pool
of free RMIDs.
These RMIDs that are referenced and in limbo are not written into the
MSR_IA32_PQR_ASSOC msr, therefore, they have the chance to drop
occupancy as any other limbo RMID. If the pmonr with a limbo RMID is
to be activated, then it reuses its former RMID even if its still
dirty. The occupancy attributed to that RMID is part of the pmonr
occupancy and therefore reusing the RMID even when dirty decreases
the error of the read.
This feature decreases the negative impact of RMIDs that do not drop
occupancy in the efficiency of the rotation logic.
For an user perspective, the behavior of the new rotation logic is
controlled by SLO type parameters:
__cqm_min_mon_slice : Minimum time a monr is to be monitored
before being eligible by rotation logic to loss any of its RMIDs.
__cqm_max_wait_mon : Maximum time a monr can be deactivated
before forcing rotation logic to be more aggresive (stealing more
RMIDs per iteration).
__cqm_min_progress_rate: Minimum number of pmonrs that must be
activated per second to consider that rotation logic's progress
is acceptable.
Since the minimum progress rate is a SLO, the magnitude of the rotation
period (the rtimer_interval_ms) do not control the speed of RMID rotation,
it only controls the frequency at which rotation logic is executed.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 727 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/events/intel/cqm.h | 59 +++-
2 files changed, 784 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index caf7152..31f0fd6 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -235,9 +235,14 @@ static int pkg_data_init_cpu(int cpu)
INIT_LIST_HEAD(&pkg_data->istate_pmonrs_lru);
INIT_LIST_HEAD(&pkg_data->ilstate_pmonrs_lru);
+ pkg_data->nr_instate_pmonrs = 0;
+ pkg_data->nr_ilstate_pmonrs = 0;
+
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
+ INIT_DELAYED_WORK(
+ &pkg_data->rotation_work, intel_cqm_rmid_rotation_work);
/* XXX: Chose randomly*/
pkg_data->rotation_cpu = cpu;
@@ -295,6 +300,10 @@ static struct pmonr *pmonr_alloc(int cpu)
pmonr->monr = NULL;
INIT_LIST_HEAD(&pmonr->rotation_entry);
+ pmonr->last_enter_istate = 0;
+ pmonr->last_enter_astate = 0;
+ pmonr->nr_enter_istate = 0;
+
pmonr->pkg_id = topology_physical_package_id(cpu);
summary.sched_rmid = INVALID_RMID;
summary.read_rmid = INVALID_RMID;
@@ -346,6 +355,8 @@ __pmonr__finish_to_astate(struct pmonr *pmonr, struct prmid *prmid)
pmonr->prmid = prmid;
+ pmonr->last_enter_astate = jiffies;
+
list_move_tail(
&prmid->pool_entry, &__pkg_data(pmonr, active_prmids_pool));
list_move_tail(
@@ -373,6 +384,8 @@ __pmonr__instate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
*/
WARN_ON_ONCE(pmonr->limbo_prmid);
+ __pkg_data(pmonr, nr_instate_pmonrs)--;
+
/* Do not depend on ancestor_pmonr anymore. Make it (A)state. */
ancestor = pmonr->ancestor_pmonr;
list_del_init(&pmonr->pmonr_deps_entry);
@@ -394,6 +407,28 @@ __pmonr__instate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
}
}
+/*
+ * Transition from (IL)state to (A)state.
+ */
+static inline void
+__pmonr__ilstate_to_astate(struct pmonr *pmonr)
+{
+ struct prmid *prmid;
+
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ WARN_ON_ONCE(!pmonr->limbo_prmid);
+
+ prmid = pmonr->limbo_prmid;
+ pmonr->limbo_prmid = NULL;
+ list_del_init(&pmonr->limbo_rotation_entry);
+
+ __pkg_data(pmonr, nr_ilstate_pmonrs)--;
+ __pkg_data(pmonr, nr_instate_pmonrs)++;
+ list_del_init(&prmid->pool_entry);
+
+ __pmonr__instate_to_astate(pmonr, prmid);
+}
+
static inline void
__pmonr__ustate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
{
@@ -485,7 +520,9 @@ __pmonr__to_ustate(struct pmonr *pmonr)
pmonr->limbo_prmid = NULL;
list_del_init(&pmonr->limbo_rotation_entry);
+ __pkg_data(pmonr, nr_ilstate_pmonrs)--;
} else {
+ __pkg_data(pmonr, nr_instate_pmonrs)--;
}
pmonr->ancestor_pmonr = NULL;
} else {
@@ -542,6 +579,9 @@ __pmonr__to_istate(struct pmonr *pmonr)
__pmonr__move_dependants(pmonr, ancestor);
list_move_tail(&pmonr->limbo_prmid->pool_entry,
&__pkg_data(pmonr, pmonr_limbo_prmids_pool));
+ __pkg_data(pmonr, nr_ilstate_pmonrs)++;
+ } else {
+ __pkg_data(pmonr, nr_instate_pmonrs)++;
}
pmonr->ancestor_pmonr = ancestor;
@@ -554,10 +594,51 @@ __pmonr__to_istate(struct pmonr *pmonr)
list_move_tail(&pmonr->limbo_rotation_entry,
&__pkg_data(pmonr, ilstate_pmonrs_lru));
+ pmonr->last_enter_istate = jiffies;
+ pmonr->nr_enter_istate++;
+
__pmonr__set_istate_summary(pmonr);
}
+static inline void
+__pmonr__ilstate_to_instate(struct pmonr *pmonr)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+ list_move_tail(&pmonr->limbo_prmid->pool_entry,
+ &__pkg_data(pmonr, free_prmids_pool));
+ pmonr->limbo_prmid = NULL;
+
+ __pkg_data(pmonr, nr_ilstate_pmonrs)--;
+ __pkg_data(pmonr, nr_instate_pmonrs)++;
+
+ list_del_init(&pmonr->limbo_rotation_entry);
+ __pmonr__set_istate_summary(pmonr);
+}
+
+/* Count all limbo prmids, including the ones still attached to pmonrs.
+ * Maximum number of prmids is fixed by hw and generally small.
+ */
+static int count_limbo_prmids(struct pkg_data *pkg_data)
+{
+ unsigned int c = 0;
+ struct prmid *prmid;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ list_for_each_entry(
+ prmid, &pkg_data->pmonr_limbo_prmids_pool, pool_entry) {
+ c++;
+ }
+ list_for_each_entry(
+ prmid, &pkg_data->nopmonr_limbo_prmids_pool, pool_entry) {
+ c++;
+ }
+
+ return c;
+}
+
static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
{
int r;
@@ -871,8 +952,652 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
return false;
}
+/*
+ * Try to reuse limbo prmid's for pmonrs at the front of ilstate_pmonrs_lru.
+ */
+static int __try_reuse_ilstate_pmonrs(struct pkg_data *pkg_data)
+{
+ int reused = 0;
+ struct pmonr *pmonr;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+ lockdep_assert_held(&pkg_data->pkg_data_lock);
+
+ while ((pmonr = list_first_entry_or_null(
+ &pkg_data->istate_pmonrs_lru, struct pmonr, rotation_entry))) {
+
+ if (__pmonr__in_instate(pmonr))
+ break;
+ __pmonr__ilstate_to_astate(pmonr);
+ reused++;
+ }
+ return reused;
+}
+
+static int try_reuse_ilstate_pmonrs(struct pkg_data *pkg_data)
+{
+ int reused;
+ unsigned long flags;
+#ifdef CONFIG_LOCKDEP
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+#endif
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+ reused = __try_reuse_ilstate_pmonrs(pkg_data);
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ return reused;
+}
+
+
+/*
+ * A monr is only readable when all it's used pmonrs have a RMID.
+ * Therefore, the time a monr entered (A)state is the maximum of the
+ * last_enter_astate times for all (A)state pmonrs if no pmonr is in (I)state.
+ * A monr with any pmonr in (I)state has no entered (A)state.
+ * Returns monr_enter_astate time if available, otherwise min_inh_pkg is
+ * set to the smallest pkg_id where the monr's pmnor is in (I)state and
+ * the return value is undefined.
+ */
+static unsigned long
+__monr__last_enter_astate(struct monr *monr, int *min_inh_pkg)
+{
+ struct pkg_data *pkg_data;
+ u16 pkg_id;
+ unsigned long flags, astate_time = 0;
+
+ *min_inh_pkg = -1;
+ cqm_pkg_id_for_each_online(pkg_id) {
+ struct pmonr *pmonr;
+
+ if (min_inh_pkg >= 0)
+ break;
+
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+
+ pmonr = monr->pmonrs[pkg_id];
+ if (__pmonr__in_istate(pmonr) && min_inh_pkg < 0)
+ *min_inh_pkg = pkg_id;
+ else if (__pmonr__in_astate(pmonr) &&
+ astate_time < pmonr->last_enter_astate)
+ astate_time = pmonr->last_enter_astate;
+
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ }
+ return astate_time;
+}
+
+/*
+ * Steal as many rmids as possible.
+ * Transition pmonrs that have stayed at least __cqm_min_mon_slice in
+ * (A)state to (I)state.
+ */
+static inline int
+__try_steal_active_pmonrs(
+ struct pkg_data *pkg_data, unsigned int max_to_steal)
+{
+ struct pmonr *pmonr, *tmp;
+ int nr_stolen = 0, min_inh_pkg;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ unsigned long flags, monr_astate_end_time, now = jiffies;
+ struct list_head *alist = &pkg_data->astate_pmonrs_lru;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ /* pmonrs don't leave astate outside of rotation logic.
+ * The pkg mutex protects against the pmonr leaving
+ * astate_pmonrs_lru. The raw_spin_lock protects these list
+ * operations from list insertions at tail coming from the
+ * sched logic ( (U)state -> (A)state )
+ */
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+
+ pmonr = list_first_entry(alist, struct pmonr, rotation_entry);
+ WARN_ON_ONCE(pmonr != monr_hrchy_root->pmonrs[pkg_id]);
+ WARN_ON_ONCE(pmonr->pkg_id != pkg_id);
+
+ list_for_each_entry_safe_continue(pmonr, tmp, alist, rotation_entry) {
+ bool steal_rmid = false;
+
+ WARN_ON_ONCE(!__pmonr__in_astate(pmonr));
+ WARN_ON_ONCE(pmonr->pkg_id != pkg_id);
+
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+
+ monr_astate_end_time =
+ __monr__last_enter_astate(pmonr->monr, &min_inh_pkg) +
+ __cqm_min_mon_slice;
+
+ /* pmonr in this pkg is supposed to be in (A)state. */
+ WARN_ON_ONCE(min_inh_pkg == pkg_id);
+
+ /* Steal a pmonr if:
+ * 1) Any pmonr in a pkg with pkg_id < local pkg_id is
+ * in (I)state.
+ * 2) It's monr has been active for enough time.
+ * Note that since the min_inh_pkg for a monr cannot decrease
+ * while the monr is not active, then the monr eventually will
+ * become active again despite the stealing of pmonrs in pkgs
+ * with id larger than min_inh_pkg.
+ */
+ if (min_inh_pkg >= 0 && min_inh_pkg < pkg_id)
+ steal_rmid = true;
+ if (min_inh_pkg < 0 && monr_astate_end_time <= now)
+ steal_rmid = true;
+
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+ if (!steal_rmid)
+ continue;
+
+ __pmonr__to_istate(pmonr);
+ nr_stolen++;
+ if (nr_stolen == max_to_steal)
+ break;
+ }
+
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+
+ return nr_stolen;
+}
+
+/* It will remove the prmid from the list its attached, if used. */
+static inline int __try_use_free_prmid(struct pkg_data *pkg_data,
+ struct prmid *prmid, bool *succeed)
+{
+ struct pmonr *pmonr;
+ int nr_activated = 0;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+ lockdep_assert_held(&pkg_data->pkg_data_lock);
+
+ *succeed = false;
+ nr_activated += __try_reuse_ilstate_pmonrs(pkg_data);
+ pmonr = list_first_entry_or_null(&pkg_data->istate_pmonrs_lru,
+ struct pmonr, rotation_entry);
+ if (!pmonr)
+ return nr_activated;
+ WARN_ON_ONCE(__pmonr__in_ilstate(pmonr));
+ WARN_ON_ONCE(!__pmonr__in_instate(pmonr));
+
+ /* the state transition function will move the prmid to
+ * the active lru list.
+ */
+ __pmonr__instate_to_astate(pmonr, prmid);
+ nr_activated++;
+ *succeed = true;
+ return nr_activated;
+}
+
+static inline int __try_use_free_prmids(struct pkg_data *pkg_data)
+{
+ struct prmid *prmid, *tmp_prmid;
+ unsigned long flags;
+ int nr_activated = 0;
+ bool succeed;
+#ifdef CONFIG_DEBUG_SPINLOCK
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+#endif
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+ /* Lock protects free_prmids_pool, istate_pmonrs_lru and
+ * the monr hrchy.
+ */
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+
+ list_for_each_entry_safe(prmid, tmp_prmid,
+ &pkg_data->free_prmids_pool, pool_entry) {
+
+ /* Removes the free prmid if used. */
+ nr_activated += __try_use_free_prmid(pkg_data,
+ prmid, &succeed);
+ }
+
+ nr_activated += __try_reuse_ilstate_pmonrs(pkg_data);
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+
+ return nr_activated;
+}
+
+/* Update prmid's of pmonrs in ilstate. To mantain fairness of rotation
+ * logic, Try to activate (IN)state pmonrs with recovered prmids when
+ * possible rather than simply adding them to free rmids list. This prevents,
+ * ustate pmonrs (pmonrs that haven't wait in istate_pmonrs_lru) to obtain
+ * the newly available RMIDs before those waiting in queue.
+ */
+static inline int
+__try_free_ilstate_prmids(struct pkg_data *pkg_data,
+ unsigned int cqm_threshold,
+ unsigned int *min_occupancy_dirty)
+{
+ struct pmonr *pmonr, *tmp_pmonr, *istate_pmonr;
+ struct prmid *prmid;
+ unsigned long flags;
+ u64 val;
+ bool succeed;
+ int ret, nr_activated = 0;
+#ifdef CONFIG_LOCKDEP
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+#endif
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ WARN_ON_ONCE(try_reuse_ilstate_pmonrs(pkg_data));
+
+ /* No need to acquire pkg lock to iterate over ilstate_pmonrs_lru
+ * since only rotation logic modifies it.
+ */
+ list_for_each_entry_safe(
+ pmonr, tmp_pmonr,
+ &pkg_data->ilstate_pmonrs_lru, limbo_rotation_entry) {
+
+ if (WARN_ON_ONCE(list_empty(&pkg_data->istate_pmonrs_lru)))
+ return nr_activated;
+
+ istate_pmonr = list_first_entry(&pkg_data->istate_pmonrs_lru,
+ struct pmonr, rotation_entry);
+
+ if (pmonr == istate_pmonr) {
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+
+ nr_activated++;
+ __pmonr__ilstate_to_astate(pmonr);
+
+ raw_spin_unlock_irqrestore(
+ &pkg_data->pkg_data_lock, flags);
+ continue;
+ }
+
+ ret = __cqm_prmid_update(pmonr->limbo_prmid,
+ __rmid_min_update_time);
+ if (WARN_ON_ONCE(ret < 0))
+ continue;
+
+ val = atomic64_read(&pmonr->limbo_prmid->last_read_value);
+ if (val > cqm_threshold) {
+ if (val < *min_occupancy_dirty)
+ *min_occupancy_dirty = val;
+ continue;
+ }
+
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+
+ prmid = pmonr->limbo_prmid;
+
+ /* moves the prmid to free_prmids_pool. */
+ __pmonr__ilstate_to_instate(pmonr);
+
+ /* Do not affect ilstate_pmonrs_lru.
+ * If succeeds, prmid will end in active_prmids_pool,
+ * otherwise, stays in free_prmids_pool where the
+ * ilstate_to_instate transition left it.
+ */
+ nr_activated += __try_use_free_prmid(pkg_data,
+ prmid, &succeed);
+
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ }
+ return nr_activated;
+}
+
+/* Update limbo prmid's no associated to a pmonr. To mantain fairness of
+ * rotation logic, Try to activate (IN)state pmonrs with recovered prmids when
+ * possible rather than simply adding them to free rmids list. This prevents,
+ * ustate pmonrs (pmonrs that haven't wait in istate_pmonrs_lru) to obtain
+ * the newly available RMIDs before those waiting in queue.
+ */
+static inline int
+__try_free_limbo_prmids(struct pkg_data *pkg_data,
+ unsigned int cqm_threshold,
+ unsigned int *min_occupancy_dirty)
+{
+ struct prmid *prmid, *tmp_prmid;
+ unsigned long flags;
+ bool succeed;
+ int ret, nr_activated = 0;
+
+#ifdef CONFIG_LOCKDEP
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+#endif
+ u64 val;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ list_for_each_entry_safe(
+ prmid, tmp_prmid,
+ &pkg_data->nopmonr_limbo_prmids_pool, pool_entry) {
+
+ /* If min update time is good enough for user, it is good
+ * enough for rotation.
+ */
+ ret = __cqm_prmid_update(prmid, __rmid_min_update_time);
+ if (WARN_ON_ONCE(ret < 0))
+ continue;
+
+ val = atomic64_read(&prmid->last_read_value);
+ if (val > cqm_threshold) {
+ if (val < *min_occupancy_dirty)
+ *min_occupancy_dirty = val;
+ continue;
+ }
+ raw_spin_lock_irqsave_nested(
+ &pkg_data->pkg_data_lock, flags, pkg_id);
+
+ nr_activated = __try_use_free_prmid(pkg_data, prmid, &succeed);
+ if (!succeed)
+ list_move_tail(&prmid->pool_entry,
+ &pkg_data->free_prmids_pool);
+
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ }
+ return nr_activated;
+}
+
+/*
+ * Activate (I)state pmonrs.
+ *
+ * @min_occupancy_dirty: pointer to store the minimum occupancy of any
+ * dirty prmid.
+ *
+ * Try to activate as many pmonrs as possible before utilizing limbo prmids
+ * pointed by ilstate pmonrs in order to minimize the number of dirty rmids
+ * that move to other pmonr when cqm_threshold > 0.
+ */
+static int __try_activate_istate_pmonrs(
+ struct pkg_data *pkg_data, unsigned int cqm_threshold,
+ unsigned int *min_occupancy_dirty)
+{
+ int nr_activated = 0;
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ /* Start reusing limbo prmids no pointed by any ilstate pmonr. */
+ nr_activated += __try_free_limbo_prmids(pkg_data, cqm_threshold,
+ min_occupancy_dirty);
+
+ /* Try to use newly available free prmids */
+ nr_activated += __try_use_free_prmids(pkg_data);
+
+ /* Continue reusing limbo prmids pointed by a ilstate pmonr. */
+ nr_activated += __try_free_ilstate_prmids(pkg_data, cqm_threshold,
+ min_occupancy_dirty);
+ /* Try to use newly available free prmids */
+ nr_activated += __try_use_free_prmids(pkg_data);
+
+ WARN_ON_ONCE(try_reuse_ilstate_pmonrs(pkg_data));
+ return nr_activated;
+}
+
+/* Number of pmonrs that have been in (I)state for at least min_wait_jiffies.
+ * XXX: Use rcu to access to istate_pmonrs_lru.
+ */
+static int
+count_istate_pmonrs(struct pkg_data *pkg_data,
+ unsigned int min_wait_jiffies, bool exclude_limbo)
+{
+ unsigned long flags;
+ unsigned int c = 0;
+ struct pmonr *pmonr;
+#ifdef CONFIG_DEBUG_SPINLOCK
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+#endif
+
+ lockdep_assert_held(&pkg_data->pkg_data_mutex);
+
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+ list_for_each_entry(
+ pmonr, &pkg_data->istate_pmonrs_lru, rotation_entry) {
+
+ if (jiffies - pmonr->last_enter_istate < min_wait_jiffies)
+ break;
+
+ WARN_ON_ONCE(!__pmonr__in_istate(pmonr));
+ if (exclude_limbo && __pmonr__in_ilstate(pmonr))
+ continue;
+ c++;
+ }
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+
+ return c;
+}
+
+static inline int
+read_nr_instate_pmonrs(struct pkg_data *pkg_data, u16 pkg_id) {
+ unsigned long flags;
+ int n;
+
+ raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+ n = READ_ONCE(cqm_pkgs_data[pkg_id]->nr_instate_pmonrs);
+ raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
+ WARN_ON_ONCE(n < 0);
+ return n;
+}
+
+/*
+ * Rotate RMIDs among rpgks.
+ *
+ * For reads to be meaningful valid rmids had to be programmed for
+ * enough time to capture enough instances of cache allocation/retirement
+ * to yield useful occupancy values. The approach to handle that problem
+ * is to guarantee that every pmonr will spend at least T time in (A)state
+ * when such transition has occurred and hope that T is long enough.
+ *
+ * The hardware retains occupancy for 'old' tags, even after changing rmid
+ * for a task/cgroup. To workaround this problem, we keep retired rmids
+ * as limbo in each pmonr and use their occupancy. Also we prefer reusing
+ * such limbo rmids rather than free ones since their residual occupancy
+ * is valid occupancy for the task/cgroup.
+ *
+ * Rotation works by taking away an RMID from a group (the old RMID),
+ * and assigning the free RMID to another group (the new RMID). We must
+ * then wait for the old RMID to not be used (no cachelines tagged).
+ * This ensure that all cachelines are tagged with 'active' RMIDs. At
+ * this point we can start reading values for the new RMID and treat the
+ * old RMID as the free RMID for the next rotation.
+ */
+static void
+__intel_cqm_rmid_rotate(struct pkg_data *pkg_data,
+ unsigned int nr_max_limbo,
+ unsigned int nr_min_activated)
+{
+ int nr_instate, nr_to_steal, nr_stolen, nr_slo_violated;
+ int limbo_cushion = 0;
+ unsigned int cqm_threshold = 0, min_occupancy_dirty;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+
+ /*
+ * To avoid locking the process, keep track of pmonrs that
+ * are activated during this execution of rotaton logic, so
+ * we don't have to rely on the state of the pmonrs lists
+ * to estimate progress, that can be modified during
+ * creation and destruction of events and cgroups.
+ */
+ int nr_activated = 0;
+
+ mutex_lock_nested(&pkg_data->pkg_data_mutex, pkg_id);
+
+ /*
+ * Since ilstates are created only during stealing or destroying pmonrs,
+ * but destroy requires pkg_data_mutex, then it is only necessary to
+ * try to reuse ilstate once per call. Furthermore, new ilstates during
+ * iteration in rotation logic is an error.
+ */
+ nr_activated += try_reuse_ilstate_pmonrs(pkg_data);
+
+again:
+ nr_stolen = 0;
+ min_occupancy_dirty = UINT_MAX;
+ /*
+ * Three types of actions are taken in rotation logic:
+ * 1) Try to activate pmonrs using limbo RMIDs.
+ * 2) Steal more RMIDs. Ideally the number of RMIDs in limbo equals
+ * the number of pmonrs in (I)state plus the limbo_cushion aimed to
+ * compensate for limbo RMIDs that do no drop occupancy fast enough.
+ * The actual number stolen is constrained
+ * prevent having more than nr_max_limbo RMIDs in limbo.
+ * 3) Increase cqm_threshold so even RMIDs with residual occupancy
+ * are utilized to activate (I)state primds. Doing so increases the
+ * error in the reported value in a way undetectable to the user, so
+ * it is left as a last resource.
+ */
+
+ /* Verify all available ilimbo where activated where they
+ * were supposed to.
+ */
+ WARN_ON_ONCE(try_reuse_ilstate_pmonrs(pkg_data) > 0);
+
+ /* Activate all pmonrs that we can by recycling rmids in limbo */
+ nr_activated += __try_activate_istate_pmonrs(
+ pkg_data, cqm_threshold, &min_occupancy_dirty);
+
+ /* Count nr of pmonrs that are inherited and do not have limbo_prmid */
+ nr_instate = read_nr_instate_pmonrs(pkg_data, pkg_id);
+ WARN_ON_ONCE(nr_instate < 0);
+ /*
+ * If no pmonr needs rmid, then it's time to let go. pmonrs in ilimbo
+ * are not counted since the limbo_prmid can be reused, once its time
+ * to activate them.
+ */
+ if (nr_instate == 0)
+ goto exit;
+
+ WARN_ON_ONCE(!list_empty(&pkg_data->free_prmids_pool));
+ WARN_ON_ONCE(try_reuse_ilstate_pmonrs(pkg_data) > 0);
+
+ /* There are still pmonrs waiting for RMID, check if the SLO about
+ * _cqm_max_wait_mon has been violated. If so, use a more
+ * aggresive version of RMID stealing and reutilization.
+ */
+ nr_slo_violated = count_istate_pmonrs(
+ pkg_data, msecs_to_jiffies(__cqm_max_wait_mon), false);
+
+ /* First measure against SLO violation is to increase number of stolen
+ * RMIDs beyond the number of pmonrs waiting for RMID. The magnitud of
+ * the limbo_cushion is proportional to nr_slo_violated (but
+ * arbitarily weighthed).
+ */
+ if (nr_slo_violated)
+ limbo_cushion = (nr_slo_violated + 1) / 2;
+
+ /*
+ * Need more free rmids. Steal RMIDs from active pmonrs and place them
+ * into limbo lru. Steal enough to have high chances that eventually
+ * occupancy of enough RMIDs in limbo will drop enough to be reused
+ * (the limbo_cushion).
+ */
+ nr_to_steal = min(nr_instate + limbo_cushion,
+ max(0, (int)nr_max_limbo -
+ count_limbo_prmids(pkg_data)));
+
+ if (nr_to_steal)
+ nr_stolen = __try_steal_active_pmonrs(pkg_data, nr_to_steal);
+
+ /* Already stole as many as possible, finish if no SLO violations. */
+ if (!nr_slo_violated)
+ goto exit;
+
+ /*
+ * There are SLO violations due to recycling RMIDs not progressing
+ * fast enough. Possible (non-exclusive) causal factors are:
+ * 1) Too many RMIDs in limbo do not drop occupancy despite having
+ * spent a "reasonable" time in limbo lru.
+ * 2) RMIDs in limbo have not been for long enough to have drop
+ * occupancy, but they will within "reasonable" time.
+ *
+ * If (2) only, it is ok to wait, since eventually the rmids
+ * will rotate. If (1), there is a danger of being stuck, in that case
+ * the dirty threshold, cqm_threshold, must be increased.
+ * The notion of "reasonable" time is ambiguous since the more SLOs
+ * violations, the more urgent it is to rotate. For now just try
+ * to guarantee any progress is made (activate at least one prmid
+ * with SLO violated).
+ */
+
+ /* Using the minimum observed occupancy in dirty rmids guarantees to
+ * to recover at least one rmid per iteration. Check if constrainst
+ * would allow to use such threshold, otherwise makes no sense to
+ * retry.
+ */
+ if (nr_activated < nr_min_activated && min_occupancy_dirty <=
+ READ_ONCE(__intel_cqm_max_threshold) / cqm_l3_scale) {
+
+ cqm_threshold = min_occupancy_dirty;
+ goto again;
+ }
+exit:
+ mutex_unlock(&pkg_data->pkg_data_mutex);
+}
+
static struct pmu intel_cqm_pmu;
+/* Rotation only needs to be run when there is any pmonr in (I)state. */
+static bool intel_cqm_need_rotation(u16 pkg_id)
+{
+
+ struct pkg_data *pkg_data;
+ bool need_rot;
+
+ pkg_data = cqm_pkgs_data[pkg_id];
+
+ mutex_lock_nested(&pkg_data->pkg_data_mutex, pkg_id);
+ /* Rotation is needed if prmids in limbo need to be recycled or if
+ * there are pmonrs in (I)state.
+ */
+ need_rot = !list_empty(&pkg_data->nopmonr_limbo_prmids_pool) ||
+ !list_empty(&pkg_data->istate_pmonrs_lru);
+
+ mutex_unlock(&pkg_data->pkg_data_mutex);
+ return need_rot;
+}
+
+/*
+ * Schedule rotation in one package.
+ */
+static void __intel_cqm_schedule_rotation_for_pkg(u16 pkg_id)
+{
+ struct pkg_data *pkg_data;
+ unsigned long delay;
+
+ delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
+ pkg_data = cqm_pkgs_data[pkg_id];
+ schedule_delayed_work_on(
+ pkg_data->rotation_cpu, &pkg_data->rotation_work, delay);
+}
+
+/*
+ * Schedule rotation and rmid's timed update in all packages.
+ * Reescheduling will stop when no longer needed.
+ */
+static void intel_cqm_schedule_work_all_pkgs(void)
+{
+ int pkg_id;
+
+ cqm_pkg_id_for_each_online(pkg_id)
+ __intel_cqm_schedule_rotation_for_pkg(pkg_id);
+}
+
+static void intel_cqm_rmid_rotation_work(struct work_struct *work)
+{
+ struct pkg_data *pkg_data = container_of(
+ to_delayed_work(work), struct pkg_data, rotation_work);
+ /* Allow max 25% of RMIDs to be in limbo. */
+ unsigned int max_limbo_rmids = max(1u, (pkg_data->max_rmid + 1) / 4);
+ unsigned int min_activated = max(1u, (intel_cqm_pmu.hrtimer_interval_ms
+ * __cqm_min_progress_rate) / 1000);
+ u16 pkg_id = topology_physical_package_id(pkg_data->rotation_cpu);
+
+ WARN_ON_ONCE(pkg_data != cqm_pkgs_data[pkg_id]);
+
+ __intel_cqm_rmid_rotate(pkg_data, max_limbo_rmids, min_activated);
+
+ if (intel_cqm_need_rotation(pkg_id))
+ __intel_cqm_schedule_rotation_for_pkg(pkg_id);
+}
+
/*
* Find a group and setup RMID.
*
@@ -1099,6 +1824,8 @@ static int intel_cqm_event_init(struct perf_event *event)
mutex_unlock(&cqm_mutex);
+ intel_cqm_schedule_work_all_pkgs();
+
return 0;
}
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 22635bc..b0e1698 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -123,9 +123,16 @@ struct monr;
* prmids.
* @limbo_rotation_entry: List entry to attach to ilstate_pmonrs_lru when
* this pmonr is in (IL)state.
- * @rotation_entry: List entry to attach to either astate_pmonrs_lru
- * or ilstate_pmonrs_lru in pkg_data.
+ * @last_enter_istate: Time last enter (I)state.
+ * @last_enter_astate: Time last enter (A)state. Used in rotation logic
+ * to guarantee that each pmonr gets a minimum
+ * time in (A)state.
+ * @rotation_entry: List entry to attach to pmonr rotation lists in
+ * pkg_data.
* @monr: The monr that contains this pmonr.
+ * @nr_enter_istate: Track number of times entered (I)state. Useful
+ * signal to diagnose excessive contention for
+ * rmids in this package.
* @pkg_id: Auxiliar variable with pkg id for this pmonr.
* @prmid_summary_atomic: Atomic accesor to store a union prmid_summary
* that represent the state of this pmonr.
@@ -194,6 +201,10 @@ struct pmonr {
struct monr *monr;
struct list_head rotation_entry;
+ unsigned long last_enter_istate;
+ unsigned long last_enter_astate;
+ unsigned int nr_enter_istate;
+
u16 pkg_id;
/* all writers are sync'ed by package's lock. */
@@ -218,6 +229,7 @@ struct pmonr {
* @ilsate_pmonrs_lru: pmonrs in (IL)state, these pmonrs have a valid
* limbo_prmid. It's a subset of istate_pmonrs_lru.
* Sorted increasingly by pmonr.last_enter_istate.
+ * @nr_inherited_pmonrs nr of pmonrs in any of the (I)state substates.
* @pkg_data_mutex: Hold for stability when modifying pmonrs
* hierarchy.
* @pkg_data_lock: Hold to protect variables that may be accessed
@@ -226,6 +238,7 @@ struct pmonr {
* hierarchy.
* @rotation_cpu: CPU to run @rotation_work on, it must be in the
* package associated to this instance of pkg_data.
+ * @rotation_work: Task that performs rotation of prmids.
*/
struct pkg_data {
u32 max_rmid;
@@ -247,9 +260,13 @@ struct pkg_data {
struct list_head istate_pmonrs_lru;
struct list_head ilstate_pmonrs_lru;
+ int nr_instate_pmonrs;
+ int nr_ilstate_pmonrs;
+
struct mutex pkg_data_mutex;
raw_spinlock_t pkg_data_lock;
+ struct delayed_work rotation_work;
int rotation_cpu;
};
@@ -410,6 +427,44 @@ static inline int monr_hrchy_count_held_raw_spin_locks(void)
#define CQM_DEFAULT_ROTATION_PERIOD 1200 /* ms */
/*
+ * Rotation function.
+ * Rotation logic runs per-package. In each package, if free rmids are needed,
+ * it will steal prmids from the pmonr that has been the longest time in
+ * (A)state.
+ * The hardware provides to way to signal that a rmid will be reused, therefore,
+ * before reusing a rmid that has been stolen, the rmid should stay for some
+ * in a "limbo" state where is not associated to any thread, hoping that the
+ * cache lines allocated for this rmid will eventually be replaced.
+ */
+static void intel_cqm_rmid_rotation_work(struct work_struct *work);
+
+/*
+ * Service Level Objectives (SLO) for the rotation logic.
+ *
+ * @__cqm_min_duration_mon_slice: Minimum duration of a monitored slice.
+ * @__cqm_max_wait_monitor: Maximum time that a pmonr can pass waiting for an
+ * RMID without rotation logic making any progress. Once elapsed for any
+ * prmid, the reusing threshold (__intel_cqm_max_threshold) can be increased,
+ * potentially increasing the speed at which RMIDs are reused, but potentially
+ * introducing measurement error.
+ */
+#define CQM_DEFAULT_MIN_MON_SLICE 2000 /* ms */
+static unsigned int __cqm_min_mon_slice = CQM_DEFAULT_MIN_MON_SLICE;
+
+#define CQM_DEFAULT_MAX_WAIT_MON 20000 /* ms */
+static unsigned int __cqm_max_wait_mon = CQM_DEFAULT_MAX_WAIT_MON;
+
+#define CQM_DEFAULT_MIN_PROGRESS_RATE 1 /* activated pmonrs per second */
+static unsigned int __cqm_min_progress_rate = CQM_DEFAULT_MIN_PROGRESS_RATE;
+/*
+ * If we fail to assign any RMID for intel_cqm_rotation because cachelines are
+ * still tagged with RMIDs in limbo even after having stolen enough rmids (a
+ * maximum number of rmids in limbo at any time), then we increment the dirty
+ * threshold to allow at least one RMID to be recycled. This mitigates the
+ * problem caused when cachelines tagged with a RMID are not evicted but
+ * it introduces error in the occupancy reads but allows the rotation of rmids
+ * to proceed.
+ *
* __intel_cqm_max_threshold provides an upper bound on the threshold,
* and is measured in bytes because it's exposed to userland.
* It's units are bytes must be scaled by cqm_l3_scale to obtain cache lines.
--
2.8.0.rc3.226.g39d4020
Utilized to detach a monr from a cgroup before the event's reference
to the cgroup is removed.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index f000fd0..dcf7f4a 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2391,7 +2391,7 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
return prmid_summary__is_istate(summary) ? -1 : 0;
}
-static void intel_cqm_event_destroy(struct perf_event *event)
+static void intel_cqm_event_terminate(struct perf_event *event)
{
struct perf_event *group_other = NULL;
struct monr *monr;
@@ -2438,6 +2438,17 @@ static void intel_cqm_event_destroy(struct perf_event *event)
if (monr__is_root(monr))
goto exit;
+ /* Handle cgroup event. */
+ if (event->cgrp) {
+ monr->mon_event_group = NULL;
+ if ((event->cgrp->css.flags & CSS_ONLINE) &&
+ !cgrp_to_cqm_info(event->cgrp)->cont_monitoring)
+ __css_stop_monitoring(&monr__get_mon_cgrp(monr)->css);
+
+ goto exit;
+ }
+ WARN_ON_ONCE(!monr_is_event_type(monr));
+
/* Transition all pmonrs to (U)state. */
monr_hrchy_acquire_locks(flags, i);
@@ -2478,8 +2489,6 @@ static int intel_cqm_event_init(struct perf_event *event)
INIT_LIST_HEAD(&event->hw.cqm_event_groups_entry);
INIT_LIST_HEAD(&event->hw.cqm_event_group_entry);
- event->destroy = intel_cqm_event_destroy;
-
mutex_lock(&cqm_mutex);
@@ -2595,6 +2604,7 @@ static struct pmu intel_cqm_pmu = {
.attr_groups = intel_cqm_attr_groups,
.task_ctx_nr = perf_sw_context,
.event_init = intel_cqm_event_init,
+ .event_terminate = intel_cqm_event_terminate,
.add = intel_cqm_event_add,
.del = intel_cqm_event_stop,
.start = intel_cqm_event_start,
--
2.8.0.rc3.226.g39d4020
Create a monr per monitored cgroup. Inserts monrs in the monr hierarchy.
Task events are leaves of the lowest monitored ancestor cgroup (the lowest
cgroup ancestor with a monr).
CQM starts after the cgroup subsystem, and uses the cqm_initialized_key
static key to avoid interfering with the perf cgroup logic until
propertly initialized. The cgroup_init_mutex protects the initialization.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 594 +++++++++++++++++++++++++++++++++++++-
arch/x86/events/intel/cqm.h | 13 +
arch/x86/include/asm/perf_event.h | 33 +++
3 files changed, 637 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 98a919f..f000fd0 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -35,10 +35,17 @@ static struct perf_pmu_events_attr event_attr_##v = { \
static LIST_HEAD(cache_groups);
static DEFINE_MUTEX(cqm_mutex);
+/*
+ * Synchronizes initialization of cqm with cgroups.
+ */
+static DEFINE_MUTEX(cqm_init_mutex);
+
struct monr *monr_hrchy_root;
struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
+DEFINE_STATIC_KEY_FALSE(cqm_initialized_key);
+
static inline bool __pmonr__in_istate(struct pmonr *pmonr)
{
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
@@ -69,6 +76,9 @@ static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
return !pmonr->prmid && !pmonr->ancestor_pmonr;
}
+/* Whether the monr is root. Recall that the cgroups can not be root and yet
+ * point to a root monr.
+ */
static inline bool monr__is_root(struct monr *monr)
{
return monr_hrchy_root == monr;
@@ -115,6 +125,23 @@ static inline void __monr__clear_mon_active(struct monr *monr)
monr->flags &= ~MONR_MON_ACTIVE;
}
+static inline bool monr__is_cgroup_type(struct monr *monr)
+{
+ return monr->mon_cgrp;
+}
+
+static inline bool monr_is_event_type(struct monr *monr)
+{
+ return !monr->mon_cgrp && monr->mon_event_group;
+}
+
+
+static inline struct cgroup_subsys_state *get_root_perf_css(void)
+{
+ /* Get css for root cgroup */
+ return init_css_set.subsys[perf_event_cgrp_id];
+}
+
/*
* Update if enough time has passed since last read.
*
@@ -725,6 +752,7 @@ static struct monr *monr_alloc(void)
monr->parent = NULL;
INIT_LIST_HEAD(&monr->children);
INIT_LIST_HEAD(&monr->parent_entry);
+ monr->mon_cgrp = NULL;
monr->mon_event_group = NULL;
/* Iterate over all pkgs, even unitialized ones. */
@@ -947,7 +975,7 @@ retry:
}
/*
- * Wrappers for monr manipulation in events.
+ * Wrappers for monr manipulation in events and cgroups.
*
*/
static inline struct monr *monr_from_event(struct perf_event *event)
@@ -960,6 +988,100 @@ static inline void event_set_monr(struct perf_event *event, struct monr *monr)
WRITE_ONCE(event->hw.cqm_monr, monr);
}
+#ifdef CONFIG_CGROUP_PERF
+static inline struct monr *monr_from_perf_cgroup(struct perf_cgroup *cgrp)
+{
+ struct monr *monr;
+ struct cgrp_cqm_info *cqm_info;
+
+ cqm_info = (struct cgrp_cqm_info *)READ_ONCE(cgrp->arch_info);
+ WARN_ON_ONCE(!cqm_info);
+ monr = READ_ONCE(cqm_info->monr);
+ return monr;
+}
+
+static inline struct perf_cgroup *monr__get_mon_cgrp(struct monr *monr)
+{
+ WARN_ON_ONCE(!monr);
+ return READ_ONCE(monr->mon_cgrp);
+}
+
+static inline void
+monr__set_mon_cgrp(struct monr *monr, struct perf_cgroup *cgrp)
+{
+ WRITE_ONCE(monr->mon_cgrp, cgrp);
+}
+
+static inline void
+perf_cgroup_set_monr(struct perf_cgroup *cgrp, struct monr *monr)
+{
+ WRITE_ONCE(cgrp_to_cqm_info(cgrp)->monr, monr);
+}
+
+/*
+ * A perf_cgroup is monitored when it's set in a monr->mon_cgrp.
+ * There is a many-to-one relationship between perf_cgroup's monrs
+ * and monrs' mon_cgrp. A monitored cgroup is necesarily referenced
+ * back by its monr's mon_cgrp.
+ */
+static inline bool perf_cgroup_is_monitored(struct perf_cgroup *cgrp)
+{
+ struct monr *monr;
+ struct perf_cgroup *monr_cgrp;
+
+ /* monr can be referenced by a cgroup other than the one in its
+ * mon_cgrp, be careful.
+ */
+ monr = monr_from_perf_cgroup(cgrp);
+
+ monr_cgrp = monr__get_mon_cgrp(monr);
+ /* Root monr do not have a cgroup associated before initialization.
+ * mon_cgrp and mon_event_group are union, so the pointer must be set
+ * for all non-root monrs.
+ */
+ return monr_cgrp && monr__get_mon_cgrp(monr) == cgrp;
+}
+
+/* Set css's monr to the monr of its lowest monitored ancestor. */
+static inline void __css_set_monr_to_lma(struct cgroup_subsys_state *css)
+{
+ lockdep_assert_held(&cqm_mutex);
+ if (!css->parent) {
+ perf_cgroup_set_monr(css_to_perf_cgroup(css), monr_hrchy_root);
+ return;
+ }
+ perf_cgroup_set_monr(
+ css_to_perf_cgroup(css),
+ monr_from_perf_cgroup(css_to_perf_cgroup(css->parent)));
+}
+
+static inline void
+perf_cgroup_make_monitored(struct perf_cgroup *cgrp, struct monr *monr)
+{
+ monr_hrchy_assert_held_mutexes();
+ perf_cgroup_set_monr(cgrp, monr);
+ /* Make sure that monr is a valid monr for css before it's visible
+ * to any reader of css.
+ */
+ smp_wmb();
+ monr__set_mon_cgrp(monr, cgrp);
+}
+
+static inline void
+perf_cgroup_make_unmonitored(struct perf_cgroup *cgrp)
+{
+ struct monr *monr = monr_from_perf_cgroup(cgrp);
+
+ monr_hrchy_assert_held_mutexes();
+ __css_set_monr_to_lma(&cgrp->css);
+ /* Make sure that all readers of css'monr see lma css before
+ * monr stops being a valid monr for css.
+ */
+ smp_wmb();
+ monr__set_mon_cgrp(monr, NULL);
+}
+#endif
+
/*
* Always finds a rmid_entry to schedule. To be called during scheduler.
* A fast path that only uses read_lock for common case when rmid for current
@@ -1068,6 +1190,286 @@ __monr_hrchy_remove_leaf(struct monr *monr)
monr->parent = NULL;
}
+#ifdef CONFIG_CGROUP_PERF
+static struct perf_cgroup *__perf_cgroup_parent(struct perf_cgroup *cgrp)
+{
+ struct cgroup_subsys_state *parent_css = cgrp->css.parent;
+
+ if (parent_css)
+ return css_to_perf_cgroup(parent_css);
+ return NULL;
+}
+
+/* Get cgroup for both task and cgroup event. */
+static inline struct perf_cgroup *
+perf_cgroup_from_event(struct perf_event *event)
+{
+#ifdef CONFIG_LOCKDEP
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ bool rcu_safe = lockdep_is_held(
+ &cqm_pkgs_data[pkg_id]->pkg_data_lock);
+#endif
+
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ return event->cgrp;
+
+ return container_of(
+ task_css_check(event->hw.target, perf_event_cgrp_id, rcu_safe),
+ struct perf_cgroup, css);
+}
+
+/* Find lowest ancestor that is monitored, not including this cgrp.
+ * Return NULL if no ancestor is monitored.
+ */
+struct perf_cgroup *__cgroup_find_lma(struct perf_cgroup *cgrp)
+{
+ do {
+ cgrp = __perf_cgroup_parent(cgrp);
+ } while (cgrp && !perf_cgroup_is_monitored(cgrp));
+ return cgrp;
+}
+
+/* Similar to css_next_descendant_pre but skips the subtree rooted by pos. */
+struct cgroup_subsys_state *
+css_skip_subtree_pre(struct cgroup_subsys_state *pos,
+ struct cgroup_subsys_state *root)
+{
+ struct cgroup_subsys_state *next;
+
+ WARN_ON_ONCE(!pos);
+ while (pos != root) {
+ next = css_next_child(pos, pos->parent);
+ if (next)
+ return next;
+ pos = pos->parent;
+ }
+ return NULL;
+}
+
+/* Make all monrs of css descendants of css to depend on new_monr. */
+inline void __css_subtree_update_monrs(struct cgroup_subsys_state *css,
+ struct monr *new_monr)
+{
+ struct cgroup_subsys_state *pos_css;
+ int i;
+ unsigned long flags;
+
+ lockdep_assert_held(&cqm_mutex);
+ monr_hrchy_assert_held_mutexes();
+
+ rcu_read_lock();
+
+ /* Iterate over descendants of css in pre-order, in a way
+ * similar to css_for_each_descendant_pre, but skipping the subtrees
+ * rooted by css's with a monitored cgroup, since the elements
+ * in those subtrees do not need to be updated.
+ */
+ pos_css = css_next_descendant_pre(css, css);
+ while (pos_css) {
+ struct perf_cgroup *pos_cgrp = css_to_perf_cgroup(pos_css);
+ struct monr *pos_monr = monr_from_perf_cgroup(pos_cgrp);
+
+ /* Skip css that are not online, sync'ed with cqm_mutex. */
+ if (!(pos_css->flags & CSS_ONLINE)) {
+ pos_css = css_next_descendant_pre(pos_css, css);
+ continue;
+ }
+ /* Update descendant pos's mnor pointers to monr_parent. */
+ if (!perf_cgroup_is_monitored(pos_cgrp)) {
+ perf_cgroup_set_monr(pos_cgrp, new_monr);
+ pos_css = css_next_descendant_pre(pos_css, css);
+ continue;
+ }
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i);
+ pos_monr->parent = new_monr;
+ list_move_tail(&pos_monr->parent_entry, &new_monr->children);
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i);
+ /* Dont go down the subtree in pos_css since pos_monr is the
+ * lma for all its descendants.
+ */
+ pos_css = css_skip_subtree_pre(pos_css, css);
+ }
+ rcu_read_unlock();
+}
+
+static inline int __css_start_monitoring(struct cgroup_subsys_state *css)
+{
+ struct perf_cgroup *cgrp, *cgrp_lma, *pos_cgrp;
+ struct monr *monr, *monr_parent, *pos_monr, *tmp_monr;
+ unsigned long flags;
+ int i;
+
+ lockdep_assert_held(&cqm_mutex);
+
+ /* Hold mutexes to prevent all rotation threads in all packages from
+ * messing with this.
+ */
+ monr_hrchy_acquire_mutexes();
+ cgrp = css_to_perf_cgroup(css);
+ if (WARN_ON_ONCE(perf_cgroup_is_monitored(cgrp)))
+ return -1;
+
+ /* When css is root cgroup's css, attach to the pre-existing
+ * and active root monr.
+ */
+ cgrp_lma = __cgroup_find_lma(cgrp);
+ if (!cgrp_lma) {
+ /* monr of root cgrp must be monr_hrchy_root. */
+ WARN_ON_ONCE(!monr__is_root(monr_from_perf_cgroup(cgrp)));
+ perf_cgroup_make_monitored(cgrp, monr_hrchy_root);
+ monr_hrchy_release_mutexes();
+ return 0;
+ }
+ /* The monr for the lowest monitored ancestor is direct ancestor
+ * of monr in the monr hierarchy.
+ */
+ monr_parent = monr_from_perf_cgroup(cgrp_lma);
+
+ /* Create new monr. */
+ monr = monr_alloc();
+ if (IS_ERR(monr)) {
+ monr_hrchy_release_mutexes();
+ return PTR_ERR(monr);
+ }
+
+ /* monr has no children yet so it is to be inserted in hierarchy with
+ * all its pmors in (U)state.
+ * We hold locks until monr_hrchy changes are complete, to prevent
+ * possible state transition for the pmonrs in monr while still
+ * allowing to read the prmid_summary in the scheduler path.
+ */
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i);
+ __monr_hrchy_insert_leaf(monr, monr_parent);
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i);
+
+ /* Make sure monr is in hierarchy before attaching monr to cgroup. */
+ barrier();
+
+ perf_cgroup_make_monitored(cgrp, monr);
+ __css_subtree_update_monrs(css, monr);
+
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i);
+ /* Move task-event monrs that are descendant from css's cgroup. */
+ list_for_each_entry_safe(pos_monr, tmp_monr,
+ &monr_parent->children, parent_entry) {
+ if (!monr_is_event_type(pos_monr))
+ continue;
+ /* all events in event group must have the same cgroup.
+ * No RCU read lock necessary for task_css_check since calling
+ * inside critical section.
+ */
+ pos_cgrp = perf_cgroup_from_event(pos_monr->mon_event_group);
+ if (!cgroup_is_descendant(pos_cgrp->css.cgroup,
+ cgrp->css.cgroup))
+ continue;
+ pos_monr->parent = monr;
+ list_move_tail(&pos_monr->parent_entry, &monr->children);
+ }
+ /* Make sure monitoring starts after all monrs have moved. */
+ barrier();
+
+ __monr__set_mon_active(monr);
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i);
+
+ monr_hrchy_release_mutexes();
+ return 0;
+}
+
+static inline int __css_stop_monitoring(struct cgroup_subsys_state *css)
+{
+ struct perf_cgroup *cgrp, *cgrp_lma;
+ struct monr *monr, *monr_parent, *pos_monr;
+ unsigned long flags;
+ int i;
+
+ lockdep_assert_held(&cqm_mutex);
+
+ monr_hrchy_acquire_mutexes();
+ cgrp = css_to_perf_cgroup(css);
+ if (WARN_ON_ONCE(!perf_cgroup_is_monitored(cgrp)))
+ return -1;
+
+ monr = monr_from_perf_cgroup(cgrp);
+
+ /* When css is root cgroup's css, detach cgroup but do not
+ * destroy monr.
+ */
+ cgrp_lma = __cgroup_find_lma(cgrp);
+ if (!cgrp_lma) {
+ /* monr of root cgrp must be monr_hrchy_root. */
+ WARN_ON_ONCE(!monr__is_root(monr_from_perf_cgroup(cgrp)));
+ perf_cgroup_make_unmonitored(cgrp);
+ monr_hrchy_release_mutexes();
+ return 0;
+ }
+ /* The monr for the lowest monitored ancestor is direct ancestor
+ * of monr in the monr hierarchy.
+ */
+ monr_parent = monr_from_perf_cgroup(cgrp_lma);
+
+ /* Lock together the transition to (U)state and clearing
+ * MONR_MON_ACTIVE to prevent prmids to return to (A)state
+ * or (I)state in between.
+ */
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i);
+ cqm_pkg_id_for_each_online(i)
+ __pmonr__to_ustate(monr->pmonrs[i]);
+ barrier();
+ __monr__clear_mon_active(monr);
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i);
+
+ __css_subtree_update_monrs(css, monr_parent);
+
+
+ /*
+ * Move the children monrs that are no cgroups.
+ */
+ monr_hrchy_acquire_raw_spin_locks_irq_save(flags, i);
+
+ list_for_each_entry(pos_monr, &monr->children, parent_entry)
+ pos_monr->parent = monr_parent;
+ list_splice_tail_init(&monr->children, &monr_parent->children);
+ perf_cgroup_make_unmonitored(cgrp);
+ __monr_hrchy_remove_leaf(monr);
+
+ monr_hrchy_release_raw_spin_locks_irq_restore(flags, i);
+
+ monr_hrchy_release_mutexes();
+ monr_dealloc(monr);
+ return 0;
+}
+
+/* Attaching an event to a cgroup starts monitoring in the cgroup.
+ * If the cgroup is already monitoring, just use its pre-existing mnor.
+ */
+static int __monr_hrchy_attach_cgroup_event(struct perf_event *event,
+ struct perf_cgroup *perf_cgrp)
+{
+ struct monr *monr;
+ int ret;
+
+ lockdep_assert_held(&cqm_mutex);
+ WARN_ON_ONCE(event->attach_state & PERF_ATTACH_TASK);
+ WARN_ON_ONCE(monr_from_event(event));
+ WARN_ON_ONCE(!perf_cgrp);
+
+ if (!perf_cgroup_is_monitored(perf_cgrp)) {
+ css_get(&perf_cgrp->css);
+ ret = __css_start_monitoring(&perf_cgrp->css);
+ css_put(&perf_cgrp->css);
+ if (ret)
+ return ret;
+ }
+
+ /* At this point, cgrp is always monitored, use its monr. */
+ monr = monr_from_perf_cgroup(perf_cgrp);
+
+ event_set_monr(event, monr);
+ monr->mon_event_group = event;
+ return 0;
+}
+#endif
+
static int __monr_hrchy_attach_cpu_event(struct perf_event *event)
{
lockdep_assert_held(&cqm_mutex);
@@ -1109,12 +1511,27 @@ static int __monr_hrchy_attach_task_event(struct perf_event *event,
static int monr_hrchy_attach_event(struct perf_event *event)
{
struct monr *monr_parent;
+#ifdef CONFIG_CGROUP_PERF
+ struct perf_cgroup *perf_cgrp;
+#endif
if (!event->cgrp && !(event->attach_state & PERF_ATTACH_TASK))
return __monr_hrchy_attach_cpu_event(event);
+#ifdef CONFIG_CGROUP_PERF
+ /* Task events become leaves, cgroup events reuse the cgroup's monr */
+ if (event->cgrp)
+ return __monr_hrchy_attach_cgroup_event(event, event->cgrp);
+
+ rcu_read_lock();
+ perf_cgrp = perf_cgroup_from_event(event);
+ rcu_read_unlock();
+
+ monr_parent = monr_from_perf_cgroup(perf_cgrp);
+#else
/* Two-levels hierarchy: Root and all event monr underneath it. */
monr_parent = monr_hrchy_root;
+#endif
return __monr_hrchy_attach_task_event(event, monr_parent);
}
@@ -1126,7 +1543,7 @@ static int monr_hrchy_attach_event(struct perf_event *event)
*/
static bool __match_event(struct perf_event *a, struct perf_event *b)
{
- /* Per-cpu and task events don't mix */
+ /* Cgroup/non-task per-cpu and task events don't mix */
if ((a->attach_state & PERF_ATTACH_TASK) !=
(b->attach_state & PERF_ATTACH_TASK))
return false;
@@ -2185,6 +2602,129 @@ static struct pmu intel_cqm_pmu = {
.read = intel_cqm_event_read,
};
+#ifdef CONFIG_CGROUP_PERF
+/* XXX: Add hooks for attach dettach task with monr to a cgroup. */
+inline int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+ struct cgroup_subsys_state *new_css)
+{
+ struct perf_cgroup *new_cgrp;
+ struct cgrp_cqm_info *cqm_info;
+
+ new_cgrp = css_to_perf_cgroup(new_css);
+ cqm_info = kmalloc(sizeof(struct cgrp_cqm_info), GFP_KERNEL);
+ if (!cqm_info)
+ return -ENOMEM;
+ cqm_info->cont_monitoring = false;
+ cqm_info->monr = NULL;
+ new_cgrp->arch_info = cqm_info;
+
+ return 0;
+}
+
+inline void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
+{
+ struct perf_cgroup *cgrp = css_to_perf_cgroup(css);
+
+ kfree(cgrp_to_cqm_info(cgrp));
+ cgrp->arch_info = NULL;
+}
+
+/* Do the bulk of arch_css_online. To be called when CQM starts after
+ * css has gone online.
+ */
+static inline int __css_go_online(struct cgroup_subsys_state *css)
+{
+ lockdep_assert_held(&cqm_mutex);
+
+ /* css must not be used in monr hierarchy before having
+ * set its monr in this step.
+ */
+ __css_set_monr_to_lma(css);
+ /* Root monr is always monitoring. */
+ if (!css->parent)
+ css_to_cqm_info(css)->cont_monitoring = true;
+
+ if (css_to_cqm_info(css)->cont_monitoring)
+ return __css_start_monitoring(css);
+ return 0;
+}
+
+inline int perf_cgroup_arch_css_online(struct cgroup_subsys_state *css)
+{
+ int ret = 0;
+
+ /* use cqm_init_mutex to synchronize with
+ * __start_monitoring_all_cgroups.
+ */
+ mutex_lock(&cqm_init_mutex);
+
+ if (static_branch_unlikely(&cqm_initialized_key)) {
+ mutex_lock(&cqm_mutex);
+ ret = __css_go_online(css);
+ mutex_unlock(&cqm_mutex);
+ WARN_ON_ONCE(ret);
+ }
+
+ mutex_unlock(&cqm_init_mutex);
+ return ret;
+}
+
+inline void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css)
+{
+ int ret = 0;
+ struct monr *monr;
+ struct perf_cgroup *cgrp = css_to_perf_cgroup(css);
+
+ mutex_lock(&cqm_init_mutex);
+
+ if (!static_branch_unlikely(&cqm_initialized_key))
+ goto out;
+
+ mutex_lock(&cqm_mutex);
+
+ monr = monr_from_perf_cgroup(cgrp);
+ if (!perf_cgroup_is_monitored(cgrp))
+ goto out_cqm;
+
+ /* Stop monitoring for the css's monr only if no more events need it.
+ * If events need the monr, it will be destroyed when the events that
+ * use it are destroyed.
+ */
+ if (monr->mon_event_group) {
+ monr_hrchy_acquire_mutexes();
+ perf_cgroup_make_unmonitored(cgrp);
+ monr_hrchy_release_mutexes();
+ } else {
+ ret = __css_stop_monitoring(css);
+ WARN_ON_ONCE(ret);
+ }
+
+out_cqm:
+ mutex_unlock(&cqm_mutex);
+out:
+ mutex_unlock(&cqm_init_mutex);
+ WARN_ON_ONCE(ret);
+}
+
+inline void perf_cgroup_arch_css_released(struct cgroup_subsys_state *css)
+{
+ mutex_lock(&cqm_init_mutex);
+
+ if (static_branch_unlikely(&cqm_initialized_key)) {
+ mutex_lock(&cqm_mutex);
+ /*
+ * Remove css from monr hierarchy now that css is about to
+ * leave the cgroup hierarchy.
+ */
+ perf_cgroup_set_monr(css_to_perf_cgroup(css), NULL);
+ mutex_unlock(&cqm_mutex);
+ }
+
+ mutex_unlock(&cqm_init_mutex);
+}
+
+#endif
+
static inline void cqm_pick_event_reader(int cpu)
{
u16 pkg_id = topology_physical_package_id(cpu);
@@ -2249,6 +2789,39 @@ static const struct x86_cpu_id intel_cqm_match[] = {
{}
};
+#ifdef CONFIG_CGROUP_PERF
+/* Start monitoring for all cgroups in cgroup hierarchy. */
+static int __start_monitoring_all_cgroups(void)
+{
+ int ret;
+ struct cgroup_subsys_state *css, *css_root;
+
+ lockdep_assert_held(&cqm_init_mutex);
+
+ rcu_read_lock();
+ /* Get css for root cgroup */
+ css_root = get_root_perf_css();
+
+ css_for_each_descendant_pre(css, css_root) {
+ if (!css_tryget_online(css))
+ continue;
+
+ rcu_read_unlock();
+ mutex_lock(&cqm_mutex);
+ ret = __css_go_online(css);
+ mutex_unlock(&cqm_mutex);
+
+ css_put(css);
+ if (ret)
+ return ret;
+
+ rcu_read_lock();
+ }
+ rcu_read_unlock();
+ return 0;
+}
+#endif
+
static int __init intel_cqm_init(void)
{
char *str, scale[20];
@@ -2324,17 +2897,32 @@ static int __init intel_cqm_init(void)
__perf_cpu_notifier(intel_cqm_cpu_notifier);
+ /* Use cqm_init_mutex to synchronize with css's online/offline. */
+ mutex_lock(&cqm_init_mutex);
+
+#ifdef CONFIG_CGROUP_PERF
+ ret = __start_monitoring_all_cgroups();
+ if (ret)
+ goto error_init_mutex;
+#endif
+
ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
if (ret)
- goto error;
+ goto error_init_mutex;
cpu_notifier_register_done();
+ static_branch_enable(&cqm_initialized_key);
+
+ mutex_unlock(&cqm_init_mutex);
+
pr_info("Intel CQM monitoring enabled with at least %u rmids per package.\n",
min_max_rmid + 1);
return ret;
+error_init_mutex:
+ mutex_unlock(&cqm_init_mutex);
error:
pr_err("Intel CQM perf registration failed: %d\n", ret);
cpu_notifier_register_done();
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 25646a2..0f3da94 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -313,6 +313,7 @@ struct pkg_data {
* struct monr: MONitored Resource.
* @flags: Flags field for monr (XXX: More flags will be added
* with MBM).
+ * @mon_cgrp: The cgroup associated with this monr, if any
* @mon_event_group: The head of event's group that use this monr, if any.
* @parent: Parent in monr hierarchy.
* @children: List of children in monr hierarchy.
@@ -333,6 +334,7 @@ struct pkg_data {
struct monr {
u16 flags;
/* Back reference pointers */
+ struct perf_cgroup *mon_cgrp;
struct perf_event *mon_event_group;
struct monr *parent;
@@ -506,3 +508,14 @@ static unsigned int __cqm_min_progress_rate = CQM_DEFAULT_MIN_PROGRESS_RATE;
* It's units are bytes must be scaled by cqm_l3_scale to obtain cache lines.
*/
static unsigned int __intel_cqm_max_threshold;
+
+
+struct cgrp_cqm_info {
+ /* Should the cgroup be continuously monitored? */
+ bool cont_monitoring;
+ struct monr *monr;
+};
+
+# define css_to_perf_cgroup(css_) container_of(css_, struct perf_cgroup, css)
+# define cgrp_to_cqm_info(cgrp_) ((struct cgrp_cqm_info *)cgrp_->arch_info)
+# define css_to_cqm_info(css_) cgrp_to_cqm_info(css_to_perf_cgroup(css_))
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f353061..c22d9e0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -299,4 +299,37 @@ static inline void perf_check_microcode(void) { }
#define arch_perf_out_copy_user copy_from_user_nmi
+
+/*
+ * Hooks for architecture specific features of perf_event cgroup.
+ * Currently used by Intel's CQM.
+ */
+#ifdef CONFIG_INTEL_RDT
+#define perf_cgroup_arch_css_alloc \
+ perf_cgroup_arch_css_alloc
+inline int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+ struct cgroup_subsys_state *new_css);
+
+#define perf_cgroup_arch_css_online \
+ perf_cgroup_arch_css_online
+inline int perf_cgroup_arch_css_online(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_css_offline \
+ perf_cgroup_arch_css_offline
+inline void perf_cgroup_arch_css_offline(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_css_released \
+ perf_cgroup_arch_css_released
+inline void perf_cgroup_arch_css_released(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_css_free \
+ perf_cgroup_arch_css_free
+inline void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
+
+#else
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+
+#endif
+
#endif /* _ASM_X86_PERF_EVENT_H */
--
2.8.0.rc3.226.g39d4020
Allow a PMU to clean an event before the event's torn down in
perf_events begins.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 6 ++++++
kernel/events/core.c | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b010b55..81e29c6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,12 @@ struct pmu {
int (*event_init) (struct perf_event *event);
/*
+ * Terminate the event for this PMU. Optional complement for a
+ * successful event_init. Called before the event fields are tear down.
+ */
+ void (*event_terminate) (struct perf_event *event);
+
+ /*
* Notification that the event was mapped or unmapped. Called
* in the context of the mapping task.
*/
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6fd226f..2a868a6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3787,6 +3787,8 @@ static void _free_event(struct perf_event *event)
ring_buffer_attach(event, NULL);
mutex_unlock(&event->mmap_mutex);
}
+ if (event->pmu->event_terminate)
+ event->pmu->event_terminate(event);
if (is_cgroup_event(event))
perf_detach_cgroup(event);
@@ -8293,6 +8295,8 @@ err_per_task:
exclusive_event_destroy(event);
err_pmu:
+ if (event->pmu->event_terminate)
+ event->pmu->event_terminate(event);
if (event->destroy)
event->destroy(event);
module_put(pmu->module);
--
2.8.0.rc3.226.g39d4020
The hooks allows architectures to extend the behavior of the
perf subsystem.
In this patch series, the hooks will be used by Intel's CQM PMU to
provide support for the llc_occupancy event.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 28 +++++++++++++++++++++++++++-
kernel/events/core.c | 27 +++++++++++++++++++++++++++
2 files changed, 54 insertions(+), 1 deletion(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf29258..b010b55 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -690,7 +690,9 @@ struct perf_cgroup_info {
};
struct perf_cgroup {
- struct cgroup_subsys_state css;
+ /* Architecture specific information. */
+ void *arch_info;
+ struct cgroup_subsys_state css;
struct perf_cgroup_info __percpu *info;
};
@@ -1228,4 +1230,28 @@ _name##_show(struct device *dev, \
\
static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
+
+/*
+ * Hooks for architecture specific extensions for perf_cgroup.
+ */
+#ifndef perf_cgroup_arch_css_alloc
+# define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_online
+# define perf_cgroup_arch_css_online(css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_offline
+# define perf_cgroup_arch_css_offline(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_css_released
+# define perf_cgroup_arch_css_released(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_css_free
+# define perf_cgroup_arch_css_free(css) do { } while (0)
+#endif
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4aaec01..6fd226f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9794,6 +9794,7 @@ static struct cgroup_subsys_state *
perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct perf_cgroup *jc;
+ int ret;
jc = kzalloc(sizeof(*jc), GFP_KERNEL);
if (!jc)
@@ -9805,13 +9806,36 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
return ERR_PTR(-ENOMEM);
}
+ jc->arch_info = NULL;
+
+ ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
+ if (ret)
+ return ERR_PTR(ret);
+
return &jc->css;
}
+static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
+{
+ return perf_cgroup_arch_css_online(css);
+}
+
+static void perf_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+ perf_cgroup_arch_css_offline(css);
+}
+
+static void perf_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+ perf_cgroup_arch_css_released(css);
+}
+
static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
+ perf_cgroup_arch_css_free(css);
+
free_percpu(jc->info);
kfree(jc);
}
@@ -9836,6 +9860,9 @@ static void perf_cgroup_attach(struct cgroup_taskset *tset)
struct cgroup_subsys perf_event_cgrp_subsys = {
.css_alloc = perf_cgroup_css_alloc,
+ .css_online = perf_cgroup_css_online,
+ .css_offline = perf_cgroup_css_offline,
+ .css_released = perf_cgroup_css_released,
.css_free = perf_cgroup_css_free,
.attach = perf_cgroup_attach,
};
--
2.8.0.rc3.226.g39d4020
Move code around, delete unnecesary code and do some renaming in
in order to increase readibility of next patches. Create cqm.h file.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 170 +++++++++++++++-----------------------------
arch/x86/events/intel/cqm.h | 42 +++++++++++
include/linux/perf_event.h | 8 +--
3 files changed, 103 insertions(+), 117 deletions(-)
create mode 100644 arch/x86/events/intel/cqm.h
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index d5eac8f..f678014 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -4,10 +4,9 @@
* Based very, very heavily on work by Peter Zijlstra.
*/
-#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
-#include <asm/pqr_common.h>
+#include "cqm.h"
#include "../perf_event.h"
#define MSR_IA32_QM_CTR 0x0c8e
@@ -16,13 +15,26 @@
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
+#define RMID_VAL_ERROR (1ULL << 63)
+#define RMID_VAL_UNAVAIL (1ULL << 62)
+
+#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
+
+#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+
+#define CQM_EVENT_ATTR_STR(_name, v, str) \
+static struct perf_pmu_events_attr event_attr_##v = { \
+ .attr = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), \
+ .id = 0, \
+ .event_str = str, \
+}
+
/*
* Updates caller cpu's cache.
*/
static inline void __update_pqr_rmid(u32 rmid)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
if (state->rmid == rmid)
return;
state->rmid = rmid;
@@ -30,37 +42,18 @@ static inline void __update_pqr_rmid(u32 rmid)
}
/*
- * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
- * Also protects event->hw.cqm_rmid
- *
- * Hold either for stability, both for modification of ->hw.cqm_rmid.
- */
-static DEFINE_MUTEX(cache_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
-
-#define CQM_EVENT_ATTR_STR(_name, v, str) \
-static struct perf_pmu_events_attr event_attr_##v = { \
- .attr = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), \
- .id = 0, \
- .event_str = str, \
-}
-
-/*
* Groups of events that have the same target(s), one RMID per group.
+ * Protected by cqm_mutex.
*/
static LIST_HEAD(cache_groups);
+static DEFINE_MUTEX(cqm_mutex);
+static DEFINE_RAW_SPINLOCK(cache_lock);
/*
* Mask of CPUs for reading CQM values. We only need one per-socket.
*/
static cpumask_t cqm_cpumask;
-#define RMID_VAL_ERROR (1ULL << 63)
-#define RMID_VAL_UNAVAIL (1ULL << 62)
-
-#define QOS_L3_OCCUP_EVENT_ID (1 << 0)
-
-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
/*
* This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -71,8 +64,6 @@ static cpumask_t cqm_cpumask;
*/
static u32 intel_cqm_rotation_rmid;
-#define INVALID_RMID (-1)
-
/*
* Is @rmid valid for programming the hardware?
*
@@ -140,7 +131,7 @@ struct cqm_rmid_entry {
* rotation worker moves RMIDs from the limbo list to the free list once
* the occupancy value drops below __intel_cqm_threshold.
*
- * Both lists are protected by cache_mutex.
+ * Both lists are protected by cqm_mutex.
*/
static LIST_HEAD(cqm_rmid_free_lru);
static LIST_HEAD(cqm_rmid_limbo_lru);
@@ -172,13 +163,13 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
/*
* Returns < 0 on fail.
*
- * We expect to be called with cache_mutex held.
+ * We expect to be called with cqm_mutex held.
*/
static u32 __get_rmid(void)
{
struct cqm_rmid_entry *entry;
- lockdep_assert_held(&cache_mutex);
+ lockdep_assert_held(&cqm_mutex);
if (list_empty(&cqm_rmid_free_lru))
return INVALID_RMID;
@@ -193,7 +184,7 @@ static void __put_rmid(u32 rmid)
{
struct cqm_rmid_entry *entry;
- lockdep_assert_held(&cache_mutex);
+ lockdep_assert_held(&cqm_mutex);
WARN_ON(!__rmid_valid(rmid));
entry = __rmid_entry(rmid);
@@ -237,9 +228,9 @@ static int intel_cqm_setup_rmid_cache(void)
entry = __rmid_entry(0);
list_del(&entry->list);
- mutex_lock(&cache_mutex);
+ mutex_lock(&cqm_mutex);
intel_cqm_rotation_rmid = __get_rmid();
- mutex_unlock(&cache_mutex);
+ mutex_unlock(&cqm_mutex);
return 0;
fail:
@@ -250,6 +241,7 @@ fail:
return -ENOMEM;
}
+
/*
* Determine if @a and @b measure the same set of tasks.
*
@@ -287,49 +279,11 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
return false;
}
-#ifdef CONFIG_CGROUP_PERF
-static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
-{
- if (event->attach_state & PERF_ATTACH_TASK)
- return perf_cgroup_from_task(event->hw.target, event->ctx);
-
- return event->cgrp;
-}
-#endif
-
struct rmid_read {
u32 rmid;
atomic64_t value;
};
-static void intel_cqm_event_read(struct perf_event *event);
-
-/*
- * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
- * cachelines are still tagged with RMIDs in limbo, we progressively
- * increment the threshold until we find an RMID in limbo with <=
- * __intel_cqm_threshold lines tagged. This is designed to mitigate the
- * problem where cachelines tagged with an RMID are not steadily being
- * evicted.
- *
- * On successful rotations we decrease the threshold back towards zero.
- *
- * __intel_cqm_max_threshold provides an upper bound on the threshold,
- * and is measured in bytes because it's exposed to userland.
- */
-static unsigned int __intel_cqm_threshold;
-static unsigned int __intel_cqm_max_threshold;
-
-/*
- * Initially use this constant for both the limbo queue time and the
- * rotation timer interval, pmu::hrtimer_interval_ms.
- *
- * They don't need to be the same, but the two are related since if you
- * rotate faster than you recycle RMIDs, you may run out of available
- * RMIDs.
- */
-#define RMID_DEFAULT_QUEUE_TIME 250 /* ms */
-
static struct pmu intel_cqm_pmu;
/*
@@ -344,7 +298,7 @@ static void intel_cqm_setup_event(struct perf_event *event,
bool conflict = false;
u32 rmid;
- list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
+ list_for_each_entry(iter, &cache_groups, hw.cqm_event_groups_entry) {
rmid = iter->hw.cqm_rmid;
if (__match_event(iter, event)) {
@@ -390,24 +344,24 @@ out:
static inline bool cqm_group_leader(struct perf_event *event)
{
- return !list_empty(&event->hw.cqm_groups_entry);
+ return !list_empty(&event->hw.cqm_event_groups_entry);
}
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
- if (!(event->hw.cqm_state & PERF_HES_STOPPED))
+ if (!(event->hw.state & PERF_HES_STOPPED))
return;
- event->hw.cqm_state &= ~PERF_HES_STOPPED;
+ event->hw.state &= ~PERF_HES_STOPPED;
__update_pqr_rmid(event->hw.cqm_rmid);
}
static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
- if (event->hw.cqm_state & PERF_HES_STOPPED)
+ if (event->hw.state & PERF_HES_STOPPED)
return;
- event->hw.cqm_state |= PERF_HES_STOPPED;
+ event->hw.state |= PERF_HES_STOPPED;
intel_cqm_event_read(event);
__update_pqr_rmid(0);
}
@@ -419,7 +373,7 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
raw_spin_lock_irqsave(&cache_lock, flags);
- event->hw.cqm_state = PERF_HES_STOPPED;
+ event->hw.state = PERF_HES_STOPPED;
rmid = event->hw.cqm_rmid;
if (__rmid_valid(rmid) && (mode & PERF_EF_START))
@@ -433,16 +387,16 @@ static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
- mutex_lock(&cache_mutex);
+ mutex_lock(&cqm_mutex);
/*
* If there's another event in this group...
*/
- if (!list_empty(&event->hw.cqm_group_entry)) {
- group_other = list_first_entry(&event->hw.cqm_group_entry,
+ if (!list_empty(&event->hw.cqm_event_group_entry)) {
+ group_other = list_first_entry(&event->hw.cqm_event_group_entry,
struct perf_event,
- hw.cqm_group_entry);
- list_del(&event->hw.cqm_group_entry);
+ hw.cqm_event_group_entry);
+ list_del(&event->hw.cqm_event_group_entry);
}
/*
@@ -454,18 +408,18 @@ static void intel_cqm_event_destroy(struct perf_event *event)
* destroy the group and return the RMID.
*/
if (group_other) {
- list_replace(&event->hw.cqm_groups_entry,
- &group_other->hw.cqm_groups_entry);
+ list_replace(&event->hw.cqm_event_groups_entry,
+ &group_other->hw.cqm_event_groups_entry);
} else {
u32 rmid = event->hw.cqm_rmid;
if (__rmid_valid(rmid))
__put_rmid(rmid);
- list_del(&event->hw.cqm_groups_entry);
+ list_del(&event->hw.cqm_event_groups_entry);
}
}
- mutex_unlock(&cache_mutex);
+ mutex_unlock(&cqm_mutex);
}
static int intel_cqm_event_init(struct perf_event *event)
@@ -488,25 +442,26 @@ static int intel_cqm_event_init(struct perf_event *event)
event->attr.sample_period) /* no sampling */
return -EINVAL;
- INIT_LIST_HEAD(&event->hw.cqm_group_entry);
- INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
+ INIT_LIST_HEAD(&event->hw.cqm_event_groups_entry);
+ INIT_LIST_HEAD(&event->hw.cqm_event_group_entry);
event->destroy = intel_cqm_event_destroy;
- mutex_lock(&cache_mutex);
+ mutex_lock(&cqm_mutex);
+
/* Will also set rmid */
intel_cqm_setup_event(event, &group);
if (group) {
- list_add_tail(&event->hw.cqm_group_entry,
- &group->hw.cqm_group_entry);
+ list_add_tail(&event->hw.cqm_event_group_entry,
+ &group->hw.cqm_event_group_entry);
} else {
- list_add_tail(&event->hw.cqm_groups_entry,
- &cache_groups);
+ list_add_tail(&event->hw.cqm_event_groups_entry,
+ &cache_groups);
}
- mutex_unlock(&cache_mutex);
+ mutex_unlock(&cqm_mutex);
return 0;
}
@@ -543,14 +498,14 @@ static struct attribute_group intel_cqm_format_group = {
};
static ssize_t
-max_recycle_threshold_show(struct device *dev, struct device_attribute *attr,
- char *page)
+max_recycle_threshold_show(
+ struct device *dev, struct device_attribute *attr, char *page)
{
ssize_t rv;
- mutex_lock(&cache_mutex);
+ mutex_lock(&cqm_mutex);
rv = snprintf(page, PAGE_SIZE-1, "%u\n", __intel_cqm_max_threshold);
- mutex_unlock(&cache_mutex);
+ mutex_unlock(&cqm_mutex);
return rv;
}
@@ -560,25 +515,16 @@ max_recycle_threshold_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
- unsigned int bytes, cachelines;
+ unsigned int bytes;
int ret;
ret = kstrtouint(buf, 0, &bytes);
if (ret)
return ret;
- mutex_lock(&cache_mutex);
-
+ mutex_lock(&cqm_mutex);
__intel_cqm_max_threshold = bytes;
- cachelines = bytes / cqm_l3_scale;
-
- /*
- * The new maximum takes effect immediately.
- */
- if (__intel_cqm_threshold > cachelines)
- __intel_cqm_threshold = cachelines;
-
- mutex_unlock(&cache_mutex);
+ mutex_unlock(&cqm_mutex);
return count;
}
@@ -602,7 +548,7 @@ static const struct attribute_group *intel_cqm_attr_groups[] = {
};
static struct pmu intel_cqm_pmu = {
- .hrtimer_interval_ms = RMID_DEFAULT_QUEUE_TIME,
+ .hrtimer_interval_ms = CQM_DEFAULT_ROTATION_PERIOD,
.attr_groups = intel_cqm_attr_groups,
.task_ctx_nr = perf_sw_context,
.event_init = intel_cqm_event_init,
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
new file mode 100644
index 0000000..e25d0a1
--- /dev/null
+++ b/arch/x86/events/intel/cqm.h
@@ -0,0 +1,42 @@
+/*
+ * Intel Cache Quality-of-Service Monitoring (CQM) support.
+ *
+ * A Resource Manager ID (RMID) is a u32 value that, when programmed in a
+ * logical CPU, will allow the LLC cache to associate the changes in occupancy
+ * generated by that cpu (cache lines allocations - deallocations) to the RMID.
+ * If an rmid has been assigned to a thread T long enough for all cache lines
+ * used by T to be allocated, then the occupancy reported by the hardware is
+ * equal to the total cache occupancy for T.
+ *
+ * Groups of threads that are to be monitored together (such as cgroups
+ * or processes) can shared a RMID.
+ *
+ * This driver implements a tree hierarchy of Monitored Resources (monr). Each
+ * monr is a cgroup, a process or a thread that needs one single RMID.
+ */
+
+#include <linux/perf_event.h>
+#include <asm/pqr_common.h>
+
+/*
+ * Minimum time elapsed between reads of occupancy value for an RMID when
+ * transversing the monr hierarchy.
+ */
+#define RMID_DEFAULT_MIN_UPDATE_TIME 20 /* ms */
+
+# define INVALID_RMID (-1)
+
+/*
+ * Time between execution of rotation logic. The frequency of execution does
+ * not affect the rate at which RMIDs are recycled, except by the delay by the
+ * delay updating the prmid's and their pools.
+ * The rotation period is stored in pmu->hrtimer_interval_ms.
+ */
+#define CQM_DEFAULT_ROTATION_PERIOD 1200 /* ms */
+
+/*
+ * __intel_cqm_max_threshold provides an upper bound on the threshold,
+ * and is measured in bytes because it's exposed to userland.
+ * It's units are bytes must be scaled by cqm_l3_scale to obtain cache lines.
+ */
+static unsigned int __intel_cqm_max_threshold;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3a847bf..5eb7dea 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -120,11 +120,9 @@ struct hw_perf_event {
};
#ifdef CONFIG_INTEL_RDT
struct { /* intel_cqm */
- int cqm_state;
- u32 cqm_rmid;
- struct list_head cqm_events_entry;
- struct list_head cqm_groups_entry;
- struct list_head cqm_group_entry;
+ u32 cqm_rmid;
+ struct list_head cqm_event_group_entry;
+ struct list_head cqm_event_groups_entry;
};
#endif
struct { /* itrace */
--
2.8.0.rc3.226.g39d4020
Cgroups and/or tasks that require to be monitored using a RMID
are abstracted as a MOnitored Resources (monr's). A CQM event points
to a monr to read occupancy (and in the future other attributes) of the
RMIDs associated to the monr.
The monrs form a hierarchy that captures the dependency within the
monitored cgroups and/or tasks/threads. The monr of a cgroup A which
contains another monitored cgroup, B, is an ancestor of B's monr.
Each monr contains one Package MONitored Resource (pmonr) per package.
The monitoring of a monr in a package starts when its corresponding
pmonr receives an RMID for that package (a prmid).
The prmids are lazily assigned to a pmonr the first time a thread
using the monr is scheduled in the package. When a pmonr with a
valid prmid is scheduled, that pmonr's prmid's RMID is written to the
msr MSR_IA32_PQR_ASSOC. If no prmid is available, the prmid of the lowest
ancestor in the monr hierarchy with a valid prmid for that package is
used instead.
A pmonr can be in one of following three states:
- (A)ctive: When it has a prmid available.
- (I)nherited: When no prmid is available. In this state, it "borrows"
the prmid of its lowest ancestor in (A)ctive state during sched in
(writes its ancestor's RMID into hw while any associated thread is
executed). But, since the "borrowed" prmid do not monitor the
occupancy of this monr, the monr cannot report occupancy individually.
- (U)nused: When the monr does not have a prmid yet and have no failed
acquiring one (either because no thread has been scheduled while
monitoring for this pmonr is active or because it has been completed
a transition to (U)state, ie. termination of the associated
event/cgroup).
To avoid synchronization overhead, each prmid contains a prmid_summary.
The union prmid_summary is a concise representation of the prmid state
and its raw RMIDs. Due to its size, the prmid_summary can be read
atomically without a LOCK instruction. Every state transition atomically
updates the prmid_summary. This avoids locking during sched in and out
of threads, except in the cases that a prmid needs to be allocated,
but this only occurs the first time a monr is scheduled in a package.
This patch introduces a first iteration of the monr hierarchy
that maintains two levels: the root monr, at top, and all other monrs
as leaves. The root monr is always (A)ctive.
This patch also implements the essential mechanism of per-package lazy
allocation of RMID.
The (I)state and the transitions from and to it are introduced in the
next patch in this series.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 633 ++++++++++++++++++++++++++++++++++++--------
arch/x86/events/intel/cqm.h | 149 +++++++++++
include/linux/perf_event.h | 2 +-
3 files changed, 674 insertions(+), 110 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 541e515..65551bb 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -35,28 +35,66 @@ static struct perf_pmu_events_attr event_attr_##v = { \
static LIST_HEAD(cache_groups);
static DEFINE_MUTEX(cqm_mutex);
+struct monr *monr_hrchy_root;
+
struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
-/*
- * Is @rmid valid for programming the hardware?
- *
- * rmid 0 is reserved by the hardware for all non-monitored tasks, which
- * means that we should never come across an rmid with that value.
- * Likewise, an rmid value of -1 is used to indicate "no rmid currently
- * assigned" and is used as part of the rotation code.
- */
-static inline bool __rmid_valid(u32 rmid)
+static inline bool __pmonr__in_astate(struct pmonr *pmonr)
{
- if (!rmid || rmid == INVALID_RMID)
- return false;
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ return pmonr->prmid;
+}
- return true;
+static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ return !pmonr->prmid;
}
-static u64 __rmid_read(u32 rmid)
+static inline bool monr__is_root(struct monr *monr)
{
- /* XXX: Placeholder, will be removed in next patch. */
- return 0;
+ return monr_hrchy_root == monr;
+}
+
+static inline bool monr__is_mon_active(struct monr *monr)
+{
+ return monr->flags & MONR_MON_ACTIVE;
+}
+
+static inline void __monr__set_summary_read_rmid(struct monr *monr, u32 rmid)
+{
+ int i;
+ struct pmonr *pmonr;
+ union prmid_summary summary;
+
+ monr_hrchy_assert_held_raw_spin_locks();
+
+ cqm_pkg_id_for_each_online(i) {
+ pmonr = monr->pmonrs[i];
+ WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ summary.read_rmid = rmid;
+ atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+ }
+}
+
+static inline void __monr__set_mon_active(struct monr *monr)
+{
+ monr_hrchy_assert_held_raw_spin_locks();
+ __monr__set_summary_read_rmid(monr, 0);
+ monr->flags |= MONR_MON_ACTIVE;
+}
+
+/*
+ * All pmonrs must be in (U)state.
+ * clearing MONR_MON_ACTIVE prevents (U)state prmids from transitioning
+ * to another state.
+ */
+static inline void __monr__clear_mon_active(struct monr *monr)
+{
+ monr_hrchy_assert_held_raw_spin_locks();
+ __monr__set_summary_read_rmid(monr, INVALID_RMID);
+ monr->flags &= ~MONR_MON_ACTIVE;
}
/*
@@ -133,22 +171,6 @@ static inline bool __valid_pkg_id(u16 pkg_id)
return pkg_id < PQR_MAX_NR_PKGS;
}
-/*
- * Returns < 0 on fail.
- *
- * We expect to be called with cache_mutex held.
- */
-static u32 __get_rmid(void)
-{
- /* XXX: Placeholder, will be removed in next patch. */
- return 0;
-}
-
-static void __put_rmid(u32 rmid)
-{
- /* XXX: Placeholder, will be removed in next patch. */
-}
-
/* Init cqm pkg_data for @cpu 's package. */
static int pkg_data_init_cpu(int cpu)
{
@@ -187,6 +209,10 @@ static int pkg_data_init_cpu(int cpu)
}
INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
+ INIT_LIST_HEAD(&pkg_data->active_prmids_pool);
+ INIT_LIST_HEAD(&pkg_data->nopmonr_limbo_prmids_pool);
+
+ INIT_LIST_HEAD(&pkg_data->astate_pmonrs_lru);
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
@@ -225,12 +251,129 @@ __prmid_from_rmid(u16 pkg_id, u32 rmid)
return prmid;
}
+static struct pmonr *pmonr_alloc(int cpu)
+{
+ struct pmonr *pmonr;
+ union prmid_summary summary;
+
+ pmonr = kmalloc_node(sizeof(struct pmonr),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!pmonr)
+ return ERR_PTR(-ENOMEM);
+
+ pmonr->prmid = NULL;
+
+ pmonr->monr = NULL;
+ INIT_LIST_HEAD(&pmonr->rotation_entry);
+
+ pmonr->pkg_id = topology_physical_package_id(cpu);
+ summary.sched_rmid = INVALID_RMID;
+ summary.read_rmid = INVALID_RMID;
+ atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+
+ return pmonr;
+}
+
+static void pmonr_dealloc(struct pmonr *pmonr)
+{
+ kfree(pmonr);
+}
+
+/*
+ * @root: Common ancestor.
+ * a bust be distinct to b.
+ * @true if a is ancestor of b.
+ */
+static inline bool
+__monr_hrchy_is_ancestor(struct monr *root,
+ struct monr *a, struct monr *b)
+{
+ WARN_ON_ONCE(!root || !a || !b);
+ WARN_ON_ONCE(a == b);
+
+ if (root == a)
+ return true;
+ if (root == b)
+ return false;
+
+ b = b->parent;
+ /* Break at the root */
+ while (b != root) {
+ WARN_ON_ONCE(!b);
+ if (a == b)
+ return true;
+ b = b->parent;
+ }
+ return false;
+}
+
+/* helper function to finish transition to astate. */
+static inline void
+__pmonr__finish_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+ union prmid_summary summary;
+
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+ pmonr->prmid = prmid;
+
+ list_move_tail(
+ &prmid->pool_entry, &__pkg_data(pmonr, active_prmids_pool));
+ list_move_tail(
+ &pmonr->rotation_entry, &__pkg_data(pmonr, astate_pmonrs_lru));
+
+ summary.sched_rmid = pmonr->prmid->rmid;
+ summary.read_rmid = pmonr->prmid->rmid;
+ atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+}
+
+static inline void
+__pmonr__ustate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ __pmonr__finish_to_astate(pmonr, prmid);
+}
+
+static inline void
+__pmonr__to_ustate(struct pmonr *pmonr)
+{
+ union prmid_summary summary;
+
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+ /* Do not warn on re-enter state for (U)state, to simplify cleanup
+ * of initialized states that were not scheduled.
+ */
+ if (__pmonr__in_ustate(pmonr))
+ return;
+
+ if (__pmonr__in_astate(pmonr)) {
+ WARN_ON_ONCE(!pmonr->prmid);
+
+ list_move_tail(&pmonr->prmid->pool_entry,
+ &__pkg_data(pmonr, nopmonr_limbo_prmids_pool));
+ pmonr->prmid = NULL;
+ } else {
+ WARN_ON_ONCE(true);
+ return;
+ }
+ list_del_init(&pmonr->rotation_entry);
+
+ summary.sched_rmid = INVALID_RMID;
+ summary.read_rmid =
+ monr__is_mon_active(pmonr->monr) ? 0 : INVALID_RMID;
+
+ atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+ WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+}
+
static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
{
int r;
unsigned long flags;
struct prmid *prmid;
struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+ struct pmonr *root_pmonr;
if (!__valid_pkg_id(pkg_id))
return -EINVAL;
@@ -252,12 +395,13 @@ static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
&pkg_data->pkg_data_lock, flags, pkg_id);
pkg_data->prmids_by_rmid[r] = prmid;
+ list_add_tail(&prmid->pool_entry, &pkg_data->free_prmids_pool);
/* RMID 0 is special and makes the root of rmid hierarchy. */
- if (r != 0)
- list_add_tail(&prmid->pool_entry,
- &pkg_data->free_prmids_pool);
-
+ if (r == 0) {
+ root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+ __pmonr__ustate_to_astate(root_pmonr, prmid);
+ }
raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
}
return 0;
@@ -273,6 +417,232 @@ fail:
}
+/* Alloc monr with all pmonrs in (U)state. */
+static struct monr *monr_alloc(void)
+{
+ int i;
+ struct pmonr *pmonr;
+ struct monr *monr;
+
+ monr = kmalloc(sizeof(struct monr), GFP_KERNEL);
+
+ if (!monr)
+ return ERR_PTR(-ENOMEM);
+
+ monr->flags = 0;
+ monr->parent = NULL;
+ INIT_LIST_HEAD(&monr->children);
+ INIT_LIST_HEAD(&monr->parent_entry);
+ monr->mon_event_group = NULL;
+
+ /* Iterate over all pkgs, even unitialized ones. */
+ for (i = 0; i < PQR_MAX_NR_PKGS; i++) {
+ /* Do not create pmonrs for unitialized packages. */
+ if (!cqm_pkgs_data[i]) {
+ monr->pmonrs[i] = NULL;
+ continue;
+ }
+ /* Rotation cpu is on pmonr's package. */
+ pmonr = pmonr_alloc(cqm_pkgs_data[i]->rotation_cpu);
+ if (IS_ERR(pmonr))
+ goto clean_pmonrs;
+ pmonr->monr = monr;
+ monr->pmonrs[i] = pmonr;
+ }
+ return monr;
+
+clean_pmonrs:
+ while (i--) {
+ if (cqm_pkgs_data[i])
+ kfree(monr->pmonrs[i]);
+ }
+ kfree(monr);
+ return ERR_PTR(PTR_ERR(pmonr));
+}
+
+/* Only can dealloc monrs with all pmonrs in (U)state. */
+static void monr_dealloc(struct monr *monr)
+{
+ int i;
+
+ cqm_pkg_id_for_each_online(i)
+ pmonr_dealloc(monr->pmonrs[i]);
+
+ kfree(monr);
+}
+
+/*
+ * Wrappers for monr manipulation in events.
+ *
+ */
+static inline struct monr *monr_from_event(struct perf_event *event)
+{
+ return (struct monr *) READ_ONCE(event->hw.cqm_monr);
+}
+
+static inline void event_set_monr(struct perf_event *event, struct monr *monr)
+{
+ WRITE_ONCE(event->hw.cqm_monr, monr);
+}
+
+/*
+ * Always finds a rmid_entry to schedule. To be called during scheduler.
+ * A fast path that only uses read_lock for common case when rmid for current
+ * package has been used before.
+ * On failure, verify that monr is active, if it is, try to obtain a free rmid
+ * and set pmonr to (A)state.
+ * On failure, transverse up monr_hrchy until finding one prmid for this
+ * pkg_id and set pmonr to (I)state.
+ * Called during task switch, it will set pmonr's prmid_summary to reflect the
+ * sched and read rmids that reflect pmonr's state.
+ */
+static inline void
+monr_hrchy_get_next_prmid_summary(struct pmonr *pmonr)
+{
+ union prmid_summary summary;
+
+ /*
+ * First, do lock-free fastpath.
+ */
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ if (summary.sched_rmid != INVALID_RMID)
+ return;
+
+ if (!prmid_summary__is_mon_active(summary))
+ return;
+
+ /*
+ * Lock-free path failed at first attempt. Now acquire lock and repeat
+ * in case the monr was modified in the mean time.
+ * This time try to obtain free rmid and update pmonr accordingly,
+ * instead of failing fast.
+ */
+ raw_spin_lock_nested(&__pkg_data(pmonr, pkg_data_lock), pmonr->pkg_id);
+
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ if (summary.sched_rmid != INVALID_RMID) {
+ raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+ return;
+ }
+
+ /* Do not try to obtain RMID if monr is not active. */
+ if (!prmid_summary__is_mon_active(summary)) {
+ raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+ return;
+ }
+
+ /*
+ * Can only fail if it was in (U)state.
+ * Try to obtain a free prmid and go to (A)state, if not possible,
+ * it should go to (I)state.
+ */
+ WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+
+ if (!list_empty(&__pkg_data(pmonr, free_prmids_pool))) {
+ /* Failed to obtain an valid rmid in this package for this
+ * monr. In next patches it will transition to (I)state.
+ * For now, stay in (U)state (do nothing)..
+ */
+ } else {
+ /* Transition to (A)state using free prmid. */
+ __pmonr__ustate_to_astate(
+ pmonr,
+ list_first_entry(&__pkg_data(pmonr, free_prmids_pool),
+ struct prmid, pool_entry));
+ }
+ raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+}
+
+static inline void __assert_monr_is_leaf(struct monr *monr)
+{
+ int i;
+
+ monr_hrchy_assert_held_mutexes();
+ monr_hrchy_assert_held_raw_spin_locks();
+
+ cqm_pkg_id_for_each_online(i)
+ WARN_ON_ONCE(!__pmonr__in_ustate(monr->pmonrs[i]));
+
+ WARN_ON_ONCE(!list_empty(&monr->children));
+}
+
+static inline void
+__monr_hrchy_insert_leaf(struct monr *monr, struct monr *parent)
+{
+ monr_hrchy_assert_held_mutexes();
+ monr_hrchy_assert_held_raw_spin_locks();
+
+ __assert_monr_is_leaf(monr);
+
+ list_add_tail(&monr->parent_entry, &parent->children);
+ monr->parent = parent;
+}
+
+static inline void
+__monr_hrchy_remove_leaf(struct monr *monr)
+{
+ /* Since root cannot be removed, monr must have a parent */
+ WARN_ON_ONCE(!monr->parent);
+
+ monr_hrchy_assert_held_mutexes();
+ monr_hrchy_assert_held_raw_spin_locks();
+
+ __assert_monr_is_leaf(monr);
+
+ list_del_init(&monr->parent_entry);
+ monr->parent = NULL;
+}
+
+static int __monr_hrchy_attach_cpu_event(struct perf_event *event)
+{
+ lockdep_assert_held(&cqm_mutex);
+ WARN_ON_ONCE(monr_from_event(event));
+
+ event_set_monr(event, monr_hrchy_root);
+ return 0;
+}
+
+/* task events are always leaves in the monr_hierarchy */
+static int __monr_hrchy_attach_task_event(struct perf_event *event,
+ struct monr *parent_monr)
+{
+ struct monr *monr;
+ unsigned long flags;
+ int i;
+
+ lockdep_assert_held(&cqm_mutex);
+
+ monr = monr_alloc();
+ if (IS_ERR(monr))
+ return PTR_ERR(monr);
+ event_set_monr(event, monr);
+ monr->mon_event_group = event;
+
+ monr_hrchy_acquire_locks(flags, i);
+ __monr_hrchy_insert_leaf(monr, parent_monr);
+ __monr__set_mon_active(monr);
+ monr_hrchy_release_locks(flags, i);
+
+ return 0;
+}
+
+/*
+ * Find appropriate position in hierarchy and set monr. Create new
+ * monr if necessary.
+ * Locks rmid hrchy.
+ */
+static int monr_hrchy_attach_event(struct perf_event *event)
+{
+ struct monr *monr_parent;
+
+ if (!event->cgrp && !(event->attach_state & PERF_ATTACH_TASK))
+ return __monr_hrchy_attach_cpu_event(event);
+
+ /* Two-levels hierarchy: Root and all event monr underneath it. */
+ monr_parent = monr_hrchy_root;
+ return __monr_hrchy_attach_task_event(event, monr_parent);
+}
+
/*
* Determine if @a and @b measure the same set of tasks.
*
@@ -291,7 +661,7 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
return false;
#endif
- /* If not task event, we're machine wide */
+ /* If not task event, it's a a cgroup or a non-task cpu event. */
if (!(b->attach_state & PERF_ATTACH_TASK))
return true;
@@ -310,69 +680,51 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
return false;
}
-struct rmid_read {
- u32 rmid;
- atomic64_t value;
-};
-
static struct pmu intel_cqm_pmu;
/*
* Find a group and setup RMID.
*
- * If we're part of a group, we use the group's RMID.
+ * If we're part of a group, we use the group's monr.
*/
-static void intel_cqm_setup_event(struct perf_event *event,
- struct perf_event **group)
+static int
+intel_cqm_setup_event(struct perf_event *event, struct perf_event **group)
{
struct perf_event *iter;
- bool conflict = false;
- u32 rmid;
+ struct monr *monr;
+ *group = NULL;
- list_for_each_entry(iter, &cache_groups, hw.cqm_event_groups_entry) {
- rmid = iter->hw.cqm_rmid;
+ lockdep_assert_held(&cqm_mutex);
+ list_for_each_entry(iter, &cache_groups, hw.cqm_event_groups_entry) {
+ monr = monr_from_event(iter);
if (__match_event(iter, event)) {
- /* All tasks in a group share an RMID */
- event->hw.cqm_rmid = rmid;
+ /* All tasks in a group share an monr. */
+ event_set_monr(event, monr);
*group = iter;
- return;
+ return 0;
}
}
-
- if (conflict)
- rmid = INVALID_RMID;
- else
- rmid = __get_rmid();
-
- event->hw.cqm_rmid = rmid;
+ /*
+ * Since no match was found, create a new monr and set this
+ * event as head of a new cache group. All events in this cache group
+ * will share the monr.
+ */
+ return monr_hrchy_attach_event(event);
}
+/* Read current package immediately and remote pkg (if any) from cache. */
static void intel_cqm_event_read(struct perf_event *event)
{
- unsigned long flags;
- u32 rmid;
- u64 val;
+ union prmid_summary summary;
+ struct prmid *prmid;
u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ struct pmonr *pmonr = monr_from_event(event)->pmonrs[pkg_id];
- raw_spin_lock_irqsave(&cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
- rmid = event->hw.cqm_rmid;
-
- if (!__rmid_valid(rmid))
- goto out;
-
- val = __rmid_read(rmid);
-
- /*
- * Ignore this reading on error states and do not update the value.
- */
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- goto out;
-
- local64_set(&event->count, val);
-out:
- raw_spin_unlock_irqrestore(
- &cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ prmid = __prmid_from_rmid(pkg_id, summary.read_rmid);
+ cqm_prmid_update(prmid);
+ local64_set(&event->count, atomic64_read(&prmid->last_read_value));
}
static inline bool cqm_group_leader(struct perf_event *event)
@@ -380,52 +732,81 @@ static inline bool cqm_group_leader(struct perf_event *event)
return !list_empty(&event->hw.cqm_event_groups_entry);
}
-static void intel_cqm_event_start(struct perf_event *event, int mode)
+static inline void __intel_cqm_event_start(
+ struct perf_event *event, union prmid_summary summary)
{
u16 pkg_id = topology_physical_package_id(smp_processor_id());
if (!(event->hw.state & PERF_HES_STOPPED))
return;
event->hw.state &= ~PERF_HES_STOPPED;
- __update_pqr_prmid(__prmid_from_rmid(pkg_id, event->hw.cqm_rmid));
+ __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
+}
+
+static void intel_cqm_event_start(struct perf_event *event, int mode)
+{
+ union prmid_summary summary;
+ u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ struct pmonr *pmonr = monr_from_event(event)->pmonrs[pkg_id];
+
+ /* Utilize most up to date pmonr summary. */
+ monr_hrchy_get_next_prmid_summary(pmonr);
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+ __intel_cqm_event_start(event, summary);
}
static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
+ union prmid_summary summary;
u16 pkg_id = topology_physical_package_id(smp_processor_id());
+ struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+
if (event->hw.state & PERF_HES_STOPPED)
return;
event->hw.state |= PERF_HES_STOPPED;
- intel_cqm_event_read(event);
- __update_pqr_prmid(__prmid_from_rmid(pkg_id, 0));
+
+ summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
+ /* Occupancy of CQM events is obtained at read. No need to read
+ * when event is stopped since read on inactive cpus succeed.
+ */
+ __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
}
static int intel_cqm_event_add(struct perf_event *event, int mode)
{
- unsigned long flags;
- u32 rmid;
+ struct monr *monr;
+ struct pmonr *pmonr;
+ union prmid_summary summary;
u16 pkg_id = topology_physical_package_id(smp_processor_id());
- raw_spin_lock_irqsave(&cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
+ monr = monr_from_event(event);
+ pmonr = monr->pmonrs[pkg_id];
event->hw.state = PERF_HES_STOPPED;
- rmid = event->hw.cqm_rmid;
- if (__rmid_valid(rmid) && (mode & PERF_EF_START))
- intel_cqm_event_start(event, mode);
+ /* Utilize most up to date pmonr summary. */
+ monr_hrchy_get_next_prmid_summary(pmonr);
+ summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+
+ if (!prmid_summary__is_mon_active(summary))
+ return -1;
- raw_spin_unlock_irqrestore(
- &cqm_pkgs_data[pkg_id]->pkg_data_lock, flags);
+ if (mode & PERF_EF_START)
+ __intel_cqm_event_start(event, summary);
+
+ /* (I)state pmonrs cannot report occupancy for themselves. */
return 0;
}
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
+ struct monr *monr;
+ int i;
+ unsigned long flags;
mutex_lock(&cqm_mutex);
-
/*
* If there's another event in this group...
*/
@@ -435,33 +816,56 @@ static void intel_cqm_event_destroy(struct perf_event *event)
hw.cqm_event_group_entry);
list_del(&event->hw.cqm_event_group_entry);
}
-
/*
* And we're the group leader..
*/
- if (cqm_group_leader(event)) {
- /*
- * If there was a group_other, make that leader, otherwise
- * destroy the group and return the RMID.
- */
- if (group_other) {
- list_replace(&event->hw.cqm_event_groups_entry,
- &group_other->hw.cqm_event_groups_entry);
- } else {
- u32 rmid = event->hw.cqm_rmid;
-
- if (__rmid_valid(rmid))
- __put_rmid(rmid);
- list_del(&event->hw.cqm_event_groups_entry);
- }
+ if (!cqm_group_leader(event))
+ goto exit;
+
+ monr = monr_from_event(event);
+
+ /*
+ * If there was a group_other, make that leader, otherwise
+ * destroy the group and return the RMID.
+ */
+ if (group_other) {
+ /* Update monr reference to group head. */
+ monr->mon_event_group = group_other;
+ list_replace(&event->hw.cqm_event_groups_entry,
+ &group_other->hw.cqm_event_groups_entry);
+ goto exit;
}
+ /*
+ * Event is the only event in cache group.
+ */
+
+ event_set_monr(event, NULL);
+ list_del(&event->hw.cqm_event_groups_entry);
+
+ if (monr__is_root(monr))
+ goto exit;
+
+ /* Transition all pmonrs to (U)state. */
+ monr_hrchy_acquire_locks(flags, i);
+
+ cqm_pkg_id_for_each_online(i)
+ __pmonr__to_ustate(monr->pmonrs[i]);
+
+ __monr__clear_mon_active(monr);
+ monr->mon_event_group = NULL;
+ __monr_hrchy_remove_leaf(monr);
+ monr_hrchy_release_locks(flags, i);
+
+ monr_dealloc(monr);
+exit:
mutex_unlock(&cqm_mutex);
}
static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
+ int ret;
if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -488,7 +892,11 @@ static int intel_cqm_event_init(struct perf_event *event)
/* Will also set rmid */
- intel_cqm_setup_event(event, &group);
+ ret = intel_cqm_setup_event(event, &group);
+ if (ret) {
+ mutex_unlock(&cqm_mutex);
+ return ret;
+ }
if (group) {
list_add_tail(&event->hw.cqm_event_group_entry,
@@ -697,6 +1105,12 @@ static int __init intel_cqm_init(void)
goto error;
}
+ monr_hrchy_root = monr_alloc();
+ if (IS_ERR(monr_hrchy_root)) {
+ ret = PTR_ERR(monr_hrchy_root);
+ goto error;
+ }
+
/* Select the minimum of the maximum rmids to use as limit for
* threshold. XXX: per-package threshold.
*/
@@ -705,6 +1119,7 @@ static int __init intel_cqm_init(void)
min_max_rmid = cqm_pkgs_data[i]->max_rmid;
intel_cqm_setup_pkg_prmid_pools(i);
}
+ monr_hrchy_root->flags |= MONR_MON_ACTIVE;
/*
* A reasonable upper limit on the max threshold is the number
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index a25d49b..81092f2 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -45,14 +45,111 @@ static unsigned int __rmid_min_update_time = RMID_DEFAULT_MIN_UPDATE_TIME;
static inline int cqm_prmid_update(struct prmid *prmid);
+/*
+ * union prmid_summary: Machine-size summary of a pmonr's prmid state.
+ * @value: One word accesor.
+ * @rmid: rmid for prmid.
+ * @sched_rmid: The rmid to write in the PQR MSR.
+ * @read_rmid: The rmid to read occupancy from.
+ *
+ * The prmid_summarys are read atomically and without the need of LOCK
+ * instructions during event and group scheduling in task context switch.
+ * They are set when a prmid change state and allow lock-free fast paths for
+ * RMID scheduling and RMID read for the common case when prmid does not need
+ * to change state.
+ * The combination of values in sched_rmid and read_rmid indicate the state of
+ * the associated pmonr (see pmonr comments) as follows:
+ * pmonr state
+ * | (A)state (U)state
+ * ----------------------------------------------------------------------------
+ * sched_rmid | pmonr.prmid INVALID_RMID
+ * read_rmid | pmonr.prmid INVALID_RMID
+ * (or 0)
+ *
+ * The combination sched_rmid == INVALID_RMID and read_rmid == 0 for (U)state
+ * denotes that the flag MONR_MON_ACTIVE is set in the monr associated with
+ * the pmonr for this prmid_summary.
+ */
+union prmid_summary {
+ long long value;
+ struct {
+ u32 sched_rmid;
+ u32 read_rmid;
+ };
+};
+
# define INVALID_RMID (-1)
+/* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or INVALID_RMID
+ * depending on whether monitoring is active or not.
+ */
+inline bool prmid_summary__is_ustate(union prmid_summary summ)
+{
+ return summ.sched_rmid == INVALID_RMID;
+}
+
+inline bool prmid_summary__is_mon_active(union prmid_summary summ)
+{
+ /* If not in (U)state, then MONR_MON_ACTIVE must be set. */
+ return summ.sched_rmid != INVALID_RMID ||
+ summ.read_rmid == 0;
+}
+
+struct monr;
+
+/* struct pmonr: Node of per-package hierarchy of MONitored Resources.
+ * @prmid: The prmid of this pmonr -when in (A)state-.
+ * @rotation_entry: List entry to attach to astate_pmonrs_lru
+ * in pkg_data.
+ * @monr: The monr that contains this pmonr.
+ * @pkg_id: Auxiliar variable with pkg id for this pmonr.
+ * @prmid_summary_atomic: Atomic accesor to store a union prmid_summary
+ * that represent the state of this pmonr.
+ *
+ * A pmonr forms a per-package hierarchy of prmids. Each one represents a
+ * resource to be monitored and can hold a prmid. Due to rmid scarcity,
+ * rmids can be recycled and rotated. When a rmid is not available for this
+ * pmonr, the pmonr utilizes the rmid of its ancestor.
+ * A pmonr is always in one of the following states:
+ * - (A)ctive: Has @prmid assigned, @ancestor_pmonr must be NULL.
+ * - (U)nused: No @ancestor_pmonr and no @prmid, hence no available
+ * prmid and no inhering one either. Not in rotation list.
+ * This state is unschedulable and a prmid
+ * should be found (either o free one or ancestor's) before
+ * scheduling a thread with (U)state pmonr in
+ * a cpu in this package.
+ *
+ * The state transitions are:
+ * (U) : The initial state. Starts there after allocation.
+ * (U) -> (A): If on first sched (or initialization) pmonr receives a prmid.
+ * (A) -> (U): On destruction of monr.
+ *
+ * Each pmonr is contained by a monr.
+ */
+struct pmonr {
+
+ struct prmid *prmid;
+
+ struct monr *monr;
+ struct list_head rotation_entry;
+
+ u16 pkg_id;
+
+ /* all writers are sync'ed by package's lock. */
+ atomic64_t prmid_summary_atomic;
+};
+
/*
* struct pkg_data: Per-package CQM data.
* @max_rmid: Max rmid valid for cpus in this package.
* @prmids_by_rmid: Utility mapping between rmid values and prmids.
* XXX: Make it an array of prmids.
* @free_prmid_pool: Free prmids.
+ * @active_prmid_pool: prmids associated with a (A)state pmonr.
+ * @nopmonr_limbo_prmid_pool: prmids in limbo state that are not referenced
+ * by a pmonr.
+ * @astate_pmonrs_lru: pmonrs in (A)state. LRU in increasing order of
+ * pmonr.last_enter_astate.
* @pkg_data_mutex: Hold for stability when modifying pmonrs
* hierarchy.
* @pkg_data_lock: Hold to protect variables that may be accessed
@@ -71,6 +168,12 @@ struct pkg_data {
* Pools of prmids used in rotation logic.
*/
struct list_head free_prmids_pool;
+ /* Can be modified during task switch with (U)state -> (A)state. */
+ struct list_head active_prmids_pool;
+ /* Only modified during rotation logic and deletion. */
+ struct list_head nopmonr_limbo_prmids_pool;
+
+ struct list_head astate_pmonrs_lru;
struct mutex pkg_data_mutex;
raw_spinlock_t pkg_data_lock;
@@ -78,6 +181,52 @@ struct pkg_data {
int rotation_cpu;
};
+/*
+ * Flags for monr.
+ */
+#define MONR_MON_ACTIVE 0x1
+
+/*
+ * struct monr: MONitored Resource.
+ * @flags: Flags field for monr (XXX: More flags will be added
+ * with MBM).
+ * @mon_event_group: The head of event's group that use this monr, if any.
+ * @parent: Parent in monr hierarchy.
+ * @children: List of children in monr hierarchy.
+ * @parent_entry: Entry in parent's children list.
+ * @pmonrs: Per-package pmonr for this monr.
+ *
+ * Each cgroup or thread that requires a RMID will have a corresponding
+ * monr in the system-wide hierarchy reflecting it's position in the
+ * cgroup/thread hierarchy.
+ * An monr is assigned to every CQM event and/or monitored cgroups when
+ * monitoring is activated and that instance's address do not change during
+ * the lifetime of the event or cgroup.
+ *
+ * On creation, the monr has flags cleared and all its pmonrs in (U)state.
+ * The flag MONR_MON_ACTIVE must be set to enable any transition out of
+ * (U)state to occur.
+ */
+struct monr {
+ u16 flags;
+ /* Back reference pointers */
+ struct perf_event *mon_event_group;
+
+ struct monr *parent;
+ struct list_head children;
+ struct list_head parent_entry;
+ struct pmonr *pmonrs[PQR_MAX_NR_PKGS];
+};
+
+/*
+ * Root for system-wide hierarchy of monr.
+ * A per-package raw_spin_lock protects changes to the per-pkg elements of
+ * the monr hierarchy.
+ * To modify the monr hierarchy, must hold all locks in each package
+ * using packaged-id as nesting parameter.
+ */
+extern struct monr *monr_hrchy_root;
+
extern struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
static inline u16 __cqm_pkgs_data_next_online(u16 pkg_id)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5eb7dea..bf29258 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -120,7 +120,7 @@ struct hw_perf_event {
};
#ifdef CONFIG_INTEL_RDT
struct { /* intel_cqm */
- u32 cqm_rmid;
+ void *cqm_monr;
struct list_head cqm_event_group_entry;
struct list_head cqm_event_groups_entry;
};
--
2.8.0.rc3.226.g39d4020
CQM defines a dirty threshold that is the minimum number of dirty
cache lines that a prmid can hold before being eligible to be reused.
This threshold is zero unless there exist significant contention of prmids
(more on this on the patch that introduces rotation of RMIDs).
A limbo prmid is a prmid that is no longer utilized by any pmonr, yet, its
occupancy exceeds the dirty threshold. This is a consequence of the
hardware design that do not provide a mechanism to flush cache lines
associated with a RMID.
If no pmonr schedules a limbo prmid, it's expected that it's occupancy
will eventually drop below the dirty threshold. Nevertheless, the cache
lines tagged to a limbo prmid still hold valid occupancy for the previous
owner of the prmid. This creates a difference in the way the occupancy of
pmonr is read depending on whether it has hold a prmid recently or not.
This patch introduces the (I)state mentioned in previous changelog.
The (I)state is a superstate conformed by two substates:
- (IL)state: (I)state with limbo prmid, this pmonr held a prmid in
(A)state before its transition to (I)state.
- (IN)state: (I)state without limbo prmid, this pmonr did not held a
prmid recently.
A pmonr in (IL)state keeps the reference to its former prmid in the field
limbo_prmid, this occupancy is counted towards the occupancy
of the ancestors of the pmonr, reducing the error caused by stealing
of prmids during RMID rotation.
In future patches (rotation logic), the occupancy of limbo_prmids is
polled periodically and (IL)state pmonrs with limbo prmids that had become
clean will transition to (IN)state.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 203 ++++++++++++++++++++++++++++++++++++++++++--
arch/x86/events/intel/cqm.h | 88 +++++++++++++++++--
2 files changed, 277 insertions(+), 14 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 65551bb..caf7152 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -39,16 +39,34 @@ struct monr *monr_hrchy_root;
struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
+static inline bool __pmonr__in_istate(struct pmonr *pmonr)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ return pmonr->ancestor_pmonr;
+}
+
+static inline bool __pmonr__in_ilstate(struct pmonr *pmonr)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ return __pmonr__in_istate(pmonr) && pmonr->limbo_prmid;
+}
+
+static inline bool __pmonr__in_instate(struct pmonr *pmonr)
+{
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+ return __pmonr__in_istate(pmonr) && !__pmonr__in_ilstate(pmonr);
+}
+
static inline bool __pmonr__in_astate(struct pmonr *pmonr)
{
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
- return pmonr->prmid;
+ return pmonr->prmid && !pmonr->ancestor_pmonr;
}
static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
{
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
- return !pmonr->prmid;
+ return !pmonr->prmid && !pmonr->ancestor_pmonr;
}
static inline bool monr__is_root(struct monr *monr)
@@ -210,9 +228,12 @@ static int pkg_data_init_cpu(int cpu)
INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
INIT_LIST_HEAD(&pkg_data->active_prmids_pool);
+ INIT_LIST_HEAD(&pkg_data->pmonr_limbo_prmids_pool);
INIT_LIST_HEAD(&pkg_data->nopmonr_limbo_prmids_pool);
INIT_LIST_HEAD(&pkg_data->astate_pmonrs_lru);
+ INIT_LIST_HEAD(&pkg_data->istate_pmonrs_lru);
+ INIT_LIST_HEAD(&pkg_data->ilstate_pmonrs_lru);
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
@@ -261,7 +282,15 @@ static struct pmonr *pmonr_alloc(int cpu)
if (!pmonr)
return ERR_PTR(-ENOMEM);
+ pmonr->ancestor_pmonr = NULL;
+
+ /*
+ * Since (A)state and (I)state have union in members,
+ * initialize one of them only.
+ */
+ INIT_LIST_HEAD(&pmonr->pmonr_deps_head);
pmonr->prmid = NULL;
+ INIT_LIST_HEAD(&pmonr->limbo_rotation_entry);
pmonr->monr = NULL;
INIT_LIST_HEAD(&pmonr->rotation_entry);
@@ -327,6 +356,44 @@ __pmonr__finish_to_astate(struct pmonr *pmonr, struct prmid *prmid)
atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
}
+/*
+ * Transition to (A)state from (IN)state, given a valid prmid.
+ * Cannot fail. Updates ancestor dependants to use this pmonr as new ancestor.
+ */
+static inline void
+__pmonr__instate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+ struct pmonr *pos, *tmp, *ancestor;
+ union prmid_summary old_summary, summary;
+
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+ /* If in (I) state, cannot have limbo_prmid, otherwise prmid
+ * in function's argument is superfluous.
+ */
+ WARN_ON_ONCE(pmonr->limbo_prmid);
+
+ /* Do not depend on ancestor_pmonr anymore. Make it (A)state. */
+ ancestor = pmonr->ancestor_pmonr;
+ list_del_init(&pmonr->pmonr_deps_entry);
+ pmonr->ancestor_pmonr = NULL;
+ __pmonr__finish_to_astate(pmonr, prmid);
+
+ /* Update ex ancestor's dependants that are pmonr descendants. */
+ list_for_each_entry_safe(pos, tmp, &ancestor->pmonr_deps_head,
+ pmonr_deps_entry) {
+ if (!__monr_hrchy_is_ancestor(monr_hrchy_root,
+ pmonr->monr, pos->monr))
+ continue;
+ list_move_tail(&pos->pmonr_deps_entry, &pmonr->pmonr_deps_head);
+ pos->ancestor_pmonr = pmonr;
+ old_summary.value = atomic64_read(&pos->prmid_summary_atomic);
+ summary.sched_rmid = prmid->rmid;
+ summary.read_rmid = old_summary.read_rmid;
+ atomic64_set(&pos->prmid_summary_atomic, summary.value);
+ }
+}
+
static inline void
__pmonr__ustate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
{
@@ -334,9 +401,59 @@ __pmonr__ustate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
__pmonr__finish_to_astate(pmonr, prmid);
}
+/*
+ * Find lowest active ancestor.
+ * Always successful since monr_hrchy_root is always in (A)state.
+ */
+static struct monr *
+__monr_hrchy__find_laa(struct monr *monr, u16 pkg_id)
+{
+ lockdep_assert_held(&cqm_pkgs_data[pkg_id]->pkg_data_lock);
+
+ while ((monr = monr->parent)) {
+ if (__pmonr__in_astate(monr->pmonrs[pkg_id]))
+ return monr;
+ }
+ /* Should have hitted monr_hrchy_root */
+ WARN_ON_ONCE(true);
+ return NULL;
+}
+
+/*
+ * __pmnor__move_dependants: Move dependants from one ancestor to another.
+ * @old: Old ancestor.
+ * @new: New ancestor.
+ *
+ * To be called on valid pmonrs. @new must be ancestor of @old.
+ */
+static inline void
+__pmonr__move_dependants(struct pmonr *old, struct pmonr *new)
+{
+ struct pmonr *dep;
+ union prmid_summary old_summary, summary;
+
+ WARN_ON_ONCE(old->pkg_id != new->pkg_id);
+ lockdep_assert_held(&__pkg_data(old, pkg_data_lock));
+
+ /* Update this pmonr dependencies to use new ancestor. */
+ list_for_each_entry(dep, &old->pmonr_deps_head, pmonr_deps_entry) {
+ /* Set next summary for dependent pmonrs. */
+ dep->ancestor_pmonr = new;
+
+ old_summary.value = atomic64_read(&dep->prmid_summary_atomic);
+ summary.sched_rmid = new->prmid->rmid;
+ summary.read_rmid = old_summary.read_rmid;
+ atomic64_set(&dep->prmid_summary_atomic, summary.value);
+ }
+ list_splice_tail_init(&old->pmonr_deps_head,
+ &new->pmonr_deps_head);
+}
+
static inline void
__pmonr__to_ustate(struct pmonr *pmonr)
{
+ struct pmonr *ancestor;
+ u16 pkg_id = pmonr->pkg_id;
union prmid_summary summary;
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
@@ -350,9 +467,27 @@ __pmonr__to_ustate(struct pmonr *pmonr)
if (__pmonr__in_astate(pmonr)) {
WARN_ON_ONCE(!pmonr->prmid);
+ ancestor = __monr_hrchy__find_laa(
+ pmonr->monr, pkg_id)->pmonrs[pkg_id];
+ WARN_ON_ONCE(!ancestor);
+ __pmonr__move_dependants(pmonr, ancestor);
list_move_tail(&pmonr->prmid->pool_entry,
&__pkg_data(pmonr, nopmonr_limbo_prmids_pool));
pmonr->prmid = NULL;
+ } else if (__pmonr__in_istate(pmonr)) {
+ list_del_init(&pmonr->pmonr_deps_entry);
+ /* limbo_prmid is already in limbo pool */
+ if (__pmonr__in_ilstate(pmonr)) {
+ WARN_ON(!pmonr->limbo_prmid);
+ list_move_tail(
+ &pmonr->limbo_prmid->pool_entry,
+ &__pkg_data(pmonr, nopmonr_limbo_prmids_pool));
+
+ pmonr->limbo_prmid = NULL;
+ list_del_init(&pmonr->limbo_rotation_entry);
+ } else {
+ }
+ pmonr->ancestor_pmonr = NULL;
} else {
WARN_ON_ONCE(true);
return;
@@ -367,6 +502,62 @@ __pmonr__to_ustate(struct pmonr *pmonr)
WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
}
+static inline void __pmonr__set_istate_summary(struct pmonr *pmonr)
+{
+ union prmid_summary summary;
+
+ summary.sched_rmid = pmonr->ancestor_pmonr->prmid->rmid;
+ summary.read_rmid =
+ pmonr->limbo_prmid ? pmonr->limbo_prmid->rmid : INVALID_RMID;
+ atomic64_set(
+ &pmonr->prmid_summary_atomic, summary.value);
+}
+
+/*
+ * Transition to (I)state from no (I)state..
+ * Finds a valid ancestor transversing monr_hrchy. Cannot fail.
+ */
+static inline void
+__pmonr__to_istate(struct pmonr *pmonr)
+{
+ struct pmonr *ancestor;
+ u16 pkg_id = pmonr->pkg_id;
+
+ lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+ if (!(__pmonr__in_ustate(pmonr) || __pmonr__in_astate(pmonr))) {
+ /* Invalid initial state. */
+ WARN_ON_ONCE(true);
+ return;
+ }
+
+ ancestor = __monr_hrchy__find_laa(pmonr->monr, pkg_id)->pmonrs[pkg_id];
+ WARN_ON_ONCE(!ancestor);
+
+ if (__pmonr__in_astate(pmonr)) {
+ /* Active pmonr->prmid becomes limbo in transition to (I)state.
+ * Note that pmonr->prmid and pmonr->limbo_prmid are in an
+ * union, so no need to copy.
+ */
+ __pmonr__move_dependants(pmonr, ancestor);
+ list_move_tail(&pmonr->limbo_prmid->pool_entry,
+ &__pkg_data(pmonr, pmonr_limbo_prmids_pool));
+ }
+
+ pmonr->ancestor_pmonr = ancestor;
+ list_add_tail(&pmonr->pmonr_deps_entry, &ancestor->pmonr_deps_head);
+
+ list_move_tail(
+ &pmonr->rotation_entry, &__pkg_data(pmonr, istate_pmonrs_lru));
+
+ if (pmonr->limbo_prmid)
+ list_move_tail(&pmonr->limbo_rotation_entry,
+ &__pkg_data(pmonr, ilstate_pmonrs_lru));
+
+ __pmonr__set_istate_summary(pmonr);
+
+}
+
static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
{
int r;
@@ -538,11 +729,11 @@ monr_hrchy_get_next_prmid_summary(struct pmonr *pmonr)
*/
WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
- if (!list_empty(&__pkg_data(pmonr, free_prmids_pool))) {
+ if (list_empty(&__pkg_data(pmonr, free_prmids_pool))) {
/* Failed to obtain an valid rmid in this package for this
- * monr. In next patches it will transition to (I)state.
- * For now, stay in (U)state (do nothing)..
+ * monr. Use an inherited one.
*/
+ __pmonr__to_istate(pmonr);
} else {
/* Transition to (A)state using free prmid. */
__pmonr__ustate_to_astate(
@@ -796,7 +987,7 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
__intel_cqm_event_start(event, summary);
/* (I)state pmonrs cannot report occupancy for themselves. */
- return 0;
+ return prmid_summary__is_istate(summary) ? -1 : 0;
}
static void intel_cqm_event_destroy(struct perf_event *event)
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 81092f2..22635bc 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -60,11 +60,11 @@ static inline int cqm_prmid_update(struct prmid *prmid);
* The combination of values in sched_rmid and read_rmid indicate the state of
* the associated pmonr (see pmonr comments) as follows:
* pmonr state
- * | (A)state (U)state
+ * | (A)state (IN)state (IL)state (U)state
* ----------------------------------------------------------------------------
- * sched_rmid | pmonr.prmid INVALID_RMID
- * read_rmid | pmonr.prmid INVALID_RMID
- * (or 0)
+ * sched_rmid | pmonr.prmid ancestor.prmid ancestor.prmid INVALID_RMID
+ * read_rmid | pmonr.prmid INVALID_RMID pmonr.limbo_prmid INVALID_RMID
+ * (or 0)
*
* The combination sched_rmid == INVALID_RMID and read_rmid == 0 for (U)state
* denotes that the flag MONR_MON_ACTIVE is set in the monr associated with
@@ -88,6 +88,13 @@ inline bool prmid_summary__is_ustate(union prmid_summary summ)
return summ.sched_rmid == INVALID_RMID;
}
+/* A pmonr in (I)state (either (IN)state or (IL)state. */
+inline bool prmid_summary__is_istate(union prmid_summary summ)
+{
+ return summ.sched_rmid != INVALID_RMID &&
+ summ.sched_rmid != summ.read_rmid;
+}
+
inline bool prmid_summary__is_mon_active(union prmid_summary summ)
{
/* If not in (U)state, then MONR_MON_ACTIVE must be set. */
@@ -98,9 +105,26 @@ inline bool prmid_summary__is_mon_active(union prmid_summary summ)
struct monr;
/* struct pmonr: Node of per-package hierarchy of MONitored Resources.
+ * @ancestor_pmonr: lowest active pmonr whose monr is ancestor of
+ * this pmonr's monr.
+ * @pmonr_deps_head: List of pmonrs without prmid that use
+ * this pmonr's prmid -when in (A)state-.
* @prmid: The prmid of this pmonr -when in (A)state-.
- * @rotation_entry: List entry to attach to astate_pmonrs_lru
- * in pkg_data.
+ * @pmonr_deps_entry: Entry into ancestor's @pmonr_deps_head
+ * -when inheriting, (I)state-.
+ * @limbo_prmid: A prmid previously used by this pmonr and that
+ * has not been reused yet and therefore contain
+ * occupancy that should be counted towards this
+ * pmonr's occupancy.
+ * The limbo_prmid can be reused in the same pmonr
+ * in the next transition to (A) state, even if
+ * the occupancy of @limbo_prmid is not below the
+ * dirty threshold, reducing the need of free
+ * prmids.
+ * @limbo_rotation_entry: List entry to attach to ilstate_pmonrs_lru when
+ * this pmonr is in (IL)state.
+ * @rotation_entry: List entry to attach to either astate_pmonrs_lru
+ * or ilstate_pmonrs_lru in pkg_data.
* @monr: The monr that contains this pmonr.
* @pkg_id: Auxiliar variable with pkg id for this pmonr.
* @prmid_summary_atomic: Atomic accesor to store a union prmid_summary
@@ -112,6 +136,15 @@ struct monr;
* pmonr, the pmonr utilizes the rmid of its ancestor.
* A pmonr is always in one of the following states:
* - (A)ctive: Has @prmid assigned, @ancestor_pmonr must be NULL.
+ * - (I)nherited: The prmid used is "Inherited" from @ancestor_pmonr.
+ * @ancestor_pmonr must be set. @prmid is unused. This is
+ * a super-state composed of two substates:
+ *
+ * - (IL)state: A pmonr in (I)state that has a valid limbo_prmid.
+ * - (IN)state: A pmonr in (I)state with NO valid limbo_prmid.
+ *
+ * When the distintion between the two substates is
+ * no relevant, the pmonr is simply in the (I)state.
* - (U)nused: No @ancestor_pmonr and no @prmid, hence no available
* prmid and no inhering one either. Not in rotation list.
* This state is unschedulable and a prmid
@@ -122,13 +155,41 @@ struct monr;
* The state transitions are:
* (U) : The initial state. Starts there after allocation.
* (U) -> (A): If on first sched (or initialization) pmonr receives a prmid.
+ * (U) -> (I): If on first sched (or initialization) pmonr cannot find a free
+ * prmid and resort to use its ancestor's.
+ * (A) -> (I): On stealing of prmid from pmonr (by rotation logic only).
* (A) -> (U): On destruction of monr.
+ * (I) -> (A): On receiving a free prmid or on reuse of its @limbo_prmid (by
+ * rotation logic only).
+ * (I) -> (U): On destruction of pmonr.
+ *
+ * Note that the (I) -> (A) transition makes monitoring available, but can
+ * introduce error due to cache lines allocated before the transition. Such
+ * error is likely to decrease over time.
+ * When entering an (I) state, the reported count of the event is unavaiable.
*
- * Each pmonr is contained by a monr.
+ * Each pmonr is contained by a monr. Each monr forms a system-wide hierarchy
+ * that is used by the pmrs to find ancestors and dependants. The per-package
+ * hierarchy spanned by the pmrs follows the monr hierarchy except by
+ * collapsing the nodes in (I)state into a super-node that contains an (A)state
+ * pmonr and all of its dependants (pmonr in pmonr_deps_head).
*/
struct pmonr {
- struct prmid *prmid;
+ /* If set, pmonr is in (I)state. */
+ struct pmonr *ancestor_pmonr;
+
+ union{
+ struct { /* (A)state variables. */
+ struct list_head pmonr_deps_head;
+ struct prmid *prmid;
+ };
+ struct { /* (I)state variables. */
+ struct list_head pmonr_deps_entry;
+ struct prmid *limbo_prmid;
+ struct list_head limbo_rotation_entry;
+ };
+ };
struct monr *monr;
struct list_head rotation_entry;
@@ -146,10 +207,17 @@ struct pmonr {
* XXX: Make it an array of prmids.
* @free_prmid_pool: Free prmids.
* @active_prmid_pool: prmids associated with a (A)state pmonr.
+ * @pmonr_limbo_prmid_pool: limbo prmids referenced by the limbo_prmid of a
+ * pmonr in (I)state.
* @nopmonr_limbo_prmid_pool: prmids in limbo state that are not referenced
* by a pmonr.
* @astate_pmonrs_lru: pmonrs in (A)state. LRU in increasing order of
* pmonr.last_enter_astate.
+ * @istate_pmonrs_lru: pmors In (I)state with no limbo_prmid. LRU in
+ * increasing order of pmonr.last_enter_istate.
+ * @ilsate_pmonrs_lru: pmonrs in (IL)state, these pmonrs have a valid
+ * limbo_prmid. It's a subset of istate_pmonrs_lru.
+ * Sorted increasingly by pmonr.last_enter_istate.
* @pkg_data_mutex: Hold for stability when modifying pmonrs
* hierarchy.
* @pkg_data_lock: Hold to protect variables that may be accessed
@@ -171,9 +239,13 @@ struct pkg_data {
/* Can be modified during task switch with (U)state -> (A)state. */
struct list_head active_prmids_pool;
/* Only modified during rotation logic and deletion. */
+ struct list_head pmonr_limbo_prmids_pool;
struct list_head nopmonr_limbo_prmids_pool;
struct list_head astate_pmonrs_lru;
+ /* Superset of ilstate_pmonrs_lru. */
+ struct list_head istate_pmonrs_lru;
+ struct list_head ilstate_pmonrs_lru;
struct mutex pkg_data_mutex;
raw_spinlock_t pkg_data_lock;
--
2.8.0.rc3.226.g39d4020
Create a CQM_EVENT_ATTR_STR to use in CQM to remove dependency
on the unrelated x86's PMU EVENT_ATTR_STR.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 8457dd0..d5eac8f 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -38,6 +38,13 @@ static inline void __update_pqr_rmid(u32 rmid)
static DEFINE_MUTEX(cache_mutex);
static DEFINE_RAW_SPINLOCK(cache_lock);
+#define CQM_EVENT_ATTR_STR(_name, v, str) \
+static struct perf_pmu_events_attr event_attr_##v = { \
+ .attr = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), \
+ .id = 0, \
+ .event_str = str, \
+}
+
/*
* Groups of events that have the same target(s), one RMID per group.
*/
@@ -504,11 +511,11 @@ static int intel_cqm_event_init(struct perf_event *event)
return 0;
}
-EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
-EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
-EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
-EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
-EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
+CQM_EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
+CQM_EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
+CQM_EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
+CQM_EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
+CQM_EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
--
2.8.0.rc3.226.g39d4020
CQM was the only user of pmu->count, no need to have it anymore.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
include/linux/perf_event.h | 6 ------
kernel/events/core.c | 10 ----------
kernel/trace/bpf_trace.c | 5 ++---
3 files changed, 2 insertions(+), 19 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 00bb6b5..8bb1532 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -373,12 +373,6 @@ struct pmu {
*/
size_t task_ctx_size;
-
- /*
- * Return the count value for a counter.
- */
- u64 (*count) (struct perf_event *event); /*optional*/
-
/*
* Set up pmu-private data structures for an AUX area
*/
diff --git a/kernel/events/core.c b/kernel/events/core.c
index aae72d3..4aaec01 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3313,9 +3313,6 @@ unlock:
static inline u64 perf_event_count(struct perf_event *event)
{
- if (event->pmu->count)
- return event->pmu->count(event);
-
return __perf_event_count(event);
}
@@ -3325,7 +3322,6 @@ static inline u64 perf_event_count(struct perf_event *event)
* - either for the current task, or for this CPU
* - does not have inherit set, for inherited task events
* will not be local and we cannot read them atomically
- * - must not have a pmu::count method
*/
u64 perf_event_read_local(struct perf_event *event)
{
@@ -3353,12 +3349,6 @@ u64 perf_event_read_local(struct perf_event *event)
WARN_ON_ONCE(event->attr.inherit);
/*
- * It must not have a pmu::count method, those are not
- * NMI safe.
- */
- WARN_ON_ONCE(event->pmu->count);
-
- /*
* If the event is currently on this CPU, its either a per-task event,
* or local to this CPU. Furthermore it means its ACTIVE (otherwise
* oncpu == -1).
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e4ffb3..7ef81b3 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -200,9 +200,8 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
event = file->private_data;
- /* make sure event is local and doesn't have pmu::count */
- if (event->oncpu != smp_processor_id() ||
- event->pmu->count)
+ /* make sure event is local */
+ if (event->oncpu != smp_processor_id())
return -EINVAL;
/*
--
2.8.0.rc3.226.g39d4020
The previous version of Intel's CQM introduced pmu::count as a replacement
for reading CQM events. This was done to avoid using an IPI to read the
CQM occupancy event when reading events attached to a thread.
Using pmu->count in place of pmu->read is inconsistent with the usage by
other PMUs and introduces several problems such as:
1) pmu::read for thread events returns bogus values when called from
interrupts disabled contexts.
2) perf_event_count(), behavior depends on whether interruptions are
enabled or not.
3) perf_event_count() will always read a fresh value from the PMU, which
is inconsistent with the behavior of other events.
4) perf_event_count() will perform slow MSR read and writes and IPIs.
This patches removes pmu::count from CQM and makes pmu::read always
read from the local socket (package). Future patches will add a mechanism
to add the event count from other packages.
This patch also removes the unused field rmid_usecnt from intel_pqr_state.
Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
---
arch/x86/events/intel/cqm.c | 125 ++++++--------------------------------------
1 file changed, 16 insertions(+), 109 deletions(-)
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 3c1e247..afd60dd 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -20,7 +20,6 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline size */
* struct intel_pqr_state - State cache for the PQR MSR
* @rmid: The cached Resource Monitoring ID
* @closid: The cached Class Of Service ID
- * @rmid_usecnt: The usage counter for rmid
*
* The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
@@ -32,7 +31,6 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline size */
struct intel_pqr_state {
u32 rmid;
u32 closid;
- int rmid_usecnt;
};
/*
@@ -44,6 +42,19 @@ struct intel_pqr_state {
static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
/*
+ * Updates caller cpu's cache.
+ */
+static inline void __update_pqr_rmid(u32 rmid)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+ if (state->rmid == rmid)
+ return;
+ state->rmid = rmid;
+ wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+}
+
+/*
* Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
* Also protects event->hw.cqm_rmid
*
@@ -309,7 +320,7 @@ struct rmid_read {
atomic64_t value;
};
-static void __intel_cqm_event_count(void *info);
+static void intel_cqm_event_read(struct perf_event *event);
/*
* If we fail to assign a new RMID for intel_cqm_rotation_rmid because
@@ -376,12 +387,6 @@ static void intel_cqm_event_read(struct perf_event *event)
u32 rmid;
u64 val;
- /*
- * Task events are handled by intel_cqm_event_count().
- */
- if (event->cpu == -1)
- return;
-
raw_spin_lock_irqsave(&cache_lock, flags);
rmid = event->hw.cqm_rmid;
@@ -401,123 +406,28 @@ out:
raw_spin_unlock_irqrestore(&cache_lock, flags);
}
-static void __intel_cqm_event_count(void *info)
-{
- struct rmid_read *rr = info;
- u64 val;
-
- val = __rmid_read(rr->rmid);
-
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
-
- atomic64_add(val, &rr->value);
-}
-
static inline bool cqm_group_leader(struct perf_event *event)
{
return !list_empty(&event->hw.cqm_groups_entry);
}
-static u64 intel_cqm_event_count(struct perf_event *event)
-{
- unsigned long flags;
- struct rmid_read rr = {
- .value = ATOMIC64_INIT(0),
- };
-
- /*
- * We only need to worry about task events. System-wide events
- * are handled like usual, i.e. entirely with
- * intel_cqm_event_read().
- */
- if (event->cpu != -1)
- return __perf_event_count(event);
-
- /*
- * Only the group leader gets to report values. This stops us
- * reporting duplicate values to userspace, and gives us a clear
- * rule for which task gets to report the values.
- *
- * Note that it is impossible to attribute these values to
- * specific packages - we forfeit that ability when we create
- * task events.
- */
- if (!cqm_group_leader(event))
- return 0;
-
- /*
- * Getting up-to-date values requires an SMP IPI which is not
- * possible if we're being called in interrupt context. Return
- * the cached values instead.
- */
- if (unlikely(in_interrupt()))
- goto out;
-
- /*
- * Notice that we don't perform the reading of an RMID
- * atomically, because we can't hold a spin lock across the
- * IPIs.
- *
- * Speculatively perform the read, since @event might be
- * assigned a different (possibly invalid) RMID while we're
- * busying performing the IPI calls. It's therefore necessary to
- * check @event's RMID afterwards, and if it has changed,
- * discard the result of the read.
- */
- rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
-
- if (!__rmid_valid(rr.rmid))
- goto out;
-
- on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count, &rr, 1);
-
- raw_spin_lock_irqsave(&cache_lock, flags);
- if (event->hw.cqm_rmid == rr.rmid)
- local64_set(&event->count, atomic64_read(&rr.value));
- raw_spin_unlock_irqrestore(&cache_lock, flags);
-out:
- return __perf_event_count(event);
-}
-
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- u32 rmid = event->hw.cqm_rmid;
-
if (!(event->hw.cqm_state & PERF_HES_STOPPED))
return;
event->hw.cqm_state &= ~PERF_HES_STOPPED;
-
- if (state->rmid_usecnt++) {
- if (!WARN_ON_ONCE(state->rmid != rmid))
- return;
- } else {
- WARN_ON_ONCE(state->rmid);
- }
-
- state->rmid = rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+ __update_pqr_rmid(event->hw.cqm_rmid);
}
static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
if (event->hw.cqm_state & PERF_HES_STOPPED)
return;
event->hw.cqm_state |= PERF_HES_STOPPED;
-
intel_cqm_event_read(event);
-
- if (!--state->rmid_usecnt) {
- state->rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);
- } else {
- WARN_ON_ONCE(!state->rmid);
- }
+ __update_pqr_rmid(0);
}
static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -534,7 +444,6 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
intel_cqm_event_start(event, mode);
raw_spin_unlock_irqrestore(&cache_lock, flags);
-
return 0;
}
@@ -720,7 +629,6 @@ static struct pmu intel_cqm_pmu = {
.start = intel_cqm_event_start,
.stop = intel_cqm_event_stop,
.read = intel_cqm_event_read,
- .count = intel_cqm_event_count,
};
static inline void cqm_pick_event_reader(int cpu)
@@ -743,7 +651,6 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
state->rmid = 0;
state->closid = 0;
- state->rmid_usecnt = 0;
WARN_ON(c->x86_cache_max_rmid != cqm_max_rmid);
WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
--
2.8.0.rc3.226.g39d4020
On Thu, Apr 28, 2016 at 09:43:31PM -0700, David Carrillo-Cisneros wrote:
> This hook allows architecture specific code to be called at the end of
> the task switch and after perf_events' context switch but before the
> scheduler lock is released.
>
> The specific use case in this series is to avoid multiple writes to a slow
> MSR until all functions which modify such register in task switch have
> finished.
Yeah, no. This really need way more justification. Why can't you use the
regular perf sched-in stuff for CQM?
On Thu, Apr 28, 2016 at 09:43:14PM -0700, David Carrillo-Cisneros wrote:
> Move code around, delete unnecesary code and do some renaming in
> in order to increase readibility of next patches. Create cqm.h file.
*sigh*, this is a royal pain in the backside to review.
Please just completely wipe the old driver in patch 1, preserve
_nothing_.
Then start adding bits back, in gradual coherent pieces. Like that msr
write optimization you need new hooks for, that should be a patch doing
just that, optimize, it should not introduce new functionality etc..
This piecewise removal of small bits makes it entirely hard to see the
complete picture of what is introduced here.
(Re-sending in plain text)
This hook is used in the following patch in the series to write to
PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT.
Since CAT is not dependent on perf, I created this hook to start CQM
monitoring right after other events start while keeping it independent
of perf. The idea is to have future versions of CAT to also rely on
this hook.
On Fri, Apr 29, 2016 at 11:05 AM, David Carrillo-Cisneros
<[email protected]> wrote:
> This hook is used in the following patch in the series to write to
> PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT. Since CAT
> is not dependent on perf, I created this hook to start CQM monitoring right
> after other events start while keeping it independent of perf. The idea is
> to have future versions of CAT to also rely on this hook.
>
> On Fri, Apr 29, 2016 at 1:52 AM Peter Zijlstra <[email protected]> wrote:
>>
>> On Thu, Apr 28, 2016 at 09:43:31PM -0700, David Carrillo-Cisneros wrote:
>> > This hook allows architecture specific code to be called at the end of
>> > the task switch and after perf_events' context switch but before the
>> > scheduler lock is released.
>> >
>> > The specific use case in this series is to avoid multiple writes to a
>> > slow
>> > MSR until all functions which modify such register in task switch have
>> > finished.
>>
>> Yeah, no. This really need way more justification. Why can't you use the
>> regular perf sched-in stuff for CQM?
On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
> Removing MBM code from arch/x86/events/intel/cqm.c. MBM will be added
> using the new RMID infrastucture introduced in this patch series.
I am still working on to rebase mbm on top of the new cqm
series (probably with the new quick fixes i sent yesterday)..
Thanks,
Vikas
On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> (Re-sending in plain text)
>
> This hook is used in the following patch in the series to write to
> PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT.
> Since CAT is not dependent on perf, I created this hook to start CQM
> monitoring right after other events start while keeping it independent
> of perf. The idea is to have future versions of CAT to also rely on
> this hook.
CAT did the msr write in switch_to as Peter did not want a new hook to be used.
Same could be done here.
Thanks,
Vikas
>
> On Fri, Apr 29, 2016 at 11:05 AM, David Carrillo-Cisneros
> <[email protected]> wrote:
>> This hook is used in the following patch in the series to write to
>> PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT. Since CAT
>> is not dependent on perf, I created this hook to start CQM monitoring right
>> after other events start while keeping it independent of perf. The idea is
>> to have future versions of CAT to also rely on this hook.
>>
>> On Fri, Apr 29, 2016 at 1:52 AM Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Thu, Apr 28, 2016 at 09:43:31PM -0700, David Carrillo-Cisneros wrote:
>>>> This hook allows architecture specific code to be called at the end of
>>>> the task switch and after perf_events' context switch but before the
>>>> scheduler lock is released.
>>>>
>>>> The specific use case in this series is to avoid multiple writes to a
>>>> slow
>>>> MSR until all functions which modify such register in task switch have
>>>> finished.
>>>
>>> Yeah, no. This really need way more justification. Why can't you use the
>>> regular perf sched-in stuff for CQM?
>
On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
> Allow monitored cgroups to update the PQR MSR during task switch even
> without an associated perf_event.
>
> The package RMID for the current monr associated with a monitored
> cgroup is written to hw during task switch (after perf_events is run)
> if perf_event did not write a RMID for an event.
>
> perf_event and any other caller of pqr_cache_update_rmid can update the
> CPU's RMID using one of two modes:
> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
> e.g. the RMID of the root pmonr when no event is scheduled.
> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
> unset on pmu::del. This mode prevents from using a non-event
> cgroup RMID.
>
> This patch also introduces caching of writes to PQR MSR within the per-pcu
> pqr state variable. This interface to update RMIDs and CLOSIDs will be
> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>
> Reviewed-by: Stephane Eranian <[email protected]>
> Signed-off-by: David Carrillo-Cisneros <[email protected]>
> ---
> arch/x86/events/intel/cqm.c | 65 +++++++++++++++++++++++++++++----------
> arch/x86/events/intel/cqm.h | 2 --
> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
> 4 files changed, 135 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
> index daf9fdf..4ece0a4 100644
> --- a/arch/x86/events/intel/cqm.c
> +++ b/arch/x86/events/intel/cqm.c
> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid *prmid)
> return __cqm_prmid_update(prmid, __rmid_min_update_time);
> }
>
> -/*
> - * Updates caller cpu's cache.
> - */
> -static inline void __update_pqr_prmid(struct prmid *prmid)
> -{
> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> -
> - if (state->rmid == prmid->rmid)
> - return;
> - state->rmid = prmid->rmid;
> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
> -}
> -
> static inline bool __valid_pkg_id(u16 pkg_id)
> {
> return pkg_id < PQR_MAX_NR_PKGS;
> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct perf_event *event)
> static inline void __intel_cqm_event_start(
> struct perf_event *event, union prmid_summary summary)
> {
> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
> if (!(event->hw.state & PERF_HES_STOPPED))
> return;
> -
> event->hw.state &= ~PERF_HES_STOPPED;
> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
> +
> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
> }
>
> static void intel_cqm_event_start(struct perf_event *event, int mode)
> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
> /* Occupancy of CQM events is obtained at read. No need to read
> * when event is stopped since read on inactive cpus succeed.
> */
> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
> }
>
> static int intel_cqm_event_add(struct perf_event *event, int mode)
> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
>
> state->rmid = 0;
> state->closid = 0;
> + state->next_rmid = 0;
> + state->next_closid = 0;
>
> /* XXX: lock */
> /* XXX: Make sure this case is handled when hotplug happens. */
> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
> pr_info("Intel CQM monitoring enabled with at least %u rmids per package.\n",
> min_max_rmid + 1);
>
> + /* Make sure pqr_common_enable_key is enabled after
> + * cqm_initialized_key.
> + */
> + barrier();
> +
> + static_branch_enable(&pqr_common_enable_key);
> return ret;
>
> error_init_mutex:
> @@ -3163,4 +3157,41 @@ error:
> return ret;
> }
>
> +/* Schedule task without a CQM perf_event. */
> +inline void __intel_cqm_no_event_sched_in(void)
> +{
> +#ifdef CONFIG_CGROUP_PERF
> + struct monr *monr;
> + struct pmonr *pmonr;
> + union prmid_summary summary;
> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
> +
> + /* Assume CQM enabled is likely given that PQR is enabled. */
> + if (!static_branch_likely(&cqm_initialized_key))
> + return;
> +
> + /* Safe to call from_task since we are in scheduler lock. */
> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current, NULL));
> + pmonr = monr->pmonrs[pkg_id];
> +
> + /* Utilize most up to date pmonr summary. */
> + monr_hrchy_get_next_prmid_summary(pmonr);
> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
> +
> + if (!prmid_summary__is_mon_active(summary))
> + goto no_rmid;
> +
> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
> + goto no_rmid;
> +
> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
> + return;
> +
> +no_rmid:
> + summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
> +#endif
> +}
> +
> device_initcall(intel_cqm_init);
> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
> index 0f3da94..e1f8bd0 100644
> --- a/arch/x86/events/intel/cqm.h
> +++ b/arch/x86/events/intel/cqm.h
> @@ -82,8 +82,6 @@ union prmid_summary {
> };
> };
>
> -# define INVALID_RMID (-1)
> -
> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or INVALID_RMID
> * depending on whether monitoring is active or not.
> */
> diff --git a/arch/x86/include/asm/pqr_common.h b/arch/x86/include/asm/pqr_common.h
> index f770637..abbb235 100644
> --- a/arch/x86/include/asm/pqr_common.h
> +++ b/arch/x86/include/asm/pqr_common.h
> @@ -3,31 +3,72 @@
>
> #if defined(CONFIG_INTEL_RDT)
>
> +#include <linux/jump_label.h>
> #include <linux/types.h>
> #include <asm/percpu.h>
> +#include <asm/msr.h>
>
> #define MSR_IA32_PQR_ASSOC 0x0c8f
> +#define INVALID_RMID (-1)
> +#define INVALID_CLOSID (-1)
> +
> +
> +extern struct static_key_false pqr_common_enable_key;
> +
> +enum intel_pqr_rmid_mode {
> + /* RMID has no perf_event associated. */
> + PQR_RMID_MODE_NOEVENT = 0,
> + /* RMID has a perf_event associated. */
> + PQR_RMID_MODE_EVENT
> +};
>
> /**
> * struct intel_pqr_state - State cache for the PQR MSR
> - * @rmid: The cached Resource Monitoring ID
> - * @closid: The cached Class Of Service ID
> + * @rmid: Last rmid written to hw.
> + * @next_rmid: Next rmid to write to hw.
> + * @next_rmid_mode: Next rmid's mode.
> + * @closid: The current Class Of Service ID
> + * @next_closid: The Class Of Service ID to use.
> *
> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
> * contains both parts, so we need to cache them.
> *
> - * The cache also helps to avoid pointless updates if the value does
> - * not change.
> + * The cache also helps to avoid pointless updates if the value does not
> + * change. It also keeps track of the type of RMID set (event vs no event)
> + * used to determine when a cgroup RMID is required.
> */
> struct intel_pqr_state {
> - u32 rmid;
> - u32 closid;
> + u32 rmid;
> + u32 next_rmid;
> + enum intel_pqr_rmid_mode next_rmid_mode;
> + u32 closid;
> + u32 next_closid;
> };
>
> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>
> #define PQR_MAX_NR_PKGS 8
>
> +void __pqr_update(void);
> +
> +inline void __intel_cqm_no_event_sched_in(void);
> +
> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode mode);
> +
> +inline void pqr_cache_update_closid(u32 closid);
> +
> +static inline void pqr_update(void)
> +{
> + if (static_branch_unlikely(&pqr_common_enable_key))
> + __pqr_update();
> +}
> +
> +#else
> +
> +static inline void pqr_update(void)
> +{
> +}
> +
> #endif
> #endif
> diff --git a/arch/x86/kernel/cpu/pqr_common.c b/arch/x86/kernel/cpu/pqr_common.c
> index 9eff5d9..d91c127 100644
> --- a/arch/x86/kernel/cpu/pqr_common.c
> +++ b/arch/x86/kernel/cpu/pqr_common.c
> @@ -1,9 +1,43 @@
> #include <asm/pqr_common.h>
>
> -/*
> - * The cached intel_pqr_state is strictly per CPU and can never be
> - * updated from a remote CPU. Both functions which modify the state
> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
> - * interrupts disabled, which is sufficient for the protection.
> - */
> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> +
> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
> +
> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode mode)
> +{
> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> + state->next_rmid_mode = mode;
> + state->next_rmid = rmid;
> +}
> +
> +inline void pqr_cache_update_closid(u32 closid)
> +{
> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> + state->next_closid = closid;
> +}
> +
> +/* Update hw's RMID using cgroup's if perf_event did not.
> + * Sync pqr cache with MSR.
> + */
> +inline void __pqr_update(void)
> +{
> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
> + /* If perf_event has set a next_rmid that is used, do not try
> + * to obtain another one from current task.
> + */
> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
> + __intel_cqm_no_event_sched_in();
if perf_cgroup is not defined then the state is not updated here so the state
might have state->rmid might have IDs ?
1.event1 for PID1 has RMID1,
2.perf sched_in - state->next_rmid = rmid1
3.pqr_update - state->rmid = rmid1
4.sched_out - write PQR_NOEVENT -
5.next switch_to - state->rmid not reset nothing changes (when no perf_cgroup) ?
> +
> + /* __intel_cqm_no_event_sched_in might have changed next_rmid. */
> + if (state->rmid == state->next_rmid &&
> + state->closid == state->next_closid)
> + return;
> +
> + state->rmid = state->next_rmid;
> + state->closid = state->next_closid;
> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
> +}
> --
> 2.8.0.rc3.226.g39d4020
>
>
Not sure I see the problem you point here. In step 3, PQR_ASSOC is
updated with RMID1, __pqr_update is the one called using the scheduler
hook, right after perf sched_in .
On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
<[email protected]> wrote:
>
>
> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>
>> Allow monitored cgroups to update the PQR MSR during task switch even
>> without an associated perf_event.
>>
>> The package RMID for the current monr associated with a monitored
>> cgroup is written to hw during task switch (after perf_events is run)
>> if perf_event did not write a RMID for an event.
>>
>> perf_event and any other caller of pqr_cache_update_rmid can update the
>> CPU's RMID using one of two modes:
>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
>> e.g. the RMID of the root pmonr when no event is scheduled.
>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
>> unset on pmu::del. This mode prevents from using a non-event
>> cgroup RMID.
>>
>> This patch also introduces caching of writes to PQR MSR within the per-pcu
>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>>
>> Reviewed-by: Stephane Eranian <[email protected]>
>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
>> ---
>> arch/x86/events/intel/cqm.c | 65
>> +++++++++++++++++++++++++++++----------
>> arch/x86/events/intel/cqm.h | 2 --
>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
>> 4 files changed, 135 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>> index daf9fdf..4ece0a4 100644
>> --- a/arch/x86/events/intel/cqm.c
>> +++ b/arch/x86/events/intel/cqm.c
>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
>> *prmid)
>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
>> }
>>
>> -/*
>> - * Updates caller cpu's cache.
>> - */
>> -static inline void __update_pqr_prmid(struct prmid *prmid)
>> -{
>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> -
>> - if (state->rmid == prmid->rmid)
>> - return;
>> - state->rmid = prmid->rmid;
>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
>> -}
>> -
>> static inline bool __valid_pkg_id(u16 pkg_id)
>> {
>> return pkg_id < PQR_MAX_NR_PKGS;
>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
>> perf_event *event)
>> static inline void __intel_cqm_event_start(
>> struct perf_event *event, union prmid_summary summary)
>> {
>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
>> if (!(event->hw.state & PERF_HES_STOPPED))
>> return;
>> -
>> event->hw.state &= ~PERF_HES_STOPPED;
>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
>> +
>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
>> }
>>
>> static void intel_cqm_event_start(struct perf_event *event, int mode)
>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
>> *event, int mode)
>> /* Occupancy of CQM events is obtained at read. No need to read
>> * when event is stopped since read on inactive cpus succeed.
>> */
>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>> }
>>
>> static int intel_cqm_event_add(struct perf_event *event, int mode)
>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
>>
>> state->rmid = 0;
>> state->closid = 0;
>> + state->next_rmid = 0;
>> + state->next_closid = 0;
>>
>> /* XXX: lock */
>> /* XXX: Make sure this case is handled when hotplug happens. */
>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
>> package.\n",
>> min_max_rmid + 1);
>>
>> + /* Make sure pqr_common_enable_key is enabled after
>> + * cqm_initialized_key.
>> + */
>> + barrier();
>> +
>> + static_branch_enable(&pqr_common_enable_key);
>> return ret;
>>
>> error_init_mutex:
>> @@ -3163,4 +3157,41 @@ error:
>> return ret;
>> }
>>
>> +/* Schedule task without a CQM perf_event. */
>> +inline void __intel_cqm_no_event_sched_in(void)
>> +{
>> +#ifdef CONFIG_CGROUP_PERF
>> + struct monr *monr;
>> + struct pmonr *pmonr;
>> + union prmid_summary summary;
>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
>> +
>> + /* Assume CQM enabled is likely given that PQR is enabled. */
>> + if (!static_branch_likely(&cqm_initialized_key))
>> + return;
>> +
>> + /* Safe to call from_task since we are in scheduler lock. */
>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
>> NULL));
>> + pmonr = monr->pmonrs[pkg_id];
>> +
>> + /* Utilize most up to date pmonr summary. */
>> + monr_hrchy_get_next_prmid_summary(pmonr);
>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
>> +
>> + if (!prmid_summary__is_mon_active(summary))
>> + goto no_rmid;
>> +
>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
>> + goto no_rmid;
>> +
>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>> + return;
>> +
>> +no_rmid:
>> + summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>> +#endif
>> +}
>> +
>> device_initcall(intel_cqm_init);
>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>> index 0f3da94..e1f8bd0 100644
>> --- a/arch/x86/events/intel/cqm.h
>> +++ b/arch/x86/events/intel/cqm.h
>> @@ -82,8 +82,6 @@ union prmid_summary {
>> };
>> };
>>
>> -# define INVALID_RMID (-1)
>> -
>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
>> INVALID_RMID
>> * depending on whether monitoring is active or not.
>> */
>> diff --git a/arch/x86/include/asm/pqr_common.h
>> b/arch/x86/include/asm/pqr_common.h
>> index f770637..abbb235 100644
>> --- a/arch/x86/include/asm/pqr_common.h
>> +++ b/arch/x86/include/asm/pqr_common.h
>> @@ -3,31 +3,72 @@
>>
>> #if defined(CONFIG_INTEL_RDT)
>>
>> +#include <linux/jump_label.h>
>> #include <linux/types.h>
>> #include <asm/percpu.h>
>> +#include <asm/msr.h>
>>
>> #define MSR_IA32_PQR_ASSOC 0x0c8f
>> +#define INVALID_RMID (-1)
>> +#define INVALID_CLOSID (-1)
>> +
>> +
>> +extern struct static_key_false pqr_common_enable_key;
>> +
>> +enum intel_pqr_rmid_mode {
>> + /* RMID has no perf_event associated. */
>> + PQR_RMID_MODE_NOEVENT = 0,
>> + /* RMID has a perf_event associated. */
>> + PQR_RMID_MODE_EVENT
>> +};
>>
>> /**
>> * struct intel_pqr_state - State cache for the PQR MSR
>> - * @rmid: The cached Resource Monitoring ID
>> - * @closid: The cached Class Of Service ID
>> + * @rmid: Last rmid written to hw.
>> + * @next_rmid: Next rmid to write to hw.
>> + * @next_rmid_mode: Next rmid's mode.
>> + * @closid: The current Class Of Service ID
>> + * @next_closid: The Class Of Service ID to use.
>> *
>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>> * contains both parts, so we need to cache them.
>> *
>> - * The cache also helps to avoid pointless updates if the value does
>> - * not change.
>> + * The cache also helps to avoid pointless updates if the value does not
>> + * change. It also keeps track of the type of RMID set (event vs no
>> event)
>> + * used to determine when a cgroup RMID is required.
>> */
>> struct intel_pqr_state {
>> - u32 rmid;
>> - u32 closid;
>> + u32 rmid;
>> + u32 next_rmid;
>> + enum intel_pqr_rmid_mode next_rmid_mode;
>> + u32 closid;
>> + u32 next_closid;
>> };
>>
>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>
>> #define PQR_MAX_NR_PKGS 8
>>
>> +void __pqr_update(void);
>> +
>> +inline void __intel_cqm_no_event_sched_in(void);
>> +
>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>> mode);
>> +
>> +inline void pqr_cache_update_closid(u32 closid);
>> +
>> +static inline void pqr_update(void)
>> +{
>> + if (static_branch_unlikely(&pqr_common_enable_key))
>> + __pqr_update();
>> +}
>> +
>> +#else
>> +
>> +static inline void pqr_update(void)
>> +{
>> +}
>> +
>> #endif
>> #endif
>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
>> b/arch/x86/kernel/cpu/pqr_common.c
>> index 9eff5d9..d91c127 100644
>> --- a/arch/x86/kernel/cpu/pqr_common.c
>> +++ b/arch/x86/kernel/cpu/pqr_common.c
>> @@ -1,9 +1,43 @@
>> #include <asm/pqr_common.h>
>>
>> -/*
>> - * The cached intel_pqr_state is strictly per CPU and can never be
>> - * updated from a remote CPU. Both functions which modify the state
>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>> - * interrupts disabled, which is sufficient for the protection.
>> - */
>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>> +
>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
>> +
>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>> mode)
>> +{
>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +
>> + state->next_rmid_mode = mode;
>> + state->next_rmid = rmid;
>> +}
>> +
>> +inline void pqr_cache_update_closid(u32 closid)
>> +{
>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +
>> + state->next_closid = closid;
>> +}
>> +
>> +/* Update hw's RMID using cgroup's if perf_event did not.
>> + * Sync pqr cache with MSR.
>> + */
>> +inline void __pqr_update(void)
>> +{
>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +
>> + /* If perf_event has set a next_rmid that is used, do not try
>> + * to obtain another one from current task.
>> + */
>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
>> + __intel_cqm_no_event_sched_in();
>
>
> if perf_cgroup is not defined then the state is not updated here so the
> state might have state->rmid might have IDs ?
>
> 1.event1 for PID1 has RMID1,
> 2.perf sched_in - state->next_rmid = rmid1
> 3.pqr_update - state->rmid = rmid1
> 4.sched_out - write PQR_NOEVENT -
> 5.next switch_to - state->rmid not reset nothing changes (when no
> perf_cgroup) ?
>
>
>> +
>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid. */
>> + if (state->rmid == state->next_rmid &&
>> + state->closid == state->next_closid)
>> + return;
>> +
>> + state->rmid = state->next_rmid;
>> + state->closid = state->next_closid;
>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
>> +}
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>>
>
That's a possibility, although it will increase the distance between
pmu->add for other perf events and the effective time that CQM
monitoring starts.
On Fri, Apr 29, 2016 at 1:21 PM, Vikas Shivappa
<[email protected]> wrote:
>
>
> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>
>> (Re-sending in plain text)
>>
>> This hook is used in the following patch in the series to write to
>> PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT.
>> Since CAT is not dependent on perf, I created this hook to start CQM
>> monitoring right after other events start while keeping it independent
>> of perf. The idea is to have future versions of CAT to also rely on
>> this hook.
>
>
> CAT did the msr write in switch_to as Peter did not want a new hook to be
> used. Same could be done here.
>
> Thanks,
> Vikas
>
>
>>
>> On Fri, Apr 29, 2016 at 11:05 AM, David Carrillo-Cisneros
>> <[email protected]> wrote:
>>>
>>> This hook is used in the following patch in the series to write to
>>> PQR_ASSOC_MSR, a msr that is utilized both by CQM/CMT and by CAT. Since
>>> CAT
>>> is not dependent on perf, I created this hook to start CQM monitoring
>>> right
>>> after other events start while keeping it independent of perf. The idea
>>> is
>>> to have future versions of CAT to also rely on this hook.
>>>
>>> On Fri, Apr 29, 2016 at 1:52 AM Peter Zijlstra <[email protected]>
>>> wrote:
>>>>
>>>>
>>>> On Thu, Apr 28, 2016 at 09:43:31PM -0700, David Carrillo-Cisneros wrote:
>>>>>
>>>>> This hook allows architecture specific code to be called at the end of
>>>>> the task switch and after perf_events' context switch but before the
>>>>> scheduler lock is released.
>>>>>
>>>>> The specific use case in this series is to avoid multiple writes to a
>>>>> slow
>>>>> MSR until all functions which modify such register in task switch have
>>>>> finished.
>>>>
>>>>
>>>> Yeah, no. This really need way more justification. Why can't you use the
>>>> regular perf sched-in stuff for CQM?
>>
>>
>
> @@ -27,5 +27,7 @@ struct intel_pqr_state {
>
> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>
> +#define PQR_MAX_NR_PKGS 8
topology_max_packages()
> +
> #endif
> #endif
> --
> 2.8.0.rc3.226.g39d4020
>
>
On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
> updated with RMID1, __pqr_update is the one called using the scheduler
> hook, right after perf sched_in .
>
> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
> <[email protected]> wrote:
>>
>>
>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>>
>>> Allow monitored cgroups to update the PQR MSR during task switch even
>>> without an associated perf_event.
>>>
>>> The package RMID for the current monr associated with a monitored
>>> cgroup is written to hw during task switch (after perf_events is run)
>>> if perf_event did not write a RMID for an event.
>>>
>>> perf_event and any other caller of pqr_cache_update_rmid can update the
>>> CPU's RMID using one of two modes:
>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
>>> e.g. the RMID of the root pmonr when no event is scheduled.
>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
>>> unset on pmu::del. This mode prevents from using a non-event
>>> cgroup RMID.
>>>
>>> This patch also introduces caching of writes to PQR MSR within the per-pcu
>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>>>
>>> Reviewed-by: Stephane Eranian <[email protected]>
>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
>>> ---
>>> arch/x86/events/intel/cqm.c | 65
>>> +++++++++++++++++++++++++++++----------
>>> arch/x86/events/intel/cqm.h | 2 --
>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
>>> 4 files changed, 135 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>>> index daf9fdf..4ece0a4 100644
>>> --- a/arch/x86/events/intel/cqm.c
>>> +++ b/arch/x86/events/intel/cqm.c
>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
>>> *prmid)
>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
>>> }
>>>
>>> -/*
>>> - * Updates caller cpu's cache.
>>> - */
>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
>>> -{
>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>> -
>>> - if (state->rmid == prmid->rmid)
>>> - return;
>>> - state->rmid = prmid->rmid;
>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
>>> -}
>>> -
>>> static inline bool __valid_pkg_id(u16 pkg_id)
>>> {
>>> return pkg_id < PQR_MAX_NR_PKGS;
>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
>>> perf_event *event)
>>> static inline void __intel_cqm_event_start(
>>> struct perf_event *event, union prmid_summary summary)
>>> {
>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>> if (!(event->hw.state & PERF_HES_STOPPED))
>>> return;
>>> -
>>> event->hw.state &= ~PERF_HES_STOPPED;
>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
>>> +
>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
>>> }
>>>
>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
>>> *event, int mode)
>>> /* Occupancy of CQM events is obtained at read. No need to read
>>> * when event is stopped since read on inactive cpus succeed.
>>> */
>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>>> }
>>>
>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
>>>
>>> state->rmid = 0;
>>> state->closid = 0;
>>> + state->next_rmid = 0;
>>> + state->next_closid = 0;
>>>
>>> /* XXX: lock */
>>> /* XXX: Make sure this case is handled when hotplug happens. */
>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
>>> package.\n",
>>> min_max_rmid + 1);
>>>
>>> + /* Make sure pqr_common_enable_key is enabled after
>>> + * cqm_initialized_key.
>>> + */
>>> + barrier();
>>> +
>>> + static_branch_enable(&pqr_common_enable_key);
>>> return ret;
>>>
>>> error_init_mutex:
>>> @@ -3163,4 +3157,41 @@ error:
>>> return ret;
>>> }
>>>
>>> +/* Schedule task without a CQM perf_event. */
>>> +inline void __intel_cqm_no_event_sched_in(void)
>>> +{
>>> +#ifdef CONFIG_CGROUP_PERF
>>> + struct monr *monr;
>>> + struct pmonr *pmonr;
>>> + union prmid_summary summary;
>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
>>> +
>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
>>> + if (!static_branch_likely(&cqm_initialized_key))
>>> + return;
>>> +
>>> + /* Safe to call from_task since we are in scheduler lock. */
>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
>>> NULL));
>>> + pmonr = monr->pmonrs[pkg_id];
>>> +
>>> + /* Utilize most up to date pmonr summary. */
>>> + monr_hrchy_get_next_prmid_summary(pmonr);
>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
>>> +
>>> + if (!prmid_summary__is_mon_active(summary))
>>> + goto no_rmid;
>>> +
>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
>>> + goto no_rmid;
>>> +
>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>>> + return;
>>> +
>>> +no_rmid:
>>> + summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
>>> +#endif
>>> +}
>>> +
>>> device_initcall(intel_cqm_init);
>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>>> index 0f3da94..e1f8bd0 100644
>>> --- a/arch/x86/events/intel/cqm.h
>>> +++ b/arch/x86/events/intel/cqm.h
>>> @@ -82,8 +82,6 @@ union prmid_summary {
>>> };
>>> };
>>>
>>> -# define INVALID_RMID (-1)
>>> -
>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
>>> INVALID_RMID
>>> * depending on whether monitoring is active or not.
>>> */
>>> diff --git a/arch/x86/include/asm/pqr_common.h
>>> b/arch/x86/include/asm/pqr_common.h
>>> index f770637..abbb235 100644
>>> --- a/arch/x86/include/asm/pqr_common.h
>>> +++ b/arch/x86/include/asm/pqr_common.h
>>> @@ -3,31 +3,72 @@
>>>
>>> #if defined(CONFIG_INTEL_RDT)
>>>
>>> +#include <linux/jump_label.h>
>>> #include <linux/types.h>
>>> #include <asm/percpu.h>
>>> +#include <asm/msr.h>
>>>
>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
>>> +#define INVALID_RMID (-1)
>>> +#define INVALID_CLOSID (-1)
>>> +
>>> +
>>> +extern struct static_key_false pqr_common_enable_key;
>>> +
>>> +enum intel_pqr_rmid_mode {
>>> + /* RMID has no perf_event associated. */
>>> + PQR_RMID_MODE_NOEVENT = 0,
>>> + /* RMID has a perf_event associated. */
>>> + PQR_RMID_MODE_EVENT
>>> +};
>>>
>>> /**
>>> * struct intel_pqr_state - State cache for the PQR MSR
>>> - * @rmid: The cached Resource Monitoring ID
>>> - * @closid: The cached Class Of Service ID
>>> + * @rmid: Last rmid written to hw.
>>> + * @next_rmid: Next rmid to write to hw.
>>> + * @next_rmid_mode: Next rmid's mode.
>>> + * @closid: The current Class Of Service ID
>>> + * @next_closid: The Class Of Service ID to use.
>>> *
>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>>> * contains both parts, so we need to cache them.
>>> *
>>> - * The cache also helps to avoid pointless updates if the value does
>>> - * not change.
>>> + * The cache also helps to avoid pointless updates if the value does not
>>> + * change. It also keeps track of the type of RMID set (event vs no
>>> event)
>>> + * used to determine when a cgroup RMID is required.
>>> */
>>> struct intel_pqr_state {
>>> - u32 rmid;
>>> - u32 closid;
>>> + u32 rmid;
>>> + u32 next_rmid;
>>> + enum intel_pqr_rmid_mode next_rmid_mode;
>>> + u32 closid;
>>> + u32 next_closid;
>>> };
>>>
>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>
>>> #define PQR_MAX_NR_PKGS 8
>>>
>>> +void __pqr_update(void);
>>> +
>>> +inline void __intel_cqm_no_event_sched_in(void);
>>> +
>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>> mode);
>>> +
>>> +inline void pqr_cache_update_closid(u32 closid);
>>> +
>>> +static inline void pqr_update(void)
>>> +{
>>> + if (static_branch_unlikely(&pqr_common_enable_key))
>>> + __pqr_update();
>>> +}
>>> +
>>> +#else
>>> +
>>> +static inline void pqr_update(void)
>>> +{
>>> +}
>>> +
>>> #endif
>>> #endif
>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
>>> b/arch/x86/kernel/cpu/pqr_common.c
>>> index 9eff5d9..d91c127 100644
>>> --- a/arch/x86/kernel/cpu/pqr_common.c
>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
>>> @@ -1,9 +1,43 @@
>>> #include <asm/pqr_common.h>
>>>
>>> -/*
>>> - * The cached intel_pqr_state is strictly per CPU and can never be
>>> - * updated from a remote CPU. Both functions which modify the state
>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>>> - * interrupts disabled, which is sufficient for the protection.
>>> - */
>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>>> +
>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
>>> +
>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>> mode)
>>> +{
>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>> +
>>> + state->next_rmid_mode = mode;
>>> + state->next_rmid = rmid;
>>> +}
>>> +
>>> +inline void pqr_cache_update_closid(u32 closid)
>>> +{
>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>> +
>>> + state->next_closid = closid;
>>> +}
>>> +
>>> +/* Update hw's RMID using cgroup's if perf_event did not.
>>> + * Sync pqr cache with MSR.
>>> + */
>>> +inline void __pqr_update(void)
>>> +{
>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>> +
>>> + /* If perf_event has set a next_rmid that is used, do not try
>>> + * to obtain another one from current task.
>>> + */
>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
>>> + __intel_cqm_no_event_sched_in();
>>
>>
>> if perf_cgroup is not defined then the state is not updated here so the
>> state might have state->rmid might have IDs ?
>>
>> 1.event1 for PID1 has RMID1,
>> 2.perf sched_in - state->next_rmid = rmid1
>> 3.pqr_update - state->rmid = rmid1
>> 4.sched_out - write PQR_NOEVENT -
>> 5.next switch_to - state->rmid not reset nothing changes (when no
>> perf_cgroup) ?
Between 4 and 5 say event1 is dead. Basically on the next context switch(#5) if
perf sched_in wasnt called the PQR still has RMID1.
When you have perf_cgroup
you call the __intel_cqm_no_event.. then it sees if there is continuous
monitoring or you set the next_rmid to 0- but all the code there is inside
#ifdef PERF_CGROUP, so without cgroup its never reset to zero ?
>>
>>
>>> +
>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid. */
>>> + if (state->rmid == state->next_rmid &&
>>> + state->closid == state->next_closid)
>>> + return;
>>> +
>>> + state->rmid = state->next_rmid;
>>> + state->closid = state->next_closid;
>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
>>> +}
>>> --
>>> 2.8.0.rc3.226.g39d4020
>>>
>>>
>>
>
On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
> This series introduces the next iteration of kernel support for the
> Cache QoS Monitoring (CQM) technology available in Intel Xeon processors.
Wondering what is the kernel version this compiles on ?
Thanks,
Vikas
>
> One of the main limitations of the previous version is the inability
> to simultaneously monitor:
> 1) cpu event and any other event in that cpu.
> 2) cgroup events for cgroups in same descendancy line.
> 3) cgroup events and any thread event of a cgroup in the same
> descendancy line.
>
> Another limitation is that monitoring for a cgroup was enabled/disabled by
> the existence of a perf event for that cgroup. Since the event
> llc_occupancy measures changes in occupancy rather than total occupancy,
> in order to read meaningful llc_occupancy values, an event should be
> enabled for a long enough period of time. The overhead in context switches
> caused by the perf events is undesired in some sensitive scenarios.
>
> This series of patches addresses the shortcomings mentioned above and,
> add some other improvements. The main changes are:
> - No more potential conflicts between different events. New
> version builds a hierarchy of RMIDs that captures the dependency
> between monitored cgroups. llc_occupancy for cgroup is the sum of
> llc_occupancies for that cgroup RMID and all other RMIDs in the
> cgroups subtree (both monitored cgroups and threads).
>
> - A cgroup integration that allows to monitor the a cgroup without
> creating a perf event, decreasing the context switch overhead.
> Monitoring is controlled by a boolean cgroup subsystem attribute
> in each perf cgroup, this is:
>
> echo 1 > cgroup_path/perf_event.cqm_cont_monitoring
>
> starts CQM monitoring whether or not there is a perf_event
> attached to the cgroup. Setting the attribute to 0 makes
> monitoring dependent on the existence of a perf_event.
> A perf_event is always required in order to read llc_occupancy.
> This cgroup integration uses Intel's PQR code and is intended to
> be used by upcoming versions of Intel's CAT.
>
> - A more stable rotation algorithm: New algorithm uses SLOs that
> guarantee:
> - A minimum of enabled time for monitored cgroups and
> threads.
> - A maximum time disabled before error is introduced by
> reusing dirty RMIDs.
> - A minimum rate at which RMIDs recycling must progress.
>
> - Reduced impact of stealing/rotation of RMIDs: The new algorithm
> accounts the residual occupancy held by limbo RMIDs towards the
> former owner of the limbo RMID, decreasing the error introduced
> by RMID rotation.
> It also allows a limbo RMID to be reused by its former owner when
> appropriate, decreasing the potential error of reusing dirty RMIDs
> and allowing to make progress even if most limbo RMIDs do not
> drop occupancy fast enough.
>
> - Elimination of pmu::count: perf generic's perf_event_count()
> perform a quick add of atomic types. The introduction of
> pmu::count in the previous CQM series to read occupancy for thread
> events changed the behavior of perf_event_count() by performing a
> potentially slow IPI and write/read to MSR. It also made pmu::read
> to have different behaviors depending on whether the event was a
> cpu/cgroup event or a thread. This patches serie removes the custom
> pmu::count from CQM and provides a consistent behavior for all
> calls of perf_event_read .
>
> - Added error return for pmu::read: Reads to CQM events may fail
> due to stealing of RMIDs, even after successfully adding an event
> to a PMU. This patch series expands pmu::read with an int return
> value and propagates the error to callers that can fail
> (ie. perf_read).
> The ability to fail of pmu::read is consistent with the recent
> changes that allow perf_event_read to fail for transactional
> reading of event groups.
>
> - Introduces the field pmu_event_flags that contain flags set by
> the PMU to signal variations on the default behavior to perf's
> generic code. In this series, three flags are introduced:
> - PERF_CGROUP_NO_RECURSION : Signals generic code to add
> events of the cgroup ancestors of a cgroup.
> - PERF_INACTIVE_CPU_READ_PKG: Signals generic coda that
> this CPU event can be read in any CPU in its event::cpu's
> package, even if the event is not active.
> - PERF_INACTIVE_EV_READ_ANY_CPU: Signals generic code that
> this event can be read in any CPU in any package in the
> system even if the event is not active.
> Using the above flags takes advantage of the CQM's hw ability to
> read llc_occupancy even when the associated perf event is not
> running in a CPU.
>
> This patch series also updates the perf tool to fix error handling and to
> better handle the idiosyncrasies of snapshot and per-pkg events.
>
> David Carrillo-Cisneros (31):
> perf/x86/intel/cqm: temporarily remove MBM from CQM and cleanup
> perf/x86/intel/cqm: remove check for conflicting events
> perf/x86/intel/cqm: remove all code for rotation of RMIDs
> perf/x86/intel/cqm: make read of RMIDs per package (Temporal)
> perf/core: remove unused pmu->count
> x86/intel,cqm: add CONFIG_INTEL_RDT configuration flag and refactor
> PQR
> perf/x86/intel/cqm: separate CQM PMU's attributes from x86 PMU
> perf/x86/intel/cqm: prepare for next patches
> perf/x86/intel/cqm: add per-package RMIDs, data and locks
> perf/x86/intel/cqm: basic RMID hierarchy with per package rmids
> perf/x86/intel/cqm: (I)state and limbo prmids
> perf/x86/intel/cqm: add per-package RMID rotation
> perf/x86/intel/cqm: add polled update of RMID's llc_occupancy
> perf/x86/intel/cqm: add preallocation of anodes
> perf/core: add hooks to expose architecture specific features in
> perf_cgroup
> perf/x86/intel/cqm: add cgroup support
> perf/core: adding pmu::event_terminate
> perf/x86/intel/cqm: use pmu::event_terminate
> perf/core: introduce PMU event flag PERF_CGROUP_NO_RECURSION
> x86/intel/cqm: use PERF_CGROUP_NO_RECURSION in CQM
> perf/x86/intel/cqm: handle inherit event and inherit_stat flag
> perf/x86/intel/cqm: introduce read_subtree
> perf/core: introduce PERF_INACTIVE_*_READ_* flags
> perf/x86/intel/cqm: use PERF_INACTIVE_*_READ_* flags in CQM
> sched: introduce the finish_arch_pre_lock_switch() scheduler hook
> perf/x86/intel/cqm: integrate CQM cgroups with scheduler
> perf/core: add perf_event cgroup hooks for subsystem attributes
> perf/x86/intel/cqm: add CQM attributes to perf_event cgroup
> perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
> pmu::read
> perf,perf/x86: add hook perf_event_arch_exec
> perf/stat: revamp error handling for snapshot and per_pkg events
>
> Stephane Eranian (1):
> perf/stat: fix bug in handling events in error state
>
> arch/alpha/kernel/perf_event.c | 3 +-
> arch/arc/kernel/perf_event.c | 3 +-
> arch/arm64/include/asm/hw_breakpoint.h | 2 +-
> arch/arm64/kernel/hw_breakpoint.c | 3 +-
> arch/metag/kernel/perf/perf_event.c | 5 +-
> arch/mips/kernel/perf_event_mipsxx.c | 3 +-
> arch/powerpc/include/asm/hw_breakpoint.h | 2 +-
> arch/powerpc/kernel/hw_breakpoint.c | 3 +-
> arch/powerpc/perf/core-book3s.c | 11 +-
> arch/powerpc/perf/core-fsl-emb.c | 5 +-
> arch/powerpc/perf/hv-24x7.c | 5 +-
> arch/powerpc/perf/hv-gpci.c | 3 +-
> arch/s390/kernel/perf_cpum_cf.c | 5 +-
> arch/s390/kernel/perf_cpum_sf.c | 3 +-
> arch/sh/include/asm/hw_breakpoint.h | 2 +-
> arch/sh/kernel/hw_breakpoint.c | 3 +-
> arch/sparc/kernel/perf_event.c | 2 +-
> arch/tile/kernel/perf_event.c | 3 +-
> arch/x86/Kconfig | 6 +
> arch/x86/events/amd/ibs.c | 2 +-
> arch/x86/events/amd/iommu.c | 5 +-
> arch/x86/events/amd/uncore.c | 3 +-
> arch/x86/events/core.c | 3 +-
> arch/x86/events/intel/Makefile | 3 +-
> arch/x86/events/intel/bts.c | 3 +-
> arch/x86/events/intel/cqm.c | 3847 +++++++++++++++++++++---------
> arch/x86/events/intel/cqm.h | 519 ++++
> arch/x86/events/intel/cstate.c | 3 +-
> arch/x86/events/intel/pt.c | 3 +-
> arch/x86/events/intel/rapl.c | 3 +-
> arch/x86/events/intel/uncore.c | 3 +-
> arch/x86/events/intel/uncore.h | 2 +-
> arch/x86/events/msr.c | 3 +-
> arch/x86/include/asm/hw_breakpoint.h | 2 +-
> arch/x86/include/asm/perf_event.h | 41 +
> arch/x86/include/asm/pqr_common.h | 74 +
> arch/x86/include/asm/processor.h | 4 +
> arch/x86/kernel/cpu/Makefile | 4 +
> arch/x86/kernel/cpu/pqr_common.c | 43 +
> arch/x86/kernel/hw_breakpoint.c | 3 +-
> arch/x86/kvm/pmu.h | 10 +-
> drivers/bus/arm-cci.c | 3 +-
> drivers/bus/arm-ccn.c | 3 +-
> drivers/perf/arm_pmu.c | 3 +-
> include/linux/perf_event.h | 91 +-
> kernel/events/core.c | 170 +-
> kernel/sched/core.c | 1 +
> kernel/sched/sched.h | 3 +
> kernel/trace/bpf_trace.c | 5 +-
> tools/perf/builtin-stat.c | 43 +-
> tools/perf/util/counts.h | 19 +
> tools/perf/util/evsel.c | 44 +-
> tools/perf/util/evsel.h | 8 +-
> tools/perf/util/stat.c | 35 +-
> 54 files changed, 3746 insertions(+), 1337 deletions(-)
> create mode 100644 arch/x86/events/intel/cqm.h
> create mode 100644 arch/x86/include/asm/pqr_common.h
> create mode 100644 arch/x86/kernel/cpu/pqr_common.c
>
> --
> 2.8.0.rc3.226.g39d4020
>
>
peterz/queue perf/core
On Fri, Apr 29, 2016 at 2:06 PM Vikas Shivappa
<[email protected]> wrote:
>
>
>
> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>
> > This series introduces the next iteration of kernel support for the
> > Cache QoS Monitoring (CQM) technology available in Intel Xeon processors.
>
> Wondering what is the kernel version this compiles on ?
>
> Thanks,
> Vikas
>
> >
> > One of the main limitations of the previous version is the inability
> > to simultaneously monitor:
> > 1) cpu event and any other event in that cpu.
> > 2) cgroup events for cgroups in same descendancy line.
> > 3) cgroup events and any thread event of a cgroup in the same
> > descendancy line.
> >
> > Another limitation is that monitoring for a cgroup was enabled/disabled by
> > the existence of a perf event for that cgroup. Since the event
> > llc_occupancy measures changes in occupancy rather than total occupancy,
> > in order to read meaningful llc_occupancy values, an event should be
> > enabled for a long enough period of time. The overhead in context switches
> > caused by the perf events is undesired in some sensitive scenarios.
> >
> > This series of patches addresses the shortcomings mentioned above and,
> > add some other improvements. The main changes are:
> > - No more potential conflicts between different events. New
> > version builds a hierarchy of RMIDs that captures the dependency
> > between monitored cgroups. llc_occupancy for cgroup is the sum of
> > llc_occupancies for that cgroup RMID and all other RMIDs in the
> > cgroups subtree (both monitored cgroups and threads).
> >
> > - A cgroup integration that allows to monitor the a cgroup without
> > creating a perf event, decreasing the context switch overhead.
> > Monitoring is controlled by a boolean cgroup subsystem attribute
> > in each perf cgroup, this is:
> >
> > echo 1 > cgroup_path/perf_event.cqm_cont_monitoring
> >
> > starts CQM monitoring whether or not there is a perf_event
> > attached to the cgroup. Setting the attribute to 0 makes
> > monitoring dependent on the existence of a perf_event.
> > A perf_event is always required in order to read llc_occupancy.
> > This cgroup integration uses Intel's PQR code and is intended to
> > be used by upcoming versions of Intel's CAT.
> >
> > - A more stable rotation algorithm: New algorithm uses SLOs that
> > guarantee:
> > - A minimum of enabled time for monitored cgroups and
> > threads.
> > - A maximum time disabled before error is introduced by
> > reusing dirty RMIDs.
> > - A minimum rate at which RMIDs recycling must progress.
> >
> > - Reduced impact of stealing/rotation of RMIDs: The new algorithm
> > accounts the residual occupancy held by limbo RMIDs towards the
> > former owner of the limbo RMID, decreasing the error introduced
> > by RMID rotation.
> > It also allows a limbo RMID to be reused by its former owner when
> > appropriate, decreasing the potential error of reusing dirty RMIDs
> > and allowing to make progress even if most limbo RMIDs do not
> > drop occupancy fast enough.
> >
> > - Elimination of pmu::count: perf generic's perf_event_count()
> > perform a quick add of atomic types. The introduction of
> > pmu::count in the previous CQM series to read occupancy for thread
> > events changed the behavior of perf_event_count() by performing a
> > potentially slow IPI and write/read to MSR. It also made pmu::read
> > to have different behaviors depending on whether the event was a
> > cpu/cgroup event or a thread. This patches serie removes the custom
> > pmu::count from CQM and provides a consistent behavior for all
> > calls of perf_event_read .
> >
> > - Added error return for pmu::read: Reads to CQM events may fail
> > due to stealing of RMIDs, even after successfully adding an event
> > to a PMU. This patch series expands pmu::read with an int return
> > value and propagates the error to callers that can fail
> > (ie. perf_read).
> > The ability to fail of pmu::read is consistent with the recent
> > changes that allow perf_event_read to fail for transactional
> > reading of event groups.
> >
> > - Introduces the field pmu_event_flags that contain flags set by
> > the PMU to signal variations on the default behavior to perf's
> > generic code. In this series, three flags are introduced:
> > - PERF_CGROUP_NO_RECURSION : Signals generic code to add
> > events of the cgroup ancestors of a cgroup.
> > - PERF_INACTIVE_CPU_READ_PKG: Signals generic coda that
> > this CPU event can be read in any CPU in its event::cpu's
> > package, even if the event is not active.
> > - PERF_INACTIVE_EV_READ_ANY_CPU: Signals generic code that
> > this event can be read in any CPU in any package in the
> > system even if the event is not active.
> > Using the above flags takes advantage of the CQM's hw ability to
> > read llc_occupancy even when the associated perf event is not
> > running in a CPU.
> >
> > This patch series also updates the perf tool to fix error handling and to
> > better handle the idiosyncrasies of snapshot and per-pkg events.
> >
> > David Carrillo-Cisneros (31):
> > perf/x86/intel/cqm: temporarily remove MBM from CQM and cleanup
> > perf/x86/intel/cqm: remove check for conflicting events
> > perf/x86/intel/cqm: remove all code for rotation of RMIDs
> > perf/x86/intel/cqm: make read of RMIDs per package (Temporal)
> > perf/core: remove unused pmu->count
> > x86/intel,cqm: add CONFIG_INTEL_RDT configuration flag and refactor
> > PQR
> > perf/x86/intel/cqm: separate CQM PMU's attributes from x86 PMU
> > perf/x86/intel/cqm: prepare for next patches
> > perf/x86/intel/cqm: add per-package RMIDs, data and locks
> > perf/x86/intel/cqm: basic RMID hierarchy with per package rmids
> > perf/x86/intel/cqm: (I)state and limbo prmids
> > perf/x86/intel/cqm: add per-package RMID rotation
> > perf/x86/intel/cqm: add polled update of RMID's llc_occupancy
> > perf/x86/intel/cqm: add preallocation of anodes
> > perf/core: add hooks to expose architecture specific features in
> > perf_cgroup
> > perf/x86/intel/cqm: add cgroup support
> > perf/core: adding pmu::event_terminate
> > perf/x86/intel/cqm: use pmu::event_terminate
> > perf/core: introduce PMU event flag PERF_CGROUP_NO_RECURSION
> > x86/intel/cqm: use PERF_CGROUP_NO_RECURSION in CQM
> > perf/x86/intel/cqm: handle inherit event and inherit_stat flag
> > perf/x86/intel/cqm: introduce read_subtree
> > perf/core: introduce PERF_INACTIVE_*_READ_* flags
> > perf/x86/intel/cqm: use PERF_INACTIVE_*_READ_* flags in CQM
> > sched: introduce the finish_arch_pre_lock_switch() scheduler hook
> > perf/x86/intel/cqm: integrate CQM cgroups with scheduler
> > perf/core: add perf_event cgroup hooks for subsystem attributes
> > perf/x86/intel/cqm: add CQM attributes to perf_event cgroup
> > perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to
> > pmu::read
> > perf,perf/x86: add hook perf_event_arch_exec
> > perf/stat: revamp error handling for snapshot and per_pkg events
> >
> > Stephane Eranian (1):
> > perf/stat: fix bug in handling events in error state
> >
> > arch/alpha/kernel/perf_event.c | 3 +-
> > arch/arc/kernel/perf_event.c | 3 +-
> > arch/arm64/include/asm/hw_breakpoint.h | 2 +-
> > arch/arm64/kernel/hw_breakpoint.c | 3 +-
> > arch/metag/kernel/perf/perf_event.c | 5 +-
> > arch/mips/kernel/perf_event_mipsxx.c | 3 +-
> > arch/powerpc/include/asm/hw_breakpoint.h | 2 +-
> > arch/powerpc/kernel/hw_breakpoint.c | 3 +-
> > arch/powerpc/perf/core-book3s.c | 11 +-
> > arch/powerpc/perf/core-fsl-emb.c | 5 +-
> > arch/powerpc/perf/hv-24x7.c | 5 +-
> > arch/powerpc/perf/hv-gpci.c | 3 +-
> > arch/s390/kernel/perf_cpum_cf.c | 5 +-
> > arch/s390/kernel/perf_cpum_sf.c | 3 +-
> > arch/sh/include/asm/hw_breakpoint.h | 2 +-
> > arch/sh/kernel/hw_breakpoint.c | 3 +-
> > arch/sparc/kernel/perf_event.c | 2 +-
> > arch/tile/kernel/perf_event.c | 3 +-
> > arch/x86/Kconfig | 6 +
> > arch/x86/events/amd/ibs.c | 2 +-
> > arch/x86/events/amd/iommu.c | 5 +-
> > arch/x86/events/amd/uncore.c | 3 +-
> > arch/x86/events/core.c | 3 +-
> > arch/x86/events/intel/Makefile | 3 +-
> > arch/x86/events/intel/bts.c | 3 +-
> > arch/x86/events/intel/cqm.c | 3847 +++++++++++++++++++++---------
> > arch/x86/events/intel/cqm.h | 519 ++++
> > arch/x86/events/intel/cstate.c | 3 +-
> > arch/x86/events/intel/pt.c | 3 +-
> > arch/x86/events/intel/rapl.c | 3 +-
> > arch/x86/events/intel/uncore.c | 3 +-
> > arch/x86/events/intel/uncore.h | 2 +-
> > arch/x86/events/msr.c | 3 +-
> > arch/x86/include/asm/hw_breakpoint.h | 2 +-
> > arch/x86/include/asm/perf_event.h | 41 +
> > arch/x86/include/asm/pqr_common.h | 74 +
> > arch/x86/include/asm/processor.h | 4 +
> > arch/x86/kernel/cpu/Makefile | 4 +
> > arch/x86/kernel/cpu/pqr_common.c | 43 +
> > arch/x86/kernel/hw_breakpoint.c | 3 +-
> > arch/x86/kvm/pmu.h | 10 +-
> > drivers/bus/arm-cci.c | 3 +-
> > drivers/bus/arm-ccn.c | 3 +-
> > drivers/perf/arm_pmu.c | 3 +-
> > include/linux/perf_event.h | 91 +-
> > kernel/events/core.c | 170 +-
> > kernel/sched/core.c | 1 +
> > kernel/sched/sched.h | 3 +
> > kernel/trace/bpf_trace.c | 5 +-
> > tools/perf/builtin-stat.c | 43 +-
> > tools/perf/util/counts.h | 19 +
> > tools/perf/util/evsel.c | 44 +-
> > tools/perf/util/evsel.h | 8 +-
> > tools/perf/util/stat.c | 35 +-
> > 54 files changed, 3746 insertions(+), 1337 deletions(-)
> > create mode 100644 arch/x86/events/intel/cqm.h
> > create mode 100644 arch/x86/include/asm/pqr_common.h
> > create mode 100644 arch/x86/kernel/cpu/pqr_common.c
> >
> > --
> > 2.8.0.rc3.226.g39d4020
> >
> >
if __intel_cqm_no_event_sched_in does nothing, the PQR_ASSOC msr is
still updated if state->rmid != state->next_rmid in __pqr_update,
even if next_rmid_mode == PQR_RMID_MODE_NOEVENT .
On Fri, Apr 29, 2016 at 2:01 PM, Vikas Shivappa
<[email protected]> wrote:
>
>
> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>
>> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
>> updated with RMID1, __pqr_update is the one called using the scheduler
>> hook, right after perf sched_in .
>>
>> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
>> <[email protected]> wrote:
>>>
>>>
>>>
>>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>>>
>>>> Allow monitored cgroups to update the PQR MSR during task switch even
>>>> without an associated perf_event.
>>>>
>>>> The package RMID for the current monr associated with a monitored
>>>> cgroup is written to hw during task switch (after perf_events is run)
>>>> if perf_event did not write a RMID for an event.
>>>>
>>>> perf_event and any other caller of pqr_cache_update_rmid can update the
>>>> CPU's RMID using one of two modes:
>>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
>>>> e.g. the RMID of the root pmonr when no event is scheduled.
>>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
>>>> unset on pmu::del. This mode prevents from using a non-event
>>>> cgroup RMID.
>>>>
>>>> This patch also introduces caching of writes to PQR MSR within the
>>>> per-pcu
>>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
>>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>>>>
>>>> Reviewed-by: Stephane Eranian <[email protected]>
>>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
>>>> ---
>>>> arch/x86/events/intel/cqm.c | 65
>>>> +++++++++++++++++++++++++++++----------
>>>> arch/x86/events/intel/cqm.h | 2 --
>>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
>>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
>>>> 4 files changed, 135 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>>>> index daf9fdf..4ece0a4 100644
>>>> --- a/arch/x86/events/intel/cqm.c
>>>> +++ b/arch/x86/events/intel/cqm.c
>>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
>>>> *prmid)
>>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
>>>> }
>>>>
>>>> -/*
>>>> - * Updates caller cpu's cache.
>>>> - */
>>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
>>>> -{
>>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>> -
>>>> - if (state->rmid == prmid->rmid)
>>>> - return;
>>>> - state->rmid = prmid->rmid;
>>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
>>>> -}
>>>> -
>>>> static inline bool __valid_pkg_id(u16 pkg_id)
>>>> {
>>>> return pkg_id < PQR_MAX_NR_PKGS;
>>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
>>>> perf_event *event)
>>>> static inline void __intel_cqm_event_start(
>>>> struct perf_event *event, union prmid_summary summary)
>>>> {
>>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>> if (!(event->hw.state & PERF_HES_STOPPED))
>>>> return;
>>>> -
>>>> event->hw.state &= ~PERF_HES_STOPPED;
>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>> summary.sched_rmid));
>>>> +
>>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
>>>> }
>>>>
>>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
>>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
>>>> *event, int mode)
>>>> /* Occupancy of CQM events is obtained at read. No need to read
>>>> * when event is stopped since read on inactive cpus succeed.
>>>> */
>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>> summary.sched_rmid));
>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>> PQR_RMID_MODE_NOEVENT);
>>>> }
>>>>
>>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
>>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int
>>>> cpu)
>>>>
>>>> state->rmid = 0;
>>>> state->closid = 0;
>>>> + state->next_rmid = 0;
>>>> + state->next_closid = 0;
>>>>
>>>> /* XXX: lock */
>>>> /* XXX: Make sure this case is handled when hotplug happens. */
>>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
>>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
>>>> package.\n",
>>>> min_max_rmid + 1);
>>>>
>>>> + /* Make sure pqr_common_enable_key is enabled after
>>>> + * cqm_initialized_key.
>>>> + */
>>>> + barrier();
>>>> +
>>>> + static_branch_enable(&pqr_common_enable_key);
>>>> return ret;
>>>>
>>>> error_init_mutex:
>>>> @@ -3163,4 +3157,41 @@ error:
>>>> return ret;
>>>> }
>>>>
>>>> +/* Schedule task without a CQM perf_event. */
>>>> +inline void __intel_cqm_no_event_sched_in(void)
>>>> +{
>>>> +#ifdef CONFIG_CGROUP_PERF
>>>> + struct monr *monr;
>>>> + struct pmonr *pmonr;
>>>> + union prmid_summary summary;
>>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
>>>> +
>>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
>>>> + if (!static_branch_likely(&cqm_initialized_key))
>>>> + return;
>>>> +
>>>> + /* Safe to call from_task since we are in scheduler lock. */
>>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
>>>> NULL));
>>>> + pmonr = monr->pmonrs[pkg_id];
>>>> +
>>>> + /* Utilize most up to date pmonr summary. */
>>>> + monr_hrchy_get_next_prmid_summary(pmonr);
>>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
>>>> +
>>>> + if (!prmid_summary__is_mon_active(summary))
>>>> + goto no_rmid;
>>>> +
>>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
>>>> + goto no_rmid;
>>>> +
>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>> PQR_RMID_MODE_NOEVENT);
>>>> + return;
>>>> +
>>>> +no_rmid:
>>>> + summary.value =
>>>> atomic64_read(&root_pmonr->prmid_summary_atomic);
>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>> PQR_RMID_MODE_NOEVENT);
>>>> +#endif
>>>> +}
>>>> +
>>>> device_initcall(intel_cqm_init);
>>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>>>> index 0f3da94..e1f8bd0 100644
>>>> --- a/arch/x86/events/intel/cqm.h
>>>> +++ b/arch/x86/events/intel/cqm.h
>>>> @@ -82,8 +82,6 @@ union prmid_summary {
>>>> };
>>>> };
>>>>
>>>> -# define INVALID_RMID (-1)
>>>> -
>>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
>>>> INVALID_RMID
>>>> * depending on whether monitoring is active or not.
>>>> */
>>>> diff --git a/arch/x86/include/asm/pqr_common.h
>>>> b/arch/x86/include/asm/pqr_common.h
>>>> index f770637..abbb235 100644
>>>> --- a/arch/x86/include/asm/pqr_common.h
>>>> +++ b/arch/x86/include/asm/pqr_common.h
>>>> @@ -3,31 +3,72 @@
>>>>
>>>> #if defined(CONFIG_INTEL_RDT)
>>>>
>>>> +#include <linux/jump_label.h>
>>>> #include <linux/types.h>
>>>> #include <asm/percpu.h>
>>>> +#include <asm/msr.h>
>>>>
>>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
>>>> +#define INVALID_RMID (-1)
>>>> +#define INVALID_CLOSID (-1)
>>>> +
>>>> +
>>>> +extern struct static_key_false pqr_common_enable_key;
>>>> +
>>>> +enum intel_pqr_rmid_mode {
>>>> + /* RMID has no perf_event associated. */
>>>> + PQR_RMID_MODE_NOEVENT = 0,
>>>> + /* RMID has a perf_event associated. */
>>>> + PQR_RMID_MODE_EVENT
>>>> +};
>>>>
>>>> /**
>>>> * struct intel_pqr_state - State cache for the PQR MSR
>>>> - * @rmid: The cached Resource Monitoring ID
>>>> - * @closid: The cached Class Of Service ID
>>>> + * @rmid: Last rmid written to hw.
>>>> + * @next_rmid: Next rmid to write to hw.
>>>> + * @next_rmid_mode: Next rmid's mode.
>>>> + * @closid: The current Class Of Service ID
>>>> + * @next_closid: The Class Of Service ID to use.
>>>> *
>>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>>>> * contains both parts, so we need to cache them.
>>>> *
>>>> - * The cache also helps to avoid pointless updates if the value does
>>>> - * not change.
>>>> + * The cache also helps to avoid pointless updates if the value does
>>>> not
>>>> + * change. It also keeps track of the type of RMID set (event vs no
>>>> event)
>>>> + * used to determine when a cgroup RMID is required.
>>>> */
>>>> struct intel_pqr_state {
>>>> - u32 rmid;
>>>> - u32 closid;
>>>> + u32 rmid;
>>>> + u32 next_rmid;
>>>> + enum intel_pqr_rmid_mode next_rmid_mode;
>>>> + u32 closid;
>>>> + u32 next_closid;
>>>> };
>>>>
>>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>>
>>>> #define PQR_MAX_NR_PKGS 8
>>>>
>>>> +void __pqr_update(void);
>>>> +
>>>> +inline void __intel_cqm_no_event_sched_in(void);
>>>> +
>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>> mode);
>>>> +
>>>> +inline void pqr_cache_update_closid(u32 closid);
>>>> +
>>>> +static inline void pqr_update(void)
>>>> +{
>>>> + if (static_branch_unlikely(&pqr_common_enable_key))
>>>> + __pqr_update();
>>>> +}
>>>> +
>>>> +#else
>>>> +
>>>> +static inline void pqr_update(void)
>>>> +{
>>>> +}
>>>> +
>>>> #endif
>>>> #endif
>>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
>>>> b/arch/x86/kernel/cpu/pqr_common.c
>>>> index 9eff5d9..d91c127 100644
>>>> --- a/arch/x86/kernel/cpu/pqr_common.c
>>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
>>>> @@ -1,9 +1,43 @@
>>>> #include <asm/pqr_common.h>
>>>>
>>>> -/*
>>>> - * The cached intel_pqr_state is strictly per CPU and can never be
>>>> - * updated from a remote CPU. Both functions which modify the state
>>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>>>> - * interrupts disabled, which is sufficient for the protection.
>>>> - */
>>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>> +
>>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
>>>> +
>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>> mode)
>>>> +{
>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>> +
>>>> + state->next_rmid_mode = mode;
>>>> + state->next_rmid = rmid;
>>>> +}
>>>> +
>>>> +inline void pqr_cache_update_closid(u32 closid)
>>>> +{
>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>> +
>>>> + state->next_closid = closid;
>>>> +}
>>>> +
>>>> +/* Update hw's RMID using cgroup's if perf_event did not.
>>>> + * Sync pqr cache with MSR.
>>>> + */
>>>> +inline void __pqr_update(void)
>>>> +{
>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>> +
>>>> + /* If perf_event has set a next_rmid that is used, do not try
>>>> + * to obtain another one from current task.
>>>> + */
>>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
>>>> + __intel_cqm_no_event_sched_in();
>>>
>>>
>>>
>>> if perf_cgroup is not defined then the state is not updated here so the
>>> state might have state->rmid might have IDs ?
>>>
>>> 1.event1 for PID1 has RMID1,
>>> 2.perf sched_in - state->next_rmid = rmid1
>>> 3.pqr_update - state->rmid = rmid1
>>> 4.sched_out - write PQR_NOEVENT -
>>> 5.next switch_to - state->rmid not reset nothing changes (when no
>>> perf_cgroup) ?
>
>
> Between 4 and 5 say event1 is dead. Basically on the next context switch(#5)
> if perf sched_in wasnt called the PQR still has RMID1.
>
> When you have perf_cgroup you call the __intel_cqm_no_event.. then it sees
> if there is continuous monitoring or you set the next_rmid to 0- but all the
> code there is inside #ifdef PERF_CGROUP, so without cgroup its never reset
> to zero ?
>
>
>
>>>
>>>
>>>> +
>>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid.
>>>> */
>>>> + if (state->rmid == state->next_rmid &&
>>>> + state->closid == state->next_closid)
>>>> + return;
>>>> +
>>>> + state->rmid = state->next_rmid;
>>>> + state->closid = state->next_closid;
>>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
>>>> +}
>>>> --
>>>> 2.8.0.rc3.226.g39d4020
>>>>
>>>>
>>>
>>
>
On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> if __intel_cqm_no_event_sched_in does nothing, the PQR_ASSOC msr is
> still updated if state->rmid != state->next_rmid in __pqr_update,
But due to 2 and 3 below they are equal ?
> even if next_rmid_mode == PQR_RMID_MODE_NOEVENT .
>
> On Fri, Apr 29, 2016 at 2:01 PM, Vikas Shivappa
> <[email protected]> wrote:
>>
>>
>> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>>
>>> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
>>> updated with RMID1, __pqr_update is the one called using the scheduler
>>> hook, right after perf sched_in .
>>>
>>> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>>>>
>>>>> Allow monitored cgroups to update the PQR MSR during task switch even
>>>>> without an associated perf_event.
>>>>>
>>>>> The package RMID for the current monr associated with a monitored
>>>>> cgroup is written to hw during task switch (after perf_events is run)
>>>>> if perf_event did not write a RMID for an event.
>>>>>
>>>>> perf_event and any other caller of pqr_cache_update_rmid can update the
>>>>> CPU's RMID using one of two modes:
>>>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
>>>>> e.g. the RMID of the root pmonr when no event is scheduled.
>>>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
>>>>> unset on pmu::del. This mode prevents from using a non-event
>>>>> cgroup RMID.
>>>>>
>>>>> This patch also introduces caching of writes to PQR MSR within the
>>>>> per-pcu
>>>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
>>>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>>>>>
>>>>> Reviewed-by: Stephane Eranian <[email protected]>
>>>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
>>>>> ---
>>>>> arch/x86/events/intel/cqm.c | 65
>>>>> +++++++++++++++++++++++++++++----------
>>>>> arch/x86/events/intel/cqm.h | 2 --
>>>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
>>>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
>>>>> 4 files changed, 135 insertions(+), 31 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>>>>> index daf9fdf..4ece0a4 100644
>>>>> --- a/arch/x86/events/intel/cqm.c
>>>>> +++ b/arch/x86/events/intel/cqm.c
>>>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
>>>>> *prmid)
>>>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
>>>>> }
>>>>>
>>>>> -/*
>>>>> - * Updates caller cpu's cache.
>>>>> - */
>>>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
>>>>> -{
>>>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>> -
>>>>> - if (state->rmid == prmid->rmid)
>>>>> - return;
>>>>> - state->rmid = prmid->rmid;
>>>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
>>>>> -}
>>>>> -
>>>>> static inline bool __valid_pkg_id(u16 pkg_id)
>>>>> {
>>>>> return pkg_id < PQR_MAX_NR_PKGS;
>>>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
>>>>> perf_event *event)
>>>>> static inline void __intel_cqm_event_start(
>>>>> struct perf_event *event, union prmid_summary summary)
>>>>> {
>>>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>>> if (!(event->hw.state & PERF_HES_STOPPED))
>>>>> return;
>>>>> -
>>>>> event->hw.state &= ~PERF_HES_STOPPED;
>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>>> summary.sched_rmid));
>>>>> +
>>>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
>>>>> }
>>>>>
>>>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
>>>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
>>>>> *event, int mode)
>>>>> /* Occupancy of CQM events is obtained at read. No need to read
>>>>> * when event is stopped since read on inactive cpus succeed.
>>>>> */
>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>>> summary.sched_rmid));
>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>> PQR_RMID_MODE_NOEVENT);
>>>>> }
>>>>>
>>>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
>>>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int
>>>>> cpu)
>>>>>
>>>>> state->rmid = 0;
>>>>> state->closid = 0;
>>>>> + state->next_rmid = 0;
>>>>> + state->next_closid = 0;
>>>>>
>>>>> /* XXX: lock */
>>>>> /* XXX: Make sure this case is handled when hotplug happens. */
>>>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
>>>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
>>>>> package.\n",
>>>>> min_max_rmid + 1);
>>>>>
>>>>> + /* Make sure pqr_common_enable_key is enabled after
>>>>> + * cqm_initialized_key.
>>>>> + */
>>>>> + barrier();
>>>>> +
>>>>> + static_branch_enable(&pqr_common_enable_key);
>>>>> return ret;
>>>>>
>>>>> error_init_mutex:
>>>>> @@ -3163,4 +3157,41 @@ error:
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +/* Schedule task without a CQM perf_event. */
>>>>> +inline void __intel_cqm_no_event_sched_in(void)
>>>>> +{
>>>>> +#ifdef CONFIG_CGROUP_PERF
>>>>> + struct monr *monr;
>>>>> + struct pmonr *pmonr;
>>>>> + union prmid_summary summary;
>>>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
>>>>> +
>>>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
>>>>> + if (!static_branch_likely(&cqm_initialized_key))
>>>>> + return;
>>>>> +
>>>>> + /* Safe to call from_task since we are in scheduler lock. */
>>>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
>>>>> NULL));
>>>>> + pmonr = monr->pmonrs[pkg_id];
>>>>> +
>>>>> + /* Utilize most up to date pmonr summary. */
>>>>> + monr_hrchy_get_next_prmid_summary(pmonr);
>>>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
>>>>> +
>>>>> + if (!prmid_summary__is_mon_active(summary))
>>>>> + goto no_rmid;
>>>>> +
>>>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
>>>>> + goto no_rmid;
>>>>> +
>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>> PQR_RMID_MODE_NOEVENT);
>>>>> + return;
>>>>> +
>>>>> +no_rmid:
>>>>> + summary.value =
>>>>> atomic64_read(&root_pmonr->prmid_summary_atomic);
>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>> PQR_RMID_MODE_NOEVENT);
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>> device_initcall(intel_cqm_init);
>>>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>>>>> index 0f3da94..e1f8bd0 100644
>>>>> --- a/arch/x86/events/intel/cqm.h
>>>>> +++ b/arch/x86/events/intel/cqm.h
>>>>> @@ -82,8 +82,6 @@ union prmid_summary {
>>>>> };
>>>>> };
>>>>>
>>>>> -# define INVALID_RMID (-1)
>>>>> -
>>>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
>>>>> INVALID_RMID
>>>>> * depending on whether monitoring is active or not.
>>>>> */
>>>>> diff --git a/arch/x86/include/asm/pqr_common.h
>>>>> b/arch/x86/include/asm/pqr_common.h
>>>>> index f770637..abbb235 100644
>>>>> --- a/arch/x86/include/asm/pqr_common.h
>>>>> +++ b/arch/x86/include/asm/pqr_common.h
>>>>> @@ -3,31 +3,72 @@
>>>>>
>>>>> #if defined(CONFIG_INTEL_RDT)
>>>>>
>>>>> +#include <linux/jump_label.h>
>>>>> #include <linux/types.h>
>>>>> #include <asm/percpu.h>
>>>>> +#include <asm/msr.h>
>>>>>
>>>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
>>>>> +#define INVALID_RMID (-1)
>>>>> +#define INVALID_CLOSID (-1)
>>>>> +
>>>>> +
>>>>> +extern struct static_key_false pqr_common_enable_key;
>>>>> +
>>>>> +enum intel_pqr_rmid_mode {
>>>>> + /* RMID has no perf_event associated. */
>>>>> + PQR_RMID_MODE_NOEVENT = 0,
>>>>> + /* RMID has a perf_event associated. */
>>>>> + PQR_RMID_MODE_EVENT
>>>>> +};
>>>>>
>>>>> /**
>>>>> * struct intel_pqr_state - State cache for the PQR MSR
>>>>> - * @rmid: The cached Resource Monitoring ID
>>>>> - * @closid: The cached Class Of Service ID
>>>>> + * @rmid: Last rmid written to hw.
>>>>> + * @next_rmid: Next rmid to write to hw.
>>>>> + * @next_rmid_mode: Next rmid's mode.
>>>>> + * @closid: The current Class Of Service ID
>>>>> + * @next_closid: The Class Of Service ID to use.
>>>>> *
>>>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>>>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>>>>> * contains both parts, so we need to cache them.
>>>>> *
>>>>> - * The cache also helps to avoid pointless updates if the value does
>>>>> - * not change.
>>>>> + * The cache also helps to avoid pointless updates if the value does
>>>>> not
>>>>> + * change. It also keeps track of the type of RMID set (event vs no
>>>>> event)
>>>>> + * used to determine when a cgroup RMID is required.
>>>>> */
>>>>> struct intel_pqr_state {
>>>>> - u32 rmid;
>>>>> - u32 closid;
>>>>> + u32 rmid;
>>>>> + u32 next_rmid;
>>>>> + enum intel_pqr_rmid_mode next_rmid_mode;
>>>>> + u32 closid;
>>>>> + u32 next_closid;
>>>>> };
>>>>>
>>>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>>>
>>>>> #define PQR_MAX_NR_PKGS 8
>>>>>
>>>>> +void __pqr_update(void);
>>>>> +
>>>>> +inline void __intel_cqm_no_event_sched_in(void);
>>>>> +
>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>>> mode);
>>>>> +
>>>>> +inline void pqr_cache_update_closid(u32 closid);
>>>>> +
>>>>> +static inline void pqr_update(void)
>>>>> +{
>>>>> + if (static_branch_unlikely(&pqr_common_enable_key))
>>>>> + __pqr_update();
>>>>> +}
>>>>> +
>>>>> +#else
>>>>> +
>>>>> +static inline void pqr_update(void)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> #endif
>>>>> #endif
>>>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
>>>>> b/arch/x86/kernel/cpu/pqr_common.c
>>>>> index 9eff5d9..d91c127 100644
>>>>> --- a/arch/x86/kernel/cpu/pqr_common.c
>>>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
>>>>> @@ -1,9 +1,43 @@
>>>>> #include <asm/pqr_common.h>
>>>>>
>>>>> -/*
>>>>> - * The cached intel_pqr_state is strictly per CPU and can never be
>>>>> - * updated from a remote CPU. Both functions which modify the state
>>>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>>>>> - * interrupts disabled, which is sufficient for the protection.
>>>>> - */
>>>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>>> +
>>>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
>>>>> +
>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>>> mode)
>>>>> +{
>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>> +
>>>>> + state->next_rmid_mode = mode;
>>>>> + state->next_rmid = rmid;
>>>>> +}
>>>>> +
>>>>> +inline void pqr_cache_update_closid(u32 closid)
>>>>> +{
>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>> +
>>>>> + state->next_closid = closid;
>>>>> +}
>>>>> +
>>>>> +/* Update hw's RMID using cgroup's if perf_event did not.
>>>>> + * Sync pqr cache with MSR.
>>>>> + */
>>>>> +inline void __pqr_update(void)
>>>>> +{
>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>> +
>>>>> + /* If perf_event has set a next_rmid that is used, do not try
>>>>> + * to obtain another one from current task.
>>>>> + */
>>>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
>>>>> + __intel_cqm_no_event_sched_in();
>>>>
>>>>
>>>>
>>>> if perf_cgroup is not defined then the state is not updated here so the
>>>> state might have state->rmid might have IDs ?
>>>>
>>>> 1.event1 for PID1 has RMID1,
>>>> 2.perf sched_in - state->next_rmid = rmid1
>>>> 3.pqr_update - state->rmid = rmid1
>>>> 4.sched_out - write PQR_NOEVENT -
>>>> 5.next switch_to - state->rmid not reset nothing changes (when no
>>>> perf_cgroup) ?
>>
>>
>> Between 4 and 5 say event1 is dead. Basically on the next context switch(#5)
>> if perf sched_in wasnt called the PQR still has RMID1.
>>
>> When you have perf_cgroup you call the __intel_cqm_no_event.. then it sees
>> if there is continuous monitoring or you set the next_rmid to 0- but all the
>> code there is inside #ifdef PERF_CGROUP, so without cgroup its never reset
>> to zero ?
>>
>>
>>
>>>>
>>>>
>>>>> +
>>>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid.
>>>>> */
>>>>> + if (state->rmid == state->next_rmid &&
>>>>> + state->closid == state->next_closid)
>>>>> + return;
>>>>> +
>>>>> + state->rmid = state->next_rmid;
>>>>> + state->closid = state->next_closid;
>>>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
>>>>> +}
>>>>> --
>>>>> 2.8.0.rc3.226.g39d4020
>>>>>
>>>>>
>>>>
>>>
>>
>
When an event is terminated, intel_cqm_event_stop calls
pqr_cache_update_rmid and sets state->next_rmid to the rmid of its
parent in the RMID hierarchy. That would make next call to
__pqr_update to update PQR_ASSOC.
On Fri, Apr 29, 2016 at 2:32 PM Vikas Shivappa
<[email protected]> wrote:
>
>
>
> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>
> > if __intel_cqm_no_event_sched_in does nothing, the PQR_ASSOC msr is
> > still updated if state->rmid != state->next_rmid in __pqr_update,
>
> But due to 2 and 3 below they are equal ?
>
>
> > even if next_rmid_mode == PQR_RMID_MODE_NOEVENT .
> >
> > On Fri, Apr 29, 2016 at 2:01 PM, Vikas Shivappa
> > <[email protected]> wrote:
> >>
> >>
> >> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> >>
> >>> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
> >>> updated with RMID1, __pqr_update is the one called using the scheduler
> >>> hook, right after perf sched_in .
> >>>
> >>> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
> >>>>
> >>>>> Allow monitored cgroups to update the PQR MSR during task switch even
> >>>>> without an associated perf_event.
> >>>>>
> >>>>> The package RMID for the current monr associated with a monitored
> >>>>> cgroup is written to hw during task switch (after perf_events is run)
> >>>>> if perf_event did not write a RMID for an event.
> >>>>>
> >>>>> perf_event and any other caller of pqr_cache_update_rmid can update the
> >>>>> CPU's RMID using one of two modes:
> >>>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
> >>>>> e.g. the RMID of the root pmonr when no event is scheduled.
> >>>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
> >>>>> unset on pmu::del. This mode prevents from using a non-event
> >>>>> cgroup RMID.
> >>>>>
> >>>>> This patch also introduces caching of writes to PQR MSR within the
> >>>>> per-pcu
> >>>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
> >>>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
> >>>>>
> >>>>> Reviewed-by: Stephane Eranian <[email protected]>
> >>>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
> >>>>> ---
> >>>>> arch/x86/events/intel/cqm.c | 65
> >>>>> +++++++++++++++++++++++++++++----------
> >>>>> arch/x86/events/intel/cqm.h | 2 --
> >>>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
> >>>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
> >>>>> 4 files changed, 135 insertions(+), 31 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
> >>>>> index daf9fdf..4ece0a4 100644
> >>>>> --- a/arch/x86/events/intel/cqm.c
> >>>>> +++ b/arch/x86/events/intel/cqm.c
> >>>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
> >>>>> *prmid)
> >>>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
> >>>>> }
> >>>>>
> >>>>> -/*
> >>>>> - * Updates caller cpu's cache.
> >>>>> - */
> >>>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
> >>>>> -{
> >>>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>> -
> >>>>> - if (state->rmid == prmid->rmid)
> >>>>> - return;
> >>>>> - state->rmid = prmid->rmid;
> >>>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
> >>>>> -}
> >>>>> -
> >>>>> static inline bool __valid_pkg_id(u16 pkg_id)
> >>>>> {
> >>>>> return pkg_id < PQR_MAX_NR_PKGS;
> >>>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
> >>>>> perf_event *event)
> >>>>> static inline void __intel_cqm_event_start(
> >>>>> struct perf_event *event, union prmid_summary summary)
> >>>>> {
> >>>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
> >>>>> if (!(event->hw.state & PERF_HES_STOPPED))
> >>>>> return;
> >>>>> -
> >>>>> event->hw.state &= ~PERF_HES_STOPPED;
> >>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
> >>>>> summary.sched_rmid));
> >>>>> +
> >>>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
> >>>>> }
> >>>>>
> >>>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
> >>>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
> >>>>> *event, int mode)
> >>>>> /* Occupancy of CQM events is obtained at read. No need to read
> >>>>> * when event is stopped since read on inactive cpus succeed.
> >>>>> */
> >>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
> >>>>> summary.sched_rmid));
> >>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>> PQR_RMID_MODE_NOEVENT);
> >>>>> }
> >>>>>
> >>>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
> >>>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int
> >>>>> cpu)
> >>>>>
> >>>>> state->rmid = 0;
> >>>>> state->closid = 0;
> >>>>> + state->next_rmid = 0;
> >>>>> + state->next_closid = 0;
> >>>>>
> >>>>> /* XXX: lock */
> >>>>> /* XXX: Make sure this case is handled when hotplug happens. */
> >>>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
> >>>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
> >>>>> package.\n",
> >>>>> min_max_rmid + 1);
> >>>>>
> >>>>> + /* Make sure pqr_common_enable_key is enabled after
> >>>>> + * cqm_initialized_key.
> >>>>> + */
> >>>>> + barrier();
> >>>>> +
> >>>>> + static_branch_enable(&pqr_common_enable_key);
> >>>>> return ret;
> >>>>>
> >>>>> error_init_mutex:
> >>>>> @@ -3163,4 +3157,41 @@ error:
> >>>>> return ret;
> >>>>> }
> >>>>>
> >>>>> +/* Schedule task without a CQM perf_event. */
> >>>>> +inline void __intel_cqm_no_event_sched_in(void)
> >>>>> +{
> >>>>> +#ifdef CONFIG_CGROUP_PERF
> >>>>> + struct monr *monr;
> >>>>> + struct pmonr *pmonr;
> >>>>> + union prmid_summary summary;
> >>>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
> >>>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
> >>>>> +
> >>>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
> >>>>> + if (!static_branch_likely(&cqm_initialized_key))
> >>>>> + return;
> >>>>> +
> >>>>> + /* Safe to call from_task since we are in scheduler lock. */
> >>>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
> >>>>> NULL));
> >>>>> + pmonr = monr->pmonrs[pkg_id];
> >>>>> +
> >>>>> + /* Utilize most up to date pmonr summary. */
> >>>>> + monr_hrchy_get_next_prmid_summary(pmonr);
> >>>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
> >>>>> +
> >>>>> + if (!prmid_summary__is_mon_active(summary))
> >>>>> + goto no_rmid;
> >>>>> +
> >>>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
> >>>>> + goto no_rmid;
> >>>>> +
> >>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>> PQR_RMID_MODE_NOEVENT);
> >>>>> + return;
> >>>>> +
> >>>>> +no_rmid:
> >>>>> + summary.value =
> >>>>> atomic64_read(&root_pmonr->prmid_summary_atomic);
> >>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>> PQR_RMID_MODE_NOEVENT);
> >>>>> +#endif
> >>>>> +}
> >>>>> +
> >>>>> device_initcall(intel_cqm_init);
> >>>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
> >>>>> index 0f3da94..e1f8bd0 100644
> >>>>> --- a/arch/x86/events/intel/cqm.h
> >>>>> +++ b/arch/x86/events/intel/cqm.h
> >>>>> @@ -82,8 +82,6 @@ union prmid_summary {
> >>>>> };
> >>>>> };
> >>>>>
> >>>>> -# define INVALID_RMID (-1)
> >>>>> -
> >>>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
> >>>>> INVALID_RMID
> >>>>> * depending on whether monitoring is active or not.
> >>>>> */
> >>>>> diff --git a/arch/x86/include/asm/pqr_common.h
> >>>>> b/arch/x86/include/asm/pqr_common.h
> >>>>> index f770637..abbb235 100644
> >>>>> --- a/arch/x86/include/asm/pqr_common.h
> >>>>> +++ b/arch/x86/include/asm/pqr_common.h
> >>>>> @@ -3,31 +3,72 @@
> >>>>>
> >>>>> #if defined(CONFIG_INTEL_RDT)
> >>>>>
> >>>>> +#include <linux/jump_label.h>
> >>>>> #include <linux/types.h>
> >>>>> #include <asm/percpu.h>
> >>>>> +#include <asm/msr.h>
> >>>>>
> >>>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
> >>>>> +#define INVALID_RMID (-1)
> >>>>> +#define INVALID_CLOSID (-1)
> >>>>> +
> >>>>> +
> >>>>> +extern struct static_key_false pqr_common_enable_key;
> >>>>> +
> >>>>> +enum intel_pqr_rmid_mode {
> >>>>> + /* RMID has no perf_event associated. */
> >>>>> + PQR_RMID_MODE_NOEVENT = 0,
> >>>>> + /* RMID has a perf_event associated. */
> >>>>> + PQR_RMID_MODE_EVENT
> >>>>> +};
> >>>>>
> >>>>> /**
> >>>>> * struct intel_pqr_state - State cache for the PQR MSR
> >>>>> - * @rmid: The cached Resource Monitoring ID
> >>>>> - * @closid: The cached Class Of Service ID
> >>>>> + * @rmid: Last rmid written to hw.
> >>>>> + * @next_rmid: Next rmid to write to hw.
> >>>>> + * @next_rmid_mode: Next rmid's mode.
> >>>>> + * @closid: The current Class Of Service ID
> >>>>> + * @next_closid: The Class Of Service ID to use.
> >>>>> *
> >>>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
> >>>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
> >>>>> * contains both parts, so we need to cache them.
> >>>>> *
> >>>>> - * The cache also helps to avoid pointless updates if the value does
> >>>>> - * not change.
> >>>>> + * The cache also helps to avoid pointless updates if the value does
> >>>>> not
> >>>>> + * change. It also keeps track of the type of RMID set (event vs no
> >>>>> event)
> >>>>> + * used to determine when a cgroup RMID is required.
> >>>>> */
> >>>>> struct intel_pqr_state {
> >>>>> - u32 rmid;
> >>>>> - u32 closid;
> >>>>> + u32 rmid;
> >>>>> + u32 next_rmid;
> >>>>> + enum intel_pqr_rmid_mode next_rmid_mode;
> >>>>> + u32 closid;
> >>>>> + u32 next_closid;
> >>>>> };
> >>>>>
> >>>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
> >>>>>
> >>>>> #define PQR_MAX_NR_PKGS 8
> >>>>>
> >>>>> +void __pqr_update(void);
> >>>>> +
> >>>>> +inline void __intel_cqm_no_event_sched_in(void);
> >>>>> +
> >>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
> >>>>> mode);
> >>>>> +
> >>>>> +inline void pqr_cache_update_closid(u32 closid);
> >>>>> +
> >>>>> +static inline void pqr_update(void)
> >>>>> +{
> >>>>> + if (static_branch_unlikely(&pqr_common_enable_key))
> >>>>> + __pqr_update();
> >>>>> +}
> >>>>> +
> >>>>> +#else
> >>>>> +
> >>>>> +static inline void pqr_update(void)
> >>>>> +{
> >>>>> +}
> >>>>> +
> >>>>> #endif
> >>>>> #endif
> >>>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
> >>>>> b/arch/x86/kernel/cpu/pqr_common.c
> >>>>> index 9eff5d9..d91c127 100644
> >>>>> --- a/arch/x86/kernel/cpu/pqr_common.c
> >>>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
> >>>>> @@ -1,9 +1,43 @@
> >>>>> #include <asm/pqr_common.h>
> >>>>>
> >>>>> -/*
> >>>>> - * The cached intel_pqr_state is strictly per CPU and can never be
> >>>>> - * updated from a remote CPU. Both functions which modify the state
> >>>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
> >>>>> - * interrupts disabled, which is sufficient for the protection.
> >>>>> - */
> >>>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> >>>>> +
> >>>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
> >>>>> +
> >>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
> >>>>> mode)
> >>>>> +{
> >>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>> +
> >>>>> + state->next_rmid_mode = mode;
> >>>>> + state->next_rmid = rmid;
> >>>>> +}
> >>>>> +
> >>>>> +inline void pqr_cache_update_closid(u32 closid)
> >>>>> +{
> >>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>> +
> >>>>> + state->next_closid = closid;
> >>>>> +}
> >>>>> +
> >>>>> +/* Update hw's RMID using cgroup's if perf_event did not.
> >>>>> + * Sync pqr cache with MSR.
> >>>>> + */
> >>>>> +inline void __pqr_update(void)
> >>>>> +{
> >>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>> +
> >>>>> + /* If perf_event has set a next_rmid that is used, do not try
> >>>>> + * to obtain another one from current task.
> >>>>> + */
> >>>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
> >>>>> + __intel_cqm_no_event_sched_in();
> >>>>
> >>>>
> >>>>
> >>>> if perf_cgroup is not defined then the state is not updated here so the
> >>>> state might have state->rmid might have IDs ?
> >>>>
> >>>> 1.event1 for PID1 has RMID1,
> >>>> 2.perf sched_in - state->next_rmid = rmid1
> >>>> 3.pqr_update - state->rmid = rmid1
> >>>> 4.sched_out - write PQR_NOEVENT -
> >>>> 5.next switch_to - state->rmid not reset nothing changes (when no
> >>>> perf_cgroup) ?
> >>
> >>
> >> Between 4 and 5 say event1 is dead. Basically on the next context switch(#5)
> >> if perf sched_in wasnt called the PQR still has RMID1.
> >>
> >> When you have perf_cgroup you call the __intel_cqm_no_event.. then it sees
> >> if there is continuous monitoring or you set the next_rmid to 0- but all the
> >> code there is inside #ifdef PERF_CGROUP, so without cgroup its never reset
> >> to zero ?
> >>
> >>
> >>
> >>>>
> >>>>
> >>>>> +
> >>>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid.
> >>>>> */
> >>>>> + if (state->rmid == state->next_rmid &&
> >>>>> + state->closid == state->next_closid)
> >>>>> + return;
> >>>>> +
> >>>>> + state->rmid = state->next_rmid;
> >>>>> + state->closid = state->next_closid;
> >>>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
> >>>>> +}
> >>>>> --
> >>>>> 2.8.0.rc3.226.g39d4020
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> When an event is terminated, intel_cqm_event_stop calls
> pqr_cache_update_rmid and sets state->next_rmid to the rmid of its
> parent in the RMID hierarchy. That would make next call to
> __pqr_update to update PQR_ASSOC.
What about all the cases like context swith to a pid2 which is not monitored..
(assume event1 still exists)
>
> On Fri, Apr 29, 2016 at 2:32 PM Vikas Shivappa
> <[email protected]> wrote:
>>
>>
>>
>> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>>
>>> if __intel_cqm_no_event_sched_in does nothing, the PQR_ASSOC msr is
>>> still updated if state->rmid != state->next_rmid in __pqr_update,
>>
>> But due to 2 and 3 below they are equal ?
>>
>>
>>> even if next_rmid_mode == PQR_RMID_MODE_NOEVENT .
>>>
>>> On Fri, Apr 29, 2016 at 2:01 PM, Vikas Shivappa
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>>>>
>>>>> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
>>>>> updated with RMID1, __pqr_update is the one called using the scheduler
>>>>> hook, right after perf sched_in .
>>>>>
>>>>> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
>>>>>>
>>>>>>> Allow monitored cgroups to update the PQR MSR during task switch even
>>>>>>> without an associated perf_event.
>>>>>>>
>>>>>>> The package RMID for the current monr associated with a monitored
>>>>>>> cgroup is written to hw during task switch (after perf_events is run)
>>>>>>> if perf_event did not write a RMID for an event.
>>>>>>>
>>>>>>> perf_event and any other caller of pqr_cache_update_rmid can update the
>>>>>>> CPU's RMID using one of two modes:
>>>>>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
>>>>>>> e.g. the RMID of the root pmonr when no event is scheduled.
>>>>>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
>>>>>>> unset on pmu::del. This mode prevents from using a non-event
>>>>>>> cgroup RMID.
>>>>>>>
>>>>>>> This patch also introduces caching of writes to PQR MSR within the
>>>>>>> per-pcu
>>>>>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
>>>>>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
>>>>>>>
>>>>>>> Reviewed-by: Stephane Eranian <[email protected]>
>>>>>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/events/intel/cqm.c | 65
>>>>>>> +++++++++++++++++++++++++++++----------
>>>>>>> arch/x86/events/intel/cqm.h | 2 --
>>>>>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
>>>>>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
>>>>>>> 4 files changed, 135 insertions(+), 31 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>>>>>>> index daf9fdf..4ece0a4 100644
>>>>>>> --- a/arch/x86/events/intel/cqm.c
>>>>>>> +++ b/arch/x86/events/intel/cqm.c
>>>>>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
>>>>>>> *prmid)
>>>>>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
>>>>>>> }
>>>>>>>
>>>>>>> -/*
>>>>>>> - * Updates caller cpu's cache.
>>>>>>> - */
>>>>>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
>>>>>>> -{
>>>>>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>>>> -
>>>>>>> - if (state->rmid == prmid->rmid)
>>>>>>> - return;
>>>>>>> - state->rmid = prmid->rmid;
>>>>>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
>>>>>>> -}
>>>>>>> -
>>>>>>> static inline bool __valid_pkg_id(u16 pkg_id)
>>>>>>> {
>>>>>>> return pkg_id < PQR_MAX_NR_PKGS;
>>>>>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
>>>>>>> perf_event *event)
>>>>>>> static inline void __intel_cqm_event_start(
>>>>>>> struct perf_event *event, union prmid_summary summary)
>>>>>>> {
>>>>>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>>>>> if (!(event->hw.state & PERF_HES_STOPPED))
>>>>>>> return;
>>>>>>> -
>>>>>>> event->hw.state &= ~PERF_HES_STOPPED;
>>>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>>>>> summary.sched_rmid));
>>>>>>> +
>>>>>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
>>>>>>> }
>>>>>>>
>>>>>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
>>>>>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
>>>>>>> *event, int mode)
>>>>>>> /* Occupancy of CQM events is obtained at read. No need to read
>>>>>>> * when event is stopped since read on inactive cpus succeed.
>>>>>>> */
>>>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
>>>>>>> summary.sched_rmid));
>>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>>>> PQR_RMID_MODE_NOEVENT);
>>>>>>> }
>>>>>>>
>>>>>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
>>>>>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int
>>>>>>> cpu)
>>>>>>>
>>>>>>> state->rmid = 0;
>>>>>>> state->closid = 0;
>>>>>>> + state->next_rmid = 0;
>>>>>>> + state->next_closid = 0;
>>>>>>>
>>>>>>> /* XXX: lock */
>>>>>>> /* XXX: Make sure this case is handled when hotplug happens. */
>>>>>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
>>>>>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
>>>>>>> package.\n",
>>>>>>> min_max_rmid + 1);
>>>>>>>
>>>>>>> + /* Make sure pqr_common_enable_key is enabled after
>>>>>>> + * cqm_initialized_key.
>>>>>>> + */
>>>>>>> + barrier();
>>>>>>> +
>>>>>>> + static_branch_enable(&pqr_common_enable_key);
>>>>>>> return ret;
>>>>>>>
>>>>>>> error_init_mutex:
>>>>>>> @@ -3163,4 +3157,41 @@ error:
>>>>>>> return ret;
>>>>>>> }
>>>>>>>
>>>>>>> +/* Schedule task without a CQM perf_event. */
>>>>>>> +inline void __intel_cqm_no_event_sched_in(void)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_CGROUP_PERF
>>>>>>> + struct monr *monr;
>>>>>>> + struct pmonr *pmonr;
>>>>>>> + union prmid_summary summary;
>>>>>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
>>>>>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
>>>>>>> +
>>>>>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
>>>>>>> + if (!static_branch_likely(&cqm_initialized_key))
>>>>>>> + return;
>>>>>>> +
>>>>>>> + /* Safe to call from_task since we are in scheduler lock. */
>>>>>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
>>>>>>> NULL));
>>>>>>> + pmonr = monr->pmonrs[pkg_id];
>>>>>>> +
>>>>>>> + /* Utilize most up to date pmonr summary. */
>>>>>>> + monr_hrchy_get_next_prmid_summary(pmonr);
>>>>>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
>>>>>>> +
>>>>>>> + if (!prmid_summary__is_mon_active(summary))
>>>>>>> + goto no_rmid;
>>>>>>> +
>>>>>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
>>>>>>> + goto no_rmid;
>>>>>>> +
>>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>>>> PQR_RMID_MODE_NOEVENT);
>>>>>>> + return;
>>>>>>> +
>>>>>>> +no_rmid:
>>>>>>> + summary.value =
>>>>>>> atomic64_read(&root_pmonr->prmid_summary_atomic);
>>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
>>>>>>> PQR_RMID_MODE_NOEVENT);
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>> device_initcall(intel_cqm_init);
>>>>>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
>>>>>>> index 0f3da94..e1f8bd0 100644
>>>>>>> --- a/arch/x86/events/intel/cqm.h
>>>>>>> +++ b/arch/x86/events/intel/cqm.h
>>>>>>> @@ -82,8 +82,6 @@ union prmid_summary {
>>>>>>> };
>>>>>>> };
>>>>>>>
>>>>>>> -# define INVALID_RMID (-1)
>>>>>>> -
>>>>>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
>>>>>>> INVALID_RMID
>>>>>>> * depending on whether monitoring is active or not.
>>>>>>> */
>>>>>>> diff --git a/arch/x86/include/asm/pqr_common.h
>>>>>>> b/arch/x86/include/asm/pqr_common.h
>>>>>>> index f770637..abbb235 100644
>>>>>>> --- a/arch/x86/include/asm/pqr_common.h
>>>>>>> +++ b/arch/x86/include/asm/pqr_common.h
>>>>>>> @@ -3,31 +3,72 @@
>>>>>>>
>>>>>>> #if defined(CONFIG_INTEL_RDT)
>>>>>>>
>>>>>>> +#include <linux/jump_label.h>
>>>>>>> #include <linux/types.h>
>>>>>>> #include <asm/percpu.h>
>>>>>>> +#include <asm/msr.h>
>>>>>>>
>>>>>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
>>>>>>> +#define INVALID_RMID (-1)
>>>>>>> +#define INVALID_CLOSID (-1)
>>>>>>> +
>>>>>>> +
>>>>>>> +extern struct static_key_false pqr_common_enable_key;
>>>>>>> +
>>>>>>> +enum intel_pqr_rmid_mode {
>>>>>>> + /* RMID has no perf_event associated. */
>>>>>>> + PQR_RMID_MODE_NOEVENT = 0,
>>>>>>> + /* RMID has a perf_event associated. */
>>>>>>> + PQR_RMID_MODE_EVENT
>>>>>>> +};
>>>>>>>
>>>>>>> /**
>>>>>>> * struct intel_pqr_state - State cache for the PQR MSR
>>>>>>> - * @rmid: The cached Resource Monitoring ID
>>>>>>> - * @closid: The cached Class Of Service ID
>>>>>>> + * @rmid: Last rmid written to hw.
>>>>>>> + * @next_rmid: Next rmid to write to hw.
>>>>>>> + * @next_rmid_mode: Next rmid's mode.
>>>>>>> + * @closid: The current Class Of Service ID
>>>>>>> + * @next_closid: The Class Of Service ID to use.
>>>>>>> *
>>>>>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
>>>>>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
>>>>>>> * contains both parts, so we need to cache them.
>>>>>>> *
>>>>>>> - * The cache also helps to avoid pointless updates if the value does
>>>>>>> - * not change.
>>>>>>> + * The cache also helps to avoid pointless updates if the value does
>>>>>>> not
>>>>>>> + * change. It also keeps track of the type of RMID set (event vs no
>>>>>>> event)
>>>>>>> + * used to determine when a cgroup RMID is required.
>>>>>>> */
>>>>>>> struct intel_pqr_state {
>>>>>>> - u32 rmid;
>>>>>>> - u32 closid;
>>>>>>> + u32 rmid;
>>>>>>> + u32 next_rmid;
>>>>>>> + enum intel_pqr_rmid_mode next_rmid_mode;
>>>>>>> + u32 closid;
>>>>>>> + u32 next_closid;
>>>>>>> };
>>>>>>>
>>>>>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>>>>>
>>>>>>> #define PQR_MAX_NR_PKGS 8
>>>>>>>
>>>>>>> +void __pqr_update(void);
>>>>>>> +
>>>>>>> +inline void __intel_cqm_no_event_sched_in(void);
>>>>>>> +
>>>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>>>>> mode);
>>>>>>> +
>>>>>>> +inline void pqr_cache_update_closid(u32 closid);
>>>>>>> +
>>>>>>> +static inline void pqr_update(void)
>>>>>>> +{
>>>>>>> + if (static_branch_unlikely(&pqr_common_enable_key))
>>>>>>> + __pqr_update();
>>>>>>> +}
>>>>>>> +
>>>>>>> +#else
>>>>>>> +
>>>>>>> +static inline void pqr_update(void)
>>>>>>> +{
>>>>>>> +}
>>>>>>> +
>>>>>>> #endif
>>>>>>> #endif
>>>>>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
>>>>>>> b/arch/x86/kernel/cpu/pqr_common.c
>>>>>>> index 9eff5d9..d91c127 100644
>>>>>>> --- a/arch/x86/kernel/cpu/pqr_common.c
>>>>>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
>>>>>>> @@ -1,9 +1,43 @@
>>>>>>> #include <asm/pqr_common.h>
>>>>>>>
>>>>>>> -/*
>>>>>>> - * The cached intel_pqr_state is strictly per CPU and can never be
>>>>>>> - * updated from a remote CPU. Both functions which modify the state
>>>>>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>>>>>>> - * interrupts disabled, which is sufficient for the protection.
>>>>>>> - */
>>>>>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>>>>>>> +
>>>>>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
>>>>>>> +
>>>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
>>>>>>> mode)
>>>>>>> +{
>>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>>>> +
>>>>>>> + state->next_rmid_mode = mode;
>>>>>>> + state->next_rmid = rmid;
>>>>>>> +}
>>>>>>> +
>>>>>>> +inline void pqr_cache_update_closid(u32 closid)
>>>>>>> +{
>>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>>>> +
>>>>>>> + state->next_closid = closid;
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* Update hw's RMID using cgroup's if perf_event did not.
>>>>>>> + * Sync pqr cache with MSR.
>>>>>>> + */
>>>>>>> +inline void __pqr_update(void)
>>>>>>> +{
>>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>>>>>>> +
>>>>>>> + /* If perf_event has set a next_rmid that is used, do not try
>>>>>>> + * to obtain another one from current task.
>>>>>>> + */
>>>>>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
>>>>>>> + __intel_cqm_no_event_sched_in();
>>>>>>
>>>>>>
>>>>>>
>>>>>> if perf_cgroup is not defined then the state is not updated here so the
>>>>>> state might have state->rmid might have IDs ?
>>>>>>
>>>>>> 1.event1 for PID1 has RMID1,
>>>>>> 2.perf sched_in - state->next_rmid = rmid1
>>>>>> 3.pqr_update - state->rmid = rmid1
>>>>>> 4.sched_out - write PQR_NOEVENT -
>>>>>> 5.next switch_to - state->rmid not reset nothing changes (when no
>>>>>> perf_cgroup) ?
>>>>
>>>>
>>>> Between 4 and 5 say event1 is dead. Basically on the next context switch(#5)
>>>> if perf sched_in wasnt called the PQR still has RMID1.
>>>>
>>>> When you have perf_cgroup you call the __intel_cqm_no_event.. then it sees
>>>> if there is continuous monitoring or you set the next_rmid to 0- but all the
>>>> code there is inside #ifdef PERF_CGROUP, so without cgroup its never reset
>>>> to zero ?
>>>>
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>> +
>>>>>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid.
>>>>>>> */
>>>>>>> + if (state->rmid == state->next_rmid &&
>>>>>>> + state->closid == state->next_closid)
>>>>>>> + return;
>>>>>>> +
>>>>>>> + state->rmid = state->next_rmid;
>>>>>>> + state->closid = state->next_closid;
>>>>>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
>>>>>>> +}
>>>>>>> --
>>>>>>> 2.8.0.rc3.226.g39d4020
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
In all those cases perf sched_out will call intel_cqm_event_stop that
call pqr_cache_update_rmid that sets state->next_rmid to 0. The RMID 0
corresponds to the root monr in the RMID hierarchy.
On Fri, Apr 29, 2016 at 4:49 PM Vikas Shivappa
<[email protected]> wrote:
>
>
>
> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
>
> > When an event is terminated, intel_cqm_event_stop calls
> > pqr_cache_update_rmid and sets state->next_rmid to the rmid of its
> > parent in the RMID hierarchy. That would make next call to
> > __pqr_update to update PQR_ASSOC.
>
> What about all the cases like context swith to a pid2 which is not monitored..
> (assume event1 still exists)
>
> >
> > On Fri, Apr 29, 2016 at 2:32 PM Vikas Shivappa
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> >>
> >>> if __intel_cqm_no_event_sched_in does nothing, the PQR_ASSOC msr is
> >>> still updated if state->rmid != state->next_rmid in __pqr_update,
> >>
> >> But due to 2 and 3 below they are equal ?
> >>
> >>
> >>> even if next_rmid_mode == PQR_RMID_MODE_NOEVENT .
> >>>
> >>> On Fri, Apr 29, 2016 at 2:01 PM, Vikas Shivappa
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> On Fri, 29 Apr 2016, David Carrillo-Cisneros wrote:
> >>>>
> >>>>> Not sure I see the problem you point here. In step 3, PQR_ASSOC is
> >>>>> updated with RMID1, __pqr_update is the one called using the scheduler
> >>>>> hook, right after perf sched_in .
> >>>>>
> >>>>> On Fri, Apr 29, 2016 at 1:25 PM, Vikas Shivappa
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, 28 Apr 2016, David Carrillo-Cisneros wrote:
> >>>>>>
> >>>>>>> Allow monitored cgroups to update the PQR MSR during task switch even
> >>>>>>> without an associated perf_event.
> >>>>>>>
> >>>>>>> The package RMID for the current monr associated with a monitored
> >>>>>>> cgroup is written to hw during task switch (after perf_events is run)
> >>>>>>> if perf_event did not write a RMID for an event.
> >>>>>>>
> >>>>>>> perf_event and any other caller of pqr_cache_update_rmid can update the
> >>>>>>> CPU's RMID using one of two modes:
> >>>>>>> - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
> >>>>>>> e.g. the RMID of the root pmonr when no event is scheduled.
> >>>>>>> - PQR_RMID_MODE_EVENT: A RMID used by an event. Set during pmu::add
> >>>>>>> unset on pmu::del. This mode prevents from using a non-event
> >>>>>>> cgroup RMID.
> >>>>>>>
> >>>>>>> This patch also introduces caching of writes to PQR MSR within the
> >>>>>>> per-pcu
> >>>>>>> pqr state variable. This interface to update RMIDs and CLOSIDs will be
> >>>>>>> also utilized in upcoming versions of Intel's MBM and CAT drivers.
> >>>>>>>
> >>>>>>> Reviewed-by: Stephane Eranian <[email protected]>
> >>>>>>> Signed-off-by: David Carrillo-Cisneros <[email protected]>
> >>>>>>> ---
> >>>>>>> arch/x86/events/intel/cqm.c | 65
> >>>>>>> +++++++++++++++++++++++++++++----------
> >>>>>>> arch/x86/events/intel/cqm.h | 2 --
> >>>>>>> arch/x86/include/asm/pqr_common.h | 53 +++++++++++++++++++++++++++----
> >>>>>>> arch/x86/kernel/cpu/pqr_common.c | 46 +++++++++++++++++++++++----
> >>>>>>> 4 files changed, 135 insertions(+), 31 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
> >>>>>>> index daf9fdf..4ece0a4 100644
> >>>>>>> --- a/arch/x86/events/intel/cqm.c
> >>>>>>> +++ b/arch/x86/events/intel/cqm.c
> >>>>>>> @@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid
> >>>>>>> *prmid)
> >>>>>>> return __cqm_prmid_update(prmid, __rmid_min_update_time);
> >>>>>>> }
> >>>>>>>
> >>>>>>> -/*
> >>>>>>> - * Updates caller cpu's cache.
> >>>>>>> - */
> >>>>>>> -static inline void __update_pqr_prmid(struct prmid *prmid)
> >>>>>>> -{
> >>>>>>> - struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>>>> -
> >>>>>>> - if (state->rmid == prmid->rmid)
> >>>>>>> - return;
> >>>>>>> - state->rmid = prmid->rmid;
> >>>>>>> - wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
> >>>>>>> -}
> >>>>>>> -
> >>>>>>> static inline bool __valid_pkg_id(u16 pkg_id)
> >>>>>>> {
> >>>>>>> return pkg_id < PQR_MAX_NR_PKGS;
> >>>>>>> @@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct
> >>>>>>> perf_event *event)
> >>>>>>> static inline void __intel_cqm_event_start(
> >>>>>>> struct perf_event *event, union prmid_summary summary)
> >>>>>>> {
> >>>>>>> - u16 pkg_id = topology_physical_package_id(smp_processor_id());
> >>>>>>> if (!(event->hw.state & PERF_HES_STOPPED))
> >>>>>>> return;
> >>>>>>> -
> >>>>>>> event->hw.state &= ~PERF_HES_STOPPED;
> >>>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
> >>>>>>> summary.sched_rmid));
> >>>>>>> +
> >>>>>>> + pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
> >>>>>>> }
> >>>>>>>
> >>>>>>> static void intel_cqm_event_start(struct perf_event *event, int mode)
> >>>>>>> @@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event
> >>>>>>> *event, int mode)
> >>>>>>> /* Occupancy of CQM events is obtained at read. No need to read
> >>>>>>> * when event is stopped since read on inactive cpus succeed.
> >>>>>>> */
> >>>>>>> - __update_pqr_prmid(__prmid_from_rmid(pkg_id,
> >>>>>>> summary.sched_rmid));
> >>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>>>> PQR_RMID_MODE_NOEVENT);
> >>>>>>> }
> >>>>>>>
> >>>>>>> static int intel_cqm_event_add(struct perf_event *event, int mode)
> >>>>>>> @@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int
> >>>>>>> cpu)
> >>>>>>>
> >>>>>>> state->rmid = 0;
> >>>>>>> state->closid = 0;
> >>>>>>> + state->next_rmid = 0;
> >>>>>>> + state->next_closid = 0;
> >>>>>>>
> >>>>>>> /* XXX: lock */
> >>>>>>> /* XXX: Make sure this case is handled when hotplug happens. */
> >>>>>>> @@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
> >>>>>>> pr_info("Intel CQM monitoring enabled with at least %u rmids per
> >>>>>>> package.\n",
> >>>>>>> min_max_rmid + 1);
> >>>>>>>
> >>>>>>> + /* Make sure pqr_common_enable_key is enabled after
> >>>>>>> + * cqm_initialized_key.
> >>>>>>> + */
> >>>>>>> + barrier();
> >>>>>>> +
> >>>>>>> + static_branch_enable(&pqr_common_enable_key);
> >>>>>>> return ret;
> >>>>>>>
> >>>>>>> error_init_mutex:
> >>>>>>> @@ -3163,4 +3157,41 @@ error:
> >>>>>>> return ret;
> >>>>>>> }
> >>>>>>>
> >>>>>>> +/* Schedule task without a CQM perf_event. */
> >>>>>>> +inline void __intel_cqm_no_event_sched_in(void)
> >>>>>>> +{
> >>>>>>> +#ifdef CONFIG_CGROUP_PERF
> >>>>>>> + struct monr *monr;
> >>>>>>> + struct pmonr *pmonr;
> >>>>>>> + union prmid_summary summary;
> >>>>>>> + u16 pkg_id = topology_physical_package_id(smp_processor_id());
> >>>>>>> + struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
> >>>>>>> +
> >>>>>>> + /* Assume CQM enabled is likely given that PQR is enabled. */
> >>>>>>> + if (!static_branch_likely(&cqm_initialized_key))
> >>>>>>> + return;
> >>>>>>> +
> >>>>>>> + /* Safe to call from_task since we are in scheduler lock. */
> >>>>>>> + monr = monr_from_perf_cgroup(perf_cgroup_from_task(current,
> >>>>>>> NULL));
> >>>>>>> + pmonr = monr->pmonrs[pkg_id];
> >>>>>>> +
> >>>>>>> + /* Utilize most up to date pmonr summary. */
> >>>>>>> + monr_hrchy_get_next_prmid_summary(pmonr);
> >>>>>>> + summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
> >>>>>>> +
> >>>>>>> + if (!prmid_summary__is_mon_active(summary))
> >>>>>>> + goto no_rmid;
> >>>>>>> +
> >>>>>>> + if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
> >>>>>>> + goto no_rmid;
> >>>>>>> +
> >>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>>>> PQR_RMID_MODE_NOEVENT);
> >>>>>>> + return;
> >>>>>>> +
> >>>>>>> +no_rmid:
> >>>>>>> + summary.value =
> >>>>>>> atomic64_read(&root_pmonr->prmid_summary_atomic);
> >>>>>>> + pqr_cache_update_rmid(summary.sched_rmid,
> >>>>>>> PQR_RMID_MODE_NOEVENT);
> >>>>>>> +#endif
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> device_initcall(intel_cqm_init);
> >>>>>>> diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
> >>>>>>> index 0f3da94..e1f8bd0 100644
> >>>>>>> --- a/arch/x86/events/intel/cqm.h
> >>>>>>> +++ b/arch/x86/events/intel/cqm.h
> >>>>>>> @@ -82,8 +82,6 @@ union prmid_summary {
> >>>>>>> };
> >>>>>>> };
> >>>>>>>
> >>>>>>> -# define INVALID_RMID (-1)
> >>>>>>> -
> >>>>>>> /* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or
> >>>>>>> INVALID_RMID
> >>>>>>> * depending on whether monitoring is active or not.
> >>>>>>> */
> >>>>>>> diff --git a/arch/x86/include/asm/pqr_common.h
> >>>>>>> b/arch/x86/include/asm/pqr_common.h
> >>>>>>> index f770637..abbb235 100644
> >>>>>>> --- a/arch/x86/include/asm/pqr_common.h
> >>>>>>> +++ b/arch/x86/include/asm/pqr_common.h
> >>>>>>> @@ -3,31 +3,72 @@
> >>>>>>>
> >>>>>>> #if defined(CONFIG_INTEL_RDT)
> >>>>>>>
> >>>>>>> +#include <linux/jump_label.h>
> >>>>>>> #include <linux/types.h>
> >>>>>>> #include <asm/percpu.h>
> >>>>>>> +#include <asm/msr.h>
> >>>>>>>
> >>>>>>> #define MSR_IA32_PQR_ASSOC 0x0c8f
> >>>>>>> +#define INVALID_RMID (-1)
> >>>>>>> +#define INVALID_CLOSID (-1)
> >>>>>>> +
> >>>>>>> +
> >>>>>>> +extern struct static_key_false pqr_common_enable_key;
> >>>>>>> +
> >>>>>>> +enum intel_pqr_rmid_mode {
> >>>>>>> + /* RMID has no perf_event associated. */
> >>>>>>> + PQR_RMID_MODE_NOEVENT = 0,
> >>>>>>> + /* RMID has a perf_event associated. */
> >>>>>>> + PQR_RMID_MODE_EVENT
> >>>>>>> +};
> >>>>>>>
> >>>>>>> /**
> >>>>>>> * struct intel_pqr_state - State cache for the PQR MSR
> >>>>>>> - * @rmid: The cached Resource Monitoring ID
> >>>>>>> - * @closid: The cached Class Of Service ID
> >>>>>>> + * @rmid: Last rmid written to hw.
> >>>>>>> + * @next_rmid: Next rmid to write to hw.
> >>>>>>> + * @next_rmid_mode: Next rmid's mode.
> >>>>>>> + * @closid: The current Class Of Service ID
> >>>>>>> + * @next_closid: The Class Of Service ID to use.
> >>>>>>> *
> >>>>>>> * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
> >>>>>>> * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
> >>>>>>> * contains both parts, so we need to cache them.
> >>>>>>> *
> >>>>>>> - * The cache also helps to avoid pointless updates if the value does
> >>>>>>> - * not change.
> >>>>>>> + * The cache also helps to avoid pointless updates if the value does
> >>>>>>> not
> >>>>>>> + * change. It also keeps track of the type of RMID set (event vs no
> >>>>>>> event)
> >>>>>>> + * used to determine when a cgroup RMID is required.
> >>>>>>> */
> >>>>>>> struct intel_pqr_state {
> >>>>>>> - u32 rmid;
> >>>>>>> - u32 closid;
> >>>>>>> + u32 rmid;
> >>>>>>> + u32 next_rmid;
> >>>>>>> + enum intel_pqr_rmid_mode next_rmid_mode;
> >>>>>>> + u32 closid;
> >>>>>>> + u32 next_closid;
> >>>>>>> };
> >>>>>>>
> >>>>>>> DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
> >>>>>>>
> >>>>>>> #define PQR_MAX_NR_PKGS 8
> >>>>>>>
> >>>>>>> +void __pqr_update(void);
> >>>>>>> +
> >>>>>>> +inline void __intel_cqm_no_event_sched_in(void);
> >>>>>>> +
> >>>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
> >>>>>>> mode);
> >>>>>>> +
> >>>>>>> +inline void pqr_cache_update_closid(u32 closid);
> >>>>>>> +
> >>>>>>> +static inline void pqr_update(void)
> >>>>>>> +{
> >>>>>>> + if (static_branch_unlikely(&pqr_common_enable_key))
> >>>>>>> + __pqr_update();
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +#else
> >>>>>>> +
> >>>>>>> +static inline void pqr_update(void)
> >>>>>>> +{
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> #endif
> >>>>>>> #endif
> >>>>>>> diff --git a/arch/x86/kernel/cpu/pqr_common.c
> >>>>>>> b/arch/x86/kernel/cpu/pqr_common.c
> >>>>>>> index 9eff5d9..d91c127 100644
> >>>>>>> --- a/arch/x86/kernel/cpu/pqr_common.c
> >>>>>>> +++ b/arch/x86/kernel/cpu/pqr_common.c
> >>>>>>> @@ -1,9 +1,43 @@
> >>>>>>> #include <asm/pqr_common.h>
> >>>>>>>
> >>>>>>> -/*
> >>>>>>> - * The cached intel_pqr_state is strictly per CPU and can never be
> >>>>>>> - * updated from a remote CPU. Both functions which modify the state
> >>>>>>> - * (intel_cqm_event_start and intel_cqm_event_stop) are called with
> >>>>>>> - * interrupts disabled, which is sufficient for the protection.
> >>>>>>> - */
> >>>>>>> DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> >>>>>>> +
> >>>>>>> +DEFINE_STATIC_KEY_FALSE(pqr_common_enable_key);
> >>>>>>> +
> >>>>>>> +inline void pqr_cache_update_rmid(u32 rmid, enum intel_pqr_rmid_mode
> >>>>>>> mode)
> >>>>>>> +{
> >>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>>>> +
> >>>>>>> + state->next_rmid_mode = mode;
> >>>>>>> + state->next_rmid = rmid;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +inline void pqr_cache_update_closid(u32 closid)
> >>>>>>> +{
> >>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>>>> +
> >>>>>>> + state->next_closid = closid;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +/* Update hw's RMID using cgroup's if perf_event did not.
> >>>>>>> + * Sync pqr cache with MSR.
> >>>>>>> + */
> >>>>>>> +inline void __pqr_update(void)
> >>>>>>> +{
> >>>>>>> + struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> >>>>>>> +
> >>>>>>> + /* If perf_event has set a next_rmid that is used, do not try
> >>>>>>> + * to obtain another one from current task.
> >>>>>>> + */
> >>>>>>> + if (state->next_rmid_mode == PQR_RMID_MODE_NOEVENT)
> >>>>>>> + __intel_cqm_no_event_sched_in();
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> if perf_cgroup is not defined then the state is not updated here so the
> >>>>>> state might have state->rmid might have IDs ?
> >>>>>>
> >>>>>> 1.event1 for PID1 has RMID1,
> >>>>>> 2.perf sched_in - state->next_rmid = rmid1
> >>>>>> 3.pqr_update - state->rmid = rmid1
> >>>>>> 4.sched_out - write PQR_NOEVENT -
> >>>>>> 5.next switch_to - state->rmid not reset nothing changes (when no
> >>>>>> perf_cgroup) ?
> >>>>
> >>>>
> >>>> Between 4 and 5 say event1 is dead. Basically on the next context switch(#5)
> >>>> if perf sched_in wasnt called the PQR still has RMID1.
> >>>>
> >>>> When you have perf_cgroup you call the __intel_cqm_no_event.. then it sees
> >>>> if there is continuous monitoring or you set the next_rmid to 0- but all the
> >>>> code there is inside #ifdef PERF_CGROUP, so without cgroup its never reset
> >>>> to zero ?
> >>>>
> >>>>
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>> +
> >>>>>>> + /* __intel_cqm_no_event_sched_in might have changed next_rmid.
> >>>>>>> */
> >>>>>>> + if (state->rmid == state->next_rmid &&
> >>>>>>> + state->closid == state->next_closid)
> >>>>>>> + return;
> >>>>>>> +
> >>>>>>> + state->rmid = state->next_rmid;
> >>>>>>> + state->closid = state->next_closid;
> >>>>>>> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->closid);
> >>>>>>> +}
> >>>>>>> --
> >>>>>>> 2.8.0.rc3.226.g39d4020
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >