2016-12-16 23:13:25

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH V4 00/14] Cqm2: Intel Cache Monitoring fixes and enhancements

Another attempt for cqm2 series-

The current upstream cqm(Cache monitoring) has major issues which make
the feature almost unusable which this series tries to fix and also
address Thomas comments on previous versions of the cqm2 patch series to
better document what we are trying to fix. Patch is based on
tip/x86/cache.

This is a continuation of patch series David([email protected])
perviously posted and hence its trying to fix the same isssues.

Below are the issues and the fixes/enhancements we attempt-

- Issue: RMID recycling leads to inaccurate data and complicates the
code and increases the code foot print. Currently, it almost makes the
feature *unusable* as we only see zeroes and inconsistent data once we
run out of RMIDs in the life time of a systemboot. The only way to get
right numbers is to reboot the system once we run out of RMIDs.

Root cause: Recycling steals an RMID from an existing event x and gives
it to an other event y. However due to the nature of monitoring
llc_occupancy we may miss tracking an unknown(possibly large) part of
cache fills at the time when event does not have RMID. Hence the user
ends up with inaccurate data for both events x and y and the inaccuracy
is arbitrary and cannot be measured. Even if an event x gets another
RMID very soon after loosing the previous RMID, we still miss all the
occupancy data that was tied to the previous RMID which means we cannot
get accurate data even when for most of the time event has an RMID.
There is no way to guarantee accurate results with recycling and data is
inaccurate by arbitrary degree. The fact that an event can loose an RMID
anytime complicates a lot of code in sched_in, init, count, read. It
also complicates mbm as we may loose the RMID anytime and hence need to
keep a history of all the old counts.

Fix: Recycling is removed based on Tony's idea originally that its
introducing a lot of code, failing to provide accurate data and hence
questionable benefits. Recycling was introduced to deal with scarce
RMIDs. We instead support below things to mitigate scarce RMID issue and
and also provide reliable output to user.
-We ran out of RMIds soon because only global RMIds were supported.
This patch supports per-pkg RMIDs and dynamic RMID
assignment only whan tasks are actually scheduled on a socket to mitigate the
scarcity of RMIDs.
Since we also increased the RMIDs upto x amount
where x is # of packages, the issue is minimized greatly given that we have 2-4
RMIDs per logical processor on each package.
-User choses the packages he wants to monitor and we
just throw an error if we so many RMIDs are not available. When user
wants guarenteed monitoring user can use this.
-User can also choose lazy RMID allocation in which case an error is
thrown when at read.
This may be better as the user then does not have
events which he thinks are being monitored but they actually are not
monitored 100% of time.

- Issue: Inaccurate data for per package data, systemwide. Just prints
zeros or arbitary numbers.
Fix: Patches fix this by just throwing an error if the mode is not supported.
The modes supported is task monitoring and cgroup monitoring.
Also the per package
data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
The systemwide data can be looked up by monitoring root cgroup.

- Support per pkg RMIDs hence scale better with more packages, and get
more RMIDs to use and use when needed (ie when tasks are actually
scheduled on the package).

- Issue: Cgroup monitoring is not complete. No hierarchical monitoring
support, inconsistent or wrong data seen when monitoring cgroup.

Fix: cgroup monitoring support added.
Patch adds full cgroup monitoring support. Can monitor different cgroups
in the same hierarchy together and seperately. And can also monitor a
task and the cgroup which the task belongs.

- Issue: Lot of inconsistent data is seen currently when we monitor different
kind of events like cgroup and task events *together*.

Fix: Patch adds support to be
able to monitor a cgroup x and as task p1 with in a cgroup x and also
monitor different cgroup and tasks together.

- Monitoring of a task for its lifetime is not supported. Patch adds
support to continuously monitor a cgroup even when perf is not being
run. This provides light weight long term/~lifetime monitoring.

- Issue: Cat and cqm write the same PQR_ASSOC_MSR seperately
Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to

Whats working now (unit tested):
Task monitoring, cgroup hierarchical monitoring, monitor multiple
cgroups, cgroup and task in same cgroup, continuous cgroup monitoring,
per pkg rmids, error on read, error on open.

TBD/Known issues :
- Most of MBM is working but will need updates to hierarchical
monitoring and other new feature related changes we introduce.

[PATCH 02/14] x86/cqm: Remove cqm recycling/conflict handling

Before the patch: Users sees only zeros or wrong data once we run out of
RMIDs.
After: User would see either correct data or an error that we run out of
RMIDs.

[PATCH 03/14] x86/rdt: Add rdt common/cqm compile option
[PATCH 04/14] x86/cqm: Add Per pkg rmid support

Before patch: RMIds are global.
Tests: Available RMIDs increase by x times where x is # of packages.
Adds LAZY RMID alloc - RMIDs are alloced during first sched in

[PATCH 05/14] x86/cqm,perf/core: Cgroup support prepare
[PATCH 06/14] x86/cqm: Add cgroup hierarchical monitoring support
[PATCH 07/14] x86/rdt,cqm: Scheduling support update

Before patch: cgroup monitoring not supported fully.
After: cgroup monitoring is fully supported including hierarchical
monitoring.

[PATCH 08/14] x86/cqm: Add support for monitoring task and cgroup
[PATCH 09/14] x86/cqm: Add Continuous cgroup monitoring

Adds new features.

[PATCH 10/14] x86/cqm: Add RMID reuse
[PATCH 11/14] x86/cqm: Add failure on open and read

Before patch: Once RMID is used , its never used again.
After: We reuse the RMIDs which are freed. User can specify NOLAZY RMID
allocation and open fails if we fail to get all RMIDs at open.

[PATCH 12/14] perf/core,x86/cqm: Add read for Cgroup events,per pkg
[PATCH 13/14] perf/stat: fix bug in handling events in error state
[PATCH 14/14] perf/stat: revamp read error handling, snapshot and

Patches 1-14 - 10/14 Add all the features but the data is not visible to
the perf/core nor the perf user mode. The 11-14 fix these and make the
data availabe to the perf user mode.


2016-12-16 23:13:34

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

Add documentation of usage of cqm and mbm events, continuous monitoring,
lazy and non-lazy monitoring.

Signed-off-by: Vikas Shivappa <[email protected]>
---
Documentation/x86/intel_rdt_mon_ui.txt | 91 ++++++++++++++++++++++++++++++++++
1 file changed, 91 insertions(+)
create mode 100644 Documentation/x86/intel_rdt_mon_ui.txt

diff --git a/Documentation/x86/intel_rdt_mon_ui.txt b/Documentation/x86/intel_rdt_mon_ui.txt
new file mode 100644
index 0000000..7d68a65
--- /dev/null
+++ b/Documentation/x86/intel_rdt_mon_ui.txt
@@ -0,0 +1,91 @@
+User Interface for Resource Monitoring in Intel Resource Director Technology
+
+Vikas Shivappa<[email protected]>
+David Carrillo-Cisneros<[email protected]>
+Stephane Eranian <[email protected]>
+
+This feature is enabled by the CONFIG_INTEL_RDT_M Kconfig and the
+X86 /proc/cpuinfo flag bits cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
+
+Resource Monitoring
+-------------------
+Resource Monitoring includes cqm(cache quality monitoring) and
+mbm(memory bandwidth monitoring) and uses the perf interface. A light
+weight interface to enable monitoring without perf is enabled as well.
+
+CQM provides OS/VMM a way to monitor llc occupancy. It measures the
+amount of L3 cache fills per task or cgroup.
+
+MBM provides OS/VMM a way to monitor bandwidth from one level of cache
+to another. The current patches support L3 external bandwidth
+monitoring. It supports both 'local bandwidth' and 'total bandwidth'
+monitoring for the socket. Local bandwidth measures the amount of data
+sent through the memory controller on the socket and total b/w measures
+the total system bandwidth.
+
+To check the monitoring events enabled:
+
+# ./tools/perf/perf list | grep -i cqm
+intel_cqm/llc_occupancy/ [Kernel PMU event]
+intel_cqm/local_bytes/ [Kernel PMU event]
+intel_cqm/total_bytes/ [Kernel PMU event]
+
+Monitoring tasks and cgroups using perf
+---------------------------------------
+Monitoring tasks and cgroup is like using any other perf event.
+
+#perf stat -I 1000 -e intel_cqm_llc/local_bytes/ -p PID1
+
+This will monitor the local_bytes event of the PID1 and report once
+every 1000ms
+
+#mkdir /sys/fs/cgroup/perf_event/p1
+#echo PID1 > /sys/fs/cgroup/perf_event/p1/tasks
+#echo PID2 > /sys/fs/cgroup/perf_event/p1/tasks
+
+#perf stat -I 1000 -e intel_cqm_llc/llc_occupancy/ -a -G p1
+
+This will monitor the llc_occupancy event of the perf cgroup p1 in
+interval mode.
+
+Hierarchical monitoring should work just like other events and users can
+also monitor a task with in a cgroup and the cgroup together, or
+different cgroups in the same hierarchy can be monitored together.
+
+Continuous monitoring
+---------------------
+A new file cont_monitoring is added to perf_cgroup which helps to enable
+cqm continuous monitoring. Enabling this field would start monitoring of
+the cgroup without perf being launched. This can be used for long term
+light weight monitoring of tasks/cgroups.
+
+To enable continuous monitoring of cgroup p1.
+#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
+
+To disable continuous monitoring of cgroup p1.
+#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
+
+To read the counters at the end of monitoring perf can be used.
+
+LAZY and NOLAZY Monitoring
+--------------------------
+LAZY:
+By default when monitoring is enabled, the RMIDs are not allocated
+immediately and allocated lazily only at the first sched_in.
+There are 2-4 RMIDs per logical processor on each package. So if a dual
+package has 48 logical processors, there would be upto 192 RMIDs on each
+package = total of 192x2 RMIDs.
+There is a possibility that RMIDs can runout and in that case the read
+reports an error since there was no RMID available to monitor for an
+event.
+
+NOLAZY:
+When user wants guaranteed monitoring, he can enable the 'monitoring
+mask' which is basically used to specify the packages he wants to
+monitor. The RMIDs are statically allocated at open and failure is
+indicated if RMIDs are not available.
+
+To specify monitoring on package 0 and package 1:
+#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
+
+An error is thrown if packages not online are specified.
--
1.9.1

2016-12-16 23:13:43

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 02/14] x86/cqm: Remove cqm recycling/conflict handling

From: David Carrillo-Cisneros <[email protected]>

We only supported global RMIDs till now and not per-pkg RMIDs and hence
the RMIDs would not scale and we would more easily run out of RMIDs.
When we run out of RMIDs we did RMID recycling and ended up in several
issues some of which listed below and complications which may outweigh
benefits.

RMID recycling 'steals' RMID from an event x that is being monitored and
gives it to an event y that is also being monitored.
- This does not guarantee that we get the correct data for both the events
as there is always times when the events are not being monitored. Hence
there could be incorrect data to the user. The extent of error is
arbitrary and unknown as we cannot measure how much of the occupancy we
missed.
- It complicated the usage of RMIDs, reading of data, MBM counting
a lot because when reading the counters we had to keep a history of
previous counts and also make sure that no one has stolen that RMID we
want to use. All this had a lot of code foot print.
- When a process changes RMID it is
no more tracking the old cache fills it had in the old RMID - so we are
not even guaranteeing the data to be correct for the time its monitored
if it had and RMID for almost all of the time.
- There was inconsistent data in the current patch due to the way RMID
was recycled, we ended up stealing RMIDs from monitored events before
using the freed RMIDs in the limbo list. This led to 0 counts as soon as
one runs out of RMIDs in the lifetime of systemboot which almost makes
the feature unusable. Also the state transitions were messy.

This patch removes RMID recycling and just
throw error when there are no more RMIDs to monitor. H/w provides ~4
RMIDs per logical processor on each package and if we use all of them
and use only when the tasks are scheduled on the pkgs,
the problem of scarcity can be mitigated.

The conflict handling is also removed as it resulted in a lot of
inconsistent data when different kinds of events like systemwide,cgroup
and task were monitored together.

David's original patch modified by Vikas<[email protected]>
to only remove recycling parts from code and edit commit message.

Tests: The cqm should either gives correct numbers for task events
or just throw error that its run out of RMIDs.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 647 ++------------------------------------------
1 file changed, 30 insertions(+), 617 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 0c45cc8..badeaf4 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -4,6 +4,8 @@
* Based very, very heavily on work by Peter Zijlstra.
*/

+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
#include <linux/perf_event.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
@@ -18,6 +20,7 @@
* Guaranteed time in ms as per SDM where MBM counters will not overflow.
*/
#define MBM_CTR_OVERFLOW_TIME 1000
+#define RMID_DEFAULT_QUEUE_TIME 250

static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
@@ -91,15 +94,6 @@ struct sample {
#define QOS_MBM_TOTAL_EVENT_ID 0x02
#define QOS_MBM_LOCAL_EVENT_ID 0x03

-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
-
#define INVALID_RMID (-1)

/*
@@ -112,7 +106,7 @@ struct sample {
*/
static inline bool __rmid_valid(u32 rmid)
{
- if (!rmid || rmid == INVALID_RMID)
+ if (!rmid || rmid > cqm_max_rmid)
return false;

return true;
@@ -137,8 +131,7 @@ static u64 __rmid_read(u32 rmid)
}

enum rmid_recycle_state {
- RMID_YOUNG = 0,
- RMID_AVAILABLE,
+ RMID_AVAILABLE = 0,
RMID_DIRTY,
};

@@ -228,7 +221,7 @@ static void __put_rmid(u32 rmid)
entry = __rmid_entry(rmid);

entry->queue_time = jiffies;
- entry->state = RMID_YOUNG;
+ entry->state = RMID_DIRTY;

list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
}
@@ -281,10 +274,6 @@ static int intel_cqm_setup_rmid_cache(void)
entry = __rmid_entry(0);
list_del(&entry->list);

- mutex_lock(&cache_mutex);
- intel_cqm_rotation_rmid = __get_rmid();
- mutex_unlock(&cache_mutex);
-
return 0;

fail:
@@ -343,92 +332,6 @@ static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
}
#endif

-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- * PROHIBITS
- * system-wide -> cgroup and task
- * cgroup -> system-wide
- * -> task in cgroup
- * task -> system-wide
- * -> task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
- /*
- * We can have any number of cgroups but only one system-wide
- * event at a time.
- */
- if (a->cgrp && b->cgrp) {
- struct perf_cgroup *ac = a->cgrp;
- struct perf_cgroup *bc = b->cgrp;
-
- /*
- * This condition should have been caught in
- * __match_event() and we should be sharing an RMID.
- */
- WARN_ON_ONCE(ac == bc);
-
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-
- if (a->cgrp || b->cgrp) {
- struct perf_cgroup *ac, *bc;
-
- /*
- * cgroup and system-wide events are mutually exclusive
- */
- if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
- (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
- return true;
-
- /*
- * Ensure neither event is part of the other's cgroup
- */
- ac = event_to_cgroup(a);
- bc = event_to_cgroup(b);
- if (ac == bc)
- return true;
-
- /*
- * Must have cgroup and non-intersecting task events.
- */
- if (!ac || !bc)
- return false;
-
- /*
- * We have cgroup and task events, and the task belongs
- * to a cgroup. Check for for overlap.
- */
- if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
- cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
- return true;
-
- return false;
- }
-#endif
- /*
- * If one of them is not a task, same story as above with cgroups.
- */
- if (!(a->attach_state & PERF_ATTACH_TASK) ||
- !(b->attach_state & PERF_ATTACH_TASK))
- return true;
-
- /*
- * Must be non-overlapping.
- */
- return false;
-}
-
struct rmid_read {
u32 rmid;
u32 evt_type;
@@ -458,461 +361,14 @@ static void cqm_mask_call(struct rmid_read *rr)
}

/*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
- struct perf_event *event;
- struct list_head *head = &group->hw.cqm_group_entry;
- u32 old_rmid = group->hw.cqm_rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- /*
- * If our RMID is being deallocated, perform a read now.
- */
- if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
- struct rmid_read rr = {
- .rmid = old_rmid,
- .evt_type = group->attr.config,
- .value = ATOMIC64_INIT(0),
- };
-
- cqm_mask_call(&rr);
- local64_set(&group->count, atomic64_read(&rr.value));
- }
-
- raw_spin_lock_irq(&cache_lock);
-
- group->hw.cqm_rmid = rmid;
- list_for_each_entry(event, head, hw.cqm_group_entry)
- event->hw.cqm_rmid = rmid;
-
- raw_spin_unlock_irq(&cache_lock);
-
- /*
- * If the allocation is for mbm, init the mbm stats.
- * Need to check if each event in the group is mbm event
- * because there could be multiple type of events in the same group.
- */
- if (__rmid_valid(rmid)) {
- event = group;
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
-
- list_for_each_entry(event, head, hw.cqm_group_entry) {
- if (is_mbm_event(event->attr.config))
- init_mbm_sample(rmid, event->attr.config);
- }
- }
-
- return old_rmid;
-}
-
-/*
- * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
- * cachelines are still tagged with RMIDs in limbo, we progressively
- * increment the threshold until we find an RMID in limbo with <=
- * __intel_cqm_threshold lines tagged. This is designed to mitigate the
- * problem where cachelines tagged with an RMID are not steadily being
- * evicted.
- *
- * On successful rotations we decrease the threshold back towards zero.
- *
* __intel_cqm_max_threshold provides an upper bound on the threshold,
* and is measured in bytes because it's exposed to userland.
*/
static unsigned int __intel_cqm_threshold;
static unsigned int __intel_cqm_max_threshold;

-/*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
- struct cqm_rmid_entry *entry;
-
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- if (entry->state != RMID_AVAILABLE)
- break;
-
- if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
- entry->state = RMID_DIRTY;
- }
-}
-
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
- struct perf_event *leader, *event;
-
- lockdep_assert_held(&cache_mutex);
-
- leader = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
- event = leader;
-
- list_for_each_entry_continue(event, &cache_groups,
- hw.cqm_groups_entry) {
- if (__rmid_valid(event->hw.cqm_rmid))
- continue;
-
- if (__conflict_event(event, leader))
- continue;
-
- intel_cqm_xchg_rmid(event, rmid);
- return true;
- }
-
- return false;
-}
-
-/*
- * Initially use this constant for both the limbo queue time and the
- * rotation timer interval, pmu::hrtimer_interval_ms.
- *
- * They don't need to be the same, but the two are related since if you
- * rotate faster than you recycle RMIDs, you may run out of available
- * RMIDs.
- */
-#define RMID_DEFAULT_QUEUE_TIME 250 /* ms */
-
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
- struct cqm_rmid_entry *entry, *tmp;
-
- lockdep_assert_held(&cache_mutex);
-
- *available = 0;
- list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
- unsigned long min_queue_time;
- unsigned long now = jiffies;
-
- /*
- * We hold RMIDs placed into limbo for a minimum queue
- * time. Before the minimum queue time has elapsed we do
- * not recycle RMIDs.
- *
- * The reasoning is that until a sufficient time has
- * passed since we stopped using an RMID, any RMID
- * placed onto the limbo list will likely still have
- * data tagged in the cache, which means we'll probably
- * fail to recycle it anyway.
- *
- * We can save ourselves an expensive IPI by skipping
- * any RMIDs that have not been queued for the minimum
- * time.
- */
- min_queue_time = entry->queue_time +
- msecs_to_jiffies(__rmid_queue_time_ms);
-
- if (time_after(min_queue_time, now))
- break;
-
- entry->state = RMID_AVAILABLE;
- (*available)++;
- }
-
- /*
- * Fast return if none of the RMIDs on the limbo list have been
- * sitting on the queue for the minimum queue time.
- */
- if (!*available)
- return false;
-
- /*
- * Test whether an RMID is free for each package.
- */
- on_each_cpu_mask(&cqm_cpumask, intel_cqm_stable, NULL, true);
-
- list_for_each_entry_safe(entry, tmp, &cqm_rmid_limbo_lru, list) {
- /*
- * Exhausted all RMIDs that have waited min queue time.
- */
- if (entry->state == RMID_YOUNG)
- break;
-
- if (entry->state == RMID_DIRTY)
- continue;
-
- list_del(&entry->list); /* remove from limbo */
-
- /*
- * The rotation RMID gets priority if it's
- * currently invalid. In which case, skip adding
- * the RMID to the the free lru.
- */
- if (!__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_rotation_rmid = entry->rmid;
- continue;
- }
-
- /*
- * If we have groups waiting for RMIDs, hand
- * them one now provided they don't conflict.
- */
- if (intel_cqm_sched_in_event(entry->rmid))
- continue;
-
- /*
- * Otherwise place it onto the free list.
- */
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
- }
-
-
- return __rmid_valid(intel_cqm_rotation_rmid);
-}
-
-/*
- * Pick a victim group and move it to the tail of the group list.
- * @next: The first group without an RMID
- */
-static void __intel_cqm_pick_and_rotate(struct perf_event *next)
-{
- struct perf_event *rotor;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- rotor = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
-
- /*
- * The group at the front of the list should always have a valid
- * RMID. If it doesn't then no groups have RMIDs assigned and we
- * don't need to rotate the list.
- */
- if (next == rotor)
- return;
-
- rmid = intel_cqm_xchg_rmid(rotor, INVALID_RMID);
- __put_rmid(rmid);
-
- list_rotate_left(&cache_groups);
-}
-
-/*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
- struct perf_event *group, *g;
- u32 rmid;
-
- lockdep_assert_held(&cache_mutex);
-
- list_for_each_entry_safe(group, g, &cache_groups, hw.cqm_groups_entry) {
- if (group == event)
- continue;
-
- rmid = group->hw.cqm_rmid;
-
- /*
- * Skip events that don't have a valid RMID.
- */
- if (!__rmid_valid(rmid))
- continue;
-
- /*
- * No conflict? No problem! Leave the event alone.
- */
- if (!__conflict_event(group, event))
- continue;
-
- intel_cqm_xchg_rmid(group, INVALID_RMID);
- __put_rmid(rmid);
- }
-}
-
-/*
- * Attempt to rotate the groups and assign new RMIDs.
- *
- * We rotate for two reasons,
- * 1. To handle the scheduling of conflicting events
- * 2. To recycle RMIDs
- *
- * Rotating RMIDs is complicated because the hardware doesn't give us
- * any clues.
- *
- * There's problems with the hardware interface; when you change the
- * task:RMID map cachelines retain their 'old' tags, giving a skewed
- * picture. In order to work around this, we must always keep one free
- * RMID - intel_cqm_rotation_rmid.
- *
- * Rotation works by taking away an RMID from a group (the old RMID),
- * and assigning the free RMID to another group (the new RMID). We must
- * then wait for the old RMID to not be used (no cachelines tagged).
- * This ensure that all cachelines are tagged with 'active' RMIDs. At
- * this point we can start reading values for the new RMID and treat the
- * old RMID as the free RMID for the next rotation.
- *
- * Return %true or %false depending on whether we did any rotating.
- */
-static bool __intel_cqm_rmid_rotate(void)
-{
- struct perf_event *group, *start = NULL;
- unsigned int threshold_limit;
- unsigned int nr_needed = 0;
- unsigned int nr_available;
- bool rotated = false;
-
- mutex_lock(&cache_mutex);
-
-again:
- /*
- * Fast path through this function if there are no groups and no
- * RMIDs that need cleaning.
- */
- if (list_empty(&cache_groups) && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- list_for_each_entry(group, &cache_groups, hw.cqm_groups_entry) {
- if (!__rmid_valid(group->hw.cqm_rmid)) {
- if (!start)
- start = group;
- nr_needed++;
- }
- }
-
- /*
- * We have some event groups, but they all have RMIDs assigned
- * and no RMIDs need cleaning.
- */
- if (!nr_needed && list_empty(&cqm_rmid_limbo_lru))
- goto out;
-
- if (!nr_needed)
- goto stabilize;
-
- /*
- * We have more event groups without RMIDs than available RMIDs,
- * or we have event groups that conflict with the ones currently
- * scheduled.
- *
- * We force deallocate the rmid of the group at the head of
- * cache_groups. The first event group without an RMID then gets
- * assigned intel_cqm_rotation_rmid. This ensures we always make
- * forward progress.
- *
- * Rotate the cache_groups list so the previous head is now the
- * tail.
- */
- __intel_cqm_pick_and_rotate(start);
-
- /*
- * If the rotation is going to succeed, reduce the threshold so
- * that we don't needlessly reuse dirty RMIDs.
- */
- if (__rmid_valid(intel_cqm_rotation_rmid)) {
- intel_cqm_xchg_rmid(start, intel_cqm_rotation_rmid);
- intel_cqm_rotation_rmid = __get_rmid();
-
- intel_cqm_sched_out_conflicting_events(start);
-
- if (__intel_cqm_threshold)
- __intel_cqm_threshold--;
- }
-
- rotated = true;
-
-stabilize:
- /*
- * We now need to stablize the RMID we freed above (if any) to
- * ensure that the next time we rotate we have an RMID with zero
- * occupancy value.
- *
- * Alternatively, if we didn't need to perform any rotation,
- * we'll have a bunch of RMIDs in limbo that need stabilizing.
- */
- threshold_limit = __intel_cqm_max_threshold / cqm_l3_scale;
-
- while (intel_cqm_rmid_stabilize(&nr_available) &&
- __intel_cqm_threshold < threshold_limit) {
- unsigned int steal_limit;
-
- /*
- * Don't spin if nobody is actively waiting for an RMID,
- * the rotation worker will be kicked as soon as an
- * event needs an RMID anyway.
- */
- if (!nr_needed)
- break;
-
- /* Allow max 25% of RMIDs to be in limbo. */
- steal_limit = (cqm_max_rmid + 1) / 4;
-
- /*
- * We failed to stabilize any RMIDs so our rotation
- * logic is now stuck. In order to make forward progress
- * we have a few options:
- *
- * 1. rotate ("steal") another RMID
- * 2. increase the threshold
- * 3. do nothing
- *
- * We do both of 1. and 2. until we hit the steal limit.
- *
- * The steal limit prevents all RMIDs ending up on the
- * limbo list. This can happen if every RMID has a
- * non-zero occupancy above threshold_limit, and the
- * occupancy values aren't dropping fast enough.
- *
- * Note that there is prioritisation at work here - we'd
- * rather increase the number of RMIDs on the limbo list
- * than increase the threshold, because increasing the
- * threshold skews the event data (because we reuse
- * dirty RMIDs) - threshold bumps are a last resort.
- */
- if (nr_available < steal_limit)
- goto again;
-
- __intel_cqm_threshold++;
- }
-
-out:
- mutex_unlock(&cache_mutex);
- return rotated;
-}
-
-static void intel_cqm_rmid_rotate(struct work_struct *work);
-
-static DECLARE_DELAYED_WORK(intel_cqm_rmid_work, intel_cqm_rmid_rotate);
-
static struct pmu intel_cqm_pmu;

-static void intel_cqm_rmid_rotate(struct work_struct *work)
-{
- unsigned long delay;
-
- __intel_cqm_rmid_rotate();
-
- delay = msecs_to_jiffies(intel_cqm_pmu.hrtimer_interval_ms);
- schedule_delayed_work(&intel_cqm_rmid_work, delay);
-}
-
static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
{
struct sample *mbm_current;
@@ -984,11 +440,10 @@ static void init_mbm_sample(u32 rmid, u32 evt_type)
*
* If we're part of a group, we use the group's RMID.
*/
-static void intel_cqm_setup_event(struct perf_event *event,
+static int intel_cqm_setup_event(struct perf_event *event,
struct perf_event **group)
{
struct perf_event *iter;
- bool conflict = false;
u32 rmid;

event->hw.is_group_event = false;
@@ -1001,26 +456,24 @@ static void intel_cqm_setup_event(struct perf_event *event,
*group = iter;
if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);
- return;
+ return 0;
}

- /*
- * We only care about conflicts for events that are
- * actually scheduled in (and hence have a valid RMID).
- */
- if (__conflict_event(iter, event) && __rmid_valid(rmid))
- conflict = true;
}

- if (conflict)
- rmid = INVALID_RMID;
- else
- rmid = __get_rmid();
+ rmid = __get_rmid();
+
+ if (!__rmid_valid(rmid)) {
+ pr_info("out of RMIDs\n");
+ return -EINVAL;
+ }

if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
init_mbm_sample(rmid, event->attr.config);

event->hw.cqm_rmid = rmid;
+
+ return 0;
}

static void intel_cqm_event_read(struct perf_event *event)
@@ -1166,7 +619,6 @@ static void mbm_hrtimer_init(void)

static u64 intel_cqm_event_count(struct perf_event *event)
{
- unsigned long flags;
struct rmid_read rr = {
.evt_type = event->attr.config,
.value = ATOMIC64_INIT(0),
@@ -1206,24 +658,11 @@ static u64 intel_cqm_event_count(struct perf_event *event)
* Notice that we don't perform the reading of an RMID
* atomically, because we can't hold a spin lock across the
* IPIs.
- *
- * Speculatively perform the read, since @event might be
- * assigned a different (possibly invalid) RMID while we're
- * busying performing the IPI calls. It's therefore necessary to
- * check @event's RMID afterwards, and if it has changed,
- * discard the result of the read.
*/
rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);
-
- if (!__rmid_valid(rr.rmid))
- goto out;
-
cqm_mask_call(&rr);
+ local64_set(&event->count, atomic64_read(&rr.value));

- raw_spin_lock_irqsave(&cache_lock, flags);
- if (event->hw.cqm_rmid == rr.rmid)
- local64_set(&event->count, atomic64_read(&rr.value));
- raw_spin_unlock_irqrestore(&cache_lock, flags);
out:
return __perf_event_count(event);
}
@@ -1238,34 +677,16 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)

event->hw.cqm_state &= ~PERF_HES_STOPPED;

- if (state->rmid_usecnt++) {
- if (!WARN_ON_ONCE(state->rmid != rmid))
- return;
- } else {
- WARN_ON_ONCE(state->rmid);
- }
-
state->rmid = rmid;
wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
}

static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
if (event->hw.cqm_state & PERF_HES_STOPPED)
return;

event->hw.cqm_state |= PERF_HES_STOPPED;
-
- intel_cqm_event_read(event);
-
- if (!--state->rmid_usecnt) {
- state->rmid = 0;
- wrmsr(MSR_IA32_PQR_ASSOC, 0, state->closid);
- } else {
- WARN_ON_ONCE(!state->rmid);
- }
}

static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -1342,8 +763,8 @@ static void intel_cqm_event_destroy(struct perf_event *event)
static int intel_cqm_event_init(struct perf_event *event)
{
struct perf_event *group = NULL;
- bool rotate = false;
unsigned long flags;
+ int ret = 0;

if (event->attr.type != intel_cqm_pmu.type)
return -ENOENT;
@@ -1373,46 +794,36 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

+ /* Will also set rmid, return error on RMID not being available*/
+ if (intel_cqm_setup_event(event, &group)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
/*
* Start the mbm overflow timers when the first event is created.
*/
if (mbm_enabled && list_empty(&cache_groups))
mbm_start_timers();

- /* Will also set rmid */
- intel_cqm_setup_event(event, &group);
-
/*
* Hold the cache_lock as mbm timer handlers be
* scanning the list of events.
*/
raw_spin_lock_irqsave(&cache_lock, flags);

- if (group) {
+ if (group)
list_add_tail(&event->hw.cqm_group_entry,
&group->hw.cqm_group_entry);
- } else {
+ else
list_add_tail(&event->hw.cqm_groups_entry,
&cache_groups);

- /*
- * All RMIDs are either in use or have recently been
- * used. Kick the rotation worker to clean/free some.
- *
- * We only do this for the group leader, rather than for
- * every event in a group to save on needless work.
- */
- if (!__rmid_valid(event->hw.cqm_rmid))
- rotate = true;
- }
-
raw_spin_unlock_irqrestore(&cache_lock, flags);
+out:
mutex_unlock(&cache_mutex);

- if (rotate)
- schedule_delayed_work(&intel_cqm_rmid_work, 0);
-
- return 0;
+ return ret;
}

EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
@@ -1706,6 +1117,8 @@ static int __init intel_cqm_init(void)
__intel_cqm_max_threshold =
boot_cpu_data.x86_cache_size * 1024 / (cqm_max_rmid + 1);

+ __intel_cqm_threshold = __intel_cqm_max_threshold / cqm_l3_scale;
+
snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
str = kstrdup(scale, GFP_KERNEL);
if (!str) {
--
1.9.1

2016-12-16 23:13:54

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 04/14] x86/cqm: Add Per pkg rmid support

The RMID is currently global and this extends it to per pkg rmid. The
h/w provides a set of RMIDs on each package and the same task can hence
be associated with different RMIDs on each package.

Patch introduces a new cqm_pkgs_data to keep track of the per package
free list, limbo list and other locking structures.
The corresponding rmid structures in the perf_event is
changed to hold an array of u32 RMIDs instead of a single u32.

The RMIDs are not assigned at the time of event creation and are
assigned in lazy mode at the first sched_in time for a task, thereby
rmid is never allocated if a task is not scheduled on a package. This
helps better usage of RMIDs and its scales with the increasing
sockets/packages.

Locking:
event list - perf init and terminate hold mutex. spin lock is held to
gaurd from the mbm hrtimer.
per pkg free and limbo list - global spin lock. Used by
get_rmid,put_rmid, perf start, terminate

Tests: RMIDs available increase by x times where x is number of sockets
and the usage is dynamic so we save more.

Patch is based on David Carrillo-Cisneros <[email protected]> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 340 ++++++++++++++++++++++++--------------------
arch/x86/events/intel/cqm.h | 37 +++++
include/linux/perf_event.h | 2 +-
3 files changed, 226 insertions(+), 153 deletions(-)
create mode 100644 arch/x86/events/intel/cqm.h

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index badeaf4..a0719af 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -11,6 +11,7 @@
#include <asm/cpu_device_id.h>
#include <asm/intel_rdt_common.h>
#include "../perf_event.h"
+#include "cqm.h"

#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
@@ -25,7 +26,7 @@
static u32 cqm_max_rmid = -1;
static unsigned int cqm_l3_scale; /* supposedly cacheline size */
static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
+unsigned int cqm_socket_max;

/*
* The cached intel_pqr_state is strictly per CPU and can never be
@@ -83,6 +84,8 @@ struct sample {
*/
static cpumask_t cqm_cpumask;

+struct pkg_data **cqm_pkgs_data;
+
#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)

@@ -142,50 +145,11 @@ struct cqm_rmid_entry {
unsigned long queue_time;
};

-/*
- * cqm_rmid_free_lru - A least recently used list of RMIDs.
- *
- * Oldest entry at the head, newest (most recently used) entry at the
- * tail. This list is never traversed, it's only used to keep track of
- * the lru order. That is, we only pick entries of the head or insert
- * them on the tail.
- *
- * All entries on the list are 'free', and their RMIDs are not currently
- * in use. To mark an RMID as in use, remove its entry from the lru
- * list.
- *
- *
- * cqm_rmid_limbo_lru - list of currently unused but (potentially) dirty RMIDs.
- *
- * This list is contains RMIDs that no one is currently using but that
- * may have a non-zero occupancy value associated with them. The
- * rotation worker moves RMIDs from the limbo list to the free list once
- * the occupancy value drops below __intel_cqm_threshold.
- *
- * Both lists are protected by cache_mutex.
- */
-static LIST_HEAD(cqm_rmid_free_lru);
-static LIST_HEAD(cqm_rmid_limbo_lru);
-
-/*
- * We use a simple array of pointers so that we can lookup a struct
- * cqm_rmid_entry in O(1). This alleviates the callers of __get_rmid()
- * and __put_rmid() from having to worry about dealing with struct
- * cqm_rmid_entry - they just deal with rmids, i.e. integers.
- *
- * Once this array is initialized it is read-only. No locks are required
- * to access it.
- *
- * All entries for all RMIDs can be looked up in the this array at all
- * times.
- */
-static struct cqm_rmid_entry **cqm_rmid_ptrs;
-
-static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
+static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
{
struct cqm_rmid_entry *entry;

- entry = cqm_rmid_ptrs[rmid];
+ entry = &cqm_pkgs_data[domain]->cqm_rmid_ptrs[rmid];
WARN_ON(entry->rmid != rmid);

return entry;
@@ -196,91 +160,56 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid)
*
* We expect to be called with cache_mutex held.
*/
-static u32 __get_rmid(void)
+static u32 __get_rmid(int domain)
{
+ struct list_head *cqm_flist;
struct cqm_rmid_entry *entry;

- lockdep_assert_held(&cache_mutex);
+ lockdep_assert_held(&cache_lock);

- if (list_empty(&cqm_rmid_free_lru))
+ cqm_flist = &cqm_pkgs_data[domain]->cqm_rmid_free_lru;
+
+ if (list_empty(cqm_flist))
return INVALID_RMID;

- entry = list_first_entry(&cqm_rmid_free_lru, struct cqm_rmid_entry, list);
+ entry = list_first_entry(cqm_flist, struct cqm_rmid_entry, list);
list_del(&entry->list);

return entry->rmid;
}

-static void __put_rmid(u32 rmid)
+static void __put_rmid(u32 rmid, int domain)
{
struct cqm_rmid_entry *entry;

- lockdep_assert_held(&cache_mutex);
+ lockdep_assert_held(&cache_lock);

- WARN_ON(!__rmid_valid(rmid));
- entry = __rmid_entry(rmid);
+ WARN_ON(!rmid);
+ entry = __rmid_entry(rmid, domain);

entry->queue_time = jiffies;
entry->state = RMID_DIRTY;

- list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
+ list_add_tail(&entry->list, &cqm_pkgs_data[domain]->cqm_rmid_limbo_lru);
}

static void cqm_cleanup(void)
{
int i;

- if (!cqm_rmid_ptrs)
+ if (!cqm_pkgs_data)
return;

- for (i = 0; i < cqm_max_rmid; i++)
- kfree(cqm_rmid_ptrs[i]);
-
- kfree(cqm_rmid_ptrs);
- cqm_rmid_ptrs = NULL;
- cqm_enabled = false;
-}
-
-static int intel_cqm_setup_rmid_cache(void)
-{
- struct cqm_rmid_entry *entry;
- unsigned int nr_rmids;
- int r = 0;
-
- nr_rmids = cqm_max_rmid + 1;
- cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
- nr_rmids, GFP_KERNEL);
- if (!cqm_rmid_ptrs)
- return -ENOMEM;
-
- for (; r <= cqm_max_rmid; r++) {
- struct cqm_rmid_entry *entry;
-
- entry = kmalloc(sizeof(*entry), GFP_KERNEL);
- if (!entry)
- goto fail;
-
- INIT_LIST_HEAD(&entry->list);
- entry->rmid = r;
- cqm_rmid_ptrs[r] = entry;
-
- list_add_tail(&entry->list, &cqm_rmid_free_lru);
+ for (i = 0; i < cqm_socket_max; i++) {
+ if (cqm_pkgs_data[i]) {
+ kfree(cqm_pkgs_data[i]->cqm_rmid_ptrs);
+ kfree(cqm_pkgs_data[i]);
+ }
}
-
- /*
- * RMID 0 is special and is always allocated. It's used for all
- * tasks that are not monitored.
- */
- entry = __rmid_entry(0);
- list_del(&entry->list);
-
- return 0;
-
-fail:
- cqm_cleanup();
- return -ENOMEM;
+ kfree(cqm_pkgs_data);
}

+
/*
* Determine if @a and @b measure the same set of tasks.
*
@@ -333,13 +262,13 @@ static inline struct perf_cgroup *event_to_cgroup(struct perf_event *event)
#endif

struct rmid_read {
- u32 rmid;
+ u32 *rmid;
u32 evt_type;
atomic64_t value;
};

static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
+static void init_mbm_sample(u32 *rmid, u32 evt_type);
static void __intel_mbm_event_count(void *info);

static bool is_cqm_event(int e)
@@ -420,10 +349,11 @@ static void __intel_mbm_event_init(void *info)
{
struct rmid_read *rr = info;

- update_sample(rr->rmid, rr->evt_type, 1);
+ if (__rmid_valid(rr->rmid[pkg_id]))
+ update_sample(rr->rmid[pkg_id], rr->evt_type, 1);
}

-static void init_mbm_sample(u32 rmid, u32 evt_type)
+static void init_mbm_sample(u32 *rmid, u32 evt_type)
{
struct rmid_read rr = {
.rmid = rmid,
@@ -444,7 +374,7 @@ static int intel_cqm_setup_event(struct perf_event *event,
struct perf_event **group)
{
struct perf_event *iter;
- u32 rmid;
+ u32 *rmid, sizet;

event->hw.is_group_event = false;
list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
@@ -454,24 +384,20 @@ static int intel_cqm_setup_event(struct perf_event *event,
/* All tasks in a group share an RMID */
event->hw.cqm_rmid = rmid;
*group = iter;
- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
+ if (is_mbm_event(event->attr.config))
init_mbm_sample(rmid, event->attr.config);
return 0;
}
-
- }
-
- rmid = __get_rmid();
-
- if (!__rmid_valid(rmid)) {
- pr_info("out of RMIDs\n");
- return -EINVAL;
}

- if (is_mbm_event(event->attr.config) && __rmid_valid(rmid))
- init_mbm_sample(rmid, event->attr.config);
-
- event->hw.cqm_rmid = rmid;
+ /*
+ * RMIDs are allocated in LAZY mode by default only when
+ * tasks monitored are scheduled in.
+ */
+ sizet = sizeof(u32) * cqm_socket_max;
+ event->hw.cqm_rmid = kzalloc(sizet, GFP_KERNEL);
+ if (!event->hw.cqm_rmid)
+ return -ENOMEM;

return 0;
}
@@ -489,7 +415,7 @@ static void intel_cqm_event_read(struct perf_event *event)
return;

raw_spin_lock_irqsave(&cache_lock, flags);
- rmid = event->hw.cqm_rmid;
+ rmid = event->hw.cqm_rmid[pkg_id];

if (!__rmid_valid(rmid))
goto out;
@@ -515,12 +441,12 @@ static void __intel_cqm_event_count(void *info)
struct rmid_read *rr = info;
u64 val;

- val = __rmid_read(rr->rmid);
-
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
-
- atomic64_add(val, &rr->value);
+ if (__rmid_valid(rr->rmid[pkg_id])) {
+ val = __rmid_read(rr->rmid[pkg_id]);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+ }
}

static inline bool cqm_group_leader(struct perf_event *event)
@@ -533,10 +459,12 @@ static void __intel_mbm_event_count(void *info)
struct rmid_read *rr = info;
u64 val;

- val = rmid_read_mbm(rr->rmid, rr->evt_type);
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- return;
- atomic64_add(val, &rr->value);
+ if (__rmid_valid(rr->rmid[pkg_id])) {
+ val = rmid_read_mbm(rr->rmid[pkg_id], rr->evt_type);
+ if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
+ return;
+ atomic64_add(val, &rr->value);
+ }
}

static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
@@ -559,7 +487,7 @@ static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
}

list_for_each_entry(iter, &cache_groups, hw.cqm_groups_entry) {
- grp_rmid = iter->hw.cqm_rmid;
+ grp_rmid = iter->hw.cqm_rmid[pkg_id];
if (!__rmid_valid(grp_rmid))
continue;
if (is_mbm_event(iter->attr.config))
@@ -572,7 +500,7 @@ static enum hrtimer_restart mbm_hrtimer_handle(struct hrtimer *hrtimer)
if (!iter1->hw.is_group_event)
break;
if (is_mbm_event(iter1->attr.config))
- update_sample(iter1->hw.cqm_rmid,
+ update_sample(iter1->hw.cqm_rmid[pkg_id],
iter1->attr.config, 0);
}
}
@@ -610,7 +538,7 @@ static void mbm_hrtimer_init(void)
struct hrtimer *hr;
int i;

- for (i = 0; i < mbm_socket_max; i++) {
+ for (i = 0; i < cqm_socket_max; i++) {
hr = &mbm_timers[i];
hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hr->function = mbm_hrtimer_handle;
@@ -667,16 +595,39 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);
}

+void alloc_needed_pkg_rmid(u32 *cqm_rmid)
+{
+ unsigned long flags;
+ u32 rmid;
+
+ if (WARN_ON(!cqm_rmid))
+ return;
+
+ if (cqm_rmid[pkg_id])
+ return;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ rmid = __get_rmid(pkg_id);
+ if (__rmid_valid(rmid))
+ cqm_rmid[pkg_id] = rmid;
+
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+}
+
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- u32 rmid = event->hw.cqm_rmid;
+ u32 rmid;

if (!(event->hw.cqm_state & PERF_HES_STOPPED))
return;

event->hw.cqm_state &= ~PERF_HES_STOPPED;

+ alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+
+ rmid = event->hw.cqm_rmid[pkg_id];
state->rmid = rmid;
wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
}
@@ -691,22 +642,27 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)

static int intel_cqm_event_add(struct perf_event *event, int mode)
{
- unsigned long flags;
- u32 rmid;
-
- raw_spin_lock_irqsave(&cache_lock, flags);
-
event->hw.cqm_state = PERF_HES_STOPPED;
- rmid = event->hw.cqm_rmid;

- if (__rmid_valid(rmid) && (mode & PERF_EF_START))
+ if ((mode & PERF_EF_START))
intel_cqm_event_start(event, mode);

- raw_spin_unlock_irqrestore(&cache_lock, flags);
-
return 0;
}

+static inline void
+ cqm_event_free_rmid(struct perf_event *event)
+{
+ u32 *rmid = event->hw.cqm_rmid;
+ int d;
+
+ for (d = 0; d < cqm_socket_max; d++) {
+ if (__rmid_valid(rmid[d]))
+ __put_rmid(rmid[d], d);
+ }
+ kfree(event->hw.cqm_rmid);
+ list_del(&event->hw.cqm_groups_entry);
+}
static void intel_cqm_event_destroy(struct perf_event *event)
{
struct perf_event *group_other = NULL;
@@ -737,16 +693,11 @@ static void intel_cqm_event_destroy(struct perf_event *event)
* If there was a group_other, make that leader, otherwise
* destroy the group and return the RMID.
*/
- if (group_other) {
+ if (group_other)
list_replace(&event->hw.cqm_groups_entry,
&group_other->hw.cqm_groups_entry);
- } else {
- u32 rmid = event->hw.cqm_rmid;
-
- if (__rmid_valid(rmid))
- __put_rmid(rmid);
- list_del(&event->hw.cqm_groups_entry);
- }
+ else
+ cqm_event_free_rmid(event);
}

raw_spin_unlock_irqrestore(&cache_lock, flags);
@@ -794,7 +745,7 @@ static int intel_cqm_event_init(struct perf_event *event)

mutex_lock(&cache_mutex);

- /* Will also set rmid, return error on RMID not being available*/
+ /* Delay allocating RMIDs */
if (intel_cqm_setup_event(event, &group)) {
ret = -EINVAL;
goto out;
@@ -1036,12 +987,95 @@ static void mbm_cleanup(void)
{}
};

+static int pkg_data_init_cpu(int cpu)
+{
+ struct cqm_rmid_entry *ccqm_rmid_ptrs = NULL, *entry = NULL;
+ int curr_pkgid = topology_physical_package_id(cpu);
+ struct pkg_data *pkg_data = NULL;
+ int i = 0, nr_rmids, ret = 0;
+
+ if (cqm_pkgs_data[curr_pkgid])
+ return 0;
+
+ pkg_data = kzalloc_node(sizeof(struct pkg_data),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!pkg_data)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&pkg_data->cqm_rmid_free_lru);
+ INIT_LIST_HEAD(&pkg_data->cqm_rmid_limbo_lru);
+
+ mutex_init(&pkg_data->pkg_data_mutex);
+ raw_spin_lock_init(&pkg_data->pkg_data_lock);
+
+ pkg_data->rmid_work_cpu = cpu;
+
+ nr_rmids = cqm_max_rmid + 1;
+ ccqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry) *
+ nr_rmids, GFP_KERNEL);
+ if (!ccqm_rmid_ptrs) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ for (; i <= cqm_max_rmid; i++) {
+ entry = &ccqm_rmid_ptrs[i];
+ INIT_LIST_HEAD(&entry->list);
+ entry->rmid = i;
+
+ list_add_tail(&entry->list, &pkg_data->cqm_rmid_free_lru);
+ }
+
+ pkg_data->cqm_rmid_ptrs = ccqm_rmid_ptrs;
+ cqm_pkgs_data[curr_pkgid] = pkg_data;
+
+ /*
+ * RMID 0 is special and is always allocated. It's used for all
+ * tasks that are not monitored.
+ */
+ entry = __rmid_entry(0, curr_pkgid);
+ list_del(&entry->list);
+
+ return 0;
+fail:
+ kfree(ccqm_rmid_ptrs);
+ ccqm_rmid_ptrs = NULL;
+ kfree(pkg_data);
+ pkg_data = NULL;
+ cqm_pkgs_data[curr_pkgid] = NULL;
+ return ret;
+}
+
+static int cqm_init_pkgs_data(void)
+{
+ int i, cpu, ret = 0;
+
+ cqm_pkgs_data = kzalloc(
+ sizeof(struct pkg_data *) * cqm_socket_max,
+ GFP_KERNEL);
+ if (!cqm_pkgs_data)
+ return -ENOMEM;
+
+ for (i = 0; i < cqm_socket_max; i++)
+ cqm_pkgs_data[i] = NULL;
+
+ for_each_online_cpu(cpu) {
+ ret = pkg_data_init_cpu(cpu);
+ if (ret)
+ goto fail;
+ }
+
+ return 0;
+fail:
+ cqm_cleanup();
+ return ret;
+}
+
static int intel_mbm_init(void)
{
int ret = 0, array_size, maxid = cqm_max_rmid + 1;

- mbm_socket_max = topology_max_packages();
- array_size = sizeof(struct sample) * maxid * mbm_socket_max;
+ array_size = sizeof(struct sample) * maxid * cqm_socket_max;
mbm_local = kmalloc(array_size, GFP_KERNEL);
if (!mbm_local)
return -ENOMEM;
@@ -1052,7 +1086,7 @@ static int intel_mbm_init(void)
goto out;
}

- array_size = sizeof(struct hrtimer) * mbm_socket_max;
+ array_size = sizeof(struct hrtimer) * cqm_socket_max;
mbm_timers = kmalloc(array_size, GFP_KERNEL);
if (!mbm_timers) {
ret = -ENOMEM;
@@ -1128,7 +1162,8 @@ static int __init intel_cqm_init(void)

event_attr_intel_cqm_llc_scale.event_str = str;

- ret = intel_cqm_setup_rmid_cache();
+ cqm_socket_max = topology_max_packages();
+ ret = cqm_init_pkgs_data();
if (ret)
goto out;

@@ -1171,6 +1206,7 @@ static int __init intel_cqm_init(void)
if (ret) {
kfree(str);
cqm_cleanup();
+ cqm_enabled = false;
mbm_cleanup();
}

diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
new file mode 100644
index 0000000..4415497
--- /dev/null
+++ b/arch/x86/events/intel/cqm.h
@@ -0,0 +1,37 @@
+#ifndef _ASM_X86_CQM_H
+#define _ASM_X86_CQM_H
+
+#ifdef CONFIG_INTEL_RDT_M
+
+#include <linux/perf_event.h>
+
+/**
+ * struct pkg_data - cqm per package(socket) meta data
+ * @cqm_rmid_free_lru A least recently used list of free RMIDs
+ * These RMIDs are guaranteed to have an occupancy less than the
+ * threshold occupancy
+ * @cqm_rmid_limbo_lru list of currently unused but (potentially)
+ * dirty RMIDs.
+ * This list contains RMIDs that no one is currently using but that
+ * may have a occupancy value > __intel_cqm_threshold. User can change
+ * the threshold occupancy value.
+ * @cqm_rmid_entry - The entry in the limbo and free lists.
+ * @delayed_work - Work to reuse the RMIDs that have been freed.
+ * @rmid_work_cpu - The cpu on the package on which work is scheduled.
+ */
+struct pkg_data {
+ struct list_head cqm_rmid_free_lru;
+ struct list_head cqm_rmid_limbo_lru;
+
+ struct cqm_rmid_entry *cqm_rmid_ptrs;
+
+ struct mutex pkg_data_mutex;
+ raw_spinlock_t pkg_data_lock;
+
+ struct delayed_work intel_cqm_rmid_work;
+ atomic_t reuse_scheduled;
+
+ int rmid_work_cpu;
+};
+#endif
+#endif
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecd..a8f4749 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -141,7 +141,7 @@ struct hw_perf_event {
};
struct { /* intel_cqm */
int cqm_state;
- u32 cqm_rmid;
+ u32 *cqm_rmid;
int is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
--
1.9.1

2016-12-16 23:14:06

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 03/14] x86/rdt: Add rdt common/cqm compile option

Add a compile option INTEL_RDT which enables common code for all
RDT(Resource director technology) and a specific INTEL_RDT_M which
enables code for RDT monitoring. CQM(cache quality monitoring) and
mbm(memory b/w monitoring) are part of Intel RDT monitoring.

Signed-off-by: Vikas Shivappa <[email protected]>

Conflicts:
arch/x86/Kconfig
---
arch/x86/Kconfig | 17 +++++++++++++++++
arch/x86/events/intel/Makefile | 3 ++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dcca4ec..67c01cf 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -407,11 +407,28 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH

+config INTEL_RDT
+ bool
+
+config INTEL_RDT_M
+ bool "Intel Resource Director Technology Monitoring support"
+ default n
+ depends on X86 && CPU_SUP_INTEL
+ select INTEL_RDT
+ help
+ Select to enable resource monitoring which is a sub-feature of
+ Intel Resource Director Technology(RDT). More information about
+ RDT can be found in the Intel x86 Architecture Software
+ Developer Manual.
+
+ Say N if unsure.
+
config INTEL_RDT_A
bool "Intel Resource Director Technology Allocation support"
default n
depends on X86 && CPU_SUP_INTEL
select KERNFS
+ select INTEL_RDT
help
Select to enable resource allocation which is a sub-feature of
Intel Resource Director Technology(RDT). More information about
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 06c2baa..2e002a5 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,5 @@
-obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
+obj-$(CONFIG_INTEL_RDT_M) += cqm.o
obj-$(CONFIG_CPU_SUP_INTEL) += ds.o knc.o
obj-$(CONFIG_CPU_SUP_INTEL) += lbr.o p4.o p6.o pt.o
obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += intel-rapl-perf.o
--
1.9.1

2016-12-16 23:14:14

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 10/14] x86/cqm: Add RMID reuse

When an RMID is freed by an event it cannot be reused immediately as the
RMID may still have some cache occupancy. Hence when an RMID is freed it
goes into limbo list and not free list. This patch provides support to
periodically check the occupancy values of such RMIDs and move them to
the free list once its occupancy < threshold_occupancy value. The
threshold occupancy value can be modified by user based on his
requirements.

Tests: Before the patch, task monitoring would just throw error once
RMIDs are used in the lifetime of systemboot.
After this patch, we would be able to reuse the RMIDs that are freed.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 107 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 73f566a..85162aa 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -173,6 +173,13 @@ u32 __get_rmid(int domain)
return entry->rmid;
}

+static void cqm_schedule_rmidwork(int domain);
+
+static inline bool is_first_cqmwork(int domain)
+{
+ return (!atomic_cmpxchg(&cqm_pkgs_data[domain]->reuse_scheduled, 0, 1));
+}
+
static void __put_rmid(u32 rmid, int domain)
{
struct cqm_rmid_entry *entry;
@@ -293,6 +300,93 @@ static void cqm_mask_call(struct rmid_read *rr)
static unsigned int __intel_cqm_threshold;
static unsigned int __intel_cqm_max_threshold;

+/*
+ * Test whether an RMID has a zero occupancy value on this cpu.
+ */
+static void intel_cqm_stable(void)
+{
+ struct cqm_rmid_entry *entry;
+ struct list_head *llist;
+
+ llist = &cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru;
+ list_for_each_entry(entry, llist, list) {
+
+ if (__rmid_read(entry->rmid) < __intel_cqm_threshold)
+ entry->state = RMID_AVAILABLE;
+ }
+}
+
+static void __intel_cqm_rmid_reuse(void)
+{
+ struct cqm_rmid_entry *entry, *tmp;
+ struct list_head *llist, *flist;
+ struct pkg_data *pdata;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+ pdata = cqm_pkgs_data[pkg_id];
+ llist = &pdata->cqm_rmid_limbo_lru;
+ flist = &pdata->cqm_rmid_free_lru;
+
+ if (list_empty(llist))
+ goto end;
+ /*
+ * Test whether an RMID is free
+ */
+ intel_cqm_stable();
+
+ list_for_each_entry_safe(entry, tmp, llist, list) {
+
+ if (entry->state == RMID_DIRTY)
+ continue;
+ /*
+ * Otherwise remove from limbo and place it onto the free list.
+ */
+ list_del(&entry->list);
+ list_add_tail(&entry->list, flist);
+ }
+
+end:
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+}
+
+static bool reschedule_cqm_work(void)
+{
+ unsigned long flags;
+ bool nwork = false;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+
+ if (!list_empty(&cqm_pkgs_data[pkg_id]->cqm_rmid_limbo_lru))
+ nwork = true;
+ else
+ atomic_set(&cqm_pkgs_data[pkg_id]->reuse_scheduled, 0U);
+
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return nwork;
+}
+
+static void cqm_schedule_rmidwork(int domain)
+{
+ struct delayed_work *dwork;
+ unsigned long delay;
+
+ dwork = &cqm_pkgs_data[domain]->intel_cqm_rmid_work;
+ delay = msecs_to_jiffies(RMID_DEFAULT_QUEUE_TIME);
+
+ schedule_delayed_work_on(cqm_pkgs_data[domain]->rmid_work_cpu,
+ dwork, delay);
+}
+
+static void intel_cqm_rmid_reuse(struct work_struct *work)
+{
+ __intel_cqm_rmid_reuse();
+
+ if (reschedule_cqm_work())
+ cqm_schedule_rmidwork(pkg_id);
+}
+
static struct pmu intel_cqm_pmu;

static u64 update_sample(unsigned int rmid, u32 evt_type, int first)
@@ -540,7 +634,7 @@ static int intel_cqm_setup_event(struct perf_event *event,
}
#ifdef CONFIG_CGROUP_PERF
/*
- * For continously monitored cgroups, *rmid is allocated already.
+ * For continously monitored cgroups, rmid is allocated already.
*/
if (event->cgrp) {
cqm_info = cgrp_to_cqm_info(event->cgrp);
@@ -882,6 +976,7 @@ static void intel_cqm_event_terminate(struct perf_event *event)
{
struct perf_event *group_other = NULL;
unsigned long flags;
+ int d;

mutex_lock(&cache_mutex);
/*
@@ -924,6 +1019,13 @@ static void intel_cqm_event_terminate(struct perf_event *event)
mbm_stop_timers();

mutex_unlock(&cache_mutex);
+
+ for (d = 0; d < cqm_socket_max; d++) {
+
+ if (cqm_pkgs_data[d] != NULL && is_first_cqmwork(d)) {
+ cqm_schedule_rmidwork(d);
+ }
+ }
}

static int intel_cqm_event_init(struct perf_event *event)
@@ -1430,6 +1532,9 @@ static int pkg_data_init_cpu(int cpu)
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);

+ INIT_DEFERRABLE_WORK(
+ &pkg_data->intel_cqm_rmid_work, intel_cqm_rmid_reuse);
+
pkg_data->rmid_work_cpu = cpu;

nr_rmids = cqm_max_rmid + 1;
--
1.9.1

2016-12-16 23:14:24

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 11/14] x86/cqm: Add failure on open and read

To provide reliable output to the user, cqm throws error when it does
not have enough RMIDs to monitor depending upon the mode user choses.
This also takes care to not overuse RMIDs. Default is LAZY mode.

NOLAZY mode: This patch adds a file mon_mask in the perf_cgroup which
indicates the packages which the user wants guaranteed monitoring. For
such cgroup events RMIDs are assigned at event create and we fail if
enough RMIDs are not present. This is basically a NOLAZY allocation of
RMIDs. This mode can be used in real time scenarios where user is sure
that tasks that are monitored are scheduled.

LAZY mode: If user did not enable the NOLAZY mode, RMIDs are allocated
only when tasks are actually scheduled. Upon failure to obtain RMIDs it
indicates a failure in read. Typical use case for this mode could be to
start monitoring cgroups which still donot have any tasks in them and
such cgroups are part of large number of cgroups which are monitored -
that way we donot overuse RMIDs.

Patch is based on David Carrillo-Cisneros <[email protected]> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 145 +++++++++++++++++++++++++++++---
arch/x86/events/intel/cqm.h | 1 +
arch/x86/include/asm/intel_rdt_common.h | 7 +-
3 files changed, 141 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 85162aa..e0d4017 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -78,6 +78,11 @@ struct sample {
*/
static cpumask_t cqm_cpumask;

+/*
+ * Mask of online sockets.
+ */
+static cpumask_t cqm_pkgmask;
+
struct pkg_data **cqm_pkgs_data;
struct cgrp_cqm_info cqm_rootcginfo;

@@ -110,6 +115,14 @@ bool __rmid_valid(u32 rmid)
return true;
}

+static inline bool __rmid_valid_raw(u32 rmid)
+{
+ if (rmid > cqm_max_rmid)
+ return false;
+
+ return true;
+}
+
static u64 __rmid_read(u32 rmid)
{
u64 val;
@@ -159,16 +172,19 @@ u32 __get_rmid(int domain)
{
struct list_head *cqm_flist;
struct cqm_rmid_entry *entry;
+ struct pkg_data *pdata;

lockdep_assert_held(&cache_lock);

- cqm_flist = &cqm_pkgs_data[domain]->cqm_rmid_free_lru;
+ pdata = cqm_pkgs_data[domain];
+ cqm_flist = &pdata->cqm_rmid_free_lru;

if (list_empty(cqm_flist))
return INVALID_RMID;

entry = list_first_entry(cqm_flist, struct cqm_rmid_entry, list);
list_del(&entry->list);
+ pdata->rmid_used_count++;

return entry->rmid;
}
@@ -344,6 +360,7 @@ static void __intel_cqm_rmid_reuse(void)
*/
list_del(&entry->list);
list_add_tail(&entry->list, flist);
+ pdata->rmid_used_count--;
}

end:
@@ -607,6 +624,33 @@ static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
return 0;
}

+static inline int check_min_rmids(struct cgrp_cqm_info *cqm_info)
+{
+ int pkg = cpumask_first_and(&cqm_info->mon_mask, &cqm_pkgmask);
+
+ for (; pkg < nr_cpu_ids;
+ pkg = cpumask_next_and(pkg, &cqm_info->mon_mask, &cqm_pkgmask)) {
+ if (cqm_pkgs_data[pkg]->rmid_used_count >= cqm_max_rmid)
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static inline void alloc_min_rmids(struct cgrp_cqm_info *cqm_info)
+{
+ int pkg = cpumask_first_and(&cqm_info->mon_mask, &cqm_pkgmask);
+ u32 rmid;
+
+ for ( ; pkg < nr_cpu_ids;
+ pkg = cpumask_next_and(pkg, &cqm_info->mon_mask, &cqm_pkgmask)) {
+
+ rmid = __get_rmid(pkg);
+ if (__rmid_valid(rmid))
+ cqm_info->rmid[pkg] = rmid;
+ }
+}
+
/*
* Find a group and setup RMID.
*
@@ -642,6 +686,14 @@ static int intel_cqm_setup_event(struct perf_event *event,
event->hw.cqm_rmid = cqm_info->rmid;
return 0;
}
+
+ /*
+ * For cgroups which must have RMIDs check if enough
+ * RMIDs are available.
+ */
+ if (cpumask_weight(&cqm_info->mon_mask) &&
+ check_min_rmids(cqm_info))
+ return -EINVAL;
}
#endif

@@ -656,6 +708,11 @@ static int intel_cqm_setup_event(struct perf_event *event,

cqm_assign_rmid(event, event->hw.cqm_rmid);

+#ifdef CONFIG_CGROUP_PERF
+ if (event->cgrp && cpumask_weight(&cqm_info->mon_mask))
+ alloc_min_rmids(cqm_info);
+#endif
+
return 0;
}

@@ -896,16 +953,16 @@ static u64 intel_cqm_event_count(struct perf_event *event)
return __perf_event_count(event);
}

-void alloc_needed_pkg_rmid(u32 *cqm_rmid)
+u32 alloc_needed_pkg_rmid(u32 *cqm_rmid)
{
unsigned long flags;
u32 rmid;

if (WARN_ON(!cqm_rmid))
- return;
+ return -EINVAL;

if (cqm_rmid == cqm_rootcginfo.rmid || cqm_rmid[pkg_id])
- return;
+ return 0;

raw_spin_lock_irqsave(&cache_lock, flags);

@@ -914,6 +971,8 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
cqm_rmid[pkg_id] = rmid;

raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return rmid;
}

static void intel_cqm_event_start(struct perf_event *event, int mode)
@@ -925,10 +984,8 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)

event->hw.cqm_state &= ~PERF_HES_STOPPED;

- if (is_task_event(event)) {
- alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+ if (is_task_event(event))
state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
- }
}

static void intel_cqm_event_stop(struct perf_event *event, int mode)
@@ -944,11 +1001,19 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)

static int intel_cqm_event_add(struct perf_event *event, int mode)
{
+ u32 rmid;
+
event->hw.cqm_state = PERF_HES_STOPPED;

- if ((mode & PERF_EF_START))
+ /*
+ * If Lazy RMID alloc fails indicate the error to the user.
+ */
+ if ((mode & PERF_EF_START)) {
+ rmid = alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+ if (!__rmid_valid_raw(rmid))
+ return -EINVAL;
intel_cqm_event_start(event, mode);
-
+ }
return 0;
}

@@ -1426,12 +1491,67 @@ static int cqm_cont_monitoring_write_u64(struct cgroup_subsys_state *css,
return ret;
}

+static int cqm_mon_mask_seq_show(struct seq_file *sf, void *v)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+ seq_printf(sf, "%*pbl\n",
+ cpumask_pr_args(&css_to_cqm_info(seq_css(sf))->mon_mask));
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ return 0;
+}
+
+static ssize_t cqm_mon_mask_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ cpumask_var_t tmp_cpus, tmp_cpus1;
+ struct cgrp_cqm_info *cqm_info;
+ unsigned long flags;
+ int ret = 0;
+
+ buf = strstrip(buf);
+
+ if (!zalloc_cpumask_var(&tmp_cpus, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&tmp_cpus1, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = cpulist_parse(buf, tmp_cpus);
+ if (ret)
+ goto out;
+
+ if (cpumask_andnot(tmp_cpus1, tmp_cpus, &cqm_pkgmask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+ cqm_info = css_to_cqm_info(of_css(of));
+ cpumask_copy(&cqm_info->mon_mask, tmp_cpus);
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+out:
+ free_cpumask_var(tmp_cpus);
+ free_cpumask_var(tmp_cpus1);
+
+ return ret ?: nbytes;
+}
+
struct cftype perf_event_cgrp_arch_subsys_cftypes[] = {
{
.name = "cqm_cont_monitoring",
.read_u64 = cqm_cont_monitoring_read_u64,
.write_u64 = cqm_cont_monitoring_write_u64,
},
+ {
+ .name = "cqm_mon_mask",
+ .seq_show = cqm_mon_mask_seq_show,
+ .write = cqm_mon_mask_write,
+ .max_write_len = (100U + 6 * NR_CPUS),
+ },

{} /* terminate */
};
@@ -1449,8 +1569,10 @@ static inline void cqm_pick_event_reader(int cpu)

/* First online cpu in package becomes the reader */
reader = cpumask_any_and(&cqm_cpumask, topology_core_cpumask(cpu));
- if (reader >= nr_cpu_ids)
+ if (reader >= nr_cpu_ids) {
cpumask_set_cpu(cpu, &cqm_cpumask);
+ cpumask_set_cpu(pkg_id, &cqm_pkgmask);
+ }
}

static int intel_cqm_cpu_starting(unsigned int cpu)
@@ -1482,6 +1604,8 @@ static int intel_cqm_cpu_exit(unsigned int cpu)

if (target < nr_cpu_ids)
cpumask_set_cpu(target, &cqm_cpumask);
+ else
+ cpumask_clear_cpu(pkg_id, &cqm_pkgmask);

return 0;
}
@@ -1562,6 +1686,7 @@ static int pkg_data_init_cpu(int cpu)
*/
entry = __rmid_entry(0, curr_pkgid);
list_del(&entry->list);
+ pkg_data->rmid_used_count++;

cqm_rootcginfo.rmid = kzalloc(sizeof(u32) * cqm_socket_max, GFP_KERNEL);
if (!cqm_rootcginfo.rmid) {
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 4415497..063956d 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -32,6 +32,7 @@ struct pkg_data {
atomic_t reuse_scheduled;

int rmid_work_cpu;
+ int rmid_used_count;
};
#endif
#endif
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index 6424322..39fa4fb 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -29,7 +29,7 @@ struct intel_pqr_state {

u32 __get_rmid(int domain);
bool __rmid_valid(u32 rmid);
-void alloc_needed_pkg_rmid(u32 *cqm_rmid);
+u32 alloc_needed_pkg_rmid(u32 *cqm_rmid);
struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);

extern struct cgrp_cqm_info cqm_rootcginfo;
@@ -42,7 +42,9 @@ struct intel_pqr_state {
* @cont_mon Continuous monitoring flag
* @mon_enabled Whether monitoring is enabled
* @level Level in the cgroup tree. Root is level 0.
- * @rmid The rmids of the cgroup.
+ * @rmid The rmids of the cgroup.
+ * @mon_mask Package Mask to indicate packages which must
+ * must have RMIDs(guaranteed cqm monitoring).
* @mfa 'Monitoring for ancestor' points to the cqm_info
* of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
* means you will use an ancestors RMID at sched_in if you are
@@ -79,6 +81,7 @@ struct cgrp_cqm_info {
bool mon_enabled;
int level;
u32 *rmid;
+ struct cpumask mon_mask;
struct cgrp_cqm_info *mfa;
struct list_head tskmon_rlist;
};
--
1.9.1

2016-12-16 23:14:49

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 14/14] perf/stat: revamp read error handling, snapshot and per_pkg events

From: David Carrillo-Cisneros <[email protected]>

A package wide event can return a valid read even if it has not run in a
specific cpu, this does not fit well with the assumption that run == 0
is equivalent to a <not counted>.

To fix the problem, this patch defines special error values for val,
run and ena (~0ULL), and use them to signal read errors, allowing run == 0
to be a valid value for package events. A new value, (NA), is output on
read error and when event has not been enabled (timed enabled == 0).

Finally, this patch revamps calculation of deltas and scaling for snapshot
events, removing the calculation of deltas for time running and enabled in
snapshot event, as should be.

Tests: After this patch, the user mode can see the package llc_occupancy
count when user calls per stat -C <cpux> -G cgroup_y and the systemwide
count when run with -a option for the cgroup

Reviewed-by: Stephane Eranian <[email protected]>
Signed-off-by: David Carrillo-Cisneros <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
tools/perf/builtin-stat.c | 36 +++++++++++++++++++++++-----------
tools/perf/util/counts.h | 19 ++++++++++++++++++
tools/perf/util/evsel.c | 49 ++++++++++++++++++++++++++++++++++++-----------
tools/perf/util/evsel.h | 8 ++++++--
tools/perf/util/stat.c | 35 +++++++++++----------------------
5 files changed, 99 insertions(+), 48 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c3c4b49..79043a3 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -311,10 +311,8 @@ static int read_counter(struct perf_evsel *counter)

count = perf_counts(counter->counts, cpu, thread);
if (perf_evsel__read(counter, cpu, thread, count)) {
- counter->counts->scaled = -1;
- perf_counts(counter->counts, cpu, thread)->ena = 0;
- perf_counts(counter->counts, cpu, thread)->run = 0;
- return -1;
+ /* do not write stat for failed reads. */
+ continue;
}

if (STAT_RECORD) {
@@ -725,12 +723,16 @@ static int run_perf_stat(int argc, const char **argv)

static void print_running(u64 run, u64 ena)
{
+ bool is_na = run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || !ena;
+
if (csv_output) {
- fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
- csv_sep,
- run,
- csv_sep,
- ena ? 100.0 * run / ena : 100.0);
+ if (is_na)
+ fprintf(stat_config.output, "%sNA%sNA", csv_sep, csv_sep);
+ else
+ fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
+ csv_sep, run, csv_sep, 100.0 * run / ena);
+ } else if (is_na) {
+ fprintf(stat_config.output, " (NA)");
} else if (run != ena) {
fprintf(stat_config.output, " (%.2f%%)", 100.0 * run / ena);
}
@@ -1103,7 +1105,7 @@ static void printout(int id, int nr, struct perf_evsel *counter, double uval,
if (counter->cgrp)
os.nfields++;
}
- if (run == 0 || ena == 0 || counter->counts->scaled == -1) {
+ if (run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || counter->counts->scaled == -1) {
if (metric_only) {
pm(&os, NULL, "", "", 0);
return;
@@ -1209,12 +1211,17 @@ static void print_aggr(char *prefix)
id = aggr_map->map[s];
first = true;
evlist__for_each_entry(evsel_list, counter) {
+ bool all_nan = true;
val = ena = run = 0;
nr = 0;
for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
s2 = aggr_get_id(perf_evsel__cpus(counter), cpu);
if (s2 != id)
continue;
+ /* skip NA reads. */
+ if (perf_counts_values__is_na(perf_counts(counter->counts, cpu, 0)))
+ continue;
+ all_nan = false;
val += perf_counts(counter->counts, cpu, 0)->val;
ena += perf_counts(counter->counts, cpu, 0)->ena;
run += perf_counts(counter->counts, cpu, 0)->run;
@@ -1228,6 +1235,10 @@ static void print_aggr(char *prefix)
fprintf(output, "%s", prefix);

uval = val * counter->scale;
+ if (all_nan) {
+ run = PERF_COUNTS_NA;
+ ena = PERF_COUNTS_NA;
+ }
printout(id, nr, counter, uval, prefix, run, ena, 1.0);
if (!metric_only)
fputc('\n', output);
@@ -1306,7 +1317,10 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
if (prefix)
fprintf(output, "%s", prefix);

- uval = val * counter->scale;
+ if (val != PERF_COUNTS_NA)
+ uval = val * counter->scale;
+ else
+ uval = NAN;
printout(cpu, 0, counter, uval, prefix, run, ena, 1.0);

fputc('\n', output);
diff --git a/tools/perf/util/counts.h b/tools/perf/util/counts.h
index 34d8baa..b65e97a 100644
--- a/tools/perf/util/counts.h
+++ b/tools/perf/util/counts.h
@@ -3,6 +3,9 @@

#include "xyarray.h"

+/* Not Available (NA) value. Any operation with a NA equals a NA. */
+#define PERF_COUNTS_NA ((u64)~0ULL)
+
struct perf_counts_values {
union {
struct {
@@ -14,6 +17,22 @@ struct perf_counts_values {
};
};

+static inline void
+perf_counts_values__make_na(struct perf_counts_values *values)
+{
+ values->val = PERF_COUNTS_NA;
+ values->ena = PERF_COUNTS_NA;
+ values->run = PERF_COUNTS_NA;
+}
+
+static inline bool
+perf_counts_values__is_na(struct perf_counts_values *values)
+{
+ return values->val == PERF_COUNTS_NA ||
+ values->ena == PERF_COUNTS_NA ||
+ values->run == PERF_COUNTS_NA;
+}
+
struct perf_counts {
s8 scaled;
struct perf_counts_values aggr;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d54efb5..fa0ba96 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1180,6 +1180,9 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
if (!evsel->prev_raw_counts)
return;

+ if (perf_counts_values__is_na(count))
+ return;
+
if (cpu == -1) {
tmp = evsel->prev_raw_counts->aggr;
evsel->prev_raw_counts->aggr = *count;
@@ -1188,26 +1191,43 @@ void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
*perf_counts(evsel->prev_raw_counts, cpu, thread) = *count;
}

- count->val = count->val - tmp.val;
+ /* Snapshot events do not calculate deltas for count values. */
+ if (!evsel->snapshot)
+ count->val = count->val - tmp.val;
count->ena = count->ena - tmp.ena;
count->run = count->run - tmp.run;
}

void perf_counts_values__scale(struct perf_counts_values *count,
- bool scale, s8 *pscaled)
+ bool scale, bool per_pkg, bool snapshot, s8 *pscaled)
{
s8 scaled = 0;

+ if (perf_counts_values__is_na(count)) {
+ if (pscaled)
+ *pscaled = -1;
+ return;
+ }
+
if (scale) {
- if (count->run == 0) {
+ /*
+ * per-pkg events can have run == 0 in a CPU and still be
+ * valid.
+ */
+ if (count->run == 0 && !per_pkg) {
scaled = -1;
count->val = 0;
} else if (count->run < count->ena) {
scaled = 1;
- count->val = (u64)((double) count->val * count->ena / count->run + 0.5);
+ /* Snapshot events do not scale counts values. */
+ if (!snapshot && count->run)
+ count->val = (u64)((double) count->val * count->ena /
+ count->run + 0.5);
}
- } else
- count->ena = count->run = 0;
+
+ } else {
+ count->run = count->ena;
+ }

if (pscaled)
*pscaled = scaled;
@@ -1221,8 +1241,10 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;

- if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
+ if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0) {
+ perf_counts_values__make_na(count);
return -errno;
+ }

return 0;
}
@@ -1230,6 +1252,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
int cpu, int thread, bool scale)
{
+ int ret = 0;
struct perf_counts_values count;
size_t nv = scale ? 3 : 1;

@@ -1239,13 +1262,17 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
return -ENOMEM;

- if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
- return -errno;
+ if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0) {
+ perf_counts_values__make_na(&count);
+ ret = -errno;
+ goto exit;
+ }

perf_evsel__compute_deltas(evsel, cpu, thread, &count);
- perf_counts_values__scale(&count, scale, NULL);
+ perf_counts_values__scale(&count, scale, evsel->per_pkg, evsel->snapshot, NULL);
+exit:
*perf_counts(evsel->counts, cpu, thread) = count;
- return 0;
+ return ret;
}

static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index b1503b0..facb6494 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -80,6 +80,10 @@ struct perf_evsel_config_term {
* @is_pos: the position (counting backwards) of the event id (PERF_SAMPLE_ID or
* PERF_SAMPLE_IDENTIFIER) in a non-sample event i.e. if sample_id_all
* is used there is an id sample appended to non-sample events
+ * @snapshot: an event that whose raw value cannot be extrapolated based on
+ * the ratio of running/enabled time.
+ * @per_pkg: an event that runs package wide. All cores in same package will
+ * read the same value, even if running time == 0.
* @priv: And what is in its containing unnamed union are tool specific
*/
struct perf_evsel {
@@ -150,8 +154,8 @@ static inline int perf_evsel__nr_cpus(struct perf_evsel *evsel)
return perf_evsel__cpus(evsel)->nr;
}

-void perf_counts_values__scale(struct perf_counts_values *count,
- bool scale, s8 *pscaled);
+void perf_counts_values__scale(struct perf_counts_values *count, bool scale,
+ bool per_pkg, bool snapshot, s8 *pscaled);

void perf_evsel__compute_deltas(struct perf_evsel *evsel, int cpu, int thread,
struct perf_counts_values *count);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 39345c2d..514b953 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -202,7 +202,7 @@ static void zero_per_pkg(struct perf_evsel *counter)
}

static int check_per_pkg(struct perf_evsel *counter,
- struct perf_counts_values *vals, int cpu, bool *skip)
+ int cpu, bool *skip)
{
unsigned long *mask = counter->per_pkg_mask;
struct cpu_map *cpus = perf_evsel__cpus(counter);
@@ -224,17 +224,6 @@ static int check_per_pkg(struct perf_evsel *counter,
counter->per_pkg_mask = mask;
}

- /*
- * we do not consider an event that has not run as a good
- * instance to mark a package as used (skip=1). Otherwise
- * we may run into a situation where the first CPU in a package
- * is not running anything, yet the second is, and this function
- * would mark the package as used after the first CPU and would
- * not read the values from the second CPU.
- */
- if (!(vals->run && vals->ena))
- return 0;
-
s = cpu_map__get_socket(cpus, cpu, NULL);
if (s < 0)
return -1;
@@ -249,30 +238,27 @@ static int check_per_pkg(struct perf_evsel *counter,
struct perf_counts_values *count)
{
struct perf_counts_values *aggr = &evsel->counts->aggr;
- static struct perf_counts_values zero;
bool skip = false;

- if (check_per_pkg(evsel, count, cpu, &skip)) {
+ if (check_per_pkg(evsel, cpu, &skip)) {
pr_err("failed to read per-pkg counter\n");
return -1;
}

- if (skip)
- count = &zero;
-
switch (config->aggr_mode) {
case AGGR_THREAD:
case AGGR_CORE:
case AGGR_SOCKET:
case AGGR_NONE:
- if (!evsel->snapshot)
- perf_evsel__compute_deltas(evsel, cpu, thread, count);
- perf_counts_values__scale(count, config->scale, NULL);
+ perf_evsel__compute_deltas(evsel, cpu, thread, count);
+ perf_counts_values__scale(count, config->scale,
+ evsel->per_pkg, evsel->snapshot, NULL);
if (config->aggr_mode == AGGR_NONE)
perf_stat__update_shadow_stats(evsel, count->values, cpu);
break;
case AGGR_GLOBAL:
- aggr->val += count->val;
+ if (!skip)
+ aggr->val += count->val;
if (config->scale) {
aggr->ena += count->ena;
aggr->run += count->run;
@@ -337,9 +323,10 @@ int perf_stat_process_counter(struct perf_stat_config *config,
if (config->aggr_mode != AGGR_GLOBAL)
return 0;

- if (!counter->snapshot)
- perf_evsel__compute_deltas(counter, -1, -1, aggr);
- perf_counts_values__scale(aggr, config->scale, &counter->counts->scaled);
+ perf_evsel__compute_deltas(counter, -1, -1, aggr);
+ perf_counts_values__scale(aggr, config->scale,
+ counter->per_pkg, counter->snapshot,
+ &counter->counts->scaled);

for (i = 0; i < 3; i++)
update_stats(&ps->res_stats[i], count[i]);
--
1.9.1

2016-12-16 23:15:00

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 13/14] perf/stat: fix bug in handling events in error state

From: Stephane Eranian <[email protected]>

When an event is in error state, read() returns 0
instead of sizeof() buffer. In certain modes, such
as interval printing, ignoring the 0 return value
may cause bogus count deltas to be computed and
thus invalid results printed.

this patch fixes this problem by modifying read_counters()
to mark the event as not scaled (scaled = -1) to force
the printout routine to show <NOT COUNTED>.

Signed-off-by: Stephane Eranian <[email protected]>
Signed-off-by: Vikas Shivappa <[email protected]>
---
tools/perf/builtin-stat.c | 12 +++++++++---
tools/perf/util/evsel.c | 4 ++--
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 688dea7..c3c4b49 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -310,8 +310,12 @@ static int read_counter(struct perf_evsel *counter)
struct perf_counts_values *count;

count = perf_counts(counter->counts, cpu, thread);
- if (perf_evsel__read(counter, cpu, thread, count))
+ if (perf_evsel__read(counter, cpu, thread, count)) {
+ counter->counts->scaled = -1;
+ perf_counts(counter->counts, cpu, thread)->ena = 0;
+ perf_counts(counter->counts, cpu, thread)->run = 0;
return -1;
+ }

if (STAT_RECORD) {
if (perf_evsel__write_stat_event(counter, cpu, thread, count)) {
@@ -336,12 +340,14 @@ static int read_counter(struct perf_evsel *counter)
static void read_counters(void)
{
struct perf_evsel *counter;
+ int ret;

evlist__for_each_entry(evsel_list, counter) {
- if (read_counter(counter))
+ ret = read_counter(counter);
+ if (ret)
pr_debug("failed to read counter %s\n", counter->name);

- if (perf_stat_process_counter(&stat_config, counter))
+ if (ret == 0 && perf_stat_process_counter(&stat_config, counter))
pr_warning("failed to process counter %s\n", counter->name);
}
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 8bc2711..d54efb5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1221,7 +1221,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;

- if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
+ if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
return -errno;

return 0;
@@ -1239,7 +1239,7 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, thread + 1) < 0)
return -ENOMEM;

- if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) < 0)
+ if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
return -errno;

perf_evsel__compute_deltas(evsel, cpu, thread, &count);
--
1.9.1

2016-12-16 23:15:12

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 12/14] perf/core,x86/cqm: Add read for Cgroup events,per pkg reads.

For cqm cgroup events, the events can be read even if the event was not
active on the cpu on which the event is being read. This is because the
RMIDs are per package and hence if we read the llc_occupancy value on a
cpu x, we are really reading the occupancy for the package where cpu x
belongs.

This patch adds a PERF_INACTIVE_CPU_READ_PKG to indicate this behaviour
of cqm and also changes the perf/core to still call the reads even when
the event is inactive on the cpu for cgroup events. The task events have
event->cpu as -1 and hence it does not apply for task events.

Tests: perf stat -C <cpux> would not return a count before this patch to
the perf/core. After this patch the count of the package is returned to
the perf/core. We still dont see the count in the perf user mode - that
is fixed in next patches.

Patch is based on David Carrillo-Cisneros <[email protected]> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 1 +
include/linux/perf_event.h | 19 ++++++++++++++++---
kernel/events/core.c | 16 ++++++++++++----
3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index e0d4017..04723cc 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -1130,6 +1130,7 @@ static int intel_cqm_event_init(struct perf_event *event)
* cgroup hierarchies.
*/
event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
+ event->event_caps |= PERF_EV_CAP_INACTIVE_CPU_READ_PKG;

mutex_lock(&cache_mutex);

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index abeacb5..e55d709 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -525,10 +525,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
* PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
* cgroup scoping. It does not need to be enabled for all of its descendants
* cgroups.
+ * PERF_EV_CAP_INACTIVE_CPU_READ_PKG: A cgroup event where we can read
+ * the package count on any cpu on the pkg even if inactive.
*/
-#define PERF_EV_CAP_SOFTWARE BIT(0)
-#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
-#define PERF_EV_CAP_CGROUP_NO_RECURSION BIT(2)
+#define PERF_EV_CAP_SOFTWARE BIT(0)
+#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
+#define PERF_EV_CAP_CGROUP_NO_RECURSION BIT(2)
+#define PERF_EV_CAP_INACTIVE_CPU_READ_PKG BIT(3)

#define SWEVENT_HLIST_BITS 8
#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
@@ -722,6 +725,16 @@ struct perf_event {
#endif /* CONFIG_PERF_EVENTS */
};

+#ifdef CONFIG_PERF_EVENTS
+static inline bool __perf_can_read_inactive(struct perf_event *event)
+{
+ if ((event->group_caps & PERF_EV_CAP_INACTIVE_CPU_READ_PKG))
+ return true;
+
+ return false;
+}
+#endif
+
/**
* struct perf_event_context - event context structure
*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a290c53..9c070b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3435,9 +3435,13 @@ struct perf_read_data {

static int find_cpu_to_read(struct perf_event *event, int local_cpu)
{
+ bool active = event->state == PERF_EVENT_STATE_ACTIVE;
int event_cpu = event->oncpu;
u16 local_pkg, event_pkg;

+ if (__perf_can_read_inactive(event) && !active)
+ event_cpu = event->cpu;
+
if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
event_pkg = topology_physical_package_id(event_cpu);
local_pkg = topology_physical_package_id(local_cpu);
@@ -3459,6 +3463,7 @@ static void __perf_event_read(void *info)
struct perf_event_context *ctx = event->ctx;
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct pmu *pmu = event->pmu;
+ bool read_inactive = __perf_can_read_inactive(event);

/*
* If this is a task context, we need to check whether it is
@@ -3467,7 +3472,7 @@ static void __perf_event_read(void *info)
* event->count would have been updated to a recent sample
* when the event was scheduled out.
*/
- if (ctx->task && cpuctx->task_ctx != ctx)
+ if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
return;

raw_spin_lock(&ctx->lock);
@@ -3477,7 +3482,7 @@ static void __perf_event_read(void *info)
}

update_event_times(event);
- if (event->state != PERF_EVENT_STATE_ACTIVE)
+ if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
goto unlock;

if (!data->group) {
@@ -3492,7 +3497,8 @@ static void __perf_event_read(void *info)

list_for_each_entry(sub, &event->sibling_list, group_entry) {
update_event_times(sub);
- if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+ if (sub->state == PERF_EVENT_STATE_ACTIVE ||
+ __perf_can_read_inactive(sub)) {
/*
* Use sibling's PMU rather than @event's since
* sibling could be on different (eg: software) PMU.
@@ -3570,13 +3576,15 @@ u64 perf_event_read_local(struct perf_event *event)

static int perf_event_read(struct perf_event *event, bool group)
{
+ bool active = event->state == PERF_EVENT_STATE_ACTIVE;
int ret = 0, cpu_to_read, local_cpu;

/*
* If event is enabled and currently active on a CPU, update the
* value in the event structure:
*/
- if (event->state == PERF_EVENT_STATE_ACTIVE) {
+ if (active || __perf_can_read_inactive(event)) {
+
struct perf_read_data data = {
.event = event,
.group = group,
--
1.9.1

2016-12-16 23:15:24

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 09/14] x86/cqm: Add Continuous cgroup monitoring

This patch adds support for cgroup continuous monitoring which enables
to start monitoring a cgroup by toggling the cont_monitor field in the
cgroup without any perf overhead.
The cgroup would be monitored from the time this field is set
and user can fetch the data from the perf when data is needed.
This avoids perf over head all along the time that the cgroup is being
monitored and if one has to monitor a cgroup for its lifetime, it doesnt
need running perf the whole time.

A new file is introduced in the cgroup cont_mon. Once this is enabled a
new RMID is assigned to the cgroup. If an event is created to monitor
this cgroup again, the event just reuses the same RMID. At switch_to
time, we add a check to see if there is cont_monitoring. During read,
data is fetched by reading the counters in the same was as its done for
other cgroups.

Tests: Should be able to monitor cgroup continuously without perf by
toggling the new cont_mon file in the cgroup.

Patch is based on David Carrillo-Cisneros <[email protected]> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 119 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 114 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 8017886..73f566a 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -521,6 +521,7 @@ static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
static int intel_cqm_setup_event(struct perf_event *event,
struct perf_event **group)
{
+ struct cgrp_cqm_info *cqm_info;
struct perf_event *iter;
u32 *rmid, sizet;

@@ -537,6 +538,18 @@ static int intel_cqm_setup_event(struct perf_event *event,
return 0;
}
}
+#ifdef CONFIG_CGROUP_PERF
+ /*
+ * For continously monitored cgroups, *rmid is allocated already.
+ */
+ if (event->cgrp) {
+ cqm_info = cgrp_to_cqm_info(event->cgrp);
+ if (cqm_info->cont_mon) {
+ event->hw.cqm_rmid = cqm_info->rmid;
+ return 0;
+ }
+ }
+#endif

/*
* RMIDs are allocated in LAZY mode by default only when
@@ -547,6 +560,8 @@ static int intel_cqm_setup_event(struct perf_event *event,
if (!event->hw.cqm_rmid)
return -ENOMEM;

+ cqm_assign_rmid(event, event->hw.cqm_rmid);
+
return 0;
}

@@ -843,18 +858,23 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
return 0;
}

+static inline bool is_cont_mon_event(struct perf_event *event);
+
static inline void
cqm_event_free_rmid(struct perf_event *event)
{
u32 *rmid = event->hw.cqm_rmid;
int d;

- for (d = 0; d < cqm_socket_max; d++) {
- if (__rmid_valid(rmid[d]))
- __put_rmid(rmid[d], d);
+ if (!is_cont_mon_event(event)) {
+
+ for (d = 0; d < cqm_socket_max; d++) {
+ if (__rmid_valid(rmid[d]))
+ __put_rmid(rmid[d], d);
+ }
+ cqm_assign_rmid(event, NULL);
+ kfree(event->hw.cqm_rmid);
}
- kfree(event->hw.cqm_rmid);
- cqm_assign_rmid(event, NULL);
list_del(&event->hw.cqm_groups_entry);
}

@@ -1122,6 +1142,11 @@ static int intel_cqm_event_init(struct perf_event *event)
};

#ifdef CONFIG_CGROUP_PERF
+static inline bool is_cont_mon_event(struct perf_event *event)
+{
+ return (is_cgroup_event(event) && cgrp_to_cqm_info(event->cgrp)->cont_mon);
+}
+
int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
struct cgroup_subsys_state *new_css)
{
@@ -1230,6 +1255,90 @@ int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)

return 0;
}
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static u64 cqm_cont_monitoring_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ int ret = -1;
+
+ mutex_lock(&cache_mutex);
+ ret = css_to_cqm_info(css)->cont_mon;
+ mutex_unlock(&cache_mutex);
+
+ return ret;
+}
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static int cqm_cont_monitoring_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 value)
+{
+ struct cgrp_cqm_info *cqm_info;
+ unsigned long flags;
+ int ret = 0, d;
+
+ if (value > 1)
+ return -1;
+
+ mutex_lock(&cache_mutex);
+
+ /* Root cgroup cannot stop being monitored. */
+ if (!css->parent)
+ goto out;
+
+ cqm_info = css_to_cqm_info(css);
+
+ /*
+ * Alloc and free rmid when cont monitoring is being set
+ * and reset.
+ */
+ if (!cqm_info->cont_mon && value && !cqm_info->rmid) {
+ cqm_info->rmid =
+ kzalloc(sizeof(u32) * cqm_socket_max, GFP_KERNEL);
+ if (!cqm_info->rmid) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ cqm_assign_hier_rmid(css, cqm_info->rmid);
+ }
+
+ if (cqm_info->cont_mon && !value) {
+ u32 *rmid = cqm_info->rmid;
+
+ raw_spin_lock_irqsave(&cache_lock, flags);
+ for (d = 0; d < cqm_socket_max; d++) {
+ if (__rmid_valid(rmid[d]))
+ __put_rmid(rmid[d], d);
+ }
+ raw_spin_unlock_irqrestore(&cache_lock, flags);
+
+ kfree(cqm_info->rmid);
+ cqm_assign_hier_rmid(css, NULL);
+ }
+
+ cqm_info->cont_mon = value;
+out:
+ mutex_unlock(&cache_mutex);
+
+ return ret;
+}
+
+struct cftype perf_event_cgrp_arch_subsys_cftypes[] = {
+ {
+ .name = "cqm_cont_monitoring",
+ .read_u64 = cqm_cont_monitoring_read_u64,
+ .write_u64 = cqm_cont_monitoring_write_u64,
+ },
+
+ {} /* terminate */
+};
+#else
+
+static inline bool is_cont_mon_event(struct perf_event *event)
+{
+ return false;
+}
#endif

static inline void cqm_pick_event_reader(int cpu)
--
1.9.1

2016-12-16 23:15:33

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 08/14] x86/cqm: Add support for monitoring task and cgroup together

From: David Carrillo-Cisneros <[email protected]>

This patch adds support to monitor a cgroup x and a task p1
when p1 is part of cgroup x. Since we cannot write two RMIDs during
sched in the driver handles this.

This patch introduces a u32 *rmid in the task_struck which keeps track
of the RMIDs associated with the task. There is also a list in the
arch_info of perf_cgroup called taskmon_list which keeps track of tasks
in the cgroup that are monitored.

The taskmon_list is modified in 2 scenarios.
- at event_init of task p1 which is part of a cgroup, add the p1 to the
cgroup->tskmon_list. At event_destroy delete the task from the list.
- at the time of task move from cgrp x to a cgp y, if task was monitored
remove the task from the cgrp x tskmon_list and add it the cgrp y
tskmon_list.

sched in: When the task p1 is scheduled in, we write the task RMID in
the PQR_ASSOC MSR

read(for task p1): As any other cqm task event

read(for the cgroup x): When counting for cgroup, the taskmon list is
traversed and the corresponding RMID counts are added.

Tests: Monitoring a cgroup x and a task with in the cgroup x should
work.

Patch modified/refactored by Vikas Shivappa
<[email protected]> to support recycling removal,
changes in the arch_info.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 137 +++++++++++++++++++++++++++++++++++++++++++-
include/linux/sched.h | 3 +
2 files changed, 137 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 506e187..8017886 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -362,6 +362,36 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
}

+static inline int add_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
+{
+ struct tsk_rmid_entry *entry;
+
+ entry = kzalloc(sizeof(struct tsk_rmid_entry), GFP_KERNEL);
+ if (!entry)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&entry->list);
+ entry->rmid = rmid;
+
+ list_add_tail(&entry->list, l);
+
+ return 0;
+}
+
+static inline void del_cgrp_tskmon_entry(u32 *rmid, struct list_head *l)
+{
+ struct tsk_rmid_entry *entry = NULL, *tmp1;
+
+ list_for_each_entry_safe(entry, tmp1, l, list) {
+ if (entry->rmid == rmid) {
+
+ list_del(&entry->list);
+ kfree(entry);
+ break;
+ }
+ }
+}
+
#ifdef CONFIG_CGROUP_PERF
struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
{
@@ -379,6 +409,49 @@ struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
}
#endif

+static inline void
+ cgrp_tskmon_update(struct task_struct *tsk, u32 *rmid, bool ena)
+{
+ struct cgrp_cqm_info *ccinfo = NULL;
+
+#ifdef CONFIG_CGROUP_PERF
+ ccinfo = cqminfo_from_tsk(tsk);
+#endif
+ if (!ccinfo)
+ return;
+
+ if (ena)
+ add_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
+ else
+ del_cgrp_tskmon_entry(rmid, &ccinfo->tskmon_rlist);
+}
+
+static int cqm_assign_task_rmid(struct perf_event *event, u32 *rmid)
+{
+ struct task_struct *tsk;
+ int ret = 0;
+
+ rcu_read_lock();
+ tsk = event->hw.target;
+ if (pid_alive(tsk)) {
+ get_task_struct(tsk);
+
+ if (rmid != NULL)
+ cgrp_tskmon_update(tsk, rmid, true);
+ else
+ cgrp_tskmon_update(tsk, tsk->rmid, false);
+
+ tsk->rmid = rmid;
+
+ put_task_struct(tsk);
+ } else {
+ ret = -EINVAL;
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
{
if (rmid != NULL) {
@@ -428,8 +501,12 @@ static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)

static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
{
+ if (is_task_event(event)) {
+ if (cqm_assign_task_rmid(event, rmid))
+ return -EINVAL;
+ }
#ifdef CONFIG_CGROUP_PERF
- if (is_cgroup_event(event)) {
+ else if (is_cgroup_event(event)) {
cqm_assign_hier_rmid(&event->cgrp->css, rmid);
}
#endif
@@ -630,6 +707,8 @@ static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)

struct cgroup_subsys_state *rcss, *pos_css;
struct cgrp_cqm_info *ccqm_info;
+ struct tsk_rmid_entry *entry;
+ struct list_head *l;

cqm_mask_call_local(rr);
local64_set(&event->count, atomic64_read(&(rr->value)));
@@ -645,6 +724,13 @@ static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
/* Add the descendent 'monitored cgroup' counts */
if (pos_css != rcss && ccqm_info->mon_enabled)
delta_local(event, rr, ccqm_info->rmid);
+
+ /* Add your and descendent 'monitored task' counts */
+ if (!list_empty(&ccqm_info->tskmon_rlist)) {
+ l = &ccqm_info->tskmon_rlist;
+ list_for_each_entry(entry, l, list)
+ delta_local(event, rr, entry->rmid);
+ }
}
rcu_read_unlock();
#endif
@@ -1095,10 +1181,55 @@ void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
mutex_unlock(&cache_mutex);
}

+/*
+ * Called while attaching/detaching task to a cgroup.
+ */
+static bool is_task_monitored(struct task_struct *tsk)
+{
+ return (tsk->rmid != NULL);
+}
+
void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
-{}
+{
+ struct cgroup_subsys_state *new_css;
+ struct cgrp_cqm_info *cqm_info;
+ struct task_struct *task;
+
+ mutex_lock(&cache_mutex);
+
+ cgroup_taskset_for_each(task, new_css, tset) {
+ if (!is_task_monitored(task))
+ continue;
+
+ cqm_info = cqminfo_from_tsk(task);
+ if (cqm_info)
+ add_cgrp_tskmon_entry(task->rmid,
+ &cqm_info->tskmon_rlist);
+ }
+ mutex_unlock(&cache_mutex);
+}
+
int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
-{}
+{
+ struct cgroup_subsys_state *new_css;
+ struct cgrp_cqm_info *cqm_info;
+ struct task_struct *task;
+
+ mutex_lock(&cache_mutex);
+ cgroup_taskset_for_each(task, new_css, tset) {
+ if (!is_task_monitored(task))
+ continue;
+ cqm_info = cqminfo_from_tsk(task);
+
+ if (cqm_info)
+ del_cgrp_tskmon_entry(task->rmid,
+ &cqm_info->tskmon_rlist);
+
+ }
+ mutex_unlock(&cache_mutex);
+
+ return 0;
+}
#endif

static inline void cqm_pick_event_reader(int cpu)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c8f4152..a6f8060b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1794,6 +1794,9 @@ struct task_struct {
#ifdef CONFIG_INTEL_RDT_A
int closid;
#endif
+#ifdef CONFIG_INTEL_RDT_M
+ u32 *rmid;
+#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
--
1.9.1

2016-12-16 23:15:54

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 07/14] x86/rdt,cqm: Scheduling support update

Introduce a scheduling hook finish_arch_pre_lock_switch which is
called just after the perf sched_in during context switch. This method
handles both cat and cqm sched in scenarios.
The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
patch integrates the two msr writes to one. The common sched_in patch
checks if the per cpu cache has a different RMID or CLOSid than the task
and does the MSR write.

During sched_in the task uses the task RMID if the task is monitored or
else uses the task's cgroup rmid.

Patch is based on David Carrillo-Cisneros <[email protected]> patches
in cqm2 series.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 46 +++++++++------
arch/x86/include/asm/intel_pqr_common.h | 38 +++++++++++++
arch/x86/include/asm/intel_rdt.h | 39 -------------
arch/x86/include/asm/intel_rdt_common.h | 13 +++++
arch/x86/include/asm/processor.h | 4 ++
arch/x86/kernel/cpu/Makefile | 1 +
arch/x86/kernel/cpu/intel_rdt_common.c | 97 ++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +-
arch/x86/kernel/process_32.c | 4 --
arch/x86/kernel/process_64.c | 4 --
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 3 +
12 files changed, 189 insertions(+), 65 deletions(-)
create mode 100644 arch/x86/include/asm/intel_pqr_common.h
create mode 100644 arch/x86/kernel/cpu/intel_rdt_common.c

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 57edbfc..506e187 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -28,13 +28,6 @@
static bool cqm_enabled, mbm_enabled;
unsigned int cqm_socket_max;

-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
static struct hrtimer *mbm_timers;
/**
* struct sample - mbm event's (local or total) data
@@ -55,7 +48,6 @@ struct sample {
*/
static struct sample *mbm_local;

-#define pkg_id topology_physical_package_id(smp_processor_id())
/*
* rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
* mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
@@ -74,6 +66,8 @@ struct sample {
static DEFINE_MUTEX(cache_mutex);
static DEFINE_RAW_SPINLOCK(cache_lock);

+DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
+
/*
* Groups of events that have the same target(s), one RMID per group.
*/
@@ -108,7 +102,7 @@ struct sample {
* Likewise, an rmid value of -1 is used to indicate "no rmid currently
* assigned" and is used as part of the rotation code.
*/
-static inline bool __rmid_valid(u32 rmid)
+bool __rmid_valid(u32 rmid)
{
if (!rmid || rmid > cqm_max_rmid)
return false;
@@ -161,7 +155,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
*
* We expect to be called with cache_mutex held.
*/
-static u32 __get_rmid(int domain)
+u32 __get_rmid(int domain)
{
struct list_head *cqm_flist;
struct cqm_rmid_entry *entry;
@@ -368,6 +362,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
}

+#ifdef CONFIG_CGROUP_PERF
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
+{
+ struct cgrp_cqm_info *ccinfo = NULL;
+ struct perf_cgroup *pcgrp;
+
+ pcgrp = perf_cgroup_from_task(tsk, NULL);
+
+ if (!pcgrp)
+ return NULL;
+ else
+ ccinfo = cgrp_to_cqm_info(pcgrp);
+
+ return ccinfo;
+}
+#endif
+
static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
{
if (rmid != NULL) {
@@ -713,26 +724,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- u32 rmid;

if (!(event->hw.cqm_state & PERF_HES_STOPPED))
return;

event->hw.cqm_state &= ~PERF_HES_STOPPED;

- alloc_needed_pkg_rmid(event->hw.cqm_rmid);
-
- rmid = event->hw.cqm_rmid[pkg_id];
- state->rmid = rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+ if (is_task_event(event)) {
+ alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+ state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
+ }
}

static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
if (event->hw.cqm_state & PERF_HES_STOPPED)
return;

event->hw.cqm_state |= PERF_HES_STOPPED;
+ state->next_task_rmid = 0;
}

static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -1366,6 +1378,8 @@ static int __init intel_cqm_init(void)
if (mbm_enabled)
pr_info("Intel MBM enabled\n");

+ static_branch_enable(&cqm_enable_key);
+
/*
* Setup the hot cpu notifier once we are sure cqm
* is enabled to avoid notifier leak.
diff --git a/arch/x86/include/asm/intel_pqr_common.h b/arch/x86/include/asm/intel_pqr_common.h
new file mode 100644
index 0000000..8fe9d8e
--- /dev/null
+++ b/arch/x86/include/asm/intel_pqr_common.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_X86_INTEL_PQR_COMMON_H
+#define _ASM_X86_INTEL_PQR_COMMON_H
+
+#ifdef CONFIG_INTEL_RDT
+
+#include <linux/jump_label.h>
+#include <linux/types.h>
+#include <asm/percpu.h>
+#include <asm/msr.h>
+#include <asm/intel_rdt_common.h>
+
+void __intel_rdt_sched_in(void);
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports resource control and we enable by mounting the
+ * resctrl file system.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+ if (static_branch_likely(&rdt_enable_key) ||
+ static_branch_unlikely(&cqm_enable_key)) {
+ __intel_rdt_sched_in();
+ }
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 95ce5c8..3b4a099 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -5,7 +5,6 @@

#include <linux/kernfs.h>
#include <linux/jump_label.h>
-
#include <asm/intel_rdt_common.h>

#define IA32_L3_QOS_CFG 0xc81
@@ -182,43 +181,5 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);

-/*
- * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
- *
- * Following considerations are made so that this has minimal impact
- * on scheduler hot path:
- * - This will stay as no-op unless we are running on an Intel SKU
- * which supports resource control and we enable by mounting the
- * resctrl file system.
- * - Caches the per cpu CLOSid values and does the MSR write only
- * when a task with a different CLOSid is scheduled in.
- *
- * Must be called with preemption disabled.
- */
-static inline void intel_rdt_sched_in(void)
-{
- if (static_branch_likely(&rdt_enable_key)) {
- struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- int closid;
-
- /*
- * If this task has a closid assigned, use it.
- * Else use the closid assigned to this cpu.
- */
- closid = current->closid;
- if (closid == 0)
- closid = this_cpu_read(cpu_closid);
-
- if (closid != state->closid) {
- state->closid = closid;
- wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
- }
- }
-}
-
-#else
-
-static inline void intel_rdt_sched_in(void) {}
-
#endif /* CONFIG_INTEL_RDT_A */
#endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index e11ed5e..6424322 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -18,12 +18,25 @@
*/
struct intel_pqr_state {
u32 rmid;
+ u32 next_task_rmid;
u32 closid;
int rmid_usecnt;
};

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);

+#define pkg_id topology_physical_package_id(smp_processor_id())
+
+u32 __get_rmid(int domain);
+bool __rmid_valid(u32 rmid);
+void alloc_needed_pkg_rmid(u32 *cqm_rmid);
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
+
+extern struct cgrp_cqm_info cqm_rootcginfo;
+
+DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
/**
* struct cgrp_cqm_info - perf_event cgroup metadata for cqm
* @cont_mon Continuous monitoring flag
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e7f8c62..b0ce5cc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@
#include <asm/nops.h>
#include <asm/special_insns.h>
#include <asm/fpu/types.h>
+#include <asm/intel_pqr_common.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -870,4 +871,7 @@ static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)

void stop_this_cpu(void *dummy);
void df_debug(struct pt_regs *regs, long error_code);
+
+#define finish_arch_pre_lock_switch intel_rdt_sched_in
+
#endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index c9f8c81..1035c97 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o

+obj-$(CONFIG_INTEL_RDT) += intel_rdt_common.o
obj-$(CONFIG_INTEL_RDT_A) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o

obj-$(CONFIG_X86_MCE) += mcheck/
diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
new file mode 100644
index 0000000..83c8c00
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_common.c
@@ -0,0 +1,97 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/cacheinfo.h>
+#include <linux/cpuhotplug.h>
+
+#include <asm/intel-family.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Both functions which modify the state
+ * (intel_cqm_event_start and intel_cqm_event_stop) are called with
+ * interrupts disabled, which is sufficient for the protection.
+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#ifdef CONFIG_INTEL_RDT_M
+static inline int get_cgroup_sched_rmid(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+ struct cgrp_cqm_info *ccinfo = NULL;
+
+ ccinfo = cqminfo_from_tsk(current);
+
+ if (!ccinfo)
+ return 0;
+
+ /*
+ * A cgroup is always monitoring for itself or
+ * for an ancestor(default is root).
+ */
+ if (ccinfo->mon_enabled) {
+ alloc_needed_pkg_rmid(ccinfo->rmid);
+ return ccinfo->rmid[pkg_id];
+ } else {
+ alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
+ return ccinfo->mfa->rmid[pkg_id];
+ }
+#endif
+
+ return 0;
+}
+
+static inline int get_sched_in_rmid(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ u32 rmid = 0;
+
+ rmid = state->next_task_rmid;
+
+ return rmid ? rmid : get_cgroup_sched_rmid();
+}
+#endif
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports resource control and we enable by mounting the
+ * resctrl file system or it supports resource monitoring.
+ * - Caches the per cpu CLOSid/RMID values and does the MSR write only
+ * when a task with a different CLOSid/RMID is scheduled in.
+ */
+void __intel_rdt_sched_in(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ int closid = 0;
+ u32 rmid = 0;
+
+#ifdef CONFIG_INTEL_RDT_A
+ if (static_branch_likely(&rdt_enable_key)) {
+ /*
+ * If this task has a closid assigned, use it.
+ * Else use the closid assigned to this cpu.
+ */
+ closid = current->closid;
+ if (closid == 0)
+ closid = this_cpu_read(cpu_closid);
+ }
+#endif
+
+#ifdef CONFIG_INTEL_RDT_M
+ if (static_branch_unlikely(&cqm_enable_key))
+ rmid = get_sched_in_rmid();
+#endif
+
+ if (closid != state->closid || rmid != state->rmid) {
+
+ state->closid = closid;
+ state->rmid = rmid;
+ wrmsr(MSR_IA32_PQR_ASSOC, rmid, closid);
+ }
+}
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 8af04af..8b6b429 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -206,7 +206,7 @@ static void rdt_update_cpu_closid(void *closid)
* executing task might have its own closid selected. Just reuse
* the context switch code.
*/
- intel_rdt_sched_in();
+ __intel_rdt_sched_in();
}

/*
@@ -328,7 +328,7 @@ static void move_myself(struct callback_head *head)

preempt_disable();
/* update PQR_ASSOC MSR to make resource group go into effect */
- intel_rdt_sched_in();
+ __intel_rdt_sched_in();
preempt_enable();

kfree(callback);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index efe7f9f..bd7be8e 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -54,7 +54,6 @@
#include <asm/debugreg.h>
#include <asm/switch_to.h>
#include <asm/vm86.h>
-#include <asm/intel_rdt.h>

void __show_regs(struct pt_regs *regs, int all)
{
@@ -300,8 +299,5 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,

this_cpu_write(current_task, next_p);

- /* Load the Intel cache allocation PQR MSR. */
- intel_rdt_sched_in();
-
return prev_p;
}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index acd7d6f..b3760b3 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -50,7 +50,6 @@
#include <asm/switch_to.h>
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
-#include <asm/intel_rdt.h>

__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);

@@ -474,9 +473,6 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
loadsegment(ss, __KERNEL_DS);
}

- /* Load the Intel cache allocation PQR MSR. */
- intel_rdt_sched_in();
-
return prev_p;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 154fd68..b2c9106 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2766,6 +2766,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
+ finish_arch_pre_lock_switch();
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..0a0208e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1112,6 +1112,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
#ifndef prepare_arch_switch
# define prepare_arch_switch(next) do { } while (0)
#endif
+#ifndef finish_arch_pre_lock_switch
+# define finish_arch_pre_lock_switch() do { } while (0)
+#endif
#ifndef finish_arch_post_lock_switch
# define finish_arch_post_lock_switch() do { } while (0)
#endif
--
1.9.1

2016-12-16 23:16:04

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 06/14] x86/cqm: Add cgroup hierarchical monitoring support

From: David Carrillo-Cisneros <[email protected]>

Patch adds support for monitoring cgroup hierarchy. The
arch_info that was introduced in the perf_cgroup is used to maintain the
cgroup related rmid and hierarchy information.

Since cgroup supports hierarchical monitoring, a cgroup is always
monitoring for some ancestor. By default root is always monitored with
RMID 0 and hence when any cgroup is first created it always reports data
to the root. mfa or 'monitor for ancestor' is used to keep track of
this information. Basically which ancestor the cgroup is actually
monitoring for or has to report the data to.

By default, all cgroup's mfa points to root.
1.event init: When ever a new cgroup x would start to be monitored,
the mfa of the monitored cgroup's descendants point towards the cgroup
x.
2.switch_to: task finds the cgroup its associated with and if the
cgroup itself is being monitored cgroup uses its own rmid(a) else it uses
the rmid of the mfa(b).
3.read: During the read call, the cgroup x just adds the
counts of its descendants who had cgroup x as mfa and were also
monitored(To count the scenario (a) in switch_to).

Locking: cgroup traversal: rcu_readlock. cgroup->arch_info: css_alloc,
css_free, event terminate, init hold mutex.

Tests: Cgroup monitoring should work. Monitoring multiple cgroups in the
same hierarchy works. monitoring cgroup and a task within same cgroup
doesnt work yet.

Patch modified/refactored by Vikas Shivappa
<[email protected]> to support recycling removal.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 227 +++++++++++++++++++++++++++-----
arch/x86/include/asm/intel_rdt_common.h | 64 +++++++++
2 files changed, 257 insertions(+), 34 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 536c8ad..57edbfc 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -85,6 +85,7 @@ struct sample {
static cpumask_t cqm_cpumask;

struct pkg_data **cqm_pkgs_data;
+struct cgrp_cqm_info cqm_rootcginfo;

#define RMID_VAL_ERROR (1ULL << 63)
#define RMID_VAL_UNAVAIL (1ULL << 62)
@@ -193,6 +194,11 @@ static void __put_rmid(u32 rmid, int domain)
list_add_tail(&entry->list, &cqm_pkgs_data[domain]->cqm_rmid_limbo_lru);
}

+static bool is_task_event(struct perf_event *e)
+{
+ return (e->attach_state & PERF_ATTACH_TASK);
+}
+
static void cqm_cleanup(void)
{
int i;
@@ -209,7 +215,6 @@ static void cqm_cleanup(void)
kfree(cqm_pkgs_data);
}

-
/*
* Determine if @a and @b measure the same set of tasks.
*
@@ -224,20 +229,18 @@ static bool __match_event(struct perf_event *a, struct perf_event *b)
return false;

#ifdef CONFIG_CGROUP_PERF
- if (a->cgrp != b->cgrp)
- return false;
-#endif
-
- /* If not task event, we're machine wide */
- if (!(b->attach_state & PERF_ATTACH_TASK))
+ if ((is_cgroup_event(a) && is_cgroup_event(b)) &&
+ (a->cgrp == b->cgrp))
return true;
+#endif

/*
* Events that target same task are placed into the same cache group.
* Mark it as a multi event group, so that we update ->count
* for every event rather than just the group leader later.
*/
- if (a->hw.target == b->hw.target) {
+ if ((is_task_event(a) && is_task_event(b)) &&
+ (a->hw.target == b->hw.target)) {
b->hw.is_group_event = true;
return true;
}
@@ -365,6 +368,63 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
}

+static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
+{
+ if (rmid != NULL) {
+ cqm_info->mon_enabled = true;
+ cqm_info->rmid = rmid;
+ } else {
+ cqm_info->mon_enabled = false;
+ cqm_info->rmid = NULL;
+ }
+}
+
+static void cqm_assign_hier_rmid(struct cgroup_subsys_state *rcss, u32 *rmid)
+{
+ struct cgrp_cqm_info *ccqm_info, *rcqm_info;
+ struct cgroup_subsys_state *pos_css;
+
+ rcu_read_lock();
+
+ rcqm_info = css_to_cqm_info(rcss);
+
+ /* Enable or disable monitoring based on rmid.*/
+ cqm_enable_mon(rcqm_info, rmid);
+
+ pos_css = css_next_descendant_pre(rcss, rcss);
+ while (pos_css) {
+ ccqm_info = css_to_cqm_info(pos_css);
+
+ /*
+ * Monitoring is being enabled.
+ * Update the descendents to monitor for you, unless
+ * they were already monitoring for a descendent of yours.
+ */
+ if (rmid && (rcqm_info->level > ccqm_info->mfa->level))
+ ccqm_info->mfa = rcqm_info;
+
+ /*
+ * Monitoring is being disabled.
+ * Update the descendents who were monitoring for you
+ * to monitor for the ancestor you were monitoring.
+ */
+ if (!rmid && (ccqm_info->mfa == rcqm_info))
+ ccqm_info->mfa = rcqm_info->mfa;
+ pos_css = css_next_descendant_pre(pos_css, rcss);
+ }
+ rcu_read_unlock();
+}
+
+static int cqm_assign_rmid(struct perf_event *event, u32 *rmid)
+{
+#ifdef CONFIG_CGROUP_PERF
+ if (is_cgroup_event(event)) {
+ cqm_assign_hier_rmid(&event->cgrp->css, rmid);
+ }
+#endif
+ return 0;
+}
+
/*
* Find a group and setup RMID.
*
@@ -402,11 +462,14 @@ static int intel_cqm_setup_event(struct perf_event *event,
return 0;
}

+static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr);
+
static void intel_cqm_event_read(struct perf_event *event)
{
- unsigned long flags;
- u32 rmid;
- u64 val;
+ struct rmid_read rr = {
+ .evt_type = event->attr.config,
+ .value = ATOMIC64_INIT(0),
+ };

/*
* Task events are handled by intel_cqm_event_count().
@@ -414,26 +477,9 @@ static void intel_cqm_event_read(struct perf_event *event)
if (event->cpu == -1)
return;

- raw_spin_lock_irqsave(&cache_lock, flags);
- rmid = event->hw.cqm_rmid[pkg_id];
-
- if (!__rmid_valid(rmid))
- goto out;
-
- if (is_mbm_event(event->attr.config))
- val = rmid_read_mbm(rmid, event->attr.config);
- else
- val = __rmid_read(rmid);
-
- /*
- * Ignore this reading on error states and do not update the value.
- */
- if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
- goto out;
+ rr.rmid = ACCESS_ONCE(event->hw.cqm_rmid);

- local64_set(&event->count, val);
-out:
- raw_spin_unlock_irqrestore(&cache_lock, flags);
+ cqm_read_subtree(event, &rr);
}

static void __intel_cqm_event_count(void *info)
@@ -545,6 +591,55 @@ static void mbm_hrtimer_init(void)
}
}

+static void cqm_mask_call_local(struct rmid_read *rr)
+{
+ if (is_mbm_event(rr->evt_type))
+ __intel_mbm_event_count(rr);
+ else
+ __intel_cqm_event_count(rr);
+}
+
+static inline void
+ delta_local(struct perf_event *event, struct rmid_read *rr, u32 *rmid)
+{
+ atomic64_set(&rr->value, 0);
+ rr->rmid = ACCESS_ONCE(rmid);
+
+ cqm_mask_call_local(rr);
+ local64_add(atomic64_read(&rr->value), &event->count);
+}
+
+/*
+ * Since cgroup follows hierarchy, add the count of
+ * the descendents who were being monitored as well.
+ */
+static u64 cqm_read_subtree(struct perf_event *event, struct rmid_read *rr)
+{
+#ifdef CONFIG_CGROUP_PERF
+
+ struct cgroup_subsys_state *rcss, *pos_css;
+ struct cgrp_cqm_info *ccqm_info;
+
+ cqm_mask_call_local(rr);
+ local64_set(&event->count, atomic64_read(&(rr->value)));
+
+ if (is_task_event(event))
+ return __perf_event_count(event);
+
+ rcu_read_lock();
+ rcss = &event->cgrp->css;
+ css_for_each_descendant_pre(pos_css, rcss) {
+ ccqm_info = (css_to_cqm_info(pos_css));
+
+ /* Add the descendent 'monitored cgroup' counts */
+ if (pos_css != rcss && ccqm_info->mon_enabled)
+ delta_local(event, rr, ccqm_info->rmid);
+ }
+ rcu_read_unlock();
+#endif
+ return __perf_event_count(event);
+}
+
static u64 intel_cqm_event_count(struct perf_event *event)
{
struct rmid_read rr = {
@@ -603,7 +698,7 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
if (WARN_ON(!cqm_rmid))
return;

- if (cqm_rmid[pkg_id])
+ if (cqm_rmid == cqm_rootcginfo.rmid || cqm_rmid[pkg_id])
return;

raw_spin_lock_irqsave(&cache_lock, flags);
@@ -661,9 +756,11 @@ static int intel_cqm_event_add(struct perf_event *event, int mode)
__put_rmid(rmid[d], d);
}
kfree(event->hw.cqm_rmid);
+ cqm_assign_rmid(event, NULL);
list_del(&event->hw.cqm_groups_entry);
}
-static void intel_cqm_event_destroy(struct perf_event *event)
+
+static void intel_cqm_event_terminate(struct perf_event *event)
{
struct perf_event *group_other = NULL;
unsigned long flags;
@@ -917,6 +1014,7 @@ static int intel_cqm_event_init(struct perf_event *event)
.attr_groups = intel_cqm_attr_groups,
.task_ctx_nr = perf_sw_context,
.event_init = intel_cqm_event_init,
+ .event_terminate = intel_cqm_event_terminate,
.add = intel_cqm_event_add,
.del = intel_cqm_event_stop,
.start = intel_cqm_event_start,
@@ -924,12 +1022,67 @@ static int intel_cqm_event_init(struct perf_event *event)
.read = intel_cqm_event_read,
.count = intel_cqm_event_count,
};
+
#ifdef CONFIG_CGROUP_PERF
int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
struct cgroup_subsys_state *new_css)
-{}
+{
+ struct cgrp_cqm_info *cqm_info, *pcqm_info;
+ struct perf_cgroup *new_cgrp;
+
+ if (!parent_css) {
+ cqm_rootcginfo.level = 0;
+
+ cqm_rootcginfo.mon_enabled = true;
+ cqm_rootcginfo.cont_mon = true;
+ cqm_rootcginfo.mfa = NULL;
+ INIT_LIST_HEAD(&cqm_rootcginfo.tskmon_rlist);
+
+ if (new_css) {
+ new_cgrp = css_to_perf_cgroup(new_css);
+ new_cgrp->arch_info = &cqm_rootcginfo;
+ }
+ return 0;
+ }
+
+ mutex_lock(&cache_mutex);
+
+ new_cgrp = css_to_perf_cgroup(new_css);
+
+ cqm_info = kzalloc(sizeof(struct cgrp_cqm_info), GFP_KERNEL);
+ if (!cqm_info) {
+ mutex_unlock(&cache_mutex);
+ return -ENOMEM;
+ }
+
+ pcqm_info = (css_to_cqm_info(parent_css));
+ cqm_info->level = pcqm_info->level + 1;
+ cqm_info->rmid = pcqm_info->rmid;
+
+ cqm_info->cont_mon = false;
+ cqm_info->mon_enabled = false;
+ INIT_LIST_HEAD(&cqm_info->tskmon_rlist);
+ if (!pcqm_info->mfa)
+ cqm_info->mfa = pcqm_info;
+ else
+ cqm_info->mfa = pcqm_info->mfa;
+
+ new_cgrp->arch_info = cqm_info;
+ mutex_unlock(&cache_mutex);
+
+ return 0;
+}
+
void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
-{}
+{
+ struct perf_cgroup *cgrp = css_to_perf_cgroup(css);
+
+ mutex_lock(&cache_mutex);
+ kfree(cgrp_to_cqm_info(cgrp));
+ cgrp->arch_info = NULL;
+ mutex_unlock(&cache_mutex);
+}
+
void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
{}
int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
@@ -1053,6 +1206,12 @@ static int pkg_data_init_cpu(int cpu)
entry = __rmid_entry(0, curr_pkgid);
list_del(&entry->list);

+ cqm_rootcginfo.rmid = kzalloc(sizeof(u32) * cqm_socket_max, GFP_KERNEL);
+ if (!cqm_rootcginfo.rmid) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
return 0;
fail:
kfree(ccqm_rmid_ptrs);
diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index b31081b..e11ed5e 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -24,4 +24,68 @@ struct intel_pqr_state {

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);

+/**
+ * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
+ * @cont_mon Continuous monitoring flag
+ * @mon_enabled Whether monitoring is enabled
+ * @level Level in the cgroup tree. Root is level 0.
+ * @rmid The rmids of the cgroup.
+ * @mfa 'Monitoring for ancestor' points to the cqm_info
+ * of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
+ * means you will use an ancestors RMID at sched_in if you are
+ * not monitoring yourself.
+ *
+ * Due to the hierarchical nature of cgroups, every cgroup just
+ * monitors for the 'nearest monitored ancestor' at all times.
+ * Since root cgroup is always monitored, all descendents
+ * at boot time monitor for root and hence all mfa points to root except
+ * for root->mfa which is NULL.
+ * 1. RMID setup: When cgroup x start monitoring:
+ * for each descendent y, if y's mfa->level < x->level, then
+ * y->mfa = x. (Where level of root node = 0...)
+ * 2. sched_in: During sched_in for x
+ * if (x->mon_enabled) choose x->rmid
+ * else choose x->mfa->rmid.
+ * 3. read: for each descendent of cgroup x
+ * if (x->monitored) count += rmid_read(x->rmid).
+ * 4. evt_destroy: for each descendent y of x, if (y->mfa == x) then
+ * y->mfa = x->mfa. Meaning if any descendent was monitoring for x,
+ * set that descendent to monitor for the cgroup which x was monitoring for.
+ *
+ * @tskmon_rlist List of tasks being monitored in the cgroup
+ * When a task which belongs to a cgroup x is being monitored, it always uses
+ * its own task->rmid even if cgroup x is monitored during sched_in.
+ * To account for the counts of such tasks, cgroup keeps this list
+ * and parses it during read.
+ *
+ * Perf handles hierarchy for other events, but because RMIDs are per pkg
+ * this is handled here.
+*/
+struct cgrp_cqm_info {
+ bool cont_mon;
+ bool mon_enabled;
+ int level;
+ u32 *rmid;
+ struct cgrp_cqm_info *mfa;
+ struct list_head tskmon_rlist;
+};
+
+struct tsk_rmid_entry {
+ u32 *rmid;
+ struct list_head list;
+};
+
+#ifdef CONFIG_CGROUP_PERF
+
+# define css_to_perf_cgroup(css_) container_of(css_, struct perf_cgroup, css)
+# define cgrp_to_cqm_info(cgrp_) ((struct cgrp_cqm_info *)cgrp_->arch_info)
+# define css_to_cqm_info(css_) cgrp_to_cqm_info(css_to_perf_cgroup(css_))
+
+#else
+
+# define css_to_perf_cgroup(css_) NULL
+# define cgrp_to_cqm_info(cgrp_) NULL
+# define css_to_cqm_info(css_) NULL
+
+#endif
#endif /* _ASM_X86_INTEL_RDT_COMMON_H */
--
1.9.1

2016-12-16 23:16:16

by Shivappa Vikas

[permalink] [raw]
Subject: [PATCH 05/14] x86/cqm,perf/core: Cgroup support prepare

From: David Carrillo-Cisneros <[email protected]>

cgroup hierarchy monitoring is not supported currently. This patch
builds all the necessary datastructures, cgroup APIs like alloc, free
etc and necessary quirks for supporting cgroup hierarchy monitoring in
later patches.

- Introduce a architecture specific data structure arch_info in
perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
- perf sched_in calls all the cgroup ancestors when a cgroup is
scheduled in. This will not work with cqm as we have a common per pkg
rmid associated with one task and hence cannot write different RMIds
into the MSR for each event. cqm driver enables a flag
PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
ancestor cgroups for each event and let the driver handle the hierarchy
monitoring for cgroup.
- Introduce event_terminate as event_destroy is called after cgrp is
disassociated from the event to support rmid handling of the cgroup.
This helps cqm clean up the cqm specific arch_info.
- Add the cgroup APIs for alloc,free,attach and can_attach

The above framework will be used to build different cgroup features in
later patches.

Tests: Same as before. Cgroup still doesnt work but we did the prep to
get it to work

Patch modified/refactored by Vikas Shivappa
<[email protected]> to support recycling removal.

Signed-off-by: Vikas Shivappa <[email protected]>
---
arch/x86/events/intel/cqm.c | 19 ++++++++++++++++++-
arch/x86/include/asm/perf_event.h | 32 ++++++++++++++++++++++++++++++++
include/linux/perf_event.h | 36 ++++++++++++++++++++++++++++++++++++
kernel/events/core.c | 29 ++++++++++++++++++++++++++++-
4 files changed, 114 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index a0719af..536c8ad 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
INIT_LIST_HEAD(&event->hw.cqm_group_entry);
INIT_LIST_HEAD(&event->hw.cqm_groups_entry);

- event->destroy = intel_cqm_event_destroy;
+ /*
+ * CQM driver handles cgroup recursion and since only noe
+ * RMID can be programmed at the time in each core, then
+ * it is incompatible with the way generic code handles
+ * cgroup hierarchies.
+ */
+ event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;

mutex_lock(&cache_mutex);

@@ -918,6 +924,17 @@ static int intel_cqm_event_init(struct perf_event *event)
.read = intel_cqm_event_read,
.count = intel_cqm_event_count,
};
+#ifdef CONFIG_CGROUP_PERF
+int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+ struct cgroup_subsys_state *new_css)
+{}
+void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
+{}
+void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
+{}
+int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
+{}
+#endif

static inline void cqm_pick_event_reader(int cpu)
{
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f353061..43c8e5e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -299,4 +299,36 @@ static inline void perf_check_microcode(void) { }

#define arch_perf_out_copy_user copy_from_user_nmi

+/*
+ * Hooks for architecture specific features of perf_event cgroup.
+ * Currently used by Intel's CQM.
+ */
+#ifdef CONFIG_INTEL_RDT_M
+#ifdef CONFIG_CGROUP_PERF
+
+#define perf_cgroup_arch_css_alloc perf_cgroup_arch_css_alloc
+
+int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
+ struct cgroup_subsys_state *new_css);
+
+#define perf_cgroup_arch_css_free perf_cgroup_arch_css_free
+
+void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
+
+#define perf_cgroup_arch_attach perf_cgroup_arch_attach
+
+void perf_cgroup_arch_attach(struct cgroup_taskset *tset);
+
+#define perf_cgroup_arch_can_attach perf_cgroup_arch_can_attach
+
+int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset);
+
+extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS \
+ .dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
+ .legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
+#endif
+
+#endif
#endif /* _ASM_X86_PERF_EVENT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a8f4749..abeacb5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -300,6 +300,12 @@ struct pmu {
int (*event_init) (struct perf_event *event);

/*
+ * Terminate the event for this PMU. Optional complement for a
+ * successful event_init. Called before the event fields are tear down.
+ */
+ void (*event_terminate) (struct perf_event *event);
+
+ /*
* Notification that the event was mapped or unmapped. Called
* in the context of the mapping task.
*/
@@ -516,9 +522,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
* PERF_EV_CAP_SOFTWARE: Is a software event.
* PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
* from any CPU in the package where it is active.
+ * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
+ * cgroup scoping. It does not need to be enabled for all of its descendants
+ * cgroups.
*/
#define PERF_EV_CAP_SOFTWARE BIT(0)
#define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
+#define PERF_EV_CAP_CGROUP_NO_RECURSION BIT(2)

#define SWEVENT_HLIST_BITS 8
#define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
@@ -823,6 +833,8 @@ struct perf_cgroup_info {
};

struct perf_cgroup {
+ /* Architecture specific information. */
+ void *arch_info;
struct cgroup_subsys_state css;
struct perf_cgroup_info __percpu *info;
};
@@ -844,6 +856,7 @@ struct perf_cgroup {

#ifdef CONFIG_PERF_EVENTS

+extern int is_cgroup_event(struct perf_event *event);
extern void *perf_aux_output_begin(struct perf_output_handle *handle,
struct perf_event *event);
extern void perf_aux_output_end(struct perf_output_handle *handle,
@@ -1387,4 +1400,27 @@ ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
#define perf_event_exit_cpu NULL
#endif

+/*
+ * Hooks for architecture specific extensions for perf_cgroup.
+ */
+#ifndef perf_cgroup_arch_css_alloc
+#define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_free
+#define perf_cgroup_arch_css_free(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_attach
+#define perf_cgroup_arch_attach(tskset) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_can_attach
+#define perf_cgroup_arch_can_attach(tskset) 0
+#endif
+
+#ifndef PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#endif
+
#endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0e29213..a290c53 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -590,6 +590,9 @@ static inline u64 perf_event_clock(struct perf_event *event)
if (!cpuctx->cgrp)
return false;

+ if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
+ return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
+
/*
* Cgroup scoping is recursive. An event enabled for a cgroup is
* also enabled for all its descendant cgroups. If @cpuctx's
@@ -606,7 +609,7 @@ static inline void perf_detach_cgroup(struct perf_event *event)
event->cgrp = NULL;
}

-static inline int is_cgroup_event(struct perf_event *event)
+int is_cgroup_event(struct perf_event *event)
{
return event->cgrp != NULL;
}
@@ -4011,6 +4014,9 @@ static void _free_event(struct perf_event *event)
mutex_unlock(&event->mmap_mutex);
}

+ if (event->pmu->event_terminate)
+ event->pmu->event_terminate(event);
+
if (is_cgroup_event(event))
perf_detach_cgroup(event);

@@ -9236,6 +9242,8 @@ static void account_event(struct perf_event *event)
exclusive_event_destroy(event);

err_pmu:
+ if (event->pmu->event_terminate)
+ event->pmu->event_terminate(event);
if (event->destroy)
event->destroy(event);
module_put(pmu->module);
@@ -10738,6 +10746,7 @@ static int __init perf_event_sysfs_init(void)
perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
struct perf_cgroup *jc;
+ int ret;

jc = kzalloc(sizeof(*jc), GFP_KERNEL);
if (!jc)
@@ -10749,6 +10758,12 @@ static int __init perf_event_sysfs_init(void)
return ERR_PTR(-ENOMEM);
}

+ jc->arch_info = NULL;
+
+ ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
+ if (ret)
+ return ERR_PTR(ret);
+
return &jc->css;
}

@@ -10756,6 +10771,8 @@ static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);

+ perf_cgroup_arch_css_free(css);
+
free_percpu(jc->info);
kfree(jc);
}
@@ -10776,11 +10793,21 @@ static void perf_cgroup_attach(struct cgroup_taskset *tset)

cgroup_taskset_for_each(task, css, tset)
task_function_call(task, __perf_cgroup_move, task);
+
+ perf_cgroup_arch_attach(tset);
+}
+
+static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+ return perf_cgroup_arch_can_attach(tset);
}

+
struct cgroup_subsys perf_event_cgrp_subsys = {
.css_alloc = perf_cgroup_css_alloc,
.css_free = perf_cgroup_css_free,
+ .can_attach = perf_cgroup_can_attach,
.attach = perf_cgroup_attach,
+ PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
};
#endif /* CONFIG_CGROUP_PERF */
--
1.9.1

2016-12-23 11:58:23

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [PATCH 11/14] x86/cqm: Add failure on open and read

On Fri, Dec 16, 2016 at 3:13 PM, Vikas Shivappa
<[email protected]> wrote:
> To provide reliable output to the user, cqm throws error when it does
> not have enough RMIDs to monitor depending upon the mode user choses.
> This also takes care to not overuse RMIDs. Default is LAZY mode.
>
> NOLAZY mode: This patch adds a file mon_mask in the perf_cgroup which
> indicates the packages which the user wants guaranteed monitoring. For
> such cgroup events RMIDs are assigned at event create and we fail if
> enough RMIDs are not present. This is basically a NOLAZY allocation of
> RMIDs. This mode can be used in real time scenarios where user is sure
> that tasks that are monitored are scheduled.
>
> LAZY mode: If user did not enable the NOLAZY mode, RMIDs are allocated
> only when tasks are actually scheduled. Upon failure to obtain RMIDs it
> indicates a failure in read. Typical use case for this mode could be to
> start monitoring cgroups which still donot have any tasks in them and
> such cgroups are part of large number of cgroups which are monitored -
> that way we donot overuse RMIDs.
>

The proposed interface is:
- a global boolean cqm_cont_monitoring.
- a per-package boolean in the bitfield cqm_mon_mask.

So, for each package there will be four states, yet one of them is
not meaningful:

cont_monitoring, cqm_mon_mask[p]: meaning
------------------------------------------
0, 0 : off
0, 1 : off but reserve a RMID that is not going to be used?
1, 0 : on with NOLAZY
1, 1 : on with LAZY

the case 0,1 is problematic.

How can new cases be added in the future? another file? What's wrong
with having a
pkg0_flags;pkg1_flags;...;pkgn_flags
cont_monitoring file, that is more akin to the RDT Allocation format.
(There is a parser function and implementation for that format in v3
of my CMT series).



Below is a full discussion about how many per-package configuration states are
useful now and if/when RMID rotation is added.


There are two types of error sources introduced by not having a RMID
when one is needed:
- E_read : Introduced into the measurement when stealing a
RMID with non-zero occupancy.
- E_sched: Introduced when a thread runs but no RMID is available for it.

A user may have two tolerance levels to errors that determine if an
event can be read
or read should fail:
- NoTol : No tolerance to error at all. If there has been any type
of E_read or
E_sched in the past, read must give an error.
- SomeTol: Tolerate _some_ error. It can be defined in terms of
time, magnitude or both.
As an example, in v3 of my CMT patches, I assumed a user would
tolerate an error that
occurred more that an arbitrarily chosen time in the past. The
minimum criterion is that
there should at least be a RMID at the time of read.

The driver can follow two types of RMID allocation policies:
- NoLazy: reserve RMID as soon as user starts monitoring (when event
is created or
cont_monitoring is set). This policy introduces no error.
- Lazy: reserve RMID first time a task is sched in. May introduce E_sched
if no RMID available on sched in.

and three RMID deallocation policies:
- Fixed: RMID can never be stolen. This policy introduces no error
into the measurement.
- Reuse: RMID can be stolen when not scheduled thread is using it
and it has non-zero
occupancy. This policy may introduce E_sched when no RMID available on
sched_in after an incidence of reuse.
- Steal: RMID can be stolen any time. This policy introduces both E_sched and
E_read errors into the measurement (this is the so-called RMID rotation).

Therefore there are three possible risks levels:
- No Risk: possible with NoLazy & Fixed
- Risk of E_sched: possible with either NoLazy & Reuse or Lazy &
Fixed or Lazy & Reuse
- Risk of E_sched and E_read: possible with NoLazy & Steal or Lazy & Steal
Notes:
a) E_read only is impossible.
b) In "No Risk" a RMID must be allocated in advance and never
released, even if unused
(e.g. a task may run only in one package but we allocade RMID in
all of them).
c) For the E_sched risk, Lazy & Reuse give the highest RMID flexibility.
d) For the E_read and E_sched risk, NoLazy & Steal give the highest
RMID flexibility.


Combining all three criteria, the possible configuration modes that
make sense are:
1) No monitoring.
2) NoLazy & Fixed & NoTol. RMID is allocated when event is created
(or cont_monitoring is set).
No possible error. May waste RMIDs.
3) Lazy & Reusable & NoTol. RMID are allocated as needed, taken away
when unused.
May fail to find RMID if there is RMID contention, once it fail,
the event/cgroup must be in error state.
4) Lazy & Reusable & SomeTol. Similar to (3) but event/cgroup
recovers from error state if a
recovered RMID stays valid for long enough.
5 and 6) Lazy allocation & Stealable with and without Tol . RMID can
be stolen even if non-empty
or in use.

Q. Which modes are useful?

Stephane and I see a clear use for (2). Users of cont_monitoring look
to avoid error and may tolerate
wasted RMIDs. It has the advantage that allows to fail on event
creation (or when cont_monitoring is set).
This is the same mode introduced with NOLAZY in cqm_mon_mask in this patch.

Mode (3) can be viewed as an optimistic approach to RMID allocation
that allows more concurrent users
than 2 when cache occupancy drops quickly and/or task/cgroups manifest
strong package locality.
It still guarantees exact measurements (within hw constraints) when
read succeeds.

Mode (4) is more useful than 3 _if_ it can be assumed that the system
will replace enough cache lines
before the tolerance time expires (otherwise it reads just garbage).
Yet, it's not clear to me how often this assumption is valid.

Modes (5) and (6) require RMID rotation, so they wouldn't be part of
this patch series.


> +static ssize_t cqm_mon_mask_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + cpumask_var_t tmp_cpus, tmp_cpus1;
> + struct cgrp_cqm_info *cqm_info;
> + unsigned long flags;
> + int ret = 0;
> +
> + buf = strstrip(buf);
> +
> + if (!zalloc_cpumask_var(&tmp_cpus, GFP_KERNEL) ||
> + !zalloc_cpumask_var(&tmp_cpus1, GFP_KERNEL)) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = cpulist_parse(buf, tmp_cpus);
> + if (ret)
> + goto out;
> +
> + if (cpumask_andnot(tmp_cpus1, tmp_cpus, &cqm_pkgmask)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + raw_spin_lock_irqsave(&cache_lock, flags);
> + cqm_info = css_to_cqm_info(of_css(of));
> + cpumask_copy(&cqm_info->mon_mask, tmp_cpus);
> + raw_spin_unlock_irqrestore(&cache_lock, flags);

So this only copies the mask so that it can be used for the next
cgroup event in intel_cqm_setup_event?
That defeats the purpose of a NON_LAZY cont_monitoring.

There is no need to create a new cgroup file only to provide a non-lazy event;
such flag could be passed in perf_event_attr::pinned or a config field.

2016-12-23 12:32:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
> +Continuous monitoring
> +---------------------
> +A new file cont_monitoring is added to perf_cgroup which helps to enable
> +cqm continuous monitoring. Enabling this field would start monitoring of
> +the cgroup without perf being launched. This can be used for long term
> +light weight monitoring of tasks/cgroups.
> +
> +To enable continuous monitoring of cgroup p1.
> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> +
> +To disable continuous monitoring of cgroup p1.
> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> +
> +To read the counters at the end of monitoring perf can be used.
> +
> +LAZY and NOLAZY Monitoring
> +--------------------------
> +LAZY:
> +By default when monitoring is enabled, the RMIDs are not allocated
> +immediately and allocated lazily only at the first sched_in.
> +There are 2-4 RMIDs per logical processor on each package. So if a dual
> +package has 48 logical processors, there would be upto 192 RMIDs on each
> +package = total of 192x2 RMIDs.
> +There is a possibility that RMIDs can runout and in that case the read
> +reports an error since there was no RMID available to monitor for an
> +event.
> +
> +NOLAZY:
> +When user wants guaranteed monitoring, he can enable the 'monitoring
> +mask' which is basically used to specify the packages he wants to
> +monitor. The RMIDs are statically allocated at open and failure is
> +indicated if RMIDs are not available.
> +
> +To specify monitoring on package 0 and package 1:
> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
> +
> +An error is thrown if packages not online are specified.

I very much dislike both those for adding files to the perf cgroup.
Drivers should really not do that.

I absolutely hate the second because events already have affinity.

I can't see this happening.

2016-12-23 19:35:06

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation


Hello Peterz,

On Fri, 23 Dec 2016, Peter Zijlstra wrote:

> On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
>> +Continuous monitoring
>> +---------------------
>> +A new file cont_monitoring is added to perf_cgroup which helps to enable
>> +cqm continuous monitoring. Enabling this field would start monitoring of
>> +the cgroup without perf being launched. This can be used for long term
>> +light weight monitoring of tasks/cgroups.
>> +
>> +To enable continuous monitoring of cgroup p1.
>> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>> +
>> +To disable continuous monitoring of cgroup p1.
>> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>> +
>> +To read the counters at the end of monitoring perf can be used.
>> +
>> +LAZY and NOLAZY Monitoring
>> +--------------------------
>> +LAZY:
>> +By default when monitoring is enabled, the RMIDs are not allocated
>> +immediately and allocated lazily only at the first sched_in.
>> +There are 2-4 RMIDs per logical processor on each package. So if a dual
>> +package has 48 logical processors, there would be upto 192 RMIDs on each
>> +package = total of 192x2 RMIDs.
>> +There is a possibility that RMIDs can runout and in that case the read
>> +reports an error since there was no RMID available to monitor for an
>> +event.
>> +
>> +NOLAZY:
>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>> +mask' which is basically used to specify the packages he wants to
>> +monitor. The RMIDs are statically allocated at open and failure is
>> +indicated if RMIDs are not available.
>> +
>> +To specify monitoring on package 0 and package 1:
>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>> +
>> +An error is thrown if packages not online are specified.
>
> I very much dislike both those for adding files to the perf cgroup.
> Drivers should really not do that.

Is the continuous monitoring the issue or the interface (adding a file in
perf_cgroup) ? I have not mentioned in the documentaion but this continuous
monitoring/ monitoring mask applies only to cgroup in this patch and hence we
thought a good place for that is in the cgroup itself because its per cgroup.

For task events , this wont apply and we are thinking of providing a prctl based
interface for user to toggle the continous monitoring ..

>
> I absolutely hate the second because events already have affinity.

This applies to continuous monitoring as well when there are no events
associated. Meaning if the monitoring mask is chosen and user tries to enable
continuous monitoring using the cgrp->cont_mon - all RMIDs are allocated
immediately. the mon_mask provides a way for the user to have guarenteed RMIDs
for both that have events and for continoous monitoring(no perf event
associated)
(assuming user uses it when user knows he would definitely use it.. or else
there is LAZY mode)

Again this is cgroup specific and wont apply to task events and is needed when
there are no events associated.

Thanks,
Vikas

>
> I can't see this happening.
>






2016-12-23 20:50:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Fri, Dec 23, 2016 at 11:35:03AM -0800, Shivappa Vikas wrote:
>
> Hello Peterz,
>
> On Fri, 23 Dec 2016, Peter Zijlstra wrote:
>
> >On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
> >>+Continuous monitoring
> >>+---------------------
> >>+A new file cont_monitoring is added to perf_cgroup which helps to enable
> >>+cqm continuous monitoring. Enabling this field would start monitoring of
> >>+the cgroup without perf being launched. This can be used for long term
> >>+light weight monitoring of tasks/cgroups.
> >>+
> >>+To enable continuous monitoring of cgroup p1.
> >>+#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> >>+
> >>+To disable continuous monitoring of cgroup p1.
> >>+#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
> >>+
> >>+To read the counters at the end of monitoring perf can be used.
> >>+
> >>+LAZY and NOLAZY Monitoring
> >>+--------------------------
> >>+LAZY:
> >>+By default when monitoring is enabled, the RMIDs are not allocated
> >>+immediately and allocated lazily only at the first sched_in.
> >>+There are 2-4 RMIDs per logical processor on each package. So if a dual
> >>+package has 48 logical processors, there would be upto 192 RMIDs on each
> >>+package = total of 192x2 RMIDs.
> >>+There is a possibility that RMIDs can runout and in that case the read
> >>+reports an error since there was no RMID available to monitor for an
> >>+event.
> >>+
> >>+NOLAZY:
> >>+When user wants guaranteed monitoring, he can enable the 'monitoring
> >>+mask' which is basically used to specify the packages he wants to
> >>+monitor. The RMIDs are statically allocated at open and failure is
> >>+indicated if RMIDs are not available.
> >>+
> >>+To specify monitoring on package 0 and package 1:
> >>+#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
> >>+
> >>+An error is thrown if packages not online are specified.
> >
> >I very much dislike both those for adding files to the perf cgroup.
> >Drivers should really not do that.
>
> Is the continuous monitoring the issue or the interface (adding a file in
> perf_cgroup) ? I have not mentioned in the documentaion but this continuous
> monitoring/ monitoring mask applies only to cgroup in this patch and hence
> we thought a good place for that is in the cgroup itself because its per
> cgroup.
>
> For task events , this wont apply and we are thinking of providing a prctl
> based interface for user to toggle the continous monitoring ..

More fail..

> >
> >I absolutely hate the second because events already have affinity.
>
> This applies to continuous monitoring as well when there are no events
> associated. Meaning if the monitoring mask is chosen and user tries to
> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
> allocated immediately. the mon_mask provides a way for the user to have
> guarenteed RMIDs for both that have events and for continoous monitoring(no
> perf event associated) (assuming user uses it when user knows he would
> definitely use it.. or else there is LAZY mode)
>
> Again this is cgroup specific and wont apply to task events and is needed
> when there are no events associated.

So no, the problem is that a driver introduces special ABI and behaviour
that radically departs from the regular behaviour.

Also, the 'whoops you ran out of RMIDs, please reboot' thing totally and
completely blows.

2016-12-23 21:41:55

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation



On Fri, 23 Dec 2016, Peter Zijlstra wrote:
>
> Also, the 'whoops you ran out of RMIDs, please reboot' thing totally and
> completely blows.

Well, this is really a hardware limitation. User cannot monitor more events on a
package than # of RMIDs at the *same time*. The 10/14 reuses the RMIDs that are
not monitored anymore. User can monitor more events once he stops monitoring
some..

So we throw error at the read (LAZY mode) or open(NOLAZY mode) ..


>
>

2016-12-25 01:51:15

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation



On Fri, 23 Dec 2016, Peter Zijlstra wrote:

> On Fri, Dec 23, 2016 at 11:35:03AM -0800, Shivappa Vikas wrote:
>>
>> Hello Peterz,
>>
>> On Fri, 23 Dec 2016, Peter Zijlstra wrote:
>>
>>> On Fri, Dec 16, 2016 at 03:12:55PM -0800, Vikas Shivappa wrote:
>>>> +Continuous monitoring
>>>> +---------------------
>>>> +A new file cont_monitoring is added to perf_cgroup which helps to enable
>>>> +cqm continuous monitoring. Enabling this field would start monitoring of
>>>> +the cgroup without perf being launched. This can be used for long term
>>>> +light weight monitoring of tasks/cgroups.
>>>> +
>>>> +To enable continuous monitoring of cgroup p1.
>>>> +#echo 1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>>>> +
>>>> +To disable continuous monitoring of cgroup p1.
>>>> +#echo 0 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_cont_monitoring
>>>> +
>>>> +To read the counters at the end of monitoring perf can be used.
>>>> +
>>>> +LAZY and NOLAZY Monitoring
>>>> +--------------------------
>>>> +LAZY:
>>>> +By default when monitoring is enabled, the RMIDs are not allocated
>>>> +immediately and allocated lazily only at the first sched_in.
>>>> +There are 2-4 RMIDs per logical processor on each package. So if a dual
>>>> +package has 48 logical processors, there would be upto 192 RMIDs on each
>>>> +package = total of 192x2 RMIDs.
>>>> +There is a possibility that RMIDs can runout and in that case the read
>>>> +reports an error since there was no RMID available to monitor for an
>>>> +event.
>>>> +
>>>> +NOLAZY:
>>>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>>>> +mask' which is basically used to specify the packages he wants to
>>>> +monitor. The RMIDs are statically allocated at open and failure is
>>>> +indicated if RMIDs are not available.
>>>> +
>>>> +To specify monitoring on package 0 and package 1:
>>>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>>>> +
>>>> +An error is thrown if packages not online are specified.
>>>
>>> I very much dislike both those for adding files to the perf cgroup.
>>> Drivers should really not do that.
>>
>> Is the continuous monitoring the issue or the interface (adding a file in
>> perf_cgroup) ? I have not mentioned in the documentaion but this continuous
>> monitoring/ monitoring mask applies only to cgroup in this patch and hence
>> we thought a good place for that is in the cgroup itself because its per
>> cgroup.
>>
>> For task events , this wont apply and we are thinking of providing a prctl
>> based interface for user to toggle the continous monitoring ..
>
> More fail..
>
>>>
>>> I absolutely hate the second because events already have affinity.
>>
>> This applies to continuous monitoring as well when there are no events
>> associated. Meaning if the monitoring mask is chosen and user tries to
>> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
>> allocated immediately. the mon_mask provides a way for the user to have
>> guarenteed RMIDs for both that have events and for continoous monitoring(no
>> perf event associated) (assuming user uses it when user knows he would
>> definitely use it.. or else there is LAZY mode)
>>
>> Again this is cgroup specific and wont apply to task events and is needed
>> when there are no events associated.
>
> So no, the problem is that a driver introduces special ABI and behaviour
> that radically departs from the regular behaviour.

Ok , looks like the interface is the problem. Will try to
fix this. We are just trying to have a light weight monitoring
option so that its reasonable to monitor for a
very long time (like lifetime of process etc). Mainly to not have all the perf
scheduling overhead.
May be a perf event attr option is a more reasonable approach for the user to
choose the option ? (rather than some new interface like prctl / cgroup file..)

Thanks,
Vikas

2016-12-27 07:13:17

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

>>>>> +LAZY and NOLAZY Monitoring
>>>>> +--------------------------
>>>>> +LAZY:
>>>>> +By default when monitoring is enabled, the RMIDs are not allocated
>>>>> +immediately and allocated lazily only at the first sched_in.
>>>>> +There are 2-4 RMIDs per logical processor on each package. So if a
>>>>> dual
>>>>> +package has 48 logical processors, there would be upto 192 RMIDs on
>>>>> each
>>>>> +package = total of 192x2 RMIDs.
>>>>> +There is a possibility that RMIDs can runout and in that case the read
>>>>> +reports an error since there was no RMID available to monitor for an
>>>>> +event.
>>>>> +
>>>>> +NOLAZY:
>>>>> +When user wants guaranteed monitoring, he can enable the 'monitoring
>>>>> +mask' which is basically used to specify the packages he wants to
>>>>> +monitor. The RMIDs are statically allocated at open and failure is
>>>>> +indicated if RMIDs are not available.
>>>>> +
>>>>> +To specify monitoring on package 0 and package 1:
>>>>> +#echo 0-1 > /sys/fs/cgroup/perf_event/p1/perf_event.cqm_mon_mask
>>>>> +
>>>>> +An error is thrown if packages not online are specified.
>>>>
>>>>
>>>> I very much dislike both those for adding files to the perf cgroup.
>>>> Drivers should really not do that.
>>>
>>>
>>> Is the continuous monitoring the issue or the interface (adding a file in
>>> perf_cgroup) ? I have not mentioned in the documentaion but this
>>> continuous
>>> monitoring/ monitoring mask applies only to cgroup in this patch and
>>> hence
>>> we thought a good place for that is in the cgroup itself because its per
>>> cgroup.
>>>
>>> For task events , this wont apply and we are thinking of providing a
>>> prctl
>>> based interface for user to toggle the continous monitoring ..
>>
>>
>> More fail..
>>
>>>>
>>>> I absolutely hate the second because events already have affinity.
>>>

The per-package NOLAZY flags are distinct than affinity. They modify
the behavior of something already running on that package. Besides
that, this is intended to work when there are no perf_events and
perf_events cpu field is already used in cgroup events.

>>>
>>> This applies to continuous monitoring as well when there are no events
>>> associated. Meaning if the monitoring mask is chosen and user tries to
>>> enable continuous monitoring using the cgrp->cont_mon - all RMIDs are
>>> allocated immediately. the mon_mask provides a way for the user to have
>>> guarenteed RMIDs for both that have events and for continoous
>>> monitoring(no
>>> perf event associated) (assuming user uses it when user knows he would
>>> definitely use it.. or else there is LAZY mode)
>>>
>>> Again this is cgroup specific and wont apply to task events and is needed
>>> when there are no events associated.
>>
>>
>> So no, the problem is that a driver introduces special ABI and behaviour
>> that radically departs from the regular behaviour.
>
>
> Ok , looks like the interface is the problem. Will try to fix this. We are
> just trying to have a light weight monitoring
> option so that its reasonable to monitor for a
> very long time (like lifetime of process etc). Mainly to not have all the
> perf scheduling overhead.
> May be a perf event attr option is a more reasonable approach for the user
> to choose the option ? (rather than some new interface like prctl / cgroup
> file..)

I don't see how a perf event attr option would work, since the goal of
continuous monitoring is to start CQM/CMT without a perf event.

An alternative is to add a single file to the cqm pmu directory. The
file contains which cgroups must be continuously monitored (optionally
with per-package flags):

$ cat /sys/devices/intel_cmt/cgroup_cont_monitoring
cgroup per-pkg flags
/ 0;1;0;0
g1 0;0;0;0
g1/g1_1 0:0;0;0
g2 0:1;0;0;0

to start continuous monitoring in a cgroup (flags optional, default to all 0's):
$ echo "g2/g2_1 0;1;0;0" > /sys/devices/intel_cmt/cgroup_cont_monitoring
to stop it:
$ echo "-g2/g2_1"

Note that the cgroup name is what perf_event_attr takes now, so it's
not that different from creating a perf event.


Another option is to create a directory per cgroup to monitor, so:
$ mkdir /sys/devices/intel_cmt/cgroup_cont_monitoring/g1
starts continuous monitoring in g1.

This approach is problematic, though, because the cont_monitoring
property is not hierarchical, i.e. a cgroup g1/g1_1 may need
cont_monitoring while g1 doesn't. Supporting this would require to
either do something funny with the cgroup name or add extra files to
each folder and expose all cgroups. None of these options seem good to
me.

So, my money is on a single file
"/sys/devices/intel_cmt/cgroup_cont_monitoring". Thoughts?

Thanks,
David

2016-12-27 20:00:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

Shivappa Vikas <[email protected]> writes:
>
> Ok , looks like the interface is the problem. Will try to fix
> this. We are just trying to have a light weight monitoring
> option so that its reasonable to monitor for a
> very long time (like lifetime of process etc). Mainly to not have all
> the perf scheduling overhead.

That seems like an odd reason to define a completely new user interface.
This is to avoid one MSR write for a RMID change per context switch
in/out cgroup or is it other code too?

Is there some number you can put to the overhead?
Or is there some other overhead other than the MSR write
you're concerned about?

Do you have an ftrace or better PT trace with the overhead before-after?

Perhaps some optimization could be done in the code to make it faster,
then the new interface wouldn't be needed.

FWIW there are some pending changes to context switch that will
eliminate at least one common MSR write [1]. If that was fixed
you could do the RMID MSR write "for free"

-Andi

[1] https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase

2016-12-27 20:21:49

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation



On Tue, 27 Dec 2016, Andi Kleen wrote:

> Shivappa Vikas <[email protected]> writes:
>>
>> Ok , looks like the interface is the problem. Will try to fix
>> this. We are just trying to have a light weight monitoring
>> option so that its reasonable to monitor for a
>> very long time (like lifetime of process etc). Mainly to not have all
>> the perf scheduling overhead.
>
> That seems like an odd reason to define a completely new user interface.
> This is to avoid one MSR write for a RMID change per context switch
> in/out cgroup or is it other code too?
>
> Is there some number you can put to the overhead?
> Or is there some other overhead other than the MSR write
> you're concerned about?

Yes, seems like the interface of having a file is odd as even Peterz thinks.

Its the perf overhead actually we are trying to avoid.

The MSR writes(the driver/cqm overhead
really not perf..) we try to optimize by having a per cpu cache/group the rmids/
have a common write for rmid/closid etc.

The perf overhead i was thinking atleast was during the context switch which is
the more constant overhead (the event creation is just one time).

-I was trying to see an alternative where
1.user specifies the continuous monitor with perf-attr in open
2.driver allocates the task/cgroup RMID and stores the RMID in cgroup or
task_struct
3.turns off the event. (hence no perf ctx switch overhead? (all the perf hook
calls for start/stop/add we dont need any of those -
i was still finding out if this route works basically if i turn off the event
there is minimal overhead for the event and not start/stop/add calls for the
event.)
4.but during switch_to driver writes the RMID MSR, so we still monitor.
5.read -> calls the driver -> driver just returns the count by reading the
RMID.

>
> Do you have an ftrace or better PT trace with the overhead before-after?
>
> Perhaps some optimization could be done in the code to make it faster,
> then the new interface wouldn't be needed.
>
> FWIW there are some pending changes to context switch that will
> eliminate at least one common MSR write [1]. If that was fixed
> you could do the RMID MSR write "for free"

I see, thats good to know..

Thanks,
Vikas

>
> -Andi
>
> [1] https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/fsgsbase
>
>

2016-12-27 21:34:42

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Tue, Dec 27, 2016 at 12:00 PM, Andi Kleen <[email protected]> wrote:
> Shivappa Vikas <[email protected]> writes:
>>
>> Ok , looks like the interface is the problem. Will try to fix
>> this. We are just trying to have a light weight monitoring
>> option so that its reasonable to monitor for a
>> very long time (like lifetime of process etc). Mainly to not have all
>> the perf scheduling overhead.
>
> That seems like an odd reason to define a completely new user interface.
> This is to avoid one MSR write for a RMID change per context switch
> in/out cgroup or is it other code too?
>
> Is there some number you can put to the overhead?

I obtained some timing by manually instrumenting the kernel in a Haswell EP.

When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
~1170ns

most of the time is spend in cgroup ctx switch (~1120ns) .

When using continuous monitoring in CQM driver, the avg time to
find the rmid to write inside of pqr_context switch is ~16ns

Note that this excludes the MSR write. It's only the overhead of
finding the RMID
to write in PQR_ASSOC. Both paths call the same routine to find the
RMID, so there are
about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
of it comes from iterating over the pmu list.

> Or is there some other overhead other than the MSR write
> you're concerned about?

No, that problem is solved with the PQR software cache introduced in the series.


> Perhaps some optimization could be done in the code to make it faster,
> then the new interface wouldn't be needed.

There are some. One in my list is to create a list of pmus with at
least one cgroup event
and use it to iterate over in perf_cgroup_switch, instead of using the
"pmus" list.
The pmus list has grown a lot recently with the addition of all the uncore pmus.

Despite this optimization, it's unlikely that the whole sched_out +
sched_in gets that
close to the 15 ns of the non perf_event approach.

Please note that context switch time for llc_occupancy events has more
impact than for
other events because in order to obtain reliable measurements, the
RMID switch must
be active _all_ the time, not only while the event is read.

>
> FWIW there are some pending changes to context switch that will
> eliminate at least one common MSR write [1]. If that was fixed
> you could do the RMID MSR write "for free"

That may save the need for the PQR software cache in this series, but
won't speed up
the context switch.

Thanks,
David

2016-12-27 21:39:44

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

The perf overhead i was thinking atleast was during the context switch which
> is the more constant overhead (the event creation is just one time).
>
> -I was trying to see an alternative where
> 1.user specifies the continuous monitor with perf-attr in open
> 2.driver allocates the task/cgroup RMID and stores the RMID in cgroup or
> task_struct
> 3.turns off the event. (hence no perf ctx switch overhead? (all the perf
> hook calls for start/stop/add we dont need any of those -
> i was still finding out if this route works basically if i turn off the
> event there is minimal overhead for the event and not start/stop/add calls
> for the event.)
> 4.but during switch_to driver writes the RMID MSR, so we still monitor.
> 5.read -> calls the driver -> driver just returns the count by reading the
> RMID.

This option breaks user expectations about an event. If an event is
closed, it's gone.
It shouldn't leave some state behind.

Do you have thoughts about adding the one cgroup file to
the intel_cmt pmu directory?

Thanks,
David

2016-12-27 23:10:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
> ~1170ns
>
> most of the time is spend in cgroup ctx switch (~1120ns) .
>
> When using continuous monitoring in CQM driver, the avg time to
> find the rmid to write inside of pqr_context switch is ~16ns
>
> Note that this excludes the MSR write. It's only the overhead of
> finding the RMID
> to write in PQR_ASSOC. Both paths call the same routine to find the
> RMID, so there are
> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
> of it comes from iterating over the pmu list.

Do Kan's pmu list patches help?

https://patchwork.kernel.org/patch/9420035/

>
> > Or is there some other overhead other than the MSR write
> > you're concerned about?
>
> No, that problem is solved with the PQR software cache introduced in the series.

So it's already fixed?

How much is the cost with your cache?

>
>
> > Perhaps some optimization could be done in the code to make it faster,
> > then the new interface wouldn't be needed.
>
> There are some. One in my list is to create a list of pmus with at
> least one cgroup event
> and use it to iterate over in perf_cgroup_switch, instead of using the
> "pmus" list.
> The pmus list has grown a lot recently with the addition of all the uncore pmus.

Kan's patches above already do that I believe.

>
> Despite this optimization, it's unlikely that the whole sched_out +
> sched_in gets that
> close to the 15 ns of the non perf_event approach.

It would be good to see how close we can get. I assume
there is more potential for optimizations and fast pathing.

-Andi

2016-12-28 01:23:49

by David Carrillo-Cisneros

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation

On Tue, Dec 27, 2016 at 3:10 PM, Andi Kleen <[email protected]> wrote:
> On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
>> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
>> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
>> ~1170ns
>>
>> most of the time is spend in cgroup ctx switch (~1120ns) .
>>
>> When using continuous monitoring in CQM driver, the avg time to
>> find the rmid to write inside of pqr_context switch is ~16ns
>>
>> Note that this excludes the MSR write. It's only the overhead of
>> finding the RMID
>> to write in PQR_ASSOC. Both paths call the same routine to find the
>> RMID, so there are
>> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
>> of it comes from iterating over the pmu list.
>
> Do Kan's pmu list patches help?
>
> https://patchwork.kernel.org/patch/9420035/

I think these are independent problems. Kan's patches aim to reduce the overhead
of multiples events in the same task context. The overhead numbers I posted
measure only _one_ event in the cpu's context.

>
>>
>> > Or is there some other overhead other than the MSR write
>> > you're concerned about?
>>
>> No, that problem is solved with the PQR software cache introduced in the series.
>
> So it's already fixed?

Sort of, with PQR sw cache there is only one write to MSR and is only
when either the
RMID or the CLOSID actually changes.

>
> How much is the cost with your cache?

If there is no change on CLOSID or RMID, the hook and comparison takes
about 60 ns.
If there is a change, the write to the MSR + other overhead is about
610 ns (dominated by the MSR write).

>
>>
>>
>> > Perhaps some optimization could be done in the code to make it faster,
>> > then the new interface wouldn't be needed.
>>
>> There are some. One in my list is to create a list of pmus with at
>> least one cgroup event
>> and use it to iterate over in perf_cgroup_switch, instead of using the
>> "pmus" list.
>> The pmus list has grown a lot recently with the addition of all the uncore pmus.
>
> Kan's patches above already do that I believe.

see previous answer.

>
>>
>> Despite this optimization, it's unlikely that the whole sched_out +
>> sched_in gets that
>> close to the 15 ns of the non perf_event approach.
>
> It would be good to see how close we can get. I assume
> there is more potential for optimizations and fast pathing.

I will work on the optimization I described earlier that avoids iterating
over all pmus on the cgroup switch. That should take the bulk of the
overhead, but still more work will probably be needed to get close to the
15ns overhead.

Thanks,
David

2016-12-28 20:03:02

by Shivappa Vikas

[permalink] [raw]
Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation



On Tue, 27 Dec 2016, David Carrillo-Cisneros wrote:

> On Tue, Dec 27, 2016 at 3:10 PM, Andi Kleen <[email protected]> wrote:
>> On Tue, Dec 27, 2016 at 01:33:46PM -0800, David Carrillo-Cisneros wrote:
>>> When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
>>> avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
>>> ~1170ns
>>>
>>> most of the time is spend in cgroup ctx switch (~1120ns) .
>>>
>>> When using continuous monitoring in CQM driver, the avg time to
>>> find the rmid to write inside of pqr_context switch is ~16ns
>>>
>>> Note that this excludes the MSR write. It's only the overhead of
>>> finding the RMID
>>> to write in PQR_ASSOC. Both paths call the same routine to find the
>>> RMID, so there are
>>> about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
>>> of it comes from iterating over the pmu list.
>>
>> Do Kan's pmu list patches help?
>>
>> https://patchwork.kernel.org/patch/9420035/
>
> I think these are independent problems. Kan's patches aim to reduce the overhead
> of multiples events in the same task context. The overhead numbers I posted
> measure only _one_ event in the cpu's context.
>
>>
>>>
>>>> Or is there some other overhead other than the MSR write
>>>> you're concerned about?
>>>
>>> No, that problem is solved with the PQR software cache introduced in the series.
>>
>> So it's already fixed?
>
> Sort of, with PQR sw cache there is only one write to MSR and is only
> when either the
> RMID or the CLOSID actually changes.
>
>>
>> How much is the cost with your cache?
>
> If there is no change on CLOSID or RMID, the hook and comparison takes
> about 60 ns.
> If there is a change, the write to the MSR + other overhead is about
> 610 ns (dominated by the MSR write).

We measured the MSR read and write we measured were close to 250 - 300 cycles.
The issue was even the read was as costly which is why the caching helps as it
avoids all reads. The grouping of RMIds using cgroup and
multiple events etc helps the cache because it increases the
hit probability.

>
>>
>>>
>>>
>>>> Perhaps some optimization could be done in the code to make it faster,
>>>> then the new interface wouldn't be needed.
>>>
>>> There are some. One in my list is to create a list of pmus with at
>>> least one cgroup event
>>> and use it to iterate over in perf_cgroup_switch, instead of using the
>>> "pmus" list.
>>> The pmus list has grown a lot recently with the addition of all the uncore pmus.
>>
>> Kan's patches above already do that I believe.
>
> see previous answer.
>
>>
>>>
>>> Despite this optimization, it's unlikely that the whole sched_out +
>>> sched_in gets that
>>> close to the 15 ns of the non perf_event approach.
>>
>> It would be good to see how close we can get. I assume
>> there is more potential for optimizations and fast pathing.
>
> I will work on the optimization I described earlier that avoids iterating
> over all pmus on the cgroup switch. That should take the bulk of the
> overhead, but still more work will probably be needed to get close to the
> 15ns overhead.

This seems best option as its more generic so we really dont need our event
specific change and adding a file interface which wasnt liked by Peterz/Andi
anyways.
Will remove/clean up the continuos monitoring parts and resend the series.

Thanks,
Vikas

>
> Thanks,
> David
>