From: Peter Zijlstra <[email protected]>
This is the 2nd version of RFC originally posted by Peter[1].
There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).
Notably:
- HW breakpoint PMU
- ARM big.little PMU / Intel ADL PMU
- Intel Branch Monitoring PMU
- AMD IBS
Current design:
---------------
Currently we have a per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ |
`---------------------------------' | `--> pmu
v ^
perf_event ------'
Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.
Each perf_event is then associated with its PMU and one
perf_event_context.
Proposed design:
----------------
New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------------' | |
| | perf_cpu_pmu_context
| `----. ^
| | |
| v v
| ,--> perf_event_pmu_context
| | ^
| | |
v v v
perf_event ---> pmu
With new design, perf_event_context will hold all pmu events in the
respective(pinned/flexible) rbtrees. This can be achieved by adding
pmu to rbtree key:
{cpu, pmu, cgroup_id, group_index}
Each perf_event_context carry a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For ex, it keeps track of
currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.
Similarly perf_cpu_pmu_context is used to hold per-pmu-per-cpu state
like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.
Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.
Original RFC -> RFC v2:
-----------------------
In addition to porting the patch to latest (v5.16-rc6) kernel, here
are some of the major changes between two revisions:
- There were quite a bit of fundamental changes since original patch.
Most notably a rbtree key has changed from {cpu,group_index} to
{cpu,cgroup_id,group_index}. Adding a pmu key in between as proposed
in original patch is not straight forward as it will break cgroup
specific optimization. Hence we need to iterate over all pmu_ctx
for a given ctx and call visit_groups_merge() one by one.
- Enabled cgroup support (CGROUP_PERF).
- Some changes wrt multiplexing events as with new design the rotation
happens at cgroup subtree unlike at pmu subtree in original patch.
Because of additional complexity above changes bring in, I thought to
get initial review about the overall approach before starting to make it
upstream ready. Hence this patch just provides an idea of the direction
we will head toward. Many loose ends in the patch rightnow. Like, I've
not paid much attention to synchronization related aspects. Similarly,
some of the issues marked in original patch (XXX) haven't been fixed.
A simple perf stat/record/top survives with the patch but machine
crashes with first run of perf test (stale cpc->task_epc causing the
crash). Lockdep is also screaming a lot :)
[1]: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
arch/powerpc/perf/core-book3s.c | 4 +-
arch/x86/events/core.c | 15 +-
arch/x86/events/intel/core.c | 12 +-
arch/x86/events/intel/ds.c | 4 +-
arch/x86/events/intel/lbr.c | 30 +-
arch/x86/events/perf_event.h | 16 +-
include/linux/perf_event.h | 105 +-
include/linux/sched.h | 2 +-
kernel/events/core.c | 1655 ++++++++++++++++---------------
9 files changed, 982 insertions(+), 861 deletions(-)
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 73e62e9b179b..fc5cdc6550d6 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -131,7 +131,7 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) {}
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) {}
static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {}
static void pmao_restore_workaround(bool ebb) { }
#endif /* CONFIG_PPC32 */
@@ -450,7 +450,7 @@ static void power_pmu_bhrb_disable(struct perf_event *event)
/* Called from ctxsw to prevent one process's branch entries to
* mingle with the other process's entries during context switch.
*/
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
if (!ppmu->bhrb_nr)
return;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6dfa8ddaa60f..51ffb1e8de0a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2053,13 +2053,14 @@ void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
*/
void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
{
- struct perf_cpu_context *cpuctx;
+ /* XXX: Don't need this quirk anymore */
+ /*struct perf_cpu_context *cpuctx;
if (!pmu->pmu_cpu_context)
return;
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- cpuctx->ctx.pmu = pmu;
+ cpuctx->ctx.pmu = pmu;*/
}
static int __init init_hw_perf_events(void)
@@ -2630,15 +2631,15 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
NULL,
};
-static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
- static_call_cond(x86_pmu_sched_task)(ctx, sched_in);
+ static_call_cond(x86_pmu_sched_task)(pmu_ctx, sched_in);
}
-static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+static void x86_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
- static_call_cond(x86_pmu_swap_task_ctx)(prev, next);
+ static_call_cond(x86_pmu_swap_task_ctx)(prev_epc, next_epc);
}
void perf_check_microcode(void)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9a044438072b..de9ca948b042 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4477,17 +4477,17 @@ static void intel_pmu_cpu_dead(int cpu)
cpumask_clear_cpu(cpu, &hybrid_pmu(cpuc->pmu)->supported_cpus);
}
-static void intel_pmu_sched_task(struct perf_event_context *ctx,
+static void intel_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
bool sched_in)
{
- intel_pmu_pebs_sched_task(ctx, sched_in);
- intel_pmu_lbr_sched_task(ctx, sched_in);
+ intel_pmu_pebs_sched_task(pmu_ctx, sched_in);
+ intel_pmu_lbr_sched_task(pmu_ctx, sched_in);
}
-static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+static void intel_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
- intel_pmu_lbr_swap_task_ctx(prev, next);
+ intel_pmu_lbr_swap_task_ctx(prev_epc, next_epc);
}
static int intel_pmu_check_period(struct perf_event *event, u64 value)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8647713276a7..67697c32fe92 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1004,7 +1004,7 @@ static inline bool pebs_needs_sched_cb(struct cpu_hw_events *cpuc)
return cpuc->n_pebs && (cpuc->n_pebs == cpuc->n_large_pebs);
}
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1112,7 +1112,7 @@ static void
pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
struct perf_event *event, bool add)
{
- struct pmu *pmu = event->ctx->pmu;
+ struct pmu *pmu = event->pmu;
/*
* Make sure we get updated with the first PEBS
* event. It will trigger also during removal, but
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9e6d6eaeb4cb..b708be7e4709 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -596,21 +596,21 @@ static void __intel_pmu_lbr_save(void *ctx)
cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
}
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
void *prev_ctx_data, *next_ctx_data;
- swap(prev->task_ctx_data, next->task_ctx_data);
+ swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
/*
- * Architecture specific synchronization makes sense in
- * case both prev->task_ctx_data and next->task_ctx_data
+ * Architecture specific synchronization makes sense in case
+ * both prev_epc->task_ctx_data and next_epc->task_ctx_data
* pointers are allocated.
*/
- prev_ctx_data = next->task_ctx_data;
- next_ctx_data = prev->task_ctx_data;
+ prev_ctx_data = next_epc->task_ctx_data;
+ next_ctx_data = prev_epc->task_ctx_data;
if (!prev_ctx_data || !next_ctx_data)
return;
@@ -619,7 +619,7 @@ void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
task_context_opt(next_ctx_data)->lbr_callstack_users);
}
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
void *task_ctx;
@@ -632,7 +632,7 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
* the task was scheduled out, restore the stack. Otherwise flush
* the LBR stack.
*/
- task_ctx = ctx ? ctx->task_ctx_data : NULL;
+ task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
if (task_ctx) {
if (sched_in)
__intel_pmu_lbr_restore(task_ctx);
@@ -668,8 +668,8 @@ void intel_pmu_lbr_add(struct perf_event *event)
cpuc->br_sel = event->hw.branch_reg.reg;
- if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
- task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
+ if (branch_user_callstack(cpuc->br_sel) && event->pmu_ctx->task_ctx_data)
+ task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users++;
/*
* Request pmu::sched_task() callback, which will fire inside the
@@ -692,7 +692,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
*/
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users++;
- perf_sched_cb_inc(event->ctx->pmu);
+ perf_sched_cb_inc(event->pmu);
if (!cpuc->lbr_users++ && !event->total_time_running)
intel_pmu_lbr_reset();
}
@@ -745,8 +745,8 @@ void intel_pmu_lbr_del(struct perf_event *event)
return;
if (branch_user_callstack(cpuc->br_sel) &&
- event->ctx->task_ctx_data)
- task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
+ event->pmu_ctx->task_ctx_data)
+ task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users--;
if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
cpuc->lbr_select = 0;
@@ -756,7 +756,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
- perf_sched_cb_dec(event->ctx->pmu);
+ perf_sched_cb_dec(event->pmu);
}
static inline bool vlbr_exclude_host(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e3ac05c97b5e..fd937743b51a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -792,7 +792,7 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);
void (*check_microcode)(void);
- void (*sched_task)(struct perf_event_context *ctx,
+ void (*sched_task)(struct perf_event_pmu_context *pmu_ctx,
bool sched_in);
/*
@@ -869,12 +869,12 @@ struct x86_pmu {
int (*set_topdown_event_period)(struct perf_event *event);
/*
- * perf task context (i.e. struct perf_event_context::task_ctx_data)
+ * perf task context (i.e. struct perf_event_pmu_context::task_ctx_data)
* switch helper to bridge calls from perf/core to perf/x86.
* See struct pmu::swap_task_ctx() usage for examples;
*/
- void (*swap_task_ctx)(struct perf_event_context *prev,
- struct perf_event_context *next);
+ void (*swap_task_ctx)(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
/*
* AMD bits
@@ -1316,7 +1316,7 @@ void intel_pmu_pebs_enable_all(void);
void intel_pmu_pebs_disable_all(void);
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
void intel_pmu_auto_reload_read(struct perf_event *event);
@@ -1324,10 +1324,10 @@ void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
void intel_ds_init(void);
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next);
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
u64 lbr_from_signext_quirk_wr(u64 val);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9b60bb89d86a..c7d1f455de0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -250,6 +250,7 @@ struct hw_perf_event {
};
struct perf_event;
+struct perf_event_pmu_context;
/*
* Common implementation detail of pmu::{start,commit,cancel}_txn
@@ -292,7 +293,7 @@ struct pmu {
int capabilities;
int __percpu *pmu_disable_count;
- struct perf_cpu_context __percpu *pmu_cpu_context;
+ struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
atomic_t exclusive_cnt; /* < 0: cpu; > 0: tsk */
int task_ctx_nr;
int hrtimer_interval_ms;
@@ -427,7 +428,7 @@ struct pmu {
/*
* context-switches callback
*/
- void (*sched_task) (struct perf_event_context *ctx,
+ void (*sched_task) (struct perf_event_pmu_context *pmu_ctx,
bool sched_in);
/*
@@ -441,8 +442,8 @@ struct pmu {
* implementation and Perf core context switch handling callbacks for usage
* examples.
*/
- void (*swap_task_ctx) (struct perf_event_context *prev,
- struct perf_event_context *next);
+ void (*swap_task_ctx) (struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
/* optional */
/*
@@ -662,6 +663,11 @@ struct perf_event {
int group_caps;
struct perf_event *group_leader;
+ /*
+ * event->pmu will always point to pmu in which this event belongs.
+ * Unlike event->pmu_ctx->pmu which points to other pmu when group of
+ * different events are created.
+ */
struct pmu *pmu;
void *pmu_private;
@@ -699,6 +705,12 @@ struct perf_event {
struct hw_perf_event hw;
struct perf_event_context *ctx;
+ /*
+ * event->pmu_ctx points to perf_event_pmu_context in which the event
+ * is added. This pmu_ctx can be of other pmu for sw event when such
+ * sw event is added to a non-sw event group.
+ */
+ struct perf_event_pmu_context *pmu_ctx;
atomic_long_t refcount;
/*
@@ -786,19 +798,60 @@ struct perf_event {
#endif /* CONFIG_PERF_EVENTS */
};
+/*
+ * ,------------------------[1:n]---------------------.
+ * V V
+ * perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
+ * ^ ^ | |
+ * `--------[1:n]---------' `-[n:1]-> pmu <-[1:n]-'
+ *
+ *
+ * XXX destroy epc when empty
+ * refcount, !rcu
+ *
+ * XXX epc locking
+ *
+ * event->pmu_ctx ctx->mutex && inactive
+ * ctx->pmu_ctx_list ctx->mutex && ctx->lock
+ *
+ */
+struct perf_event_pmu_context {
+ struct pmu *pmu;
+ struct perf_event_context *ctx;
+
+ struct list_head pmu_ctx_entry;
+
+ struct list_head pinned_active;
+ struct list_head flexible_active;
+
+ /* Used to avoid freeing per-cpu perf_event_pmu_context */
+ unsigned int embedded : 1;
+
+ unsigned int nr_events;
+ unsigned int nr_active;
+
+ atomic_t refcount; /* event <-> epc */
+
+ void *task_ctx_data; /* pmu specific data */
+ /*
+ * Set when nr_events != nr_active, except tolerant to events not
+ * necessary to be active due to scheduling constraints, such as cgroups.
+ */
+ int rotate_necessary;
+};
struct perf_event_groups {
struct rb_root tree;
u64 index;
};
+
/**
* struct perf_event_context - event context structure
*
* Used as a container for task events and CPU events as well:
*/
struct perf_event_context {
- struct pmu *pmu;
/*
* Protect the states of the events in the list,
* nr_active, and the list:
@@ -811,25 +864,20 @@ struct perf_event_context {
*/
struct mutex mutex;
- struct list_head active_ctx_list;
+ struct list_head pmu_ctx_list;
struct perf_event_groups pinned_groups;
struct perf_event_groups flexible_groups;
struct list_head event_list;
- struct list_head pinned_active;
- struct list_head flexible_active;
-
int nr_events;
int nr_active;
int is_active;
+
+ int nr_task_data;
int nr_stat;
int nr_freq;
int rotate_disable;
- /*
- * Set when nr_events != nr_active, except tolerant to events not
- * necessary to be active due to scheduling constraints, such as cgroups.
- */
- int rotate_necessary;
+
refcount_t refcount;
struct task_struct *task;
@@ -850,7 +898,6 @@ struct perf_event_context {
#ifdef CONFIG_CGROUP_PERF
int nr_cgroups; /* cgroup evts */
#endif
- void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
};
@@ -860,12 +907,13 @@ struct perf_event_context {
*/
#define PERF_NR_CONTEXTS 4
-/**
- * struct perf_event_cpu_context - per cpu event context structure
- */
-struct perf_cpu_context {
- struct perf_event_context ctx;
- struct perf_event_context *task_ctx;
+struct perf_cpu_pmu_context {
+ struct perf_event_pmu_context epc;
+ struct perf_event_pmu_context *task_epc;
+
+ struct list_head sched_cb_entry;
+ int sched_cb_usage;
+
int active_oncpu;
int exclusive;
@@ -873,16 +921,21 @@ struct perf_cpu_context {
struct hrtimer hrtimer;
ktime_t hrtimer_interval;
unsigned int hrtimer_active;
+};
+
+/**
+ * struct perf_event_cpu_context - per cpu event context structure
+ */
+struct perf_cpu_context {
+ struct perf_event_context ctx;
+ struct perf_event_context *task_ctx;
+ int online;
#ifdef CONFIG_CGROUP_PERF
struct perf_cgroup *cgrp;
struct list_head cgrp_cpuctx_entry;
#endif
- struct list_head sched_cb_entry;
- int sched_cb_usage;
-
- int online;
/*
* Per-CPU storage for iterators used in visit_groups_merge. The default
* storage is of size 2 to hold the CPU and any CPU event iterators.
@@ -1130,7 +1183,7 @@ static inline int is_software_event(struct perf_event *event)
*/
static inline int in_software_context(struct perf_event *event)
{
- return event->ctx->pmu->task_ctx_nr == perf_sw_context;
+ return event->pmu_ctx->pmu->task_ctx_nr == perf_sw_context;
}
static inline int is_exclusive_pmu(struct pmu *pmu)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a927ddec64..17e8e1b04ded 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1221,7 +1221,7 @@ struct task_struct {
unsigned int futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
- struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
+ struct perf_event_context *perf_event_ctxp;
struct mutex perf_event_mutex;
struct list_head perf_event_list;
#endif
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f23ca260307f..cf95240c6db0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,12 +154,6 @@ static int cpu_function_call(int cpu, remote_function_f func, void *info)
return data.ret;
}
-static inline struct perf_cpu_context *
-__get_cpu_context(struct perf_event_context *ctx)
-{
- return this_cpu_ptr(ctx->pmu->pmu_cpu_context);
-}
-
static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
@@ -183,6 +177,8 @@ static bool is_kernel_event(struct perf_event *event)
return READ_ONCE(event->owner) == TASK_TOMBSTONE;
}
+static DEFINE_PER_CPU(struct perf_cpu_context, cpu_context);
+
/*
* On task ctx scheduling...
*
@@ -216,7 +212,7 @@ static int event_function(void *info)
struct event_function_struct *efs = info;
struct perf_event *event = efs->event;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
int ret = 0;
@@ -313,7 +309,7 @@ static void event_function_call(struct perf_event *event, event_f func, void *da
static void event_function_local(struct perf_event *event, event_f func, void *data)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct task_struct *task = READ_ONCE(ctx->task);
struct perf_event_context *task_ctx = NULL;
@@ -387,7 +383,6 @@ static DEFINE_MUTEX(perf_sched_mutex);
static atomic_t perf_sched_count;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(int, perf_sched_cb_usages);
static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
static atomic_t nr_mmap_events __read_mostly;
@@ -447,7 +442,7 @@ static void update_perf_cpu_limits(void)
WRITE_ONCE(perf_sample_allowed_ns, tmp);
}
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc);
int perf_proc_update_handler(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
@@ -570,13 +565,6 @@ void perf_sample_event_took(u64 sample_len_ns)
static atomic64_t perf_event_id;
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type,
- struct task_struct *task);
-
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);
@@ -674,13 +662,35 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
WRITE_ONCE(event->state, state);
}
+static void perf_ctx_disable(struct perf_event_context *ctx)
+{
+ struct perf_event_pmu_context *pmu_ctx;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ perf_pmu_disable(pmu_ctx->pmu);
+}
+
+static void perf_ctx_enable(struct perf_event_context *ctx)
+{
+ struct perf_event_pmu_context *pmu_ctx;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ perf_pmu_enable(pmu_ctx->pmu);
+}
+
+static void ctx_sched_out(struct perf_event_context *ctx,
+ enum event_type_t event_type);
+static void
+ctx_sched_in(struct perf_event_context *ctx,
+ enum event_type_t event_type,
+ struct task_struct *task);
+
#ifdef CONFIG_CGROUP_PERF
static inline bool
perf_cgroup_match(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
/* @event doesn't care about cgroup */
if (!event->cgrp)
@@ -789,6 +799,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
}
}
+/* XXX: No need of list now. Convert it to per-cpu variable */
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
#define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
@@ -817,10 +828,10 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
- perf_pmu_disable(cpuctx->ctx.pmu);
+ perf_ctx_disable(&cpuctx->ctx);
if (mode & PERF_CGROUP_SWOUT) {
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
/*
* must not be done before ctxswout due
* to event_filter_match() in event_sched_out()
@@ -837,11 +848,10 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
* we pass the cpuctx->ctx to perf_cgroup_from_task()
* because cgorup events are only per-cpu
*/
- cpuctx->cgrp = perf_cgroup_from_task(task,
- &cpuctx->ctx);
- cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+ cpuctx->cgrp = perf_cgroup_from_task(task, &cpuctx->ctx);
+ ctx_sched_in(&cpuctx->ctx, EVENT_ALL, task);
}
- perf_pmu_enable(cpuctx->ctx.pmu);
+ perf_ctx_enable(&cpuctx->ctx);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -915,7 +925,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
heap_size++;
for_each_possible_cpu(cpu) {
- cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+ cpuctx = this_cpu_ptr(&cpu_context);
if (heap_size <= cpuctx->heap_size)
continue;
@@ -1129,34 +1139,30 @@ perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *c
*/
static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
{
- struct perf_cpu_context *cpuctx;
+ struct perf_cpu_pmu_context *cpc;
bool rotations;
lockdep_assert_irqs_disabled();
- cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
- rotations = perf_rotate_context(cpuctx);
+ cpc = container_of(hr, struct perf_cpu_pmu_context, hrtimer);
+ rotations = perf_rotate_context(cpc);
- raw_spin_lock(&cpuctx->hrtimer_lock);
+ raw_spin_lock(&cpc->hrtimer_lock);
if (rotations)
- hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
+ hrtimer_forward_now(hr, cpc->hrtimer_interval);
else
- cpuctx->hrtimer_active = 0;
- raw_spin_unlock(&cpuctx->hrtimer_lock);
+ cpc->hrtimer_active = 0;
+ raw_spin_unlock(&cpc->hrtimer_lock);
return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
}
-static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_pmu_context *cpc, int cpu)
{
- struct hrtimer *timer = &cpuctx->hrtimer;
- struct pmu *pmu = cpuctx->ctx.pmu;
+ struct hrtimer *timer = &cpc->hrtimer;
+ struct pmu *pmu = cpc->epc.pmu;
u64 interval;
- /* no multiplexing needed for SW PMU */
- if (pmu->task_ctx_nr == perf_sw_context)
- return;
-
/*
* check default is sane, if not set then force to
* default interval (1/tick)
@@ -1165,30 +1171,25 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
if (interval < 1)
interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
- cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
+ cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
- raw_spin_lock_init(&cpuctx->hrtimer_lock);
+ raw_spin_lock_init(&cpc->hrtimer_lock);
hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
timer->function = perf_mux_hrtimer_handler;
}
-static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_pmu_context *cpc)
{
- struct hrtimer *timer = &cpuctx->hrtimer;
- struct pmu *pmu = cpuctx->ctx.pmu;
+ struct hrtimer *timer = &cpc->hrtimer;
unsigned long flags;
- /* not for SW PMU */
- if (pmu->task_ctx_nr == perf_sw_context)
- return 0;
-
- raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
- if (!cpuctx->hrtimer_active) {
- cpuctx->hrtimer_active = 1;
- hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+ raw_spin_lock_irqsave(&cpc->hrtimer_lock, flags);
+ if (!cpc->hrtimer_active) {
+ cpc->hrtimer_active = 1;
+ hrtimer_forward_now(timer, cpc->hrtimer_interval);
hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED_HARD);
}
- raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
+ raw_spin_unlock_irqrestore(&cpc->hrtimer_lock, flags);
return 0;
}
@@ -1207,32 +1208,9 @@ void perf_pmu_enable(struct pmu *pmu)
pmu->pmu_enable(pmu);
}
-static DEFINE_PER_CPU(struct list_head, active_ctx_list);
-
-/*
- * perf_event_ctx_activate(), perf_event_ctx_deactivate(), and
- * perf_event_task_tick() are fully serialized because they're strictly cpu
- * affine and perf_event_ctx{activate,deactivate} are called with IRQs
- * disabled, while perf_event_task_tick is called from IRQ context.
- */
-static void perf_event_ctx_activate(struct perf_event_context *ctx)
-{
- struct list_head *head = this_cpu_ptr(&active_ctx_list);
-
- lockdep_assert_irqs_disabled();
-
- WARN_ON(!list_empty(&ctx->active_ctx_list));
-
- list_add(&ctx->active_ctx_list, head);
-}
-
-static void perf_event_ctx_deactivate(struct perf_event_context *ctx)
+static void perf_assert_pmu_disabled(struct pmu *pmu)
{
- lockdep_assert_irqs_disabled();
-
- WARN_ON(list_empty(&ctx->active_ctx_list));
-
- list_del_init(&ctx->active_ctx_list);
+ WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0);
}
static void get_ctx(struct perf_event_context *ctx)
@@ -1259,7 +1237,6 @@ static void free_ctx(struct rcu_head *head)
struct perf_event_context *ctx;
ctx = container_of(head, struct perf_event_context, rcu_head);
- free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
kfree(ctx);
}
@@ -1444,7 +1421,7 @@ static u64 primary_event_id(struct perf_event *event)
* the context could get moved to another task.
*/
static struct perf_event_context *
-perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
+perf_lock_task_context(struct task_struct *task, unsigned long *flags)
{
struct perf_event_context *ctx;
@@ -1460,7 +1437,7 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
*/
local_irq_save(*flags);
rcu_read_lock();
- ctx = rcu_dereference(task->perf_event_ctxp[ctxn]);
+ ctx = rcu_dereference(task->perf_event_ctxp);
if (ctx) {
/*
* If this context is a clone of another, it might
@@ -1473,7 +1450,7 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
* can't get swapped on us any more.
*/
raw_spin_lock(&ctx->lock);
- if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) {
+ if (ctx != rcu_dereference(task->perf_event_ctxp)) {
raw_spin_unlock(&ctx->lock);
rcu_read_unlock();
local_irq_restore(*flags);
@@ -1500,12 +1477,12 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
* reference count so that the context can't get freed.
*/
static struct perf_event_context *
-perf_pin_task_context(struct task_struct *task, int ctxn)
+perf_pin_task_context(struct task_struct *task)
{
struct perf_event_context *ctx;
unsigned long flags;
- ctx = perf_lock_task_context(task, ctxn, &flags);
+ ctx = perf_lock_task_context(task, &flags);
if (ctx) {
++ctx->pin_count;
raw_spin_unlock_irqrestore(&ctx->lock, flags);
@@ -1614,14 +1591,22 @@ static inline struct cgroup *event_cgroup(const struct perf_event *event)
* which provides ordering when rotating groups for the same CPU.
*/
static __always_inline int
-perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
- const u64 left_group_index, const struct perf_event *right)
+perf_event_groups_cmp(const int left_cpu, const struct pmu *left_pmu,
+ const struct cgroup *left_cgroup, const u64 left_group_index,
+ const struct perf_event *right)
{
if (left_cpu < right->cpu)
return -1;
if (left_cpu > right->cpu)
return 1;
+ if (left_pmu) {
+ if (left_pmu < right->pmu_ctx->pmu)
+ return -1;
+ if (left_pmu > right->pmu_ctx->pmu)
+ return 1;
+ }
+
#ifdef CONFIG_CGROUP_PERF
{
const struct cgroup *right_cgroup = event_cgroup(right);
@@ -1664,12 +1649,13 @@ perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
static inline bool __group_less(struct rb_node *a, const struct rb_node *b)
{
struct perf_event *e = __node_2_pe(a);
- return perf_event_groups_cmp(e->cpu, event_cgroup(e), e->group_index,
- __node_2_pe(b)) < 0;
+ return perf_event_groups_cmp(e->cpu, e->pmu_ctx->pmu, event_cgroup(e),
+ e->group_index, __node_2_pe(b)) < 0;
}
struct __group_key {
int cpu;
+ struct pmu *pmu;
struct cgroup *cgroup;
};
@@ -1678,14 +1664,25 @@ static inline int __group_cmp(const void *key, const struct rb_node *node)
const struct __group_key *a = key;
const struct perf_event *b = __node_2_pe(node);
- /* partial/subtree match: @cpu, @cgroup; ignore: @group_index */
- return perf_event_groups_cmp(a->cpu, a->cgroup, b->group_index, b);
+ /* partial/subtree match: @cpu, @pmu, @cgroup; ignore: @group_index */
+ return perf_event_groups_cmp(a->cpu, a->pmu, a->cgroup, b->group_index, b);
+}
+
+static inline int
+__group_cmp_ignore_cgroup(const void *key, const struct rb_node *node)
+{
+ const struct __group_key *a = key;
+ const struct perf_event *b = __node_2_pe(node);
+
+ /* partial/subtree match: @cpu, @pmu, ignore: @cgroup, @group_index */
+ return perf_event_groups_cmp(a->cpu, a->pmu, event_cgroup(b),
+ b->group_index, b);
}
/*
- * Insert @event into @groups' tree; using {@event->cpu, ++@groups->index} for
- * key (see perf_event_groups_less). This places it last inside the CPU
- * subtree.
+ * Insert @event into @groups' tree; using
+ * {@event->cpu, @event->pmu_ctx->pmu, event_cgroup(@event), ++@groups->index}
+ * as key. This places it last inside the {cpu,pmu,cgroup} subtree.
*/
static void
perf_event_groups_insert(struct perf_event_groups *groups,
@@ -1735,14 +1732,15 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
}
/*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the {cpu,pmu,cgroup} subtree.
*/
static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu,
- struct cgroup *cgrp)
+ struct pmu *pmu, struct cgroup *cgrp)
{
struct __group_key key = {
.cpu = cpu,
+ .pmu = pmu,
.cgroup = cgrp,
};
struct rb_node *node;
@@ -1754,14 +1752,12 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
return NULL;
}
-/*
- * Like rb_entry_next_safe() for the @cpu subtree.
- */
static struct perf_event *
-perf_event_groups_next(struct perf_event *event)
+perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
{
struct __group_key key = {
.cpu = event->cpu,
+ .pmu = pmu,
.cgroup = event_cgroup(event),
};
struct rb_node *next;
@@ -1815,6 +1811,7 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
perf_cgroup_event_enable(event, ctx);
ctx->generation++;
+ event->pmu_ctx->nr_events++;
}
/*
@@ -2020,6 +2017,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
}
ctx->generation++;
+ event->pmu_ctx->nr_events--;
}
static int
@@ -2036,13 +2034,11 @@ perf_aux_output_match(struct perf_event *event, struct perf_event *aux_event)
static void put_event(struct perf_event *event);
static void event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx);
static void perf_put_aux_event(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct perf_event *iter;
/*
@@ -2071,7 +2067,7 @@ static void perf_put_aux_event(struct perf_event *event)
* state so that we don't try to schedule it again. Note
* that perf_event_enable() will clear the ERROR status.
*/
- event_sched_out(iter, cpuctx, ctx);
+ event_sched_out(iter, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
}
}
@@ -2122,8 +2118,8 @@ static int perf_get_aux_event(struct perf_event *event,
static inline struct list_head *get_event_list(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
+ return event->attr.pinned ? &event->pmu_ctx->pinned_active :
+ &event->pmu_ctx->flexible_active;
}
/*
@@ -2134,10 +2130,7 @@ static inline struct list_head *get_event_list(struct perf_event *event)
*/
static inline void perf_remove_sibling_event(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, event->ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
}
@@ -2261,12 +2254,14 @@ event_filter_match(struct perf_event *event)
}
static void
-event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
+ // XXX cpc serialization, probably per-cpu IRQ disabled
+
WARN_ON_ONCE(event->ctx != ctx);
lockdep_assert_held(&ctx->lock);
@@ -2293,38 +2288,34 @@ event_sched_out(struct perf_event *event,
perf_event_set_state(event, state);
if (!is_software_event(event))
- cpuctx->active_oncpu--;
- if (!--ctx->nr_active)
- perf_event_ctx_deactivate(ctx);
+ cpc->active_oncpu--;
+ ctx->nr_active--;
+ event->pmu_ctx->nr_active--;
if (event->attr.freq && event->attr.sample_freq)
ctx->nr_freq--;
- if (event->attr.exclusive || !cpuctx->active_oncpu)
- cpuctx->exclusive = 0;
+ if (event->attr.exclusive || !cpc->active_oncpu)
+ cpc->exclusive = 0;
perf_pmu_enable(event->pmu);
}
static void
-group_sched_out(struct perf_event *group_event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event;
if (group_event->state != PERF_EVENT_STATE_ACTIVE)
return;
- perf_pmu_disable(ctx->pmu);
+ perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, ctx);
/*
* Schedule out siblings (if any):
*/
for_each_sibling_event(event, group_event)
- event_sched_out(event, cpuctx, ctx);
-
- perf_pmu_enable(ctx->pmu);
+ event_sched_out(event, ctx);
}
#define DETACH_GROUP 0x01UL
@@ -2349,16 +2340,18 @@ __perf_remove_from_context(struct perf_event *event,
update_cgrp_time_from_cpuctx(cpuctx);
}
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
+ if (!event->pmu_ctx->nr_events)
+ event->pmu_ctx->rotate_necessary = 0;
+
if (!ctx->nr_events && ctx->is_active) {
ctx->is_active = 0;
- ctx->rotate_necessary = 0;
if (ctx->task) {
WARN_ON_ONCE(cpuctx->task_ctx != ctx);
cpuctx->task_ctx = NULL;
@@ -2389,7 +2382,7 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla
*/
raw_spin_lock_irq(&ctx->lock);
if (!ctx->is_active) {
- __perf_remove_from_context(event, __get_cpu_context(ctx),
+ __perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
ctx, (void *)flags);
raw_spin_unlock_irq(&ctx->lock);
return;
@@ -2415,13 +2408,17 @@ static void __perf_event_disable(struct perf_event *event,
update_cgrp_time_from_event(event);
}
+ perf_pmu_disable(event->pmu_ctx->pmu);
+
if (event == event->group_leader)
- group_sched_out(event, cpuctx, ctx);
+ group_sched_out(event, ctx);
else
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_OFF);
perf_cgroup_event_disable(event, ctx);
+
+ perf_pmu_enable(event->pmu_ctx->pmu);
}
/*
@@ -2518,10 +2515,10 @@ static void perf_log_throttle(struct perf_event *event, int enable);
static void perf_log_itrace_start(struct perf_event *event);
static int
-event_sched_in(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
int ret = 0;
WARN_ON_ONCE(event->ctx != ctx);
@@ -2564,14 +2561,14 @@ event_sched_in(struct perf_event *event,
}
if (!is_software_event(event))
- cpuctx->active_oncpu++;
- if (!ctx->nr_active++)
- perf_event_ctx_activate(ctx);
+ cpc->active_oncpu++;
+ ctx->nr_active++;
+ event->pmu_ctx->nr_active++;
if (event->attr.freq && event->attr.sample_freq)
ctx->nr_freq++;
if (event->attr.exclusive)
- cpuctx->exclusive = 1;
+ cpc->exclusive = 1;
out:
perf_pmu_enable(event->pmu);
@@ -2580,26 +2577,24 @@ event_sched_in(struct perf_event *event,
}
static int
-group_sched_in(struct perf_event *group_event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event, *partial_group = NULL;
- struct pmu *pmu = ctx->pmu;
+ struct pmu *pmu = group_event->pmu_ctx->pmu;
if (group_event->state == PERF_EVENT_STATE_OFF)
return 0;
pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
- if (event_sched_in(group_event, cpuctx, ctx))
+ if (event_sched_in(group_event, ctx))
goto error;
/*
* Schedule in siblings as one group (if any):
*/
for_each_sibling_event(event, group_event) {
- if (event_sched_in(event, cpuctx, ctx)) {
+ if (event_sched_in(event, ctx)) {
partial_group = event;
goto group_error;
}
@@ -2618,9 +2613,9 @@ group_sched_in(struct perf_event *group_event,
if (event == partial_group)
break;
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
}
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, ctx);
error:
pmu->cancel_txn(pmu);
@@ -2630,10 +2625,11 @@ group_sched_in(struct perf_event *group_event,
/*
* Work out whether we can put this event group on the CPU now.
*/
-static int group_can_go_on(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- int can_add_hw)
+static int group_can_go_on(struct perf_event *event, int can_add_hw)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+
/*
* Groups consisting entirely of software events can always go on.
*/
@@ -2643,7 +2639,7 @@ static int group_can_go_on(struct perf_event *event,
* If an exclusive group is already on, no other hardware
* events can go on.
*/
- if (cpuctx->exclusive)
+ if (cpc->exclusive)
return 0;
/*
* If this group is exclusive and there are already
@@ -2665,38 +2661,30 @@ static void add_event_to_ctx(struct perf_event *event,
perf_group_attach(event);
}
-static void ctx_sched_out(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-static void
-ctx_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type,
- struct task_struct *task);
-
-static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx,
- enum event_type_t event_type)
+static void task_ctx_sched_out(struct perf_event_context *ctx,
+ enum event_type_t event_type)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+
if (!cpuctx->task_ctx)
return;
if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;
- ctx_sched_out(ctx, cpuctx, event_type);
+ ctx_sched_out(ctx, event_type);
}
static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx,
struct task_struct *task)
{
- cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+ ctx_sched_in(&cpuctx->ctx, EVENT_PINNED, task);
if (ctx)
- ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
- cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+ ctx_sched_in(ctx, EVENT_PINNED, task);
+ ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE, task);
if (ctx)
- ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+ ctx_sched_in(ctx, EVENT_FLEXIBLE, task);
}
/*
@@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
struct perf_event_context *task_ctx,
enum event_type_t event_type)
{
- enum event_type_t ctx_event_type;
bool cpu_event = !!(event_type & EVENT_CPU);
/*
@@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
if (event_type & EVENT_PINNED)
event_type |= EVENT_FLEXIBLE;
- ctx_event_type = event_type & EVENT_ALL;
+ event_type &= EVENT_ALL;
- perf_pmu_disable(cpuctx->ctx.pmu);
- if (task_ctx)
- task_ctx_sched_out(cpuctx, task_ctx, event_type);
+ perf_ctx_disable(&cpuctx->ctx);
+ if (task_ctx) {
+ perf_ctx_disable(task_ctx);
+ task_ctx_sched_out(task_ctx, event_type);
+ }
/*
* Decide which cpu ctx groups to schedule out based on the types
@@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
* - otherwise, do nothing more.
*/
if (cpu_event)
- cpu_ctx_sched_out(cpuctx, ctx_event_type);
- else if (ctx_event_type & EVENT_PINNED)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ ctx_sched_out(&cpuctx->ctx, event_type);
+ else if (event_type & EVENT_PINNED)
+ ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
perf_event_sched_in(cpuctx, task_ctx, current);
- perf_pmu_enable(cpuctx->ctx.pmu);
+
+ perf_ctx_enable(&cpuctx->ctx);
+ if (task_ctx)
+ perf_ctx_enable(task_ctx);
}
void perf_pmu_resched(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
perf_ctx_lock(cpuctx, task_ctx);
@@ -2770,7 +2762,7 @@ static int __perf_install_in_context(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
bool reprogram = true;
int ret = 0;
@@ -2812,7 +2804,7 @@ static int __perf_install_in_context(void *info)
#endif
if (reprogram) {
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
ctx_resched(cpuctx, task_ctx, get_event_type(event));
} else {
@@ -2957,7 +2949,7 @@ static void __perf_event_enable(struct perf_event *event,
return;
if (ctx->is_active)
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
perf_cgroup_event_enable(event, ctx);
@@ -2966,7 +2958,7 @@ static void __perf_event_enable(struct perf_event *event,
return;
if (!event_filter_match(event)) {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+ ctx_sched_in(ctx, EVENT_TIME, current);
return;
}
@@ -2975,7 +2967,7 @@ static void __perf_event_enable(struct perf_event *event,
* then don't put it on unless the group is on.
*/
if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+ ctx_sched_in(ctx, EVENT_TIME, current);
return;
}
@@ -3228,11 +3220,52 @@ static int perf_event_modify_attr(struct perf_event *event,
return err;
}
-static void ctx_sched_out(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
+ enum event_type_t event_type)
{
+ struct perf_event_context *ctx = pmu_ctx->ctx;
struct perf_event *event, *tmp;
+ struct pmu *pmu = pmu_ctx->pmu;
+
+ if (ctx->task && !ctx->is_active) {
+ struct perf_cpu_pmu_context *cpc;
+
+ cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+ WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+ cpc->task_epc = NULL;
+ }
+
+ if (!event_type)
+ return;
+
+ perf_pmu_disable(pmu);
+ if (event_type & EVENT_PINNED) {
+ list_for_each_entry_safe(event, tmp,
+ &pmu_ctx->pinned_active,
+ active_list)
+ group_sched_out(event, ctx);
+ }
+
+ if (event_type & EVENT_FLEXIBLE) {
+ list_for_each_entry_safe(event, tmp,
+ &pmu_ctx->flexible_active,
+ active_list)
+ group_sched_out(event, ctx);
+ /*
+ * Since we cleared EVENT_FLEXIBLE, also clear
+ * rotate_necessary, is will be reset by
+ * ctx_flexible_sched_in() when needed.
+ */
+ pmu_ctx->rotate_necessary = 0;
+ }
+ perf_pmu_enable(pmu);
+}
+
+static void
+ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
+{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_pmu_context *pmu_ctx;
int is_active = ctx->is_active;
lockdep_assert_held(&ctx->lock);
@@ -3278,24 +3311,8 @@ static void ctx_sched_out(struct perf_event_context *ctx,
if (!ctx->nr_active || !(is_active & EVENT_ALL))
return;
- perf_pmu_disable(ctx->pmu);
- if (is_active & EVENT_PINNED) {
- list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
- group_sched_out(event, cpuctx, ctx);
- }
-
- if (is_active & EVENT_FLEXIBLE) {
- list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
- group_sched_out(event, cpuctx, ctx);
-
- /*
- * Since we cleared EVENT_FLEXIBLE, also clear
- * rotate_necessary, is will be reset by
- * ctx_flexible_sched_in() when needed.
- */
- ctx->rotate_necessary = 0;
- }
- perf_pmu_enable(ctx->pmu);
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ __pmu_ctx_sched_out(pmu_ctx, is_active);
}
/*
@@ -3400,26 +3417,65 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
}
}
-static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
- struct task_struct *next)
+static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx,
+ struct perf_event_context *next_ctx)
+{
+ struct perf_event_pmu_context *prev_epc, *next_epc;
+
+ if (!prev_ctx->nr_task_data)
+ return;
+
+ prev_epc = list_first_entry(&prev_ctx->pmu_ctx_list,
+ struct perf_event_pmu_context,
+ pmu_ctx_entry);
+ next_epc = list_first_entry(&next_ctx->pmu_ctx_list,
+ struct perf_event_pmu_context,
+ pmu_ctx_entry);
+
+ while (&prev_epc->pmu_ctx_entry != &prev_ctx->pmu_ctx_list &&
+ &next_epc->pmu_ctx_entry != &next_ctx->pmu_ctx_list) {
+
+ WARN_ON_ONCE(prev_epc->pmu != next_epc->pmu);
+
+ /*
+ * PMU specific parts of task perf context can require
+ * additional synchronization. As an example of such
+ * synchronization see implementation details of Intel
+ * LBR call stack data profiling;
+ */
+ if (prev_epc->pmu->swap_task_ctx)
+ prev_epc->pmu->swap_task_ctx(prev_epc, next_epc);
+ else
+ swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
+ }
+}
+
+static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in)
+{
+ struct perf_event_pmu_context *pmu_ctx;
+ struct perf_cpu_pmu_context *cpc;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+
+ if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task)
+ pmu_ctx->pmu->sched_task(pmu_ctx, sched_in);
+ }
+}
+
+static void
+perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
{
- struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
+ struct perf_event_context *ctx = task->perf_event_ctxp;
struct perf_event_context *next_ctx;
struct perf_event_context *parent, *next_parent;
- struct perf_cpu_context *cpuctx;
int do_switch = 1;
- struct pmu *pmu;
if (likely(!ctx))
return;
- pmu = ctx->pmu;
- cpuctx = __get_cpu_context(ctx);
- if (!cpuctx->task_ctx)
- return;
-
rcu_read_lock();
- next_ctx = next->perf_event_ctxp[ctxn];
+ next_ctx = rcu_dereference(next->perf_event_ctxp);
if (!next_ctx)
goto unlock;
@@ -3447,23 +3503,12 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
WRITE_ONCE(ctx->task, next);
WRITE_ONCE(next_ctx->task, task);
- perf_pmu_disable(pmu);
-
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(ctx, false);
+ perf_ctx_disable(ctx);
- /*
- * PMU specific parts of task perf context can require
- * additional synchronization. As an example of such
- * synchronization see implementation details of Intel
- * LBR call stack data profiling;
- */
- if (pmu->swap_task_ctx)
- pmu->swap_task_ctx(ctx, next_ctx);
- else
- swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+ perf_ctx_sched_task_cb(ctx, false);
+ perf_event_swap_task_ctx_data(ctx, next_ctx);
- perf_pmu_enable(pmu);
+ perf_ctx_enable(ctx);
/*
* RCU_INIT_POINTER here is safe because we've not
@@ -3472,8 +3517,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
* since those values are always verified under
* ctx->lock which we're now holding.
*/
- RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], next_ctx);
- RCU_INIT_POINTER(next->perf_event_ctxp[ctxn], ctx);
+ RCU_INIT_POINTER(task->perf_event_ctxp, next_ctx);
+ RCU_INIT_POINTER(next->perf_event_ctxp, ctx);
do_switch = 0;
@@ -3487,37 +3532,39 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
if (do_switch) {
raw_spin_lock(&ctx->lock);
- perf_pmu_disable(pmu);
+ perf_ctx_disable(ctx);
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(ctx, false);
- task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+ perf_ctx_sched_task_cb(ctx, false);
+ task_ctx_sched_out(ctx, EVENT_ALL);
- perf_pmu_enable(pmu);
+ perf_ctx_enable(ctx);
raw_spin_unlock(&ctx->lock);
}
}
static DEFINE_PER_CPU(struct list_head, sched_cb_list);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
void perf_sched_cb_dec(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
this_cpu_dec(perf_sched_cb_usages);
+ barrier();
- if (!--cpuctx->sched_cb_usage)
- list_del(&cpuctx->sched_cb_entry);
+ if (!--cpc->sched_cb_usage)
+ list_del(&cpc->sched_cb_entry);
}
void perf_sched_cb_inc(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
- if (!cpuctx->sched_cb_usage++)
- list_add(&cpuctx->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+ if (!cpc->sched_cb_usage++)
+ list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+ barrier();
this_cpu_inc(perf_sched_cb_usages);
}
@@ -3529,19 +3576,21 @@ void perf_sched_cb_inc(struct pmu *pmu)
* PEBS requires this to provide PID/TID information. This requires we flush
* all queued PEBS records before we context switch to a new task.
*/
-static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in)
+static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, bool sched_in)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct pmu *pmu;
- pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */
+ pmu = cpc->epc.pmu;
+ /* software PMUs will not have sched_task */
if (WARN_ON_ONCE(!pmu->sched_task))
return;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(pmu);
- pmu->sched_task(cpuctx->task_ctx, sched_in);
+ pmu->sched_task(cpc->task_epc, sched_in);
perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3551,26 +3600,20 @@ static void perf_pmu_sched_task(struct task_struct *prev,
struct task_struct *next,
bool sched_in)
{
- struct perf_cpu_context *cpuctx;
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_cpu_pmu_context *cpc;
- if (prev == next)
+ /* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
+ if (prev == next || cpuctx->task_ctx)
return;
- list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
- /* will be handled in perf_event_context_sched_in/out */
- if (cpuctx->task_ctx)
- continue;
-
- __perf_pmu_sched_task(cpuctx, sched_in);
- }
+ list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+ __perf_pmu_sched_task(cpc, sched_in);
}
static void perf_event_switch(struct task_struct *task,
struct task_struct *next_prev, bool sched_in);
-#define for_each_task_context_nr(ctxn) \
- for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
/*
* Called from scheduler to remove the events of the current task,
* with interrupts disabled.
@@ -3585,16 +3628,13 @@ static void perf_event_switch(struct task_struct *task,
void __perf_event_task_sched_out(struct task_struct *task,
struct task_struct *next)
{
- int ctxn;
-
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(task, next, false);
if (atomic_read(&nr_switch_events))
perf_event_switch(task, next, false);
- for_each_task_context_nr(ctxn)
- perf_event_context_sched_out(task, ctxn, next);
+ perf_event_context_sched_out(task, next);
/*
* if cgroup events exist on this CPU, then we need
@@ -3605,15 +3645,6 @@ void __perf_event_task_sched_out(struct task_struct *task,
perf_cgroup_sched_out(task, next);
}
-/*
- * Called with IRQs disabled
- */
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
-{
- ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
-}
-
static bool perf_less_group_idx(const void *l, const void *r)
{
const struct perf_event *le = *(const struct perf_event **)l;
@@ -3645,21 +3676,36 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event)
}
}
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+static void __link_epc(struct perf_event_pmu_context *pmu_ctx)
+{
+ struct perf_cpu_pmu_context *cpc;
+
+ if (!pmu_ctx->ctx->task)
+ return;
+
+ cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+ WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+ cpc->task_epc = pmu_ctx;
+}
+
+static noinline int visit_groups_merge(struct perf_event_context *ctx,
struct perf_event_groups *groups, int cpu,
+ struct pmu *pmu,
int (*func)(struct perf_event *, void *),
void *data)
{
#ifdef CONFIG_CGROUP_PERF
struct cgroup_subsys_state *css = NULL;
#endif
+ struct perf_cpu_context *cpuctx = NULL;
/* Space for per CPU and/or any CPU event iterators. */
struct perf_event *itrs[2];
struct min_heap event_heap;
struct perf_event **evt;
int ret;
- if (cpuctx) {
+ if (!ctx->task) {
+ cpuctx = this_cpu_ptr(&cpu_context);
event_heap = (struct min_heap){
.data = cpuctx->heap,
.nr = 0,
@@ -3679,17 +3725,28 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
.size = ARRAY_SIZE(itrs),
};
/* Events not within a CPU context may be on any CPU. */
- __heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
+ __heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
}
evt = event_heap.data;
- __heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+ __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
#ifdef CONFIG_CGROUP_PERF
for (; css; css = css->parent)
- __heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
+ __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
#endif
+ if (event_heap.nr) {
+ /*
+ * XXX: For now, visit_groups_merge() gets called with pmu
+ * pointer never NULL. But these functions needs to be called
+ * once for each pmu if I implement pmu=NULL optimization.
+ */
+ __link_epc((*evt)->pmu_ctx);
+ perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
+ }
+
+
min_heapify_all(&event_heap, &perf_min_heap);
while (event_heap.nr) {
@@ -3697,7 +3754,7 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
if (ret)
return ret;
- *evt = perf_event_groups_next(*evt);
+ *evt = perf_event_groups_next(*evt, pmu);
if (*evt)
min_heapify(&event_heap, 0, &perf_min_heap);
else
@@ -3733,7 +3790,6 @@ static inline void group_update_userpage(struct perf_event *group_event)
static int merge_sched_in(struct perf_event *event, void *data)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
int *can_add_hw = data;
if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3742,8 +3798,8 @@ static int merge_sched_in(struct perf_event *event, void *data)
if (!event_filter_match(event))
return 0;
- if (group_can_go_on(event, cpuctx, *can_add_hw)) {
- if (!group_sched_in(event, cpuctx, ctx))
+ if (group_can_go_on(event, *can_add_hw)) {
+ if (!group_sched_in(event, ctx))
list_add_tail(&event->active_list, get_event_list(event));
}
@@ -3753,8 +3809,11 @@ static int merge_sched_in(struct perf_event *event, void *data)
perf_cgroup_event_disable(event, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
} else {
- ctx->rotate_necessary = 1;
- perf_mux_hrtimer_restart(cpuctx);
+ struct perf_cpu_pmu_context *cpc;
+
+ event->pmu_ctx->rotate_necessary = 1;
+ cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context);
+ perf_mux_hrtimer_restart(cpc);
group_update_userpage(event);
}
}
@@ -3762,40 +3821,68 @@ static int merge_sched_in(struct perf_event *event, void *data)
return 0;
}
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx)
+static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
{
+ struct perf_event_pmu_context *pmu_ctx;
int can_add_hw = 1;
- if (ctx != &cpuctx->ctx)
- cpuctx = NULL;
-
- visit_groups_merge(cpuctx, &ctx->pinned_groups,
- smp_processor_id(),
- merge_sched_in, &can_add_hw);
+ if (pmu) {
+ visit_groups_merge(ctx, &ctx->pinned_groups,
+ smp_processor_id(), pmu,
+ merge_sched_in, &can_add_hw);
+ } else {
+ /*
+ * XXX: This can be optimized for per-task context by calling
+ * visit_groups_merge() only once with:
+ * 1) pmu=NULL
+ * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
+ * 3) Making can_add_hw a per-pmu variable
+ *
+ * Though, it can not be opimized for per-cpu context because
+ * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
+ * consist of cgroup-subtrees. i.e. a cgroup events of same
+ * cgroup but different pmus are seperated out into respective
+ * pmu-subtrees.
+ */
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ can_add_hw = 1;
+ visit_groups_merge(ctx, &ctx->pinned_groups,
+ smp_processor_id(), pmu_ctx->pmu,
+ merge_sched_in, &can_add_hw);
+ }
+ }
}
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx)
+/* XXX .busy thingy from Peter's patch */
+static void ctx_flexible_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
{
+ struct perf_event_pmu_context *pmu_ctx;
int can_add_hw = 1;
- if (ctx != &cpuctx->ctx)
- cpuctx = NULL;
+ if (pmu) {
+ visit_groups_merge(ctx, &ctx->flexible_groups,
+ smp_processor_id(), pmu,
+ merge_sched_in, &can_add_hw);
+ } else {
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ can_add_hw = 1;
+ visit_groups_merge(ctx, &ctx->flexible_groups,
+ smp_processor_id(), pmu_ctx->pmu,
+ merge_sched_in, &can_add_hw);
+ }
+ }
+}
- visit_groups_merge(cpuctx, &ctx->flexible_groups,
- smp_processor_id(),
- merge_sched_in, &can_add_hw);
+static void __pmu_ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
+{
+ ctx_flexible_sched_in(ctx, pmu);
}
static void
-ctx_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type,
+ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
struct task_struct *task)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
int is_active = ctx->is_active;
u64 now;
@@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
/* start ctx time */
now = perf_clock();
ctx->timestamp = now;
+ // XXX ctx->task =? task
perf_cgroup_set_timestamp(task, ctx);
}
@@ -3826,40 +3914,32 @@ ctx_sched_in(struct perf_event_context *ctx,
* in order to give them the best chance of going on.
*/
if (is_active & EVENT_PINNED)
- ctx_pinned_sched_in(ctx, cpuctx);
+ ctx_pinned_sched_in(ctx, NULL);
/* Then walk through the lower prio flexible groups */
if (is_active & EVENT_FLEXIBLE)
- ctx_flexible_sched_in(ctx, cpuctx);
+ ctx_flexible_sched_in(ctx, NULL);
}
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type,
- struct task_struct *task)
+static void perf_event_context_sched_in(struct task_struct *task)
{
- struct perf_event_context *ctx = &cpuctx->ctx;
-
- ctx_sched_in(ctx, cpuctx, event_type, task);
-}
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_context *ctx;
-static void perf_event_context_sched_in(struct perf_event_context *ctx,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
+ rcu_read_lock();
+ ctx = rcu_dereference(task->perf_event_ctxp);
+ if (!ctx)
+ goto rcu_unlock;
- cpuctx = __get_cpu_context(ctx);
+ if (cpuctx->task_ctx == ctx) {
+ perf_ctx_lock(cpuctx, ctx);
+ perf_ctx_disable(ctx);
- /*
- * HACK: for HETEROGENEOUS the task context might have switched to a
- * different PMU, force (re)set the context,
- */
- pmu = ctx->pmu = cpuctx->ctx.pmu;
+ perf_ctx_sched_task_cb(ctx, true);
- if (cpuctx->task_ctx == ctx) {
- if (cpuctx->sched_cb_usage)
- __perf_pmu_sched_task(cpuctx, true);
- return;
+ perf_ctx_enable(ctx);
+ perf_ctx_unlock(cpuctx, ctx);
+ goto rcu_unlock;
}
perf_ctx_lock(cpuctx, ctx);
@@ -3870,7 +3950,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
if (!ctx->nr_events)
goto unlock;
- perf_pmu_disable(pmu);
+ perf_ctx_disable(ctx);
/*
* We want to keep the following priority order:
* cpu pinned (that don't need to move), task pinned,
@@ -3879,17 +3959,24 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
* However, if task's ctx is not carrying any pinned
* events, no need to flip the cpuctx's events around.
*/
- if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
+ perf_ctx_disable(&cpuctx->ctx);
+ ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+ }
+
perf_event_sched_in(cpuctx, ctx, task);
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(cpuctx->task_ctx, true);
+ perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
- perf_pmu_enable(pmu);
+ if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
+ perf_ctx_enable(&cpuctx->ctx);
+
+ perf_ctx_enable(ctx);
unlock:
perf_ctx_unlock(cpuctx, ctx);
+rcu_unlock:
+ rcu_read_unlock();
}
/*
@@ -3906,9 +3993,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
void __perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task)
{
- struct perf_event_context *ctx;
- int ctxn;
-
/*
* If cgroup events exist on this CPU, then we need to check if we have
* to switch in PMU state; cgroup event are system-wide mode only.
@@ -3919,13 +4003,7 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);
- for_each_task_context_nr(ctxn) {
- ctx = task->perf_event_ctxp[ctxn];
- if (likely(!ctx))
- continue;
-
- perf_event_context_sched_in(ctx, task);
- }
+ perf_event_context_sched_in(task);
if (atomic_read(&nr_switch_events))
perf_event_switch(task, prev, true);
@@ -4044,8 +4122,8 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
* events. At the same time, make sure, having freq events does not change
* the rate of unthrottling as that would introduce bias.
*/
-static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
- int needs_unthr)
+static void
+perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle)
{
struct perf_event *event;
struct hw_perf_event *hwc;
@@ -4057,16 +4135,16 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
* - context have events in frequency mode (needs freq adjust)
* - there are events to unthrottle on this cpu
*/
- if (!(ctx->nr_freq || needs_unthr))
+ if (!(ctx->nr_freq || unthrottle))
return;
raw_spin_lock(&ctx->lock);
- perf_pmu_disable(ctx->pmu);
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->state != PERF_EVENT_STATE_ACTIVE)
continue;
+ // XXX use visit thingy to avoid the -1,cpu match
if (!event_filter_match(event))
continue;
@@ -4107,7 +4185,6 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
perf_pmu_enable(event->pmu);
}
- perf_pmu_enable(ctx->pmu);
raw_spin_unlock(&ctx->lock);
}
@@ -4129,72 +4206,111 @@ static void rotate_ctx(struct perf_event_context *ctx, struct perf_event *event)
/* pick an event from the flexible_groups to rotate */
static inline struct perf_event *
-ctx_event_to_rotate(struct perf_event_context *ctx)
+ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
{
struct perf_event *event;
+ struct rb_node *node;
+ struct rb_root *tree;
+ struct __group_key key = {
+ .pmu = pmu_ctx->pmu,
+ };
/* pick the first active flexible event */
- event = list_first_entry_or_null(&ctx->flexible_active,
+ event = list_first_entry_or_null(&pmu_ctx->flexible_active,
struct perf_event, active_list);
+ if (event)
+ goto out;
/* if no active flexible event, pick the first event */
- if (!event) {
- event = rb_entry_safe(rb_first(&ctx->flexible_groups.tree),
- typeof(*event), group_node);
+ tree = &pmu_ctx->ctx->flexible_groups.tree;
+
+ if (!pmu_ctx->ctx->task) {
+ key.cpu = smp_processor_id();
+
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node)
+ event = __node_2_pe(node);
+ goto out;
+ }
+
+ key.cpu = -1;
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node) {
+ event = __node_2_pe(node);
+ goto out;
}
+ key.cpu = smp_processor_id();
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node)
+ event = __node_2_pe(node);
+
+out:
/*
* Unconditionally clear rotate_necessary; if ctx_flexible_sched_in()
* finds there are unschedulable events, it will set it again.
*/
- ctx->rotate_necessary = 0;
+ pmu_ctx->rotate_necessary = 0;
return event;
}
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
struct perf_event *cpu_event = NULL, *task_event = NULL;
struct perf_event_context *task_ctx = NULL;
int cpu_rotate, task_rotate;
+ struct pmu *pmu;
/*
* Since we run this from IRQ context, nobody can install new
* events, thus the event count values are stable.
*/
- cpu_rotate = cpuctx->ctx.rotate_necessary;
+ cpu_epc = &cpc->epc;
+ pmu = cpu_epc->pmu;
+ task_epc = cpc->task_epc;
+
+ cpu_rotate = cpu_epc->rotate_necessary;
task_ctx = cpuctx->task_ctx;
- task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
+ task_rotate = task_epc ? task_epc->rotate_necessary : 0;
if (!(cpu_rotate || task_rotate))
return false;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
- perf_pmu_disable(cpuctx->ctx.pmu);
+ perf_pmu_disable(pmu);
if (task_rotate)
- task_event = ctx_event_to_rotate(task_ctx);
+ task_event = ctx_event_to_rotate(task_epc);
if (cpu_rotate)
- cpu_event = ctx_event_to_rotate(&cpuctx->ctx);
+ cpu_event = ctx_event_to_rotate(cpu_epc);
/*
* As per the order given at ctx_resched() first 'pop' task flexible
* and then, if needed CPU flexible.
*/
- if (task_event || (task_ctx && cpu_event))
- ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
- if (cpu_event)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ if (task_event || (task_epc && cpu_event)) {
+ update_context_time(task_epc->ctx);
+ __pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
+ }
- if (task_event)
- rotate_ctx(task_ctx, task_event);
- if (cpu_event)
+ if (cpu_event) {
+ update_context_time(&cpuctx->ctx);
+ __pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
rotate_ctx(&cpuctx->ctx, cpu_event);
+ __pmu_ctx_sched_in(&cpuctx->ctx, pmu);
+ }
- perf_event_sched_in(cpuctx, task_ctx, current);
+ if (task_event)
+ rotate_ctx(task_epc->ctx, task_event);
- perf_pmu_enable(cpuctx->ctx.pmu);
+ if (task_event || (task_epc && cpu_event))
+ __pmu_ctx_sched_in(task_epc->ctx, pmu);
+
+ perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
return true;
@@ -4202,8 +4318,8 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
void perf_event_task_tick(void)
{
- struct list_head *head = this_cpu_ptr(&active_ctx_list);
- struct perf_event_context *ctx, *tmp;
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_context *ctx;
int throttled;
lockdep_assert_irqs_disabled();
@@ -4212,8 +4328,13 @@ void perf_event_task_tick(void)
throttled = __this_cpu_xchg(perf_throttled_count, 0);
tick_dep_clear_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
- list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
- perf_adjust_freq_unthr_context(ctx, throttled);
+ perf_adjust_freq_unthr_context(&cpuctx->ctx, !!throttled);
+
+ rcu_read_lock();
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
+ perf_adjust_freq_unthr_context(ctx, !!throttled);
+ rcu_read_unlock();
}
static int event_enable_on_exec(struct perf_event *event,
@@ -4235,9 +4356,9 @@ static int event_enable_on_exec(struct perf_event *event,
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
*/
-static void perf_event_enable_on_exec(int ctxn)
+static void perf_event_enable_on_exec(struct perf_event_context *ctx)
{
- struct perf_event_context *ctx, *clone_ctx = NULL;
+ struct perf_event_context *clone_ctx = NULL;
enum event_type_t event_type = 0;
struct perf_cpu_context *cpuctx;
struct perf_event *event;
@@ -4245,13 +4366,16 @@ static void perf_event_enable_on_exec(int ctxn)
int enabled = 0;
local_irq_save(flags);
- ctx = current->perf_event_ctxp[ctxn];
- if (!ctx || !ctx->nr_events)
+ if (WARN_ON_ONCE(current->perf_event_ctxp != ctx))
+ goto out;
+
+ if (!ctx->nr_events)
goto out;
- cpuctx = __get_cpu_context(ctx);
+ cpuctx = this_cpu_ptr(&cpu_context);
perf_ctx_lock(cpuctx, ctx);
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
+
list_for_each_entry(event, &ctx->event_list, event_entry) {
enabled |= event_enable_on_exec(event, ctx);
event_type |= get_event_type(event);
@@ -4264,7 +4388,7 @@ static void perf_event_enable_on_exec(int ctxn)
clone_ctx = unclone_ctx(ctx);
ctx_resched(cpuctx, ctx, event_type);
} else {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+ ctx_sched_in(ctx, EVENT_TIME, current);
}
perf_ctx_unlock(cpuctx, ctx);
@@ -4283,17 +4407,15 @@ static void perf_event_exit_event(struct perf_event *event,
* Removes all events from the current task that have been marked
* remove-on-exec, and feeds their values back to parent events.
*/
-static void perf_event_remove_on_exec(int ctxn)
+static void perf_event_remove_on_exec(struct perf_event_context *ctx)
{
- struct perf_event_context *ctx, *clone_ctx = NULL;
+ struct perf_event_context *clone_ctx = NULL;
struct perf_event *event, *next;
LIST_HEAD(free_list);
unsigned long flags;
bool modified = false;
- ctx = perf_pin_task_context(current, ctxn);
- if (!ctx)
- return;
+ perf_pin_task_context(current);
mutex_lock(&ctx->mutex);
@@ -4357,7 +4479,7 @@ static void __perf_event_read(void *info)
struct perf_read_data *data = info;
struct perf_event *sub, *event = data->event;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct pmu *pmu = event->pmu;
/*
@@ -4572,17 +4694,25 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
{
raw_spin_lock_init(&ctx->lock);
mutex_init(&ctx->mutex);
- INIT_LIST_HEAD(&ctx->active_ctx_list);
+ INIT_LIST_HEAD(&ctx->pmu_ctx_list);
perf_event_groups_init(&ctx->pinned_groups);
perf_event_groups_init(&ctx->flexible_groups);
INIT_LIST_HEAD(&ctx->event_list);
- INIT_LIST_HEAD(&ctx->pinned_active);
- INIT_LIST_HEAD(&ctx->flexible_active);
refcount_set(&ctx->refcount, 1);
}
+static void
+__perf_init_event_pmu_context(struct perf_event_pmu_context *epc, struct pmu *pmu)
+{
+ epc->pmu = pmu;
+ INIT_LIST_HEAD(&epc->pmu_ctx_entry);
+ INIT_LIST_HEAD(&epc->pinned_active);
+ INIT_LIST_HEAD(&epc->flexible_active);
+ atomic_set(&epc->refcount, 1);
+}
+
static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct task_struct *task)
{
struct perf_event_context *ctx;
@@ -4593,7 +4723,6 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
__perf_event_init_context(ctx);
if (task)
ctx->task = get_task_struct(task);
- ctx->pmu = pmu;
return ctx;
}
@@ -4622,15 +4751,12 @@ find_lively_task_by_vpid(pid_t vpid)
* Returns a matching context with refcount and pincount.
*/
static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task,
- struct perf_event *event)
+find_get_context(struct task_struct *task, struct perf_event *event)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_cpu_context *cpuctx;
- void *task_ctx_data = NULL;
unsigned long flags;
- int ctxn, err;
- int cpu = event->cpu;
+ int err;
if (!task) {
/* Must be root to operate on a CPU event: */
@@ -4638,7 +4764,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
if (err)
return ERR_PTR(err);
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+ cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
ctx = &cpuctx->ctx;
get_ctx(ctx);
raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -4649,43 +4775,22 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
}
err = -EINVAL;
- ctxn = pmu->task_ctx_nr;
- if (ctxn < 0)
- goto errout;
+retry:
+ ctx = perf_lock_task_context(task, &flags);
+ if (ctx) {
+ clone_ctx = unclone_ctx(ctx);
+ ++ctx->pin_count;
- if (event->attach_state & PERF_ATTACH_TASK_DATA) {
- task_ctx_data = alloc_task_ctx_data(pmu);
- if (!task_ctx_data) {
- err = -ENOMEM;
- goto errout;
- }
- }
-
-retry:
- ctx = perf_lock_task_context(task, ctxn, &flags);
- if (ctx) {
- clone_ctx = unclone_ctx(ctx);
- ++ctx->pin_count;
-
- if (task_ctx_data && !ctx->task_ctx_data) {
- ctx->task_ctx_data = task_ctx_data;
- task_ctx_data = NULL;
- }
raw_spin_unlock_irqrestore(&ctx->lock, flags);
if (clone_ctx)
put_ctx(clone_ctx);
} else {
- ctx = alloc_perf_context(pmu, task);
+ ctx = alloc_perf_context(task);
err = -ENOMEM;
if (!ctx)
goto errout;
- if (task_ctx_data) {
- ctx->task_ctx_data = task_ctx_data;
- task_ctx_data = NULL;
- }
-
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -4694,12 +4799,12 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
*/
if (task->flags & PF_EXITING)
err = -ESRCH;
- else if (task->perf_event_ctxp[ctxn])
+ else if (task->perf_event_ctxp)
err = -EAGAIN;
else {
get_ctx(ctx);
++ctx->pin_count;
- rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+ rcu_assign_pointer(task->perf_event_ctxp, ctx);
}
mutex_unlock(&task->perf_event_mutex);
@@ -4712,14 +4817,117 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
}
}
- free_task_ctx_data(pmu, task_ctx_data);
return ctx;
errout:
- free_task_ctx_data(pmu, task_ctx_data);
return ERR_PTR(err);
}
+struct perf_event_pmu_context *
+find_get_pmu_context(struct pmu *pmu, struct perf_event_context *ctx,
+ struct perf_event *event)
+{
+ struct perf_event_pmu_context *new = NULL, *epc;
+ void *task_ctx_data = NULL;
+
+ if (!ctx->task) {
+ struct perf_cpu_pmu_context *cpc;
+
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
+ epc = &cpc->epc;
+
+ if (!epc->ctx) {
+ atomic_set(&epc->refcount, 1);
+ epc->embedded = 1;
+ raw_spin_lock_irq(&ctx->lock);
+ list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+ epc->ctx = ctx;
+ raw_spin_unlock_irq(&ctx->lock);
+ } else {
+ WARN_ON_ONCE(epc->ctx != ctx);
+ atomic_inc(&epc->refcount);
+ }
+
+ return epc;
+ }
+
+ new = kzalloc(sizeof(*epc), GFP_KERNEL);
+ if (!new)
+ return ERR_PTR(-ENOMEM);
+
+ if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+ task_ctx_data = alloc_task_ctx_data(pmu);;
+ if (!task_ctx_data) {
+ kfree(new);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+
+ __perf_init_event_pmu_context(new, pmu);
+
+ raw_spin_lock_irq(&ctx->lock);
+ list_for_each_entry(epc, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ if (epc->pmu == pmu) {
+ WARN_ON_ONCE(epc->ctx != ctx);
+ atomic_inc(&epc->refcount);
+ goto found_epc;
+ }
+ }
+
+ epc = new;
+ new = NULL;
+
+ list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+ epc->ctx = ctx;
+
+found_epc:
+ if (task_ctx_data && !epc->task_ctx_data) {
+ epc->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ ctx->nr_task_data++;
+ }
+ raw_spin_unlock_irq(&ctx->lock);
+
+ free_task_ctx_data(pmu, task_ctx_data);
+ kfree(new);
+
+ return epc;
+}
+
+static void get_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+ WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount));
+}
+
+static void put_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+ unsigned long flags;
+
+ if (!atomic_dec_and_test(&epc->refcount))
+ return;
+
+ if (epc->ctx) {
+ struct perf_event_context *ctx = epc->ctx;
+
+ // XXX ctx->mutex
+
+ WARN_ON_ONCE(list_empty(&epc->pmu_ctx_entry));
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+ list_del_init(&epc->pmu_ctx_entry);
+ epc->ctx = NULL;
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+ }
+
+ WARN_ON_ONCE(!list_empty(&epc->pinned_active));
+ WARN_ON_ONCE(!list_empty(&epc->flexible_active));
+
+ if (epc->embedded)
+ return;
+
+ kfree(epc->task_ctx_data);
+ kfree(epc);
+}
+
static void perf_event_free_filter(struct perf_event *event);
static void free_event_rcu(struct rcu_head *head)
@@ -4988,6 +5196,9 @@ static void _free_event(struct perf_event *event)
if (event->hw.target)
put_task_struct(event->hw.target);
+ if (event->pmu_ctx)
+ put_pmu_ctx(event->pmu_ctx);
+
/*
* perf_event_free_task() relies on put_ctx() being 'last', in particular
* all task references must be cleaned up.
@@ -5518,7 +5729,7 @@ static void __perf_event_period(struct perf_event *event,
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
- perf_pmu_disable(ctx->pmu);
+ perf_pmu_disable(event->pmu);
/*
* We could be throttled; unthrottle now to avoid the tick
* trying to unthrottle while we already re-started the event.
@@ -5534,7 +5745,7 @@ static void __perf_event_period(struct perf_event *event,
if (active) {
event->pmu->start(event, PERF_EF_RELOAD);
- perf_pmu_enable(ctx->pmu);
+ perf_pmu_enable(event->pmu);
}
}
@@ -7617,7 +7828,6 @@ perf_iterate_sb(perf_iterate_f output, void *data,
struct perf_event_context *task_ctx)
{
struct perf_event_context *ctx;
- int ctxn;
rcu_read_lock();
preempt_disable();
@@ -7634,11 +7844,9 @@ perf_iterate_sb(perf_iterate_f output, void *data,
perf_iterate_sb_cpu(output, data);
- for_each_task_context_nr(ctxn) {
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx)
- perf_iterate_ctx(ctx, output, data, false);
- }
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
+ perf_iterate_ctx(ctx, output, data, false);
done:
preempt_enable();
rcu_read_unlock();
@@ -7680,20 +7888,15 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
void perf_event_exec(void)
{
struct perf_event_context *ctx;
- int ctxn;
-
- for_each_task_context_nr(ctxn) {
- perf_event_enable_on_exec(ctxn);
- perf_event_remove_on_exec(ctxn);
- rcu_read_lock();
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx) {
- perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
- NULL, true);
- }
- rcu_read_unlock();
+ rcu_read_lock();
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx) {
+ perf_event_enable_on_exec(ctx);
+ perf_event_remove_on_exec(ctx);
+ perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
}
+ rcu_read_unlock();
}
struct remote_output {
@@ -7733,8 +7936,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
static int __perf_pmu_output_stop(void *info)
{
struct perf_event *event = info;
- struct pmu *pmu = event->ctx->pmu;
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct remote_output ro = {
.rb = event->rb,
};
@@ -8523,7 +8725,6 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
static void perf_addr_filters_adjust(struct vm_area_struct *vma)
{
struct perf_event_context *ctx;
- int ctxn;
/*
* Data tracing isn't supported yet and as such there is no need
@@ -8533,13 +8734,9 @@ static void perf_addr_filters_adjust(struct vm_area_struct *vma)
return;
rcu_read_lock();
- for_each_task_context_nr(ctxn) {
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (!ctx)
- continue;
-
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
perf_iterate_ctx(ctx, __perf_addr_filters_adjust, vma, true);
- }
rcu_read_unlock();
}
@@ -9718,10 +9915,13 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
struct trace_entry *entry = record;
rcu_read_lock();
- ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+ ctx = rcu_dereference(task->perf_event_ctxp);
if (!ctx)
goto unlock;
+ // XXX iterate groups instead, we should be able to
+ // find the subtree for the perf_tracepoint pmu and CPU.
+
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->cpu != smp_processor_id())
continue;
@@ -10850,36 +11050,9 @@ static int perf_event_idx_default(struct perf_event *event)
return 0;
}
-/*
- * Ensures all contexts with the same task_ctx_nr have the same
- * pmu_cpu_context too.
- */
-static struct perf_cpu_context __percpu *find_pmu_context(int ctxn)
-{
- struct pmu *pmu;
-
- if (ctxn < 0)
- return NULL;
-
- list_for_each_entry(pmu, &pmus, entry) {
- if (pmu->task_ctx_nr == ctxn)
- return pmu->pmu_cpu_context;
- }
-
- return NULL;
-}
-
static void free_pmu_context(struct pmu *pmu)
{
- /*
- * Static contexts such as perf_sw_context have a global lifetime
- * and may be shared between different PMUs. Avoid freeing them
- * when a single PMU is going away.
- */
- if (pmu->task_ctx_nr > perf_invalid_context)
- return;
-
- free_percpu(pmu->pmu_cpu_context);
+ free_percpu(pmu->cpu_pmu_context);
}
/*
@@ -10943,12 +11116,12 @@ perf_event_mux_interval_ms_store(struct device *dev,
/* update all cpuctx for this PMU */
cpus_read_lock();
for_each_online_cpu(cpu) {
- struct perf_cpu_context *cpuctx;
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+ struct perf_cpu_pmu_context *cpc;
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+ cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
cpu_function_call(cpu,
- (remote_function_f)perf_mux_hrtimer_restart, cpuctx);
+ (remote_function_f)perf_mux_hrtimer_restart, cpc);
}
cpus_read_unlock();
mutex_unlock(&mux_interval_mutex);
@@ -11059,47 +11232,19 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
}
skip_type:
- if (pmu->task_ctx_nr == perf_hw_context) {
- static int hw_context_taken = 0;
-
- /*
- * Other than systems with heterogeneous CPUs, it never makes
- * sense for two PMUs to share perf_hw_context. PMUs which are
- * uncore must use perf_invalid_context.
- */
- if (WARN_ON_ONCE(hw_context_taken &&
- !(pmu->capabilities & PERF_PMU_CAP_HETEROGENEOUS_CPUS)))
- pmu->task_ctx_nr = perf_invalid_context;
-
- hw_context_taken = 1;
- }
-
- pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
- if (pmu->pmu_cpu_context)
- goto got_cpu_context;
-
ret = -ENOMEM;
- pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
- if (!pmu->pmu_cpu_context)
+ pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
+ if (!pmu->cpu_pmu_context)
goto free_dev;
for_each_possible_cpu(cpu) {
- struct perf_cpu_context *cpuctx;
+ struct perf_cpu_pmu_context *cpc;
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- __perf_event_init_context(&cpuctx->ctx);
- lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
- lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
- cpuctx->ctx.pmu = pmu;
- cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
-
- __perf_mux_hrtimer_init(cpuctx, cpu);
-
- cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
- cpuctx->heap = cpuctx->heap_default;
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+ __perf_init_event_pmu_context(&cpc->epc, pmu);
+ __perf_mux_hrtimer_init(cpc, cpu);
}
-got_cpu_context:
if (!pmu->start_txn) {
if (pmu->pmu_enable) {
/*
@@ -11578,10 +11723,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
}
/*
- * Disallow uncore-cgroup events, they don't make sense as the cgroup will
- * be different on other CPUs in the uncore mask.
+ * Disallow uncode-task events. Similarly, disallow uncore-cgroup
+ * events (they don't make sense as the cgroup will be different
+ * on other CPUs in the uncore mask).
*/
- if (pmu->task_ctx_nr == perf_invalid_context && cgroup_fd != -1) {
+ if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
err = -EINVAL;
goto err_pmu;
}
@@ -11913,37 +12059,6 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
return 0;
}
-/*
- * Variation on perf_event_ctx_lock_nested(), except we take two context
- * mutexes.
- */
-static struct perf_event_context *
-__perf_event_ctx_lock_double(struct perf_event *group_leader,
- struct perf_event_context *ctx)
-{
- struct perf_event_context *gctx;
-
-again:
- rcu_read_lock();
- gctx = READ_ONCE(group_leader->ctx);
- if (!refcount_inc_not_zero(&gctx->refcount)) {
- rcu_read_unlock();
- goto again;
- }
- rcu_read_unlock();
-
- mutex_lock_double(&gctx->mutex, &ctx->mutex);
-
- if (group_leader->ctx != gctx) {
- mutex_unlock(&ctx->mutex);
- mutex_unlock(&gctx->mutex);
- put_ctx(gctx);
- goto again;
- }
-
- return gctx;
-}
-
static bool
perf_check_permission(struct perf_event_attr *attr, struct task_struct *task)
{
@@ -11989,9 +12104,10 @@ SYSCALL_DEFINE5(perf_event_open,
pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
{
struct perf_event *group_leader = NULL, *output_event = NULL;
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event *event, *sibling;
struct perf_event_attr attr;
- struct perf_event_context *ctx, *gctx;
+ struct perf_event_context *ctx;
struct file *event_file = NULL;
struct fd group = {NULL, 0};
struct task_struct *task = NULL;
@@ -12099,6 +12215,8 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_task;
}
+ // XXX premature; what if this is allowed, but we get moved to a PMU
+ // that doesn't have this.
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
@@ -12121,42 +12239,37 @@ SYSCALL_DEFINE5(perf_event_open,
if (pmu->task_ctx_nr == perf_sw_context)
event->event_caps |= PERF_EV_CAP_SOFTWARE;
- if (group_leader) {
- if (is_software_event(event) &&
- !in_software_context(group_leader)) {
- /*
- * If the event is a sw event, but the group_leader
- * is on hw context.
- *
- * Allow the addition of software events to hw
- * groups, this is safe because software events
- * never fail to schedule.
- */
- pmu = group_leader->ctx->pmu;
- } else if (!is_software_event(event) &&
- is_software_event(group_leader) &&
- (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
- /*
- * In case the group is a pure software group, and we
- * try to add a hardware event, move the whole group to
- * the hardware context.
- */
- move_group = 1;
- }
- }
-
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(pmu, task, event);
+ ctx = find_get_context(task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_alloc;
}
- /*
- * Look up the group leader (we will attach this event to it):
- */
+ mutex_lock(&ctx->mutex);
+
+ if (ctx->task == TASK_TOMBSTONE) {
+ err = -ESRCH;
+ goto err_locked;
+ }
+
+ if (!task) {
+ /*
+ * Check if the @cpu we're creating an event for is online.
+ *
+ * We use the perf_cpu_context::ctx::mutex to serialize against
+ * the hotplug notifiers. See perf_event_{init,exit}_cpu().
+ */
+ struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
+
+ if (!cpuctx->online) {
+ err = -ENODEV;
+ goto err_locked;
+ }
+ }
+
if (group_leader) {
err = -EINVAL;
@@ -12165,11 +12278,11 @@ SYSCALL_DEFINE5(perf_event_open,
* becoming part of another group-sibling):
*/
if (group_leader->group_leader != group_leader)
- goto err_context;
+ goto err_locked;
/* All events in a group should have the same clock */
if (group_leader->clock != event->clock)
- goto err_context;
+ goto err_locked;
/*
* Make sure we're both events for the same CPU;
@@ -12177,29 +12290,52 @@ SYSCALL_DEFINE5(perf_event_open,
* you can never concurrently schedule them anyhow.
*/
if (group_leader->cpu != event->cpu)
- goto err_context;
-
- /*
- * Make sure we're both on the same task, or both
- * per-CPU events.
- */
- if (group_leader->ctx->task != ctx->task)
- goto err_context;
+ goto err_locked;
/*
- * Do not allow to attach to a group in a different task
- * or CPU context. If we're moving SW events, we'll fix
- * this up later, so allow that.
+ * Make sure we're both on the same context; either task or cpu.
*/
- if (!move_group && group_leader->ctx != ctx)
- goto err_context;
+ if (group_leader->ctx != ctx)
+ goto err_locked;
/*
* Only a group leader can be exclusive or pinned
*/
if (attr.exclusive || attr.pinned)
- goto err_context;
+ goto err_locked;
+
+ if (is_software_event(event) &&
+ !in_software_context(group_leader)) {
+ /*
+ * If the event is a sw event, but the group_leader
+ * is on hw context.
+ *
+ * Allow the addition of software events to hw
+ * groups, this is safe because software events
+ * never fail to schedule.
+ */
+ pmu = group_leader->pmu_ctx->pmu;
+ } else if (!is_software_event(event) &&
+ is_software_event(group_leader) &&
+ (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
+ /*
+ * In case the group is a pure software group, and we
+ * try to add a hardware event, move the whole group to
+ * the hardware context.
+ */
+ move_group = 1;
+ }
+ }
+
+ /*
+ * Now that we're certain of the pmu; find the pmu_ctx.
+ */
+ pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+ if (IS_ERR(pmu_ctx)) {
+ err = PTR_ERR(pmu_ctx);
+ goto err_locked;
}
+ event->pmu_ctx = pmu_ctx;
if (output_event) {
err = perf_event_set_output(event, output_event);
@@ -12207,8 +12343,7 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
- f_flags);
+ event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
if (IS_ERR(event_file)) {
err = PTR_ERR(event_file);
event_file = NULL;
@@ -12231,77 +12366,14 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
- if (move_group) {
- gctx = __perf_event_ctx_lock_double(group_leader, ctx);
-
- if (gctx->task == TASK_TOMBSTONE) {
- err = -ESRCH;
- goto err_locked;
- }
-
- /*
- * Check if we raced against another sys_perf_event_open() call
- * moving the software group underneath us.
- */
- if (!(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
- /*
- * If someone moved the group out from under us, check
- * if this new event wound up on the same ctx, if so
- * its the regular !move_group case, otherwise fail.
- */
- if (gctx != ctx) {
- err = -EINVAL;
- goto err_locked;
- } else {
- perf_event_ctx_unlock(group_leader, gctx);
- move_group = 0;
- }
- }
-
- /*
- * Failure to create exclusive events returns -EBUSY.
- */
- err = -EBUSY;
- if (!exclusive_event_installable(group_leader, ctx))
- goto err_locked;
-
- for_each_sibling_event(sibling, group_leader) {
- if (!exclusive_event_installable(sibling, ctx))
- goto err_locked;
- }
- } else {
- mutex_lock(&ctx->mutex);
- }
-
- if (ctx->task == TASK_TOMBSTONE) {
- err = -ESRCH;
- goto err_locked;
- }
-
if (!perf_event_validate_size(event)) {
err = -E2BIG;
- goto err_locked;
- }
-
- if (!task) {
- /*
- * Check if the @cpu we're creating an event for is online.
- *
- * We use the perf_cpu_context::ctx::mutex to serialize against
- * the hotplug notifiers. See perf_event_{init,exit}_cpu().
- */
- struct perf_cpu_context *cpuctx =
- container_of(ctx, struct perf_cpu_context, ctx);
-
- if (!cpuctx->online) {
- err = -ENODEV;
- goto err_locked;
- }
+ goto err_cred;
}
if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) {
err = -EINVAL;
- goto err_locked;
+ goto err_cred;
}
/*
@@ -12310,7 +12382,7 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (!exclusive_event_installable(event, ctx)) {
err = -EBUSY;
- goto err_locked;
+ goto err_cred;
}
WARN_ON_ONCE(ctx->parent_ctx);
@@ -12321,24 +12393,14 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (move_group) {
- /*
- * See perf_event_ctx_lock() for comments on the details
- * of swizzling perf_event::ctx.
- */
perf_remove_from_context(group_leader, 0);
- put_ctx(gctx);
+ put_pmu_ctx(group_leader->pmu_ctx);
for_each_sibling_event(sibling, group_leader) {
perf_remove_from_context(sibling, 0);
- put_ctx(gctx);
+ put_pmu_ctx(sibling->pmu_ctx);
}
- /*
- * Wait for everybody to stop referencing the events through
- * the old lists, before installing it on new lists.
- */
- synchronize_rcu();
-
/*
* Install the group siblings before the group leader.
*
@@ -12350,9 +12412,10 @@ SYSCALL_DEFINE5(perf_event_open,
* reachable through the group lists.
*/
for_each_sibling_event(sibling, group_leader) {
+ sibling->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
perf_event__state_init(sibling);
perf_install_in_context(ctx, sibling, sibling->cpu);
- get_ctx(ctx);
}
/*
@@ -12360,9 +12423,10 @@ SYSCALL_DEFINE5(perf_event_open,
* event. What we want here is event in the initial
* startup state, ready to be add into new context.
*/
+ group_leader->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
perf_event__state_init(group_leader);
perf_install_in_context(ctx, group_leader, group_leader->cpu);
- get_ctx(ctx);
}
/*
@@ -12379,8 +12443,6 @@ SYSCALL_DEFINE5(perf_event_open,
perf_install_in_context(ctx, event, event->cpu);
perf_unpin_context(ctx);
- if (move_group)
- perf_event_ctx_unlock(group_leader, gctx);
mutex_unlock(&ctx->mutex);
if (task) {
@@ -12402,16 +12464,15 @@ SYSCALL_DEFINE5(perf_event_open,
fd_install(event_fd, event_file);
return event_fd;
-err_locked:
- if (move_group)
- perf_event_ctx_unlock(group_leader, gctx);
- mutex_unlock(&ctx->mutex);
err_cred:
if (task)
up_read(&task->signal->exec_update_lock);
err_file:
fput(event_file);
err_context:
+ /* event->pmu_ctx freed by free_event() */
+err_locked:
+ mutex_unlock(&ctx->mutex);
perf_unpin_context(ctx);
put_ctx(ctx);
err_alloc:
@@ -12446,8 +12507,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
perf_overflow_handler_t overflow_handler,
void *context)
{
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event_context *ctx;
struct perf_event *event;
+ struct pmu *pmu;
int err;
/*
@@ -12466,15 +12529,31 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
/* Mark owner so we could distinguish it from user events. */
event->owner = TASK_TOMBSTONE;
+ pmu = event->pmu;
+
+ if (pmu->task_ctx_nr < 0 && task) {
+ err = -EINVAL;
+ goto err_alloc;
+ }
+
+ if (pmu->task_ctx_nr == perf_sw_context)
+ event->event_caps |= PERF_EV_CAP_SOFTWARE;
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(event->pmu, task, event);
+ ctx = find_get_context(task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
- goto err_free;
+ goto err_alloc;
+ }
+
+ pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+ if (IS_ERR(pmu_ctx)) {
+ err = PTR_ERR(pmu_ctx);
+ goto err_ctx;
}
+ event->pmu_ctx = pmu_ctx;
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
@@ -12511,9 +12590,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
err_unlock:
mutex_unlock(&ctx->mutex);
+err_ctx:
perf_unpin_context(ctx);
put_ctx(ctx);
-err_free:
+err_alloc:
free_event(event);
err:
return ERR_PTR(err);
@@ -12522,6 +12602,7 @@ EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
{
+#if 0 // XXX buggered - cpu hotplug, who cares
struct perf_event_context *src_ctx;
struct perf_event_context *dst_ctx;
struct perf_event *event, *tmp;
@@ -12582,6 +12663,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
}
mutex_unlock(&dst_ctx->mutex);
mutex_unlock(&src_ctx->mutex);
+#endif
}
EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
@@ -12659,14 +12741,14 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
+static void perf_event_exit_task_context(struct task_struct *child)
{
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
WARN_ON_ONCE(child != current);
- child_ctx = perf_pin_task_context(child, ctxn);
+ child_ctx = perf_pin_task_context(child);
if (!child_ctx)
return;
@@ -12688,13 +12770,13 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+ task_ctx_sched_out(child_ctx, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
* and mark the context dead.
*/
- RCU_INIT_POINTER(child->perf_event_ctxp[ctxn], NULL);
+ RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
put_ctx(child_ctx); /* cannot be last */
WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
put_task_struct(current); /* cannot be last */
@@ -12729,7 +12811,6 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
void perf_event_exit_task(struct task_struct *child)
{
struct perf_event *event, *tmp;
- int ctxn;
mutex_lock(&child->perf_event_mutex);
list_for_each_entry_safe(event, tmp, &child->perf_event_list,
@@ -12745,8 +12826,7 @@ void perf_event_exit_task(struct task_struct *child)
}
mutex_unlock(&child->perf_event_mutex);
- for_each_task_context_nr(ctxn)
- perf_event_exit_task_context(child, ctxn);
+ perf_event_exit_task_context(child);
/*
* The perf_event_exit_task_context calls perf_event_task
@@ -12789,56 +12869,51 @@ void perf_event_free_task(struct task_struct *task)
{
struct perf_event_context *ctx;
struct perf_event *event, *tmp;
- int ctxn;
- for_each_task_context_nr(ctxn) {
- ctx = task->perf_event_ctxp[ctxn];
- if (!ctx)
- continue;
+ ctx = rcu_dereference(task->perf_event_ctxp);
+ if (!ctx)
+ return;
- mutex_lock(&ctx->mutex);
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Destroy the task <-> ctx relation and mark the context dead.
- *
- * This is important because even though the task hasn't been
- * exposed yet the context has been (through child_list).
- */
- RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], NULL);
- WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
- raw_spin_unlock_irq(&ctx->lock);
+ mutex_lock(&ctx->mutex);
+ raw_spin_lock_irq(&ctx->lock);
+ /*
+ * Destroy the task <-> ctx relation and mark the context dead.
+ *
+ * This is important because even though the task hasn't been
+ * exposed yet the context has been (through child_list).
+ */
+ RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+ WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
+ put_task_struct(task); /* cannot be last */
+ raw_spin_unlock_irq(&ctx->lock);
- list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
- perf_free_event(event, ctx);
- mutex_unlock(&ctx->mutex);
+ list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
+ perf_free_event(event, ctx);
- /*
- * perf_event_release_kernel() could've stolen some of our
- * child events and still have them on its free_list. In that
- * case we must wait for these events to have been freed (in
- * particular all their references to this task must've been
- * dropped).
- *
- * Without this copy_process() will unconditionally free this
- * task (irrespective of its reference count) and
- * _free_event()'s put_task_struct(event->hw.target) will be a
- * use-after-free.
- *
- * Wait for all events to drop their context reference.
- */
- wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
- put_ctx(ctx); /* must be last */
- }
+ mutex_unlock(&ctx->mutex);
+
+ /*
+ * perf_event_release_kernel() could've stolen some of our
+ * child events and still have them on its free_list. In that
+ * case we must wait for these events to have been freed (in
+ * particular all their references to this task must've been
+ * dropped).
+ *
+ * Without this copy_process() will unconditionally free this
+ * task (irrespective of its reference count) and
+ * _free_event()'s put_task_struct(event->hw.target) will be a
+ * use-after-free.
+ *
+ * Wait for all events to drop their context reference.
+ */
+ wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
+ put_ctx(ctx); /* must be last */
}
void perf_event_delayed_put(struct task_struct *task)
{
- int ctxn;
-
- for_each_task_context_nr(ctxn)
- WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+ WARN_ON_ONCE(task->perf_event_ctxp);
}
struct file *perf_event_get(unsigned int fd)
@@ -12888,6 +12963,7 @@ inherit_event(struct perf_event *parent_event,
struct perf_event_context *child_ctx)
{
enum perf_event_state parent_state = parent_event->state;
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event *child_event;
unsigned long flags;
@@ -12908,17 +12984,12 @@ inherit_event(struct perf_event *parent_event,
if (IS_ERR(child_event))
return child_event;
-
- if ((child_event->attach_state & PERF_ATTACH_TASK_DATA) &&
- !child_ctx->task_ctx_data) {
- struct pmu *pmu = child_event->pmu;
-
- child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
- if (!child_ctx->task_ctx_data) {
- free_event(child_event);
- return ERR_PTR(-ENOMEM);
- }
+ pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
+ if (!pmu_ctx) {
+ free_event(child_event);
+ return NULL;
}
+ child_event->pmu_ctx = pmu_ctx;
/*
* is_orphaned_event() and list_add_tail(&parent_event->child_list)
@@ -13041,11 +13112,11 @@ static int inherit_group(struct perf_event *parent_event,
static int
inherit_task_group(struct perf_event *event, struct task_struct *parent,
struct perf_event_context *parent_ctx,
- struct task_struct *child, int ctxn,
+ struct task_struct *child,
u64 clone_flags, int *inherited_all)
{
- int ret;
struct perf_event_context *child_ctx;
+ int ret;
if (!event->attr.inherit ||
(event->attr.inherit_thread && !(clone_flags & CLONE_THREAD)) ||
@@ -13055,7 +13126,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
return 0;
}
- child_ctx = child->perf_event_ctxp[ctxn];
+ child_ctx = child->perf_event_ctxp;
if (!child_ctx) {
/*
* This is executed from the parent task context, so
@@ -13063,16 +13134,14 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
* First allocate and initialize a context for the
* child.
*/
- child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+ child_ctx = alloc_perf_context(child);
if (!child_ctx)
return -ENOMEM;
- child->perf_event_ctxp[ctxn] = child_ctx;
+ child->perf_event_ctxp = child_ctx;
}
- ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
-
+ ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
if (ret)
*inherited_all = 0;
@@ -13082,8 +13151,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
/*
* Initialize the perf_event context in task_struct
*/
-static int perf_event_init_context(struct task_struct *child, int ctxn,
- u64 clone_flags)
+static int perf_event_init_context(struct task_struct *child, u64 clone_flags)
{
struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
@@ -13093,14 +13161,14 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
unsigned long flags;
int ret = 0;
- if (likely(!parent->perf_event_ctxp[ctxn]))
+ if (likely(!parent->perf_event_ctxp))
return 0;
/*
* If the parent's context is a clone, pin it so it won't get
* swapped under us.
*/
- parent_ctx = perf_pin_task_context(parent, ctxn);
+ parent_ctx = perf_pin_task_context(parent);
if (!parent_ctx)
return 0;
@@ -13123,8 +13191,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
*/
perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, clone_flags,
- &inherited_all);
+ child, clone_flags, &inherited_all);
if (ret)
goto out_unlock;
}
@@ -13140,8 +13207,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, clone_flags,
- &inherited_all);
+ child, clone_flags, &inherited_all);
if (ret)
goto out_unlock;
}
@@ -13149,7 +13215,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
raw_spin_lock_irqsave(&parent_ctx->lock, flags);
parent_ctx->rotate_disable = 0;
- child_ctx = child->perf_event_ctxp[ctxn];
+ child_ctx = child->perf_event_ctxp;
if (child_ctx && inherited_all) {
/*
@@ -13185,18 +13251,16 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
*/
int perf_event_init_task(struct task_struct *child, u64 clone_flags)
{
- int ctxn, ret;
+ int ret;
- memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+ child->perf_event_ctxp = NULL;
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
- for_each_task_context_nr(ctxn) {
- ret = perf_event_init_context(child, ctxn, clone_flags);
- if (ret) {
- perf_event_free_task(child);
- return ret;
- }
+ ret = perf_event_init_context(child, clone_flags);
+ if (ret) {
+ perf_event_free_task(child);
+ return ret;
}
return 0;
@@ -13205,6 +13269,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
static void __init perf_event_init_all_cpus(void)
{
struct swevent_htable *swhash;
+ struct perf_cpu_context *cpuctx;
int cpu;
zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL);
@@ -13212,7 +13277,6 @@ static void __init perf_event_init_all_cpus(void)
for_each_possible_cpu(cpu) {
swhash = &per_cpu(swevent_htable, cpu);
mutex_init(&swhash->hlist_mutex);
- INIT_LIST_HEAD(&per_cpu(active_ctx_list, cpu));
INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
@@ -13221,6 +13285,14 @@ static void __init perf_event_init_all_cpus(void)
INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
#endif
INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
+
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ __perf_event_init_context(&cpuctx->ctx);
+ lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
+ lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
+ cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
+ cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
+ cpuctx->heap = cpuctx->heap_default;
}
}
@@ -13242,12 +13314,12 @@ static void perf_swevent_init_cpu(unsigned int cpu)
#if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC_CORE
static void __perf_event_exit_context(void *__info)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *ctx = __info;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct perf_event *event;
raw_spin_lock(&ctx->lock);
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
list_for_each_entry(event, &ctx->event_list, event_entry)
__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
raw_spin_unlock(&ctx->lock);
@@ -13257,18 +13329,16 @@ static void perf_event_exit_cpu_context(int cpu)
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
- struct pmu *pmu;
+ // XXX simplify cpuctx->online
mutex_lock(&pmus_lock);
- list_for_each_entry(pmu, &pmus, entry) {
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- ctx = &cpuctx->ctx;
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ ctx = &cpuctx->ctx;
- mutex_lock(&ctx->mutex);
- smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
- cpuctx->online = 0;
- mutex_unlock(&ctx->mutex);
- }
+ mutex_lock(&ctx->mutex);
+ smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
+ cpuctx->online = 0;
+ mutex_unlock(&ctx->mutex);
cpumask_clear_cpu(cpu, perf_online_mask);
mutex_unlock(&pmus_lock);
}
@@ -13282,20 +13352,17 @@ int perf_event_init_cpu(unsigned int cpu)
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
- struct pmu *pmu;
perf_swevent_init_cpu(cpu);
mutex_lock(&pmus_lock);
cpumask_set_cpu(cpu, perf_online_mask);
- list_for_each_entry(pmu, &pmus, entry) {
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- ctx = &cpuctx->ctx;
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ ctx = &cpuctx->ctx;
- mutex_lock(&ctx->mutex);
- cpuctx->online = 1;
- mutex_unlock(&ctx->mutex);
- }
+ mutex_lock(&ctx->mutex);
+ cpuctx->online = 1;
+ mutex_unlock(&ctx->mutex);
mutex_unlock(&pmus_lock);
return 0;
--
2.27.0
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: f7cf7134e405062bf0f22c3ba5637241c4c4d06a ("[RFC v2] perf: Rewrite core context handling")
url: https://github.com/0day-ci/linux/commits/Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a9f4a6e92b3b319296fb078da2615f618f6cd80c
patch link: https://lore.kernel.org/lkml/[email protected]
in testcase: trinity
version: trinity-x86_64-608712d8-1_20220112
with following parameters:
runtime: 300s
group: group-03
test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+------------------------------------------------------+------------+------------+
| | a9f4a6e92b | f7cf7134e4 |
+------------------------------------------------------+------------+------------+
| WARNING:at_kernel/events/core.c:#__pmu_ctx_sched_out | 0 | 12 |
| RIP:__pmu_ctx_sched_out | 0 | 12 |
+------------------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 61.980728][ T8152] WARNING: CPU: 0 PID: 8152 at kernel/events/core.c:3234 __pmu_ctx_sched_out (kernel/events/core.c:3234 (discriminator 1))
[ 61.983310][ T8152] Modules linked in: can_bcm can_raw can cn scsi_transport_iscsi ipmi_msghandler sr_mod cdrom sg ata_generic
[ 61.986280][ T8152] CPU: 0 PID: 8152 Comm: trinity-c7 Not tainted 5.16.0-rc1-00018-gf7cf7134e405 #1
[ 61.988767][ T8152] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 61.991025][ T8152] RIP: 0010:__pmu_ctx_sched_out (kernel/events/core.c:3234 (discriminator 1))
[ 61.992479][ T8152] Code: 89 fb 4c 8b 2f 49 83 bc 24 88 00 00 00 00 74 24 41 83 7c 24 70 00 75 1c 49 8b 45 48 65 48 03 05 79 a0 e9 7e 48 39 78 60 74 02 <0f> 0b 48 c7 40 60 00 00 00 00 4c 89 ef e8 50 fd ff ff 41 f6 c6 02
All code
========
0: 89 fb mov %edi,%ebx
2: 4c 8b 2f mov (%rdi),%r13
5: 49 83 bc 24 88 00 00 cmpq $0x0,0x88(%r12)
c: 00 00
e: 74 24 je 0x34
10: 41 83 7c 24 70 00 cmpl $0x0,0x70(%r12)
16: 75 1c jne 0x34
18: 49 8b 45 48 mov 0x48(%r13),%rax
1c: 65 48 03 05 79 a0 e9 add %gs:0x7ee9a079(%rip),%rax # 0x7ee9a09d
23: 7e
24: 48 39 78 60 cmp %rdi,0x60(%rax)
28: 74 02 je 0x2c
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 c7 40 60 00 00 00 movq $0x0,0x60(%rax)
33: 00
34: 4c 89 ef mov %r13,%rdi
37: e8 50 fd ff ff callq 0xfffffffffffffd8c
3c: 41 f6 c6 02 test $0x2,%r14b
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 40 60 00 00 00 movq $0x0,0x60(%rax)
9: 00
a: 4c 89 ef mov %r13,%rdi
d: e8 50 fd ff ff callq 0xfffffffffffffd62
12: 41 f6 c6 02 test $0x2,%r14b
[ 61.997335][ T8152] RSP: 0018:ffffc9000218fc78 EFLAGS: 00010007
[ 61.998976][ T8152] RAX: ffff88842fc2f658 RBX: ffff88812b633000 RCX: 0000000000000001
[ 62.000996][ T8152] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff88812b633000
[ 62.003040][ T8152] RBP: ffff88812b633000 R08: ffff888175700b38 R09: ffff88816c54dc50
[ 62.005690][ T8152] R10: ffff8881767fdf00 R11: ffff88810cc004c0 R12: ffff888175700b00
[ 62.007781][ T8152] R13: ffffffff829f5580 R14: 0000000000000003 R15: ffff88812b37c600
[ 62.009979][ T8152] FS: 00007f72f7208740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[ 62.012266][ T8152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.014072][ T8152] CR2: 00007f72f6ec5efc CR3: 00000001755b6000 CR4: 00000000000006f0
[ 62.016206][ T8152] Call Trace:
[ 62.017163][ T8152] <TASK>
[ 62.018032][ T8152] ctx_sched_out (kernel/events/core.c:3314 (discriminator 3))
[ 62.019167][ T8152] ctx_resched (kernel/events/core.c:2734)
[ 62.020303][ T8152] __perf_install_in_context (kernel/events/core.c:2809)
[ 62.021707][ T8152] ? sw_perf_event_destroy (kernel/events/core.c:72)
[ 62.023104][ T8152] remote_function (kernel/events/core.c:91 kernel/events/core.c:71)
[ 62.024448][ T8152] generic_exec_single (arch/x86/include/asm/irqflags.h:127 kernel/smp.c:520 kernel/smp.c:504)
[ 62.025714][ T8152] ? __alloc_file (fs/file_table.c:117)
[ 62.027233][ T8152] smp_call_function_single (kernel/smp.c:757)
[ 62.028756][ T8152] ? sw_perf_event_destroy (kernel/events/core.c:72)
[ 62.030203][ T8152] ? alloc_file (fs/file_table.c:200)
[ 62.031351][ T8152] task_function_call (kernel/events/core.c:121)
[ 62.032654][ T8152] ? ctx_resched (kernel/events/core.c:2762)
[ 62.033860][ T8152] perf_install_in_context (kernel/events/core.c:2910)
[ 62.035324][ T8152] __do_sys_perf_event_open (kernel/events/core.c:12491)
[ 62.036830][ T8152] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 62.038078][ T8152] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
[ 62.039568][ T8152] RIP: 0033:0x7f72f731ff59
[ 62.040675][ T8152] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
All code
========
0: 00 c3 add %al,%bl
2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
9: 00 00 00
c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f41
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f17
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 62.045496][ T8152] RSP: 002b:00007ffddf82c708 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
[ 62.047784][ T8152] RAX: ffffffffffffffda RBX: 000000000000012a RCX: 00007f72f731ff59
[ 62.049967][ T8152] RDX: ffffffffffffffff RSI: 0000000000001fd8 RDI: 000055dad0ac6c00
[ 62.051954][ T8152] RBP: 000000000000012a R08: 0000000000000000 R09: 0000000037ecd53b
[ 62.053953][ T8152] R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
[ 62.055966][ T8152] R13: 00007f72f5c7b058 R14: 00007f72f72086c0 R15: 00007f72f5c7b000
[ 62.057961][ T8152] </TASK>
[ 62.058735][ T8152] ---[ end trace d948954eb385fb99 ]---
[ 64.410009][ T2470] result_service: raw_upload, RESULT_MNT: /internal-lkp-server/result, RESULT_ROOT: /internal-lkp-server/result/trinity/group-03-300s/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-kexec/gcc-9/f7cf7134e405062bf0f22c3ba5637241c4c4d06a/3, TMP_RESULT_ROOT: /tmp/lkp/result
[ 64.410031][ T2470]
[ 64.429418][ T2470] run-job /lkp/jobs/scheduled/vm-snb-123/trinity-group-03-300s-debian-10.4-x86_64-20200603.cgz-f7cf7134e405062bf0f22c3ba5637241c4c4d06a-20220115-27780-1ajh2ux-5.yaml
[ 64.429432][ T2470]
[ 66.886071][ T2470] /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/vm-snb-123/trinity-group-03-300s-debian-10.4-x86_64-20200603.cgz-f7cf7134e405062bf0f22c3ba5637241c4c4d06a-20220115-27780-1ajh2ux-5.yaml&job_state=running -O /dev/null
[ 66.886087][ T2470]
[ 66.900236][ T2470] target ucode:
[ 66.900245][ T2470]
[ 66.907783][ T2470] Seeding trinity by 3185738815 based on vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-kexec
[ 66.907811][ T2470]
[ 75.752629][ T2470] 2022-01-15 08:08:38 chroot --userspec nobody:nogroup / trinity -q -q -l off -s 3185738815 -N 999999999 -c add_key -c bpf -c clock_adjtime -c clone3 -c copy_file_range -c creat -c epoll_create1 -c fadvise64_64 -c fchdir -c fdatasync -c geteuid16 -c getgroups -c gettimeofday -c inotify_init -c io_uring_register -c io_uring_setup -c ioctl -c kcmp -c keyctl -c lsetxattr -c lstat -c mkdirat -c mknod -c mknodat -c mmap2 -c mq_getsetattr -c munlockall -c olduname -c openat -c perf_event_open -c personality -c pidfd_open -c poll -c prctl -c preadv -c process_madvise -c process_mrelease -c ptrace -c quotactl -c readlinkat -c rt_sigtimedwait -c sched_getattr -c sched_rr_get_interval -c setgid -c setgroups -c setrlimit -c setsid -c signalfd4 -c statfs -c stime -c syncfs -c timer_create -c timer_delete -c timer_gettime -c timerfd_create -c uname -c utimensat -c vm86old -c vmsplice
[ 75.752696][ T2470]
[ 80.049523][ T2470] Trinity 2021.10 Dave Jones <[email protected]>
[ 80.049600][ T2470]
[ 80.056064][ T2470] shm:0x7f72f73f9000-0x7f7303ff5d00 (4 pages)
[ 80.056078][ T2470]
[ 81.317675][ T2470] [main] Marking syscall add_key (64bit:248 32bit:286) as to be enabled.
[ 81.317695][ T2470]
[ 81.326020][ T2470] [main] Marking syscall bpf (64bit:321 32bit:357) as to be enabled.
[ 81.326033][ T2470]
[ 81.762673][ T2470] [main] Marking syscall clock_adjtime (64bit:305 32bit:343) as to be enabled.
[ 81.762703][ T2470]
[ 81.769765][ T2470] [main] clone3 is marked as AVOID. Skipping
[ 81.769779][ T2470]
[ 81.775664][ T2470] [main] clone3 is marked as AVOID. Skipping
[ 81.775679][ T2470]
[ 81.783066][ T2470] [main] Marking syscall clone3 (64bit:435 32bit:428) as to be enabled.
[ 81.783089][ T2470]
[ 81.932838][ T2470] [main] Marking syscall copy_file_range (64bit:326 32bit:377) as to be enabled.
[ 81.932853][ T2470]
[ 81.941437][ T2470] [main] Marking syscall creat (64bit:85 32bit:8) as to be enabled.
[ 81.941458][ T2470]
[ 81.950070][ T2470] [main] Marking syscall epoll_create1 (64bit:291 32bit:329) as to be enabled.
[ 81.950086][ T2470]
[ 81.958583][ T2470] [main] Marking 32-bit syscall fadvise64_64 (272) as to be enabled.
[ 81.958597][ T2470]
[ 81.969906][ T2470] [main] Marking syscall fchdir (64bit:81 32bit:133) as to be enabled.
[ 81.969924][ T2470]
[ 81.978369][ T2470] [main] Marking syscall fdatasync (64bit:75 32bit:148) as to be enabled.
[ 81.978384][ T2470]
[ 81.986696][ T2470] [main] Marking 32-bit syscall geteuid16 (49) as to be enabled.
[ 81.986711][ T2470]
[ 81.995106][ T2470] [main] Marking syscall getgroups (64bit:115 32bit:205) as to be enabled.
[ 81.995123][ T2470]
[ 82.004068][ T2470] [main] Marking syscall gettimeofday (64bit:96 32bit:78) as to be enabled.
[ 82.004084][ T2470]
[ 82.013067][ T2470] [main] Marking syscall inotify_init (64bit:253 32bit:291) as to be enabled.
[ 82.013081][ T2470]
To reproduce:
# build kernel
cd linux
cp config-5.16.0-rc1-00018-gf7cf7134e405 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On 13-Jan-22 7:17 PM, Ravi Bangoria wrote:
> From: Peter Zijlstra <[email protected]>
>
> This is the 2nd version of RFC originally posted by Peter[1].
>
> There have been various issues and limitations with the way perf uses
> (task) contexts to track events. Most notable is the single hardware
> PMU task context, which has resulted in a number of yucky things (both
> proposed and merged).
>
> Notably:
> - HW breakpoint PMU
> - ARM big.little PMU / Intel ADL PMU
> - Intel Branch Monitoring PMU
> - AMD IBS
>
> Current design:
> ---------------
> Currently we have a per task and per cpu perf_event_contexts:
>
> task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
> ^ | ^ |
> `---------------------------------' | `--> pmu
> v ^
> perf_event ------'
>
> Each task has an array of pointers to a perf_event_context. Each
> perf_event_context has a direct relation to a PMU and a group of
> events for that PMU. The task related perf_event_context's have a
> pointer back to that task.
>
> Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
> includes a perf_event_context, which again has a direct relation to
> that PMU, and a group of events for that PMU.
>
> The perf_cpu_context also tracks which task context is currently
> associated with that CPU and includes a few other things like the
> hrtimer for rotation etc.
>
> Each perf_event is then associated with its PMU and one
> perf_event_context.
>
> Proposed design:
> ----------------
> New design proposed by this patch reduce to a single task context and
> a single CPU context but adds some intermediate data-structures:
>
> task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
> ^ | ^ ^
> `---------------------------------' | |
> | | perf_cpu_pmu_context
> | `----. ^
> | | |
> | v v
> | ,--> perf_event_pmu_context
> | | ^
> | | |
> v v v
> perf_event ---> pmu
>
> With new design, perf_event_context will hold all pmu events in the
> respective(pinned/flexible) rbtrees. This can be achieved by adding
> pmu to rbtree key:
>
> {cpu, pmu, cgroup_id, group_index}
>
> Each perf_event_context carry a list of perf_event_pmu_context which
> is used to hold per-pmu-per-context state. For ex, it keeps track of
> currently active events for that pmu, a pmu specific task_ctx_data,
> a flag to tell whether rotation is required or not etc.
>
> Similarly perf_cpu_pmu_context is used to hold per-pmu-per-cpu state
> like hrtimer details to drive the event rotation, a pointer to
> perf_event_pmu_context of currently running task and some other
> ancillary information.
>
> Each perf_event is associated to it's pmu, perf_event_context and
> perf_event_pmu_context.
>
> Original RFC -> RFC v2:
> -----------------------
> In addition to porting the patch to latest (v5.16-rc6) kernel, here
> are some of the major changes between two revisions:
>
> - There were quite a bit of fundamental changes since original patch.
> Most notably a rbtree key has changed from {cpu,group_index} to
> {cpu,cgroup_id,group_index}. Adding a pmu key in between as proposed
> in original patch is not straight forward as it will break cgroup
> specific optimization. Hence we need to iterate over all pmu_ctx
> for a given ctx and call visit_groups_merge() one by one.
> - Enabled cgroup support (CGROUP_PERF).
> - Some changes wrt multiplexing events as with new design the rotation
> happens at cgroup subtree unlike at pmu subtree in original patch.
>
> Because of additional complexity above changes bring in, I thought to
> get initial review about the overall approach before starting to make it
> upstream ready. Hence this patch just provides an idea of the direction
> we will head toward. Many loose ends in the patch rightnow. Like, I've
> not paid much attention to synchronization related aspects. Similarly,
> some of the issues marked in original patch (XXX) haven't been fixed.
>
> A simple perf stat/record/top survives with the patch but machine
> crashes with first run of perf test (stale cpc->task_epc causing the
> crash). Lockdep is also screaming a lot :)
Hi Peter, can you please review this.
Thanks,
Ravi
Right, so sorry for being incredibly tardy on this. Find below the
patch fwd ported to something recent.
I'll reply to this with fixes and comments.
---
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -132,7 +132,7 @@ static unsigned long ebb_switch_in(bool
static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) {}
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) {}
static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {}
static void pmao_restore_workaround(bool ebb) { }
#endif /* CONFIG_PPC32 */
@@ -451,7 +451,7 @@ static void power_pmu_bhrb_disable(struc
/* Called from ctxsw to prevent one process's branch entries to
* mingle with the other process's entries during context switch.
*/
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
if (!ppmu->bhrb_nr)
return;
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -317,7 +317,7 @@ static void amd_brs_poison_buffer(void)
* On ctxswin, sched_in = true, called after the PMU has started
* On ctxswout, sched_in = false, called before the PMU is stopped
*/
-void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1248,11 +1248,11 @@ static ssize_t amd_event_sysfs_show(char
return x86_event_sysfs_show(page, config, event);
}
-static void amd_pmu_sched_task(struct perf_event_context *ctx,
+static void amd_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
bool sched_in)
{
if (sched_in && x86_pmu.lbr_nr)
- amd_pmu_brs_sched_task(ctx, sched_in);
+ amd_pmu_brs_sched_task(pmu_ctx, sched_in);
}
static u64 amd_pmu_limit_period(struct perf_event *event, u64 left)
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2067,13 +2067,14 @@ void x86_pmu_show_pmu_cap(int num_counte
*/
void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
{
- struct perf_cpu_context *cpuctx;
+ /* XXX: Don't need this quirk anymore */
+ /*struct perf_cpu_context *cpuctx;
if (!pmu->pmu_cpu_context)
return;
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- cpuctx->ctx.pmu = pmu;
+ cpuctx->ctx.pmu = pmu;*/
}
static int __init init_hw_perf_events(void)
@@ -2644,15 +2645,15 @@ static const struct attribute_group *x86
NULL,
};
-static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
- static_call_cond(x86_pmu_sched_task)(ctx, sched_in);
+ static_call_cond(x86_pmu_sched_task)(pmu_ctx, sched_in);
}
-static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+static void x86_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
- static_call_cond(x86_pmu_swap_task_ctx)(prev, next);
+ static_call_cond(x86_pmu_swap_task_ctx)(prev_epc, next_epc);
}
void perf_check_microcode(void)
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4545,17 +4545,17 @@ static void intel_pmu_cpu_dead(int cpu)
cpumask_clear_cpu(cpu, &hybrid_pmu(cpuc->pmu)->supported_cpus);
}
-static void intel_pmu_sched_task(struct perf_event_context *ctx,
+static void intel_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
bool sched_in)
{
- intel_pmu_pebs_sched_task(ctx, sched_in);
- intel_pmu_lbr_sched_task(ctx, sched_in);
+ intel_pmu_pebs_sched_task(pmu_ctx, sched_in);
+ intel_pmu_lbr_sched_task(pmu_ctx, sched_in);
}
-static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+static void intel_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
- intel_pmu_lbr_swap_task_ctx(prev, next);
+ intel_pmu_lbr_swap_task_ctx(prev_epc, next_epc);
}
static int intel_pmu_check_period(struct perf_event *event, u64 value)
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1005,7 +1005,7 @@ static inline bool pebs_needs_sched_cb(s
return cpuc->n_pebs && (cpuc->n_pebs == cpuc->n_large_pebs);
}
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1113,7 +1113,7 @@ static void
pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
struct perf_event *event, bool add)
{
- struct pmu *pmu = event->ctx->pmu;
+ struct pmu *pmu = event->pmu;
/*
* Make sure we get updated with the first PEBS
* event. It will trigger also during removal, but
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -575,21 +575,21 @@ static void __intel_pmu_lbr_save(void *c
cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
}
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next)
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc)
{
void *prev_ctx_data, *next_ctx_data;
- swap(prev->task_ctx_data, next->task_ctx_data);
+ swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
/*
- * Architecture specific synchronization makes sense in
- * case both prev->task_ctx_data and next->task_ctx_data
+ * Architecture specific synchronization makes sense in case
+ * both prev_epc->task_ctx_data and next_epc->task_ctx_data
* pointers are allocated.
*/
- prev_ctx_data = next->task_ctx_data;
- next_ctx_data = prev->task_ctx_data;
+ prev_ctx_data = next_epc->task_ctx_data;
+ next_ctx_data = prev_epc->task_ctx_data;
if (!prev_ctx_data || !next_ctx_data)
return;
@@ -598,7 +598,7 @@ void intel_pmu_lbr_swap_task_ctx(struct
task_context_opt(next_ctx_data)->lbr_callstack_users);
}
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
void *task_ctx;
@@ -611,7 +611,7 @@ void intel_pmu_lbr_sched_task(struct per
* the task was scheduled out, restore the stack. Otherwise flush
* the LBR stack.
*/
- task_ctx = ctx ? ctx->task_ctx_data : NULL;
+ task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
if (task_ctx) {
if (sched_in)
__intel_pmu_lbr_restore(task_ctx);
@@ -647,8 +647,8 @@ void intel_pmu_lbr_add(struct perf_event
cpuc->br_sel = event->hw.branch_reg.reg;
- if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
- task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
+ if (branch_user_callstack(cpuc->br_sel) && event->pmu_ctx->task_ctx_data)
+ task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users++;
/*
* Request pmu::sched_task() callback, which will fire inside the
@@ -671,7 +671,7 @@ void intel_pmu_lbr_add(struct perf_event
*/
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users++;
- perf_sched_cb_inc(event->ctx->pmu);
+ perf_sched_cb_inc(event->pmu);
if (!cpuc->lbr_users++ && !event->total_time_running)
intel_pmu_lbr_reset();
}
@@ -724,8 +724,8 @@ void intel_pmu_lbr_del(struct perf_event
return;
if (branch_user_callstack(cpuc->br_sel) &&
- event->ctx->task_ctx_data)
- task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
+ event->pmu_ctx->task_ctx_data)
+ task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users--;
if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
cpuc->lbr_select = 0;
@@ -735,7 +735,7 @@ void intel_pmu_lbr_del(struct perf_event
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
- perf_sched_cb_dec(event->ctx->pmu);
+ perf_sched_cb_dec(event->pmu);
}
static inline bool vlbr_exclude_host(void)
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -798,7 +798,7 @@ struct x86_pmu {
void (*cpu_dead)(int cpu);
void (*check_microcode)(void);
- void (*sched_task)(struct perf_event_context *ctx,
+ void (*sched_task)(struct perf_event_pmu_context *pmu_ctx,
bool sched_in);
/*
@@ -880,12 +880,12 @@ struct x86_pmu {
int (*set_topdown_event_period)(struct perf_event *event);
/*
- * perf task context (i.e. struct perf_event_context::task_ctx_data)
+ * perf task context (i.e. struct perf_event_pmu_context::task_ctx_data)
* switch helper to bridge calls from perf/core to perf/x86.
* See struct pmu::swap_task_ctx() usage for examples;
*/
- void (*swap_task_ctx)(struct perf_event_context *prev,
- struct perf_event_context *next);
+ void (*swap_task_ctx)(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
/*
* AMD bits
@@ -1253,7 +1253,7 @@ static inline void amd_pmu_brs_del(struc
perf_sched_cb_dec(event->ctx->pmu);
}
-void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
#else
static inline int amd_brs_init(void)
{
@@ -1278,7 +1278,7 @@ static inline void amd_pmu_brs_del(struc
{
}
-static inline void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+static inline void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
{
}
@@ -1436,7 +1436,7 @@ void intel_pmu_pebs_enable_all(void);
void intel_pmu_pebs_disable_all(void);
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
void intel_pmu_auto_reload_read(struct perf_event *event);
@@ -1444,10 +1444,10 @@ void intel_pmu_store_pebs_lbrs(struct lb
void intel_ds_init(void);
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
- struct perf_event_context *next);
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
u64 lbr_from_signext_quirk_wr(u64 val);
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -262,6 +262,7 @@ struct hw_perf_event {
};
struct perf_event;
+struct perf_event_pmu_context;
/*
* Common implementation detail of pmu::{start,commit,cancel}_txn
@@ -304,7 +305,7 @@ struct pmu {
int capabilities;
int __percpu *pmu_disable_count;
- struct perf_cpu_context __percpu *pmu_cpu_context;
+ struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
atomic_t exclusive_cnt; /* < 0: cpu; > 0: tsk */
int task_ctx_nr;
int hrtimer_interval_ms;
@@ -439,7 +440,7 @@ struct pmu {
/*
* context-switches callback
*/
- void (*sched_task) (struct perf_event_context *ctx,
+ void (*sched_task) (struct perf_event_pmu_context *pmu_ctx,
bool sched_in);
/*
@@ -453,8 +454,8 @@ struct pmu {
* implementation and Perf core context switch handling callbacks for usage
* examples.
*/
- void (*swap_task_ctx) (struct perf_event_context *prev,
- struct perf_event_context *next);
+ void (*swap_task_ctx) (struct perf_event_pmu_context *prev_epc,
+ struct perf_event_pmu_context *next_epc);
/* optional */
/*
@@ -675,6 +676,11 @@ struct perf_event {
int group_caps;
struct perf_event *group_leader;
+ /*
+ * event->pmu will always point to pmu in which this event belongs.
+ * Unlike event->pmu_ctx->pmu which points to other pmu when group of
+ * different events are created.
+ */
struct pmu *pmu;
void *pmu_private;
@@ -700,6 +706,12 @@ struct perf_event {
struct hw_perf_event hw;
struct perf_event_context *ctx;
+ /*
+ * event->pmu_ctx points to perf_event_pmu_context in which the event
+ * is added. This pmu_ctx can be of other pmu for sw event when such
+ * sw event is added to a non-sw event group.
+ */
+ struct perf_event_pmu_context *pmu_ctx;
atomic_long_t refcount;
/*
@@ -787,19 +799,60 @@ struct perf_event {
#endif /* CONFIG_PERF_EVENTS */
};
+/*
+ * ,------------------------[1:n]---------------------.
+ * V V
+ * perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
+ * ^ ^ | |
+ * `--------[1:n]---------' `-[n:1]-> pmu <-[1:n]-'
+ *
+ *
+ * XXX destroy epc when empty
+ * refcount, !rcu
+ *
+ * XXX epc locking
+ *
+ * event->pmu_ctx ctx->mutex && inactive
+ * ctx->pmu_ctx_list ctx->mutex && ctx->lock
+ *
+ */
+struct perf_event_pmu_context {
+ struct pmu *pmu;
+ struct perf_event_context *ctx;
+
+ struct list_head pmu_ctx_entry;
+
+ struct list_head pinned_active;
+ struct list_head flexible_active;
+
+ /* Used to avoid freeing per-cpu perf_event_pmu_context */
+ unsigned int embedded : 1;
+
+ unsigned int nr_events;
+ unsigned int nr_active;
+
+ atomic_t refcount; /* event <-> epc */
+
+ void *task_ctx_data; /* pmu specific data */
+ /*
+ * Set when nr_events != nr_active, except tolerant to events not
+ * necessary to be active due to scheduling constraints, such as cgroups.
+ */
+ int rotate_necessary;
+};
struct perf_event_groups {
struct rb_root tree;
u64 index;
};
+
/**
* struct perf_event_context - event context structure
*
* Used as a container for task events and CPU events as well:
*/
struct perf_event_context {
- struct pmu *pmu;
/*
* Protect the states of the events in the list,
* nr_active, and the list:
@@ -812,26 +865,21 @@ struct perf_event_context {
*/
struct mutex mutex;
- struct list_head active_ctx_list;
+ struct list_head pmu_ctx_list;
struct perf_event_groups pinned_groups;
struct perf_event_groups flexible_groups;
struct list_head event_list;
- struct list_head pinned_active;
- struct list_head flexible_active;
-
int nr_events;
int nr_active;
int nr_user;
int is_active;
+
+ int nr_task_data;
int nr_stat;
int nr_freq;
int rotate_disable;
- /*
- * Set when nr_events != nr_active, except tolerant to events not
- * necessary to be active due to scheduling constraints, such as cgroups.
- */
- int rotate_necessary;
+
refcount_t refcount;
struct task_struct *task;
@@ -853,7 +901,6 @@ struct perf_event_context {
#ifdef CONFIG_CGROUP_PERF
int nr_cgroups; /* cgroup evts */
#endif
- void *task_ctx_data; /* pmu specific data */
struct rcu_head rcu_head;
};
@@ -863,12 +910,13 @@ struct perf_event_context {
*/
#define PERF_NR_CONTEXTS 4
-/**
- * struct perf_cpu_context - per cpu event context structure
- */
-struct perf_cpu_context {
- struct perf_event_context ctx;
- struct perf_event_context *task_ctx;
+struct perf_cpu_pmu_context {
+ struct perf_event_pmu_context epc;
+ struct perf_event_pmu_context *task_epc;
+
+ struct list_head sched_cb_entry;
+ int sched_cb_usage;
+
int active_oncpu;
int exclusive;
@@ -876,16 +924,21 @@ struct perf_cpu_context {
struct hrtimer hrtimer;
ktime_t hrtimer_interval;
unsigned int hrtimer_active;
+};
+
+/**
+ * struct perf_event_cpu_context - per cpu event context structure
+ */
+struct perf_cpu_context {
+ struct perf_event_context ctx;
+ struct perf_event_context *task_ctx;
+ int online;
#ifdef CONFIG_CGROUP_PERF
struct perf_cgroup *cgrp;
struct list_head cgrp_cpuctx_entry;
#endif
- struct list_head sched_cb_entry;
- int sched_cb_usage;
-
- int online;
/*
* Per-CPU storage for iterators used in visit_groups_merge. The default
* storage is of size 2 to hold the CPU and any CPU event iterators.
@@ -1151,7 +1204,7 @@ static inline int is_software_event(stru
*/
static inline int in_software_context(struct perf_event *event)
{
- return event->ctx->pmu->task_ctx_nr == perf_sw_context;
+ return event->pmu_ctx->pmu->task_ctx_nr == perf_sw_context;
}
static inline int is_exclusive_pmu(struct pmu *pmu)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,7 +1226,7 @@ struct task_struct {
unsigned int futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
- struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
+ struct perf_event_context *perf_event_ctxp;
struct mutex perf_event_mutex;
struct list_head perf_event_list;
#endif
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,12 +154,6 @@ static int cpu_function_call(int cpu, re
return data.ret;
}
-static inline struct perf_cpu_context *
-__get_cpu_context(struct perf_event_context *ctx)
-{
- return this_cpu_ptr(ctx->pmu->pmu_cpu_context);
-}
-
static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
@@ -183,6 +177,8 @@ static bool is_kernel_event(struct perf_
return READ_ONCE(event->owner) == TASK_TOMBSTONE;
}
+static DEFINE_PER_CPU(struct perf_cpu_context, cpu_context);
+
/*
* On task ctx scheduling...
*
@@ -216,7 +212,7 @@ static int event_function(void *info)
struct event_function_struct *efs = info;
struct perf_event *event = efs->event;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
int ret = 0;
@@ -313,7 +309,7 @@ static void event_function_call(struct p
static void event_function_local(struct perf_event *event, event_f func, void *data)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct task_struct *task = READ_ONCE(ctx->task);
struct perf_event_context *task_ctx = NULL;
@@ -387,7 +383,6 @@ static DEFINE_MUTEX(perf_sched_mutex);
static atomic_t perf_sched_count;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(int, perf_sched_cb_usages);
static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
static atomic_t nr_mmap_events __read_mostly;
@@ -447,7 +442,7 @@ static void update_perf_cpu_limits(void)
WRITE_ONCE(perf_sample_allowed_ns, tmp);
}
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc);
int perf_proc_update_handler(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
@@ -570,12 +565,6 @@ void perf_sample_event_took(u64 sample_l
static atomic64_t perf_event_id;
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-
static void update_context_time(struct perf_event_context *ctx);
static u64 perf_event_time(struct perf_event *event);
@@ -690,13 +679,31 @@ do { \
___p; \
})
+static void perf_ctx_disable(struct perf_event_context *ctx)
+{
+ struct perf_event_pmu_context *pmu_ctx;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ perf_pmu_disable(pmu_ctx->pmu);
+}
+
+static void perf_ctx_enable(struct perf_event_context *ctx)
+{
+ struct perf_event_pmu_context *pmu_ctx;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ perf_pmu_enable(pmu_ctx->pmu);
+}
+
+static void ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type);
+static void ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type);
+
#ifdef CONFIG_CGROUP_PERF
static inline bool
perf_cgroup_match(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
/* @event doesn't care about cgroup */
if (!event->cgrp)
@@ -822,6 +829,7 @@ perf_cgroup_set_timestamp(struct perf_cp
}
}
+/* XXX: No need of list now. Convert it to per-cpu variable */
static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
/*
@@ -849,9 +857,9 @@ static void perf_cgroup_switch(struct ta
continue;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
- perf_pmu_disable(cpuctx->ctx.pmu);
+ perf_ctx_disable(&cpuctx->ctx);
- cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+ ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
/*
* must not be done before ctxswout due
* to update_cgrp_time_from_cpuctx() in
@@ -863,9 +871,9 @@ static void perf_cgroup_switch(struct ta
* perf_cgroup_set_timestamp() in ctx_sched_in()
* to not have to pass task around
*/
- cpu_ctx_sched_in(cpuctx, EVENT_ALL);
+ ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
- perf_pmu_enable(cpuctx->ctx.pmu);
+ perf_ctx_enable(&cpuctx->ctx);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
@@ -887,7 +895,7 @@ static int perf_cgroup_ensure_storage(st
heap_size++;
for_each_possible_cpu(cpu) {
- cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+ cpuctx = this_cpu_ptr(&cpu_context);
if (heap_size <= cpuctx->heap_size)
continue;
@@ -1068,34 +1076,30 @@ static void perf_cgroup_switch(struct ta
*/
static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
{
- struct perf_cpu_context *cpuctx;
+ struct perf_cpu_pmu_context *cpc;
bool rotations;
lockdep_assert_irqs_disabled();
- cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
- rotations = perf_rotate_context(cpuctx);
+ cpc = container_of(hr, struct perf_cpu_pmu_context, hrtimer);
+ rotations = perf_rotate_context(cpc);
- raw_spin_lock(&cpuctx->hrtimer_lock);
+ raw_spin_lock(&cpc->hrtimer_lock);
if (rotations)
- hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
+ hrtimer_forward_now(hr, cpc->hrtimer_interval);
else
- cpuctx->hrtimer_active = 0;
- raw_spin_unlock(&cpuctx->hrtimer_lock);
+ cpc->hrtimer_active = 0;
+ raw_spin_unlock(&cpc->hrtimer_lock);
return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
}
-static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_pmu_context *cpc, int cpu)
{
- struct hrtimer *timer = &cpuctx->hrtimer;
- struct pmu *pmu = cpuctx->ctx.pmu;
+ struct hrtimer *timer = &cpc->hrtimer;
+ struct pmu *pmu = cpc->epc.pmu;
u64 interval;
- /* no multiplexing needed for SW PMU */
- if (pmu->task_ctx_nr == perf_sw_context)
- return;
-
/*
* check default is sane, if not set then force to
* default interval (1/tick)
@@ -1104,30 +1108,25 @@ static void __perf_mux_hrtimer_init(stru
if (interval < 1)
interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
- cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
+ cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
- raw_spin_lock_init(&cpuctx->hrtimer_lock);
+ raw_spin_lock_init(&cpc->hrtimer_lock);
hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
timer->function = perf_mux_hrtimer_handler;
}
-static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_pmu_context *cpc)
{
- struct hrtimer *timer = &cpuctx->hrtimer;
- struct pmu *pmu = cpuctx->ctx.pmu;
+ struct hrtimer *timer = &cpc->hrtimer;
unsigned long flags;
- /* not for SW PMU */
- if (pmu->task_ctx_nr == perf_sw_context)
- return 0;
-
- raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
- if (!cpuctx->hrtimer_active) {
- cpuctx->hrtimer_active = 1;
- hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+ raw_spin_lock_irqsave(&cpc->hrtimer_lock, flags);
+ if (!cpc->hrtimer_active) {
+ cpc->hrtimer_active = 1;
+ hrtimer_forward_now(timer, cpc->hrtimer_interval);
hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED_HARD);
}
- raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
+ raw_spin_unlock_irqrestore(&cpc->hrtimer_lock, flags);
return 0;
}
@@ -1146,32 +1145,9 @@ void perf_pmu_enable(struct pmu *pmu)
pmu->pmu_enable(pmu);
}
-static DEFINE_PER_CPU(struct list_head, active_ctx_list);
-
-/*
- * perf_event_ctx_activate(), perf_event_ctx_deactivate(), and
- * perf_event_task_tick() are fully serialized because they're strictly cpu
- * affine and perf_event_ctx{activate,deactivate} are called with IRQs
- * disabled, while perf_event_task_tick is called from IRQ context.
- */
-static void perf_event_ctx_activate(struct perf_event_context *ctx)
-{
- struct list_head *head = this_cpu_ptr(&active_ctx_list);
-
- lockdep_assert_irqs_disabled();
-
- WARN_ON(!list_empty(&ctx->active_ctx_list));
-
- list_add(&ctx->active_ctx_list, head);
-}
-
-static void perf_event_ctx_deactivate(struct perf_event_context *ctx)
+static void perf_assert_pmu_disabled(struct pmu *pmu)
{
- lockdep_assert_irqs_disabled();
-
- WARN_ON(list_empty(&ctx->active_ctx_list));
-
- list_del_init(&ctx->active_ctx_list);
+ WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0);
}
static void get_ctx(struct perf_event_context *ctx)
@@ -1198,7 +1174,6 @@ static void free_ctx(struct rcu_head *he
struct perf_event_context *ctx;
ctx = container_of(head, struct perf_event_context, rcu_head);
- free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
kfree(ctx);
}
@@ -1383,7 +1358,7 @@ static u64 primary_event_id(struct perf_
* the context could get moved to another task.
*/
static struct perf_event_context *
-perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
+perf_lock_task_context(struct task_struct *task, unsigned long *flags)
{
struct perf_event_context *ctx;
@@ -1399,7 +1374,7 @@ perf_lock_task_context(struct task_struc
*/
local_irq_save(*flags);
rcu_read_lock();
- ctx = rcu_dereference(task->perf_event_ctxp[ctxn]);
+ ctx = rcu_dereference(task->perf_event_ctxp);
if (ctx) {
/*
* If this context is a clone of another, it might
@@ -1412,7 +1387,7 @@ perf_lock_task_context(struct task_struc
* can't get swapped on us any more.
*/
raw_spin_lock(&ctx->lock);
- if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) {
+ if (ctx != rcu_dereference(task->perf_event_ctxp)) {
raw_spin_unlock(&ctx->lock);
rcu_read_unlock();
local_irq_restore(*flags);
@@ -1439,12 +1414,12 @@ perf_lock_task_context(struct task_struc
* reference count so that the context can't get freed.
*/
static struct perf_event_context *
-perf_pin_task_context(struct task_struct *task, int ctxn)
+perf_pin_task_context(struct task_struct *task)
{
struct perf_event_context *ctx;
unsigned long flags;
- ctx = perf_lock_task_context(task, ctxn, &flags);
+ ctx = perf_lock_task_context(task, &flags);
if (ctx) {
++ctx->pin_count;
raw_spin_unlock_irqrestore(&ctx->lock, flags);
@@ -1590,14 +1565,22 @@ static inline struct cgroup *event_cgrou
* which provides ordering when rotating groups for the same CPU.
*/
static __always_inline int
-perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
- const u64 left_group_index, const struct perf_event *right)
+perf_event_groups_cmp(const int left_cpu, const struct pmu *left_pmu,
+ const struct cgroup *left_cgroup, const u64 left_group_index,
+ const struct perf_event *right)
{
if (left_cpu < right->cpu)
return -1;
if (left_cpu > right->cpu)
return 1;
+ if (left_pmu) {
+ if (left_pmu < right->pmu_ctx->pmu)
+ return -1;
+ if (left_pmu > right->pmu_ctx->pmu)
+ return 1;
+ }
+
#ifdef CONFIG_CGROUP_PERF
{
const struct cgroup *right_cgroup = event_cgroup(right);
@@ -1640,12 +1623,13 @@ perf_event_groups_cmp(const int left_cpu
static inline bool __group_less(struct rb_node *a, const struct rb_node *b)
{
struct perf_event *e = __node_2_pe(a);
- return perf_event_groups_cmp(e->cpu, event_cgroup(e), e->group_index,
- __node_2_pe(b)) < 0;
+ return perf_event_groups_cmp(e->cpu, e->pmu_ctx->pmu, event_cgroup(e),
+ e->group_index, __node_2_pe(b)) < 0;
}
struct __group_key {
int cpu;
+ struct pmu *pmu;
struct cgroup *cgroup;
};
@@ -1654,14 +1638,25 @@ static inline int __group_cmp(const void
const struct __group_key *a = key;
const struct perf_event *b = __node_2_pe(node);
- /* partial/subtree match: @cpu, @cgroup; ignore: @group_index */
- return perf_event_groups_cmp(a->cpu, a->cgroup, b->group_index, b);
+ /* partial/subtree match: @cpu, @pmu, @cgroup; ignore: @group_index */
+ return perf_event_groups_cmp(a->cpu, a->pmu, a->cgroup, b->group_index, b);
+}
+
+static inline int
+__group_cmp_ignore_cgroup(const void *key, const struct rb_node *node)
+{
+ const struct __group_key *a = key;
+ const struct perf_event *b = __node_2_pe(node);
+
+ /* partial/subtree match: @cpu, @pmu, ignore: @cgroup, @group_index */
+ return perf_event_groups_cmp(a->cpu, a->pmu, event_cgroup(b),
+ b->group_index, b);
}
/*
- * Insert @event into @groups' tree; using {@event->cpu, ++@groups->index} for
- * key (see perf_event_groups_less). This places it last inside the CPU
- * subtree.
+ * Insert @event into @groups' tree; using
+ * {@event->cpu, @event->pmu_ctx->pmu, event_cgroup(@event), ++@groups->index}
+ * as key. This places it last inside the {cpu,pmu,cgroup} subtree.
*/
static void
perf_event_groups_insert(struct perf_event_groups *groups,
@@ -1711,14 +1706,15 @@ del_event_from_groups(struct perf_event
}
/*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the {cpu,pmu,cgroup} subtree.
*/
static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu,
- struct cgroup *cgrp)
+ struct pmu *pmu, struct cgroup *cgrp)
{
struct __group_key key = {
.cpu = cpu,
+ .pmu = pmu,
.cgroup = cgrp,
};
struct rb_node *node;
@@ -1730,14 +1726,12 @@ perf_event_groups_first(struct perf_even
return NULL;
}
-/*
- * Like rb_entry_next_safe() for the @cpu subtree.
- */
static struct perf_event *
-perf_event_groups_next(struct perf_event *event)
+perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
{
struct __group_key key = {
.cpu = event->cpu,
+ .pmu = pmu,
.cgroup = event_cgroup(event),
};
struct rb_node *next;
@@ -1793,6 +1787,7 @@ list_add_event(struct perf_event *event,
perf_cgroup_event_enable(event, ctx);
ctx->generation++;
+ event->pmu_ctx->nr_events++;
}
/*
@@ -2000,6 +1995,7 @@ list_del_event(struct perf_event *event,
}
ctx->generation++;
+ event->pmu_ctx->nr_events--;
}
static int
@@ -2016,13 +2012,11 @@ perf_aux_output_match(struct perf_event
static void put_event(struct perf_event *event);
static void event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx);
static void perf_put_aux_event(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct perf_event *iter;
/*
@@ -2051,7 +2045,7 @@ static void perf_put_aux_event(struct pe
* state so that we don't try to schedule it again. Note
* that perf_event_enable() will clear the ERROR status.
*/
- event_sched_out(iter, cpuctx, ctx);
+ event_sched_out(iter, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
}
}
@@ -2102,8 +2096,8 @@ static int perf_get_aux_event(struct per
static inline struct list_head *get_event_list(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
+ return event->attr.pinned ? &event->pmu_ctx->pinned_active :
+ &event->pmu_ctx->flexible_active;
}
/*
@@ -2114,10 +2108,7 @@ static inline struct list_head *get_even
*/
static inline void perf_remove_sibling_event(struct perf_event *event)
{
- struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, event->ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
}
@@ -2241,12 +2232,14 @@ event_filter_match(struct perf_event *ev
}
static void
-event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
+ // XXX cpc serialization, probably per-cpu IRQ disabled
+
WARN_ON_ONCE(event->ctx != ctx);
lockdep_assert_held(&ctx->lock);
@@ -2273,38 +2266,34 @@ event_sched_out(struct perf_event *event
perf_event_set_state(event, state);
if (!is_software_event(event))
- cpuctx->active_oncpu--;
- if (!--ctx->nr_active)
- perf_event_ctx_deactivate(ctx);
+ cpc->active_oncpu--;
+ ctx->nr_active--;
+ event->pmu_ctx->nr_active--;
if (event->attr.freq && event->attr.sample_freq)
ctx->nr_freq--;
- if (event->attr.exclusive || !cpuctx->active_oncpu)
- cpuctx->exclusive = 0;
+ if (event->attr.exclusive || !cpc->active_oncpu)
+ cpc->exclusive = 0;
perf_pmu_enable(event->pmu);
}
static void
-group_sched_out(struct perf_event *group_event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event;
if (group_event->state != PERF_EVENT_STATE_ACTIVE)
return;
- perf_pmu_disable(ctx->pmu);
+ perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, ctx);
/*
* Schedule out siblings (if any):
*/
for_each_sibling_event(event, group_event)
- event_sched_out(event, cpuctx, ctx);
-
- perf_pmu_enable(ctx->pmu);
+ event_sched_out(event, ctx);
}
#define DETACH_GROUP 0x01UL
@@ -2329,19 +2318,21 @@ __perf_remove_from_context(struct perf_e
update_cgrp_time_from_cpuctx(cpuctx, false);
}
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
+ if (!event->pmu_ctx->nr_events)
+ event->pmu_ctx->rotate_necessary = 0;
+
if (!ctx->nr_events && ctx->is_active) {
if (ctx == &cpuctx->ctx)
update_cgrp_time_from_cpuctx(cpuctx, true);
ctx->is_active = 0;
- ctx->rotate_necessary = 0;
if (ctx->task) {
WARN_ON_ONCE(cpuctx->task_ctx != ctx);
cpuctx->task_ctx = NULL;
@@ -2376,7 +2367,7 @@ static void perf_remove_from_context(str
* cgrp_cpuctx_list.
*/
if (!ctx->is_active && !is_cgroup_event(event)) {
- __perf_remove_from_context(event, __get_cpu_context(ctx),
+ __perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
ctx, (void *)flags);
raw_spin_unlock_irq(&ctx->lock);
return;
@@ -2402,13 +2393,17 @@ static void __perf_event_disable(struct
update_cgrp_time_from_event(event);
}
+ perf_pmu_disable(event->pmu_ctx->pmu);
+
if (event == event->group_leader)
- group_sched_out(event, cpuctx, ctx);
+ group_sched_out(event, ctx);
else
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_OFF);
perf_cgroup_event_disable(event, ctx);
+
+ perf_pmu_enable(event->pmu_ctx->pmu);
}
/*
@@ -2471,10 +2466,10 @@ static void perf_log_throttle(struct per
static void perf_log_itrace_start(struct perf_event *event);
static int
-event_sched_in(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
int ret = 0;
WARN_ON_ONCE(event->ctx != ctx);
@@ -2515,14 +2510,14 @@ event_sched_in(struct perf_event *event,
}
if (!is_software_event(event))
- cpuctx->active_oncpu++;
- if (!ctx->nr_active++)
- perf_event_ctx_activate(ctx);
+ cpc->active_oncpu++;
+ ctx->nr_active++;
+ event->pmu_ctx->nr_active++;
if (event->attr.freq && event->attr.sample_freq)
ctx->nr_freq++;
if (event->attr.exclusive)
- cpuctx->exclusive = 1;
+ cpc->exclusive = 1;
out:
perf_pmu_enable(event->pmu);
@@ -2531,26 +2526,24 @@ event_sched_in(struct perf_event *event,
}
static int
-group_sched_in(struct perf_event *group_event,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event, *partial_group = NULL;
- struct pmu *pmu = ctx->pmu;
+ struct pmu *pmu = group_event->pmu_ctx->pmu;
if (group_event->state == PERF_EVENT_STATE_OFF)
return 0;
pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
- if (event_sched_in(group_event, cpuctx, ctx))
+ if (event_sched_in(group_event, ctx))
goto error;
/*
* Schedule in siblings as one group (if any):
*/
for_each_sibling_event(event, group_event) {
- if (event_sched_in(event, cpuctx, ctx)) {
+ if (event_sched_in(event, ctx)) {
partial_group = event;
goto group_error;
}
@@ -2569,9 +2562,9 @@ group_sched_in(struct perf_event *group_
if (event == partial_group)
break;
- event_sched_out(event, cpuctx, ctx);
+ event_sched_out(event, ctx);
}
- event_sched_out(group_event, cpuctx, ctx);
+ event_sched_out(group_event, ctx);
error:
pmu->cancel_txn(pmu);
@@ -2581,10 +2574,11 @@ group_sched_in(struct perf_event *group_
/*
* Work out whether we can put this event group on the CPU now.
*/
-static int group_can_go_on(struct perf_event *event,
- struct perf_cpu_context *cpuctx,
- int can_add_hw)
+static int group_can_go_on(struct perf_event *event, int can_add_hw)
{
+ struct perf_event_pmu_context *epc = event->pmu_ctx;
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+
/*
* Groups consisting entirely of software events can always go on.
*/
@@ -2594,7 +2588,7 @@ static int group_can_go_on(struct perf_e
* If an exclusive group is already on, no other hardware
* events can go on.
*/
- if (cpuctx->exclusive)
+ if (cpc->exclusive)
return 0;
/*
* If this group is exclusive and there are already
@@ -2616,36 +2610,29 @@ static void add_event_to_ctx(struct perf
perf_group_attach(event);
}
-static void ctx_sched_out(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-static void
-ctx_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type);
-
-static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx,
- enum event_type_t event_type)
+static void task_ctx_sched_out(struct perf_event_context *ctx,
+ enum event_type_t event_type)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+
if (!cpuctx->task_ctx)
return;
if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;
- ctx_sched_out(ctx, cpuctx, event_type);
+ ctx_sched_out(ctx, event_type);
}
static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
- cpu_ctx_sched_in(cpuctx, EVENT_PINNED);
+ ctx_sched_in(&cpuctx->ctx, EVENT_PINNED);
if (ctx)
- ctx_sched_in(ctx, cpuctx, EVENT_PINNED);
- cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE);
+ ctx_sched_in(ctx, EVENT_PINNED);
+ ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE);
if (ctx)
- ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE);
+ ctx_sched_in(ctx, EVENT_FLEXIBLE);
}
/*
@@ -2667,7 +2654,6 @@ static void ctx_resched(struct perf_cpu_
struct perf_event_context *task_ctx,
enum event_type_t event_type)
{
- enum event_type_t ctx_event_type;
bool cpu_event = !!(event_type & EVENT_CPU);
/*
@@ -2677,11 +2663,13 @@ static void ctx_resched(struct perf_cpu_
if (event_type & EVENT_PINNED)
event_type |= EVENT_FLEXIBLE;
- ctx_event_type = event_type & EVENT_ALL;
+ event_type &= EVENT_ALL;
- perf_pmu_disable(cpuctx->ctx.pmu);
- if (task_ctx)
- task_ctx_sched_out(cpuctx, task_ctx, event_type);
+ perf_ctx_disable(&cpuctx->ctx);
+ if (task_ctx) {
+ perf_ctx_disable(task_ctx);
+ task_ctx_sched_out(task_ctx, event_type);
+ }
/*
* Decide which cpu ctx groups to schedule out based on the types
@@ -2691,17 +2679,20 @@ static void ctx_resched(struct perf_cpu_
* - otherwise, do nothing more.
*/
if (cpu_event)
- cpu_ctx_sched_out(cpuctx, ctx_event_type);
- else if (ctx_event_type & EVENT_PINNED)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ ctx_sched_out(&cpuctx->ctx, event_type);
+ else if (event_type & EVENT_PINNED)
+ ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
perf_event_sched_in(cpuctx, task_ctx);
- perf_pmu_enable(cpuctx->ctx.pmu);
+
+ perf_ctx_enable(&cpuctx->ctx);
+ if (task_ctx)
+ perf_ctx_enable(task_ctx);
}
void perf_pmu_resched(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
perf_ctx_lock(cpuctx, task_ctx);
@@ -2719,7 +2710,7 @@ static int __perf_install_in_context(vo
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
bool reprogram = true;
int ret = 0;
@@ -2761,7 +2752,7 @@ static int __perf_install_in_context(vo
#endif
if (reprogram) {
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
ctx_resched(cpuctx, task_ctx, get_event_type(event));
} else {
@@ -2909,7 +2900,7 @@ static void __perf_event_enable(struct p
return;
if (ctx->is_active)
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
perf_cgroup_event_enable(event, ctx);
@@ -2918,7 +2909,7 @@ static void __perf_event_enable(struct p
return;
if (!event_filter_match(event)) {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_in(ctx, EVENT_TIME);
return;
}
@@ -2927,7 +2918,7 @@ static void __perf_event_enable(struct p
* then don't put it on unless the group is on.
*/
if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_in(ctx, EVENT_TIME);
return;
}
@@ -3196,11 +3187,52 @@ static int perf_event_modify_attr(struct
return err;
}
-static void ctx_sched_out(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
+ enum event_type_t event_type)
{
+ struct perf_event_context *ctx = pmu_ctx->ctx;
struct perf_event *event, *tmp;
+ struct pmu *pmu = pmu_ctx->pmu;
+
+ if (ctx->task && !ctx->is_active) {
+ struct perf_cpu_pmu_context *cpc;
+
+ cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+ WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+ cpc->task_epc = NULL;
+ }
+
+ if (!event_type)
+ return;
+
+ perf_pmu_disable(pmu);
+ if (event_type & EVENT_PINNED) {
+ list_for_each_entry_safe(event, tmp,
+ &pmu_ctx->pinned_active,
+ active_list)
+ group_sched_out(event, ctx);
+ }
+
+ if (event_type & EVENT_FLEXIBLE) {
+ list_for_each_entry_safe(event, tmp,
+ &pmu_ctx->flexible_active,
+ active_list)
+ group_sched_out(event, ctx);
+ /*
+ * Since we cleared EVENT_FLEXIBLE, also clear
+ * rotate_necessary, is will be reset by
+ * ctx_flexible_sched_in() when needed.
+ */
+ pmu_ctx->rotate_necessary = 0;
+ }
+ perf_pmu_enable(pmu);
+}
+
+static void
+ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
+{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_pmu_context *pmu_ctx;
int is_active = ctx->is_active;
lockdep_assert_held(&ctx->lock);
@@ -3251,24 +3283,8 @@ static void ctx_sched_out(struct perf_ev
if (!ctx->nr_active || !(is_active & EVENT_ALL))
return;
- perf_pmu_disable(ctx->pmu);
- if (is_active & EVENT_PINNED) {
- list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
- group_sched_out(event, cpuctx, ctx);
- }
-
- if (is_active & EVENT_FLEXIBLE) {
- list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
- group_sched_out(event, cpuctx, ctx);
-
- /*
- * Since we cleared EVENT_FLEXIBLE, also clear
- * rotate_necessary, is will be reset by
- * ctx_flexible_sched_in() when needed.
- */
- ctx->rotate_necessary = 0;
- }
- perf_pmu_enable(ctx->pmu);
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+ __pmu_ctx_sched_out(pmu_ctx, is_active);
}
/*
@@ -3373,26 +3389,65 @@ static void perf_event_sync_stat(struct
}
}
-static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
- struct task_struct *next)
+static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx,
+ struct perf_event_context *next_ctx)
+{
+ struct perf_event_pmu_context *prev_epc, *next_epc;
+
+ if (!prev_ctx->nr_task_data)
+ return;
+
+ prev_epc = list_first_entry(&prev_ctx->pmu_ctx_list,
+ struct perf_event_pmu_context,
+ pmu_ctx_entry);
+ next_epc = list_first_entry(&next_ctx->pmu_ctx_list,
+ struct perf_event_pmu_context,
+ pmu_ctx_entry);
+
+ while (&prev_epc->pmu_ctx_entry != &prev_ctx->pmu_ctx_list &&
+ &next_epc->pmu_ctx_entry != &next_ctx->pmu_ctx_list) {
+
+ WARN_ON_ONCE(prev_epc->pmu != next_epc->pmu);
+
+ /*
+ * PMU specific parts of task perf context can require
+ * additional synchronization. As an example of such
+ * synchronization see implementation details of Intel
+ * LBR call stack data profiling;
+ */
+ if (prev_epc->pmu->swap_task_ctx)
+ prev_epc->pmu->swap_task_ctx(prev_epc, next_epc);
+ else
+ swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
+ }
+}
+
+static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in)
{
- struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
+ struct perf_event_pmu_context *pmu_ctx;
+ struct perf_cpu_pmu_context *cpc;
+
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+
+ if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task)
+ pmu_ctx->pmu->sched_task(pmu_ctx, sched_in);
+ }
+}
+
+static void
+perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
+{
+ struct perf_event_context *ctx = task->perf_event_ctxp;
struct perf_event_context *next_ctx;
struct perf_event_context *parent, *next_parent;
- struct perf_cpu_context *cpuctx;
int do_switch = 1;
- struct pmu *pmu;
if (likely(!ctx))
return;
- pmu = ctx->pmu;
- cpuctx = __get_cpu_context(ctx);
- if (!cpuctx->task_ctx)
- return;
-
rcu_read_lock();
- next_ctx = next->perf_event_ctxp[ctxn];
+ next_ctx = rcu_dereference(next->perf_event_ctxp);
if (!next_ctx)
goto unlock;
@@ -3420,23 +3475,12 @@ static void perf_event_context_sched_out
WRITE_ONCE(ctx->task, next);
WRITE_ONCE(next_ctx->task, task);
- perf_pmu_disable(pmu);
+ perf_ctx_disable(ctx);
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(ctx, false);
+ perf_ctx_sched_task_cb(ctx, false);
+ perf_event_swap_task_ctx_data(ctx, next_ctx);
- /*
- * PMU specific parts of task perf context can require
- * additional synchronization. As an example of such
- * synchronization see implementation details of Intel
- * LBR call stack data profiling;
- */
- if (pmu->swap_task_ctx)
- pmu->swap_task_ctx(ctx, next_ctx);
- else
- swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
-
- perf_pmu_enable(pmu);
+ perf_ctx_enable(ctx);
/*
* RCU_INIT_POINTER here is safe because we've not
@@ -3445,8 +3489,8 @@ static void perf_event_context_sched_out
* since those values are always verified under
* ctx->lock which we're now holding.
*/
- RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], next_ctx);
- RCU_INIT_POINTER(next->perf_event_ctxp[ctxn], ctx);
+ RCU_INIT_POINTER(task->perf_event_ctxp, next_ctx);
+ RCU_INIT_POINTER(next->perf_event_ctxp, ctx);
do_switch = 0;
@@ -3460,37 +3504,39 @@ static void perf_event_context_sched_out
if (do_switch) {
raw_spin_lock(&ctx->lock);
- perf_pmu_disable(pmu);
+ perf_ctx_disable(ctx);
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(ctx, false);
- task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+ perf_ctx_sched_task_cb(ctx, false);
+ task_ctx_sched_out(ctx, EVENT_ALL);
- perf_pmu_enable(pmu);
+ perf_ctx_enable(ctx);
raw_spin_unlock(&ctx->lock);
}
}
static DEFINE_PER_CPU(struct list_head, sched_cb_list);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
void perf_sched_cb_dec(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
this_cpu_dec(perf_sched_cb_usages);
+ barrier();
- if (!--cpuctx->sched_cb_usage)
- list_del(&cpuctx->sched_cb_entry);
+ if (!--cpc->sched_cb_usage)
+ list_del(&cpc->sched_cb_entry);
}
void perf_sched_cb_inc(struct pmu *pmu)
{
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
- if (!cpuctx->sched_cb_usage++)
- list_add(&cpuctx->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+ if (!cpc->sched_cb_usage++)
+ list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+ barrier();
this_cpu_inc(perf_sched_cb_usages);
}
@@ -3502,19 +3548,21 @@ void perf_sched_cb_inc(struct pmu *pmu)
* PEBS requires this to provide PID/TID information. This requires we flush
* all queued PEBS records before we context switch to a new task.
*/
-static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in)
+static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, bool sched_in)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct pmu *pmu;
- pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */
+ pmu = cpc->epc.pmu;
+ /* software PMUs will not have sched_task */
if (WARN_ON_ONCE(!pmu->sched_task))
return;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_pmu_disable(pmu);
- pmu->sched_task(cpuctx->task_ctx, sched_in);
+ pmu->sched_task(cpc->task_epc, sched_in);
perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3524,26 +3572,20 @@ static void perf_pmu_sched_task(struct t
struct task_struct *next,
bool sched_in)
{
- struct perf_cpu_context *cpuctx;
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_cpu_pmu_context *cpc;
- if (prev == next)
+ /* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
+ if (prev == next || cpuctx->task_ctx)
return;
- list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
- /* will be handled in perf_event_context_sched_in/out */
- if (cpuctx->task_ctx)
- continue;
-
- __perf_pmu_sched_task(cpuctx, sched_in);
- }
+ list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+ __perf_pmu_sched_task(cpc, sched_in);
}
static void perf_event_switch(struct task_struct *task,
struct task_struct *next_prev, bool sched_in);
-#define for_each_task_context_nr(ctxn) \
- for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
/*
* Called from scheduler to remove the events of the current task,
* with interrupts disabled.
@@ -3558,16 +3600,13 @@ static void perf_event_switch(struct tas
void __perf_event_task_sched_out(struct task_struct *task,
struct task_struct *next)
{
- int ctxn;
-
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(task, next, false);
if (atomic_read(&nr_switch_events))
perf_event_switch(task, next, false);
- for_each_task_context_nr(ctxn)
- perf_event_context_sched_out(task, ctxn, next);
+ perf_event_context_sched_out(task, next);
/*
* if cgroup events exist on this CPU, then we need
@@ -3578,15 +3617,6 @@ void __perf_event_task_sched_out(struct
perf_cgroup_switch(next);
}
-/*
- * Called with IRQs disabled
- */
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
-{
- ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
-}
-
static bool perf_less_group_idx(const void *l, const void *r)
{
const struct perf_event *le = *(const struct perf_event **)l;
@@ -3618,21 +3648,36 @@ static void __heap_add(struct min_heap *
}
}
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+static void __link_epc(struct perf_event_pmu_context *pmu_ctx)
+{
+ struct perf_cpu_pmu_context *cpc;
+
+ if (!pmu_ctx->ctx->task)
+ return;
+
+ cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+ WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+ cpc->task_epc = pmu_ctx;
+}
+
+static noinline int visit_groups_merge(struct perf_event_context *ctx,
struct perf_event_groups *groups, int cpu,
+ struct pmu *pmu,
int (*func)(struct perf_event *, void *),
void *data)
{
#ifdef CONFIG_CGROUP_PERF
struct cgroup_subsys_state *css = NULL;
#endif
+ struct perf_cpu_context *cpuctx = NULL;
/* Space for per CPU and/or any CPU event iterators. */
struct perf_event *itrs[2];
struct min_heap event_heap;
struct perf_event **evt;
int ret;
- if (cpuctx) {
+ if (!ctx->task) {
+ cpuctx = this_cpu_ptr(&cpu_context);
event_heap = (struct min_heap){
.data = cpuctx->heap,
.nr = 0,
@@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
.size = ARRAY_SIZE(itrs),
};
/* Events not within a CPU context may be on any CPU. */
- __heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
+ __heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
}
evt = event_heap.data;
- __heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+ __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
#ifdef CONFIG_CGROUP_PERF
for (; css; css = css->parent)
- __heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
+ __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
#endif
+ if (event_heap.nr) {
+ /*
+ * XXX: For now, visit_groups_merge() gets called with pmu
+ * pointer never NULL. But these functions needs to be called
+ * once for each pmu if I implement pmu=NULL optimization.
+ */
+ __link_epc((*evt)->pmu_ctx);
+ perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
+ }
+
+
min_heapify_all(&event_heap, &perf_min_heap);
while (event_heap.nr) {
@@ -3670,7 +3726,7 @@ static noinline int visit_groups_merge(s
if (ret)
return ret;
- *evt = perf_event_groups_next(*evt);
+ *evt = perf_event_groups_next(*evt, pmu);
if (*evt)
min_heapify(&event_heap, 0, &perf_min_heap);
else
@@ -3712,7 +3768,6 @@ static inline void group_update_userpage
static int merge_sched_in(struct perf_event *event, void *data)
{
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
int *can_add_hw = data;
if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3721,8 +3776,8 @@ static int merge_sched_in(struct perf_ev
if (!event_filter_match(event))
return 0;
- if (group_can_go_on(event, cpuctx, *can_add_hw)) {
- if (!group_sched_in(event, cpuctx, ctx))
+ if (group_can_go_on(event, *can_add_hw)) {
+ if (!group_sched_in(event, ctx))
list_add_tail(&event->active_list, get_event_list(event));
}
@@ -3732,8 +3787,11 @@ static int merge_sched_in(struct perf_ev
perf_cgroup_event_disable(event, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
} else {
- ctx->rotate_necessary = 1;
- perf_mux_hrtimer_restart(cpuctx);
+ struct perf_cpu_pmu_context *cpc;
+
+ event->pmu_ctx->rotate_necessary = 1;
+ cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context);
+ perf_mux_hrtimer_restart(cpc);
group_update_userpage(event);
}
}
@@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
return 0;
}
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx)
+static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
{
+ struct perf_event_pmu_context *pmu_ctx;
int can_add_hw = 1;
- if (ctx != &cpuctx->ctx)
- cpuctx = NULL;
-
- visit_groups_merge(cpuctx, &ctx->pinned_groups,
- smp_processor_id(),
- merge_sched_in, &can_add_hw);
+ if (pmu) {
+ visit_groups_merge(ctx, &ctx->pinned_groups,
+ smp_processor_id(), pmu,
+ merge_sched_in, &can_add_hw);
+ } else {
+ /*
+ * XXX: This can be optimized for per-task context by calling
+ * visit_groups_merge() only once with:
+ * 1) pmu=NULL
+ * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
+ * 3) Making can_add_hw a per-pmu variable
+ *
+ * Though, it can not be opimized for per-cpu context because
+ * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
+ * consist of cgroup-subtrees. i.e. a cgroup events of same
+ * cgroup but different pmus are seperated out into respective
+ * pmu-subtrees.
+ */
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ can_add_hw = 1;
+ visit_groups_merge(ctx, &ctx->pinned_groups,
+ smp_processor_id(), pmu_ctx->pmu,
+ merge_sched_in, &can_add_hw);
+ }
+ }
}
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx)
+/* XXX .busy thingy from Peter's patch */
+static void ctx_flexible_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
{
+ struct perf_event_pmu_context *pmu_ctx;
int can_add_hw = 1;
- if (ctx != &cpuctx->ctx)
- cpuctx = NULL;
+ if (pmu) {
+ visit_groups_merge(ctx, &ctx->flexible_groups,
+ smp_processor_id(), pmu,
+ merge_sched_in, &can_add_hw);
+ } else {
+ list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ can_add_hw = 1;
+ visit_groups_merge(ctx, &ctx->flexible_groups,
+ smp_processor_id(), pmu_ctx->pmu,
+ merge_sched_in, &can_add_hw);
+ }
+ }
+}
- visit_groups_merge(cpuctx, &ctx->flexible_groups,
- smp_processor_id(),
- merge_sched_in, &can_add_hw);
+static void __pmu_ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
+{
+ ctx_flexible_sched_in(ctx, pmu);
}
static void
-ctx_sched_in(struct perf_event_context *ctx,
- struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
int is_active = ctx->is_active;
lockdep_assert_held(&ctx->lock);
@@ -3785,6 +3871,7 @@ ctx_sched_in(struct perf_event_context *
/* start ctx time */
__update_context_time(ctx, false);
perf_cgroup_set_timestamp(cpuctx);
+ // XXX ctx->task =? task
/*
* CPU-release for the below ->is_active store,
* see __load_acquire() in perf_event_time_now()
@@ -3807,39 +3894,32 @@ ctx_sched_in(struct perf_event_context *
* in order to give them the best chance of going on.
*/
if (is_active & EVENT_PINNED)
- ctx_pinned_sched_in(ctx, cpuctx);
+ ctx_pinned_sched_in(ctx, NULL);
/* Then walk through the lower prio flexible groups */
if (is_active & EVENT_FLEXIBLE)
- ctx_flexible_sched_in(ctx, cpuctx);
+ ctx_flexible_sched_in(ctx, NULL);
}
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
- enum event_type_t event_type)
+static void perf_event_context_sched_in(struct task_struct *task)
{
- struct perf_event_context *ctx = &cpuctx->ctx;
-
- ctx_sched_in(ctx, cpuctx, event_type);
-}
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_context *ctx;
-static void perf_event_context_sched_in(struct perf_event_context *ctx,
- struct task_struct *task)
-{
- struct perf_cpu_context *cpuctx;
- struct pmu *pmu;
+ rcu_read_lock();
+ ctx = rcu_dereference(task->perf_event_ctxp);
+ if (!ctx)
+ goto rcu_unlock;
- cpuctx = __get_cpu_context(ctx);
+ if (cpuctx->task_ctx == ctx) {
+ perf_ctx_lock(cpuctx, ctx);
+ perf_ctx_disable(ctx);
- /*
- * HACK: for HETEROGENEOUS the task context might have switched to a
- * different PMU, force (re)set the context,
- */
- pmu = ctx->pmu = cpuctx->ctx.pmu;
+ perf_ctx_sched_task_cb(ctx, true);
- if (cpuctx->task_ctx == ctx) {
- if (cpuctx->sched_cb_usage)
- __perf_pmu_sched_task(cpuctx, true);
- return;
+ perf_ctx_enable(ctx);
+ perf_ctx_unlock(cpuctx, ctx);
+ goto rcu_unlock;
}
perf_ctx_lock(cpuctx, ctx);
@@ -3850,7 +3930,7 @@ static void perf_event_context_sched_in(
if (!ctx->nr_events)
goto unlock;
- perf_pmu_disable(pmu);
+ perf_ctx_disable(ctx);
/*
* We want to keep the following priority order:
* cpu pinned (that don't need to move), task pinned,
@@ -3859,17 +3939,24 @@ static void perf_event_context_sched_in(
* However, if task's ctx is not carrying any pinned
* events, no need to flip the cpuctx's events around.
*/
- if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
+ perf_ctx_disable(&cpuctx->ctx);
+ ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+ }
+
perf_event_sched_in(cpuctx, ctx);
- if (cpuctx->sched_cb_usage && pmu->sched_task)
- pmu->sched_task(cpuctx->task_ctx, true);
+ perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
- perf_pmu_enable(pmu);
+ if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
+ perf_ctx_enable(&cpuctx->ctx);
+
+ perf_ctx_enable(ctx);
unlock:
perf_ctx_unlock(cpuctx, ctx);
+rcu_unlock:
+ rcu_read_unlock();
}
/*
@@ -3886,16 +3973,7 @@ static void perf_event_context_sched_in(
void __perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task)
{
- struct perf_event_context *ctx;
- int ctxn;
-
- for_each_task_context_nr(ctxn) {
- ctx = task->perf_event_ctxp[ctxn];
- if (likely(!ctx))
- continue;
-
- perf_event_context_sched_in(ctx, task);
- }
+ perf_event_context_sched_in(task);
if (atomic_read(&nr_switch_events))
perf_event_switch(task, prev, true);
@@ -4014,8 +4092,8 @@ static void perf_adjust_period(struct pe
* events. At the same time, make sure, having freq events does not change
* the rate of unthrottling as that would introduce bias.
*/
-static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
- int needs_unthr)
+static void
+perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle)
{
struct perf_event *event;
struct hw_perf_event *hwc;
@@ -4027,16 +4105,16 @@ static void perf_adjust_freq_unthr_conte
* - context have events in frequency mode (needs freq adjust)
* - there are events to unthrottle on this cpu
*/
- if (!(ctx->nr_freq || needs_unthr))
+ if (!(ctx->nr_freq || unthrottle))
return;
raw_spin_lock(&ctx->lock);
- perf_pmu_disable(ctx->pmu);
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->state != PERF_EVENT_STATE_ACTIVE)
continue;
+ // XXX use visit thingy to avoid the -1,cpu match
if (!event_filter_match(event))
continue;
@@ -4077,7 +4155,6 @@ static void perf_adjust_freq_unthr_conte
perf_pmu_enable(event->pmu);
}
- perf_pmu_enable(ctx->pmu);
raw_spin_unlock(&ctx->lock);
}
@@ -4099,72 +4176,111 @@ static void rotate_ctx(struct perf_event
/* pick an event from the flexible_groups to rotate */
static inline struct perf_event *
-ctx_event_to_rotate(struct perf_event_context *ctx)
+ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
{
struct perf_event *event;
+ struct rb_node *node;
+ struct rb_root *tree;
+ struct __group_key key = {
+ .pmu = pmu_ctx->pmu,
+ };
/* pick the first active flexible event */
- event = list_first_entry_or_null(&ctx->flexible_active,
+ event = list_first_entry_or_null(&pmu_ctx->flexible_active,
struct perf_event, active_list);
+ if (event)
+ goto out;
/* if no active flexible event, pick the first event */
- if (!event) {
- event = rb_entry_safe(rb_first(&ctx->flexible_groups.tree),
- typeof(*event), group_node);
+ tree = &pmu_ctx->ctx->flexible_groups.tree;
+
+ if (!pmu_ctx->ctx->task) {
+ key.cpu = smp_processor_id();
+
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node)
+ event = __node_2_pe(node);
+ goto out;
}
+ key.cpu = -1;
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node) {
+ event = __node_2_pe(node);
+ goto out;
+ }
+
+ key.cpu = smp_processor_id();
+ node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+ if (node)
+ event = __node_2_pe(node);
+
+out:
/*
* Unconditionally clear rotate_necessary; if ctx_flexible_sched_in()
* finds there are unschedulable events, it will set it again.
*/
- ctx->rotate_necessary = 0;
+ pmu_ctx->rotate_necessary = 0;
return event;
}
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
struct perf_event *cpu_event = NULL, *task_event = NULL;
struct perf_event_context *task_ctx = NULL;
int cpu_rotate, task_rotate;
+ struct pmu *pmu;
/*
* Since we run this from IRQ context, nobody can install new
* events, thus the event count values are stable.
*/
- cpu_rotate = cpuctx->ctx.rotate_necessary;
+ cpu_epc = &cpc->epc;
+ pmu = cpu_epc->pmu;
+ task_epc = cpc->task_epc;
+
+ cpu_rotate = cpu_epc->rotate_necessary;
task_ctx = cpuctx->task_ctx;
- task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
+ task_rotate = task_epc ? task_epc->rotate_necessary : 0;
if (!(cpu_rotate || task_rotate))
return false;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
- perf_pmu_disable(cpuctx->ctx.pmu);
+ perf_pmu_disable(pmu);
if (task_rotate)
- task_event = ctx_event_to_rotate(task_ctx);
+ task_event = ctx_event_to_rotate(task_epc);
if (cpu_rotate)
- cpu_event = ctx_event_to_rotate(&cpuctx->ctx);
+ cpu_event = ctx_event_to_rotate(cpu_epc);
/*
* As per the order given at ctx_resched() first 'pop' task flexible
* and then, if needed CPU flexible.
*/
- if (task_event || (task_ctx && cpu_event))
- ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
- if (cpu_event)
- cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+ if (task_event || (task_epc && cpu_event)) {
+ update_context_time(task_epc->ctx);
+ __pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
+ }
- if (task_event)
- rotate_ctx(task_ctx, task_event);
- if (cpu_event)
+ if (cpu_event) {
+ update_context_time(&cpuctx->ctx);
+ __pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
rotate_ctx(&cpuctx->ctx, cpu_event);
+ __pmu_ctx_sched_in(&cpuctx->ctx, pmu);
+ }
- perf_event_sched_in(cpuctx, task_ctx);
+ if (task_event)
+ rotate_ctx(task_epc->ctx, task_event);
+
+ if (task_event || (task_epc && cpu_event))
+ __pmu_ctx_sched_in(task_epc->ctx, pmu);
- perf_pmu_enable(cpuctx->ctx.pmu);
+ perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
return true;
@@ -4172,8 +4288,8 @@ static bool perf_rotate_context(struct p
void perf_event_task_tick(void)
{
- struct list_head *head = this_cpu_ptr(&active_ctx_list);
- struct perf_event_context *ctx, *tmp;
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+ struct perf_event_context *ctx;
int throttled;
lockdep_assert_irqs_disabled();
@@ -4182,8 +4298,13 @@ void perf_event_task_tick(void)
throttled = __this_cpu_xchg(perf_throttled_count, 0);
tick_dep_clear_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
- list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
- perf_adjust_freq_unthr_context(ctx, throttled);
+ perf_adjust_freq_unthr_context(&cpuctx->ctx, !!throttled);
+
+ rcu_read_lock();
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
+ perf_adjust_freq_unthr_context(ctx, !!throttled);
+ rcu_read_unlock();
}
static int event_enable_on_exec(struct perf_event *event,
@@ -4205,9 +4326,9 @@ static int event_enable_on_exec(struct p
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
*/
-static void perf_event_enable_on_exec(int ctxn)
+static void perf_event_enable_on_exec(struct perf_event_context *ctx)
{
- struct perf_event_context *ctx, *clone_ctx = NULL;
+ struct perf_event_context *clone_ctx = NULL;
enum event_type_t event_type = 0;
struct perf_cpu_context *cpuctx;
struct perf_event *event;
@@ -4215,13 +4336,16 @@ static void perf_event_enable_on_exec(in
int enabled = 0;
local_irq_save(flags);
- ctx = current->perf_event_ctxp[ctxn];
- if (!ctx || !ctx->nr_events)
+ if (WARN_ON_ONCE(current->perf_event_ctxp != ctx))
+ goto out;
+
+ if (!ctx->nr_events)
goto out;
- cpuctx = __get_cpu_context(ctx);
+ cpuctx = this_cpu_ptr(&cpu_context);
perf_ctx_lock(cpuctx, ctx);
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
+
list_for_each_entry(event, &ctx->event_list, event_entry) {
enabled |= event_enable_on_exec(event, ctx);
event_type |= get_event_type(event);
@@ -4234,7 +4358,7 @@ static void perf_event_enable_on_exec(in
clone_ctx = unclone_ctx(ctx);
ctx_resched(cpuctx, ctx, event_type);
} else {
- ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_in(ctx, EVENT_TIME);
}
perf_ctx_unlock(cpuctx, ctx);
@@ -4253,16 +4377,14 @@ static void perf_event_exit_event(struct
* Removes all events from the current task that have been marked
* remove-on-exec, and feeds their values back to parent events.
*/
-static void perf_event_remove_on_exec(int ctxn)
+static void perf_event_remove_on_exec(struct perf_event_context *ctx)
{
- struct perf_event_context *ctx, *clone_ctx = NULL;
+ struct perf_event_context *clone_ctx = NULL;
struct perf_event *event, *next;
unsigned long flags;
bool modified = false;
- ctx = perf_pin_task_context(current, ctxn);
- if (!ctx)
- return;
+ perf_pin_task_context(current);
mutex_lock(&ctx->mutex);
@@ -4326,7 +4448,7 @@ static void __perf_event_read(void *info
struct perf_read_data *data = info;
struct perf_event *sub, *event = data->event;
struct perf_event_context *ctx = event->ctx;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct pmu *pmu = event->pmu;
/*
@@ -4552,17 +4674,25 @@ static void __perf_event_init_context(st
{
raw_spin_lock_init(&ctx->lock);
mutex_init(&ctx->mutex);
- INIT_LIST_HEAD(&ctx->active_ctx_list);
+ INIT_LIST_HEAD(&ctx->pmu_ctx_list);
perf_event_groups_init(&ctx->pinned_groups);
perf_event_groups_init(&ctx->flexible_groups);
INIT_LIST_HEAD(&ctx->event_list);
- INIT_LIST_HEAD(&ctx->pinned_active);
- INIT_LIST_HEAD(&ctx->flexible_active);
refcount_set(&ctx->refcount, 1);
}
+static void
+__perf_init_event_pmu_context(struct perf_event_pmu_context *epc, struct pmu *pmu)
+{
+ epc->pmu = pmu;
+ INIT_LIST_HEAD(&epc->pmu_ctx_entry);
+ INIT_LIST_HEAD(&epc->pinned_active);
+ INIT_LIST_HEAD(&epc->flexible_active);
+ atomic_set(&epc->refcount, 1);
+}
+
static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct task_struct *task)
{
struct perf_event_context *ctx;
@@ -4573,7 +4703,6 @@ alloc_perf_context(struct pmu *pmu, stru
__perf_event_init_context(ctx);
if (task)
ctx->task = get_task_struct(task);
- ctx->pmu = pmu;
return ctx;
}
@@ -4602,15 +4731,12 @@ find_lively_task_by_vpid(pid_t vpid)
* Returns a matching context with refcount and pincount.
*/
static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task,
- struct perf_event *event)
+find_get_context(struct task_struct *task, struct perf_event *event)
{
struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_cpu_context *cpuctx;
- void *task_ctx_data = NULL;
unsigned long flags;
- int ctxn, err;
- int cpu = event->cpu;
+ int err;
if (!task) {
/* Must be root to operate on a CPU event: */
@@ -4618,7 +4744,7 @@ find_get_context(struct pmu *pmu, struct
if (err)
return ERR_PTR(err);
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+ cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
ctx = &cpuctx->ctx;
get_ctx(ctx);
raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -4629,43 +4755,22 @@ find_get_context(struct pmu *pmu, struct
}
err = -EINVAL;
- ctxn = pmu->task_ctx_nr;
- if (ctxn < 0)
- goto errout;
-
- if (event->attach_state & PERF_ATTACH_TASK_DATA) {
- task_ctx_data = alloc_task_ctx_data(pmu);
- if (!task_ctx_data) {
- err = -ENOMEM;
- goto errout;
- }
- }
-
retry:
- ctx = perf_lock_task_context(task, ctxn, &flags);
+ ctx = perf_lock_task_context(task, &flags);
if (ctx) {
clone_ctx = unclone_ctx(ctx);
++ctx->pin_count;
- if (task_ctx_data && !ctx->task_ctx_data) {
- ctx->task_ctx_data = task_ctx_data;
- task_ctx_data = NULL;
- }
raw_spin_unlock_irqrestore(&ctx->lock, flags);
if (clone_ctx)
put_ctx(clone_ctx);
} else {
- ctx = alloc_perf_context(pmu, task);
+ ctx = alloc_perf_context(task);
err = -ENOMEM;
if (!ctx)
goto errout;
- if (task_ctx_data) {
- ctx->task_ctx_data = task_ctx_data;
- task_ctx_data = NULL;
- }
-
err = 0;
mutex_lock(&task->perf_event_mutex);
/*
@@ -4674,12 +4779,12 @@ find_get_context(struct pmu *pmu, struct
*/
if (task->flags & PF_EXITING)
err = -ESRCH;
- else if (task->perf_event_ctxp[ctxn])
+ else if (task->perf_event_ctxp)
err = -EAGAIN;
else {
get_ctx(ctx);
++ctx->pin_count;
- rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+ rcu_assign_pointer(task->perf_event_ctxp, ctx);
}
mutex_unlock(&task->perf_event_mutex);
@@ -4692,14 +4797,117 @@ find_get_context(struct pmu *pmu, struct
}
}
- free_task_ctx_data(pmu, task_ctx_data);
return ctx;
errout:
- free_task_ctx_data(pmu, task_ctx_data);
return ERR_PTR(err);
}
+struct perf_event_pmu_context *
+find_get_pmu_context(struct pmu *pmu, struct perf_event_context *ctx,
+ struct perf_event *event)
+{
+ struct perf_event_pmu_context *new = NULL, *epc;
+ void *task_ctx_data = NULL;
+
+ if (!ctx->task) {
+ struct perf_cpu_pmu_context *cpc;
+
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
+ epc = &cpc->epc;
+
+ if (!epc->ctx) {
+ atomic_set(&epc->refcount, 1);
+ epc->embedded = 1;
+ raw_spin_lock_irq(&ctx->lock);
+ list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+ epc->ctx = ctx;
+ raw_spin_unlock_irq(&ctx->lock);
+ } else {
+ WARN_ON_ONCE(epc->ctx != ctx);
+ atomic_inc(&epc->refcount);
+ }
+
+ return epc;
+ }
+
+ new = kzalloc(sizeof(*epc), GFP_KERNEL);
+ if (!new)
+ return ERR_PTR(-ENOMEM);
+
+ if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+ task_ctx_data = alloc_task_ctx_data(pmu);;
+ if (!task_ctx_data) {
+ kfree(new);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+
+ __perf_init_event_pmu_context(new, pmu);
+
+ raw_spin_lock_irq(&ctx->lock);
+ list_for_each_entry(epc, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+ if (epc->pmu == pmu) {
+ WARN_ON_ONCE(epc->ctx != ctx);
+ atomic_inc(&epc->refcount);
+ goto found_epc;
+ }
+ }
+
+ epc = new;
+ new = NULL;
+
+ list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+ epc->ctx = ctx;
+
+found_epc:
+ if (task_ctx_data && !epc->task_ctx_data) {
+ epc->task_ctx_data = task_ctx_data;
+ task_ctx_data = NULL;
+ ctx->nr_task_data++;
+ }
+ raw_spin_unlock_irq(&ctx->lock);
+
+ free_task_ctx_data(pmu, task_ctx_data);
+ kfree(new);
+
+ return epc;
+}
+
+static void get_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+ WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount));
+}
+
+static void put_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+ unsigned long flags;
+
+ if (!atomic_dec_and_test(&epc->refcount))
+ return;
+
+ if (epc->ctx) {
+ struct perf_event_context *ctx = epc->ctx;
+
+ // XXX ctx->mutex
+
+ WARN_ON_ONCE(list_empty(&epc->pmu_ctx_entry));
+ raw_spin_lock_irqsave(&ctx->lock, flags);
+ list_del_init(&epc->pmu_ctx_entry);
+ epc->ctx = NULL;
+ raw_spin_unlock_irqrestore(&ctx->lock, flags);
+ }
+
+ WARN_ON_ONCE(!list_empty(&epc->pinned_active));
+ WARN_ON_ONCE(!list_empty(&epc->flexible_active));
+
+ if (epc->embedded)
+ return;
+
+ kfree(epc->task_ctx_data);
+ kfree(epc);
+}
+
static void perf_event_free_filter(struct perf_event *event);
static void free_event_rcu(struct rcu_head *head)
@@ -4968,6 +5176,9 @@ static void _free_event(struct perf_even
if (event->hw.target)
put_task_struct(event->hw.target);
+ if (event->pmu_ctx)
+ put_pmu_ctx(event->pmu_ctx);
+
/*
* perf_event_free_task() relies on put_ctx() being 'last', in particular
* all task references must be cleaned up.
@@ -5498,7 +5709,7 @@ static void __perf_event_period(struct p
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
- perf_pmu_disable(ctx->pmu);
+ perf_pmu_disable(event->pmu);
/*
* We could be throttled; unthrottle now to avoid the tick
* trying to unthrottle while we already re-started the event.
@@ -5514,7 +5725,7 @@ static void __perf_event_period(struct p
if (active) {
event->pmu->start(event, PERF_EF_RELOAD);
- perf_pmu_enable(ctx->pmu);
+ perf_pmu_enable(event->pmu);
}
}
@@ -7606,7 +7817,6 @@ perf_iterate_sb(perf_iterate_f output, v
struct perf_event_context *task_ctx)
{
struct perf_event_context *ctx;
- int ctxn;
rcu_read_lock();
preempt_disable();
@@ -7623,11 +7833,9 @@ perf_iterate_sb(perf_iterate_f output, v
perf_iterate_sb_cpu(output, data);
- for_each_task_context_nr(ctxn) {
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx)
- perf_iterate_ctx(ctx, output, data, false);
- }
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
+ perf_iterate_ctx(ctx, output, data, false);
done:
preempt_enable();
rcu_read_unlock();
@@ -7669,20 +7877,15 @@ static void perf_event_addr_filters_exec
void perf_event_exec(void)
{
struct perf_event_context *ctx;
- int ctxn;
-
- for_each_task_context_nr(ctxn) {
- perf_event_enable_on_exec(ctxn);
- perf_event_remove_on_exec(ctxn);
- rcu_read_lock();
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (ctx) {
- perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
- NULL, true);
- }
- rcu_read_unlock();
+ rcu_read_lock();
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx) {
+ perf_event_enable_on_exec(ctx);
+ perf_event_remove_on_exec(ctx);
+ perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
}
+ rcu_read_unlock();
}
struct remote_output {
@@ -7722,8 +7925,7 @@ static void __perf_event_output_stop(str
static int __perf_pmu_output_stop(void *info)
{
struct perf_event *event = info;
- struct pmu *pmu = event->ctx->pmu;
- struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct remote_output ro = {
.rb = event->rb,
};
@@ -8512,7 +8714,6 @@ static void __perf_addr_filters_adjust(s
static void perf_addr_filters_adjust(struct vm_area_struct *vma)
{
struct perf_event_context *ctx;
- int ctxn;
/*
* Data tracing isn't supported yet and as such there is no need
@@ -8522,13 +8723,9 @@ static void perf_addr_filters_adjust(str
return;
rcu_read_lock();
- for_each_task_context_nr(ctxn) {
- ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
- if (!ctx)
- continue;
-
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx)
perf_iterate_ctx(ctx, __perf_addr_filters_adjust, vma, true);
- }
rcu_read_unlock();
}
@@ -9737,10 +9934,13 @@ void perf_tp_event(u16 event_type, u64 c
struct trace_entry *entry = record;
rcu_read_lock();
- ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+ ctx = rcu_dereference(task->perf_event_ctxp);
if (!ctx)
goto unlock;
+ // XXX iterate groups instead, we should be able to
+ // find the subtree for the perf_tracepoint pmu and CPU.
+
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->cpu != smp_processor_id())
continue;
@@ -10873,36 +11073,9 @@ static int perf_event_idx_default(struct
return 0;
}
-/*
- * Ensures all contexts with the same task_ctx_nr have the same
- * pmu_cpu_context too.
- */
-static struct perf_cpu_context __percpu *find_pmu_context(int ctxn)
-{
- struct pmu *pmu;
-
- if (ctxn < 0)
- return NULL;
-
- list_for_each_entry(pmu, &pmus, entry) {
- if (pmu->task_ctx_nr == ctxn)
- return pmu->pmu_cpu_context;
- }
-
- return NULL;
-}
-
static void free_pmu_context(struct pmu *pmu)
{
- /*
- * Static contexts such as perf_sw_context have a global lifetime
- * and may be shared between different PMUs. Avoid freeing them
- * when a single PMU is going away.
- */
- if (pmu->task_ctx_nr > perf_invalid_context)
- return;
-
- free_percpu(pmu->pmu_cpu_context);
+ free_percpu(pmu->cpu_pmu_context);
}
/*
@@ -10966,12 +11139,12 @@ perf_event_mux_interval_ms_store(struct
/* update all cpuctx for this PMU */
cpus_read_lock();
for_each_online_cpu(cpu) {
- struct perf_cpu_context *cpuctx;
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+ struct perf_cpu_pmu_context *cpc;
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+ cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
cpu_function_call(cpu,
- (remote_function_f)perf_mux_hrtimer_restart, cpuctx);
+ (remote_function_f)perf_mux_hrtimer_restart, cpc);
}
cpus_read_unlock();
mutex_unlock(&mux_interval_mutex);
@@ -11082,47 +11255,19 @@ int perf_pmu_register(struct pmu *pmu, c
}
skip_type:
- if (pmu->task_ctx_nr == perf_hw_context) {
- static int hw_context_taken = 0;
-
- /*
- * Other than systems with heterogeneous CPUs, it never makes
- * sense for two PMUs to share perf_hw_context. PMUs which are
- * uncore must use perf_invalid_context.
- */
- if (WARN_ON_ONCE(hw_context_taken &&
- !(pmu->capabilities & PERF_PMU_CAP_HETEROGENEOUS_CPUS)))
- pmu->task_ctx_nr = perf_invalid_context;
-
- hw_context_taken = 1;
- }
-
- pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
- if (pmu->pmu_cpu_context)
- goto got_cpu_context;
-
ret = -ENOMEM;
- pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
- if (!pmu->pmu_cpu_context)
+ pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
+ if (!pmu->cpu_pmu_context)
goto free_dev;
for_each_possible_cpu(cpu) {
- struct perf_cpu_context *cpuctx;
-
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- __perf_event_init_context(&cpuctx->ctx);
- lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
- lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
- cpuctx->ctx.pmu = pmu;
- cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
-
- __perf_mux_hrtimer_init(cpuctx, cpu);
+ struct perf_cpu_pmu_context *cpc;
- cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
- cpuctx->heap = cpuctx->heap_default;
+ cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+ __perf_init_event_pmu_context(&cpc->epc, pmu);
+ __perf_mux_hrtimer_init(cpc, cpu);
}
-got_cpu_context:
if (!pmu->start_txn) {
if (pmu->pmu_enable) {
/*
@@ -11604,10 +11749,11 @@ perf_event_alloc(struct perf_event_attr
}
/*
- * Disallow uncore-cgroup events, they don't make sense as the cgroup will
- * be different on other CPUs in the uncore mask.
+ * Disallow uncode-task events. Similarly, disallow uncore-cgroup
+ * events (they don't make sense as the cgroup will be different
+ * on other CPUs in the uncore mask).
*/
- if (pmu->task_ctx_nr == perf_invalid_context && cgroup_fd != -1) {
+ if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
err = -EINVAL;
goto err_pmu;
}
@@ -11893,15 +12039,6 @@ perf_event_set_output(struct perf_event
return ret;
}
-static void mutex_lock_double(struct mutex *a, struct mutex *b)
-{
- if (b < a)
- swap(a, b);
-
- mutex_lock(a);
- mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
-}
-
static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
{
bool nmi_safe = false;
@@ -11939,37 +12076,6 @@ static int perf_event_set_clock(struct p
return 0;
}
-/*
- * Variation on perf_event_ctx_lock_nested(), except we take two context
- * mutexes.
- */
-static struct perf_event_context *
-__perf_event_ctx_lock_double(struct perf_event *group_leader,
- struct perf_event_context *ctx)
-{
- struct perf_event_context *gctx;
-
-again:
- rcu_read_lock();
- gctx = READ_ONCE(group_leader->ctx);
- if (!refcount_inc_not_zero(&gctx->refcount)) {
- rcu_read_unlock();
- goto again;
- }
- rcu_read_unlock();
-
- mutex_lock_double(&gctx->mutex, &ctx->mutex);
-
- if (group_leader->ctx != gctx) {
- mutex_unlock(&ctx->mutex);
- mutex_unlock(&gctx->mutex);
- put_ctx(gctx);
- goto again;
- }
-
- return gctx;
-}
-
static bool
perf_check_permission(struct perf_event_attr *attr, struct task_struct *task)
{
@@ -12015,9 +12121,10 @@ SYSCALL_DEFINE5(perf_event_open,
pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
{
struct perf_event *group_leader = NULL, *output_event = NULL;
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event *event, *sibling;
struct perf_event_attr attr;
- struct perf_event_context *ctx, *gctx;
+ struct perf_event_context *ctx;
struct file *event_file = NULL;
struct fd group = {NULL, 0};
struct task_struct *task = NULL;
@@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_task;
}
+ // XXX premature; what if this is allowed, but we get moved to a PMU
+ // that doesn't have this.
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
@@ -12147,42 +12256,37 @@ SYSCALL_DEFINE5(perf_event_open,
if (pmu->task_ctx_nr == perf_sw_context)
event->event_caps |= PERF_EV_CAP_SOFTWARE;
- if (group_leader) {
- if (is_software_event(event) &&
- !in_software_context(group_leader)) {
- /*
- * If the event is a sw event, but the group_leader
- * is on hw context.
- *
- * Allow the addition of software events to hw
- * groups, this is safe because software events
- * never fail to schedule.
- */
- pmu = group_leader->ctx->pmu;
- } else if (!is_software_event(event) &&
- is_software_event(group_leader) &&
- (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
- /*
- * In case the group is a pure software group, and we
- * try to add a hardware event, move the whole group to
- * the hardware context.
- */
- move_group = 1;
- }
- }
-
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(pmu, task, event);
+ ctx = find_get_context(task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
goto err_alloc;
}
- /*
- * Look up the group leader (we will attach this event to it):
- */
+ mutex_lock(&ctx->mutex);
+
+ if (ctx->task == TASK_TOMBSTONE) {
+ err = -ESRCH;
+ goto err_locked;
+ }
+
+ if (!task) {
+ /*
+ * Check if the @cpu we're creating an event for is online.
+ *
+ * We use the perf_cpu_context::ctx::mutex to serialize against
+ * the hotplug notifiers. See perf_event_{init,exit}_cpu().
+ */
+ struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
+
+ if (!cpuctx->online) {
+ err = -ENODEV;
+ goto err_locked;
+ }
+ }
+
if (group_leader) {
err = -EINVAL;
@@ -12191,11 +12295,11 @@ SYSCALL_DEFINE5(perf_event_open,
* becoming part of another group-sibling):
*/
if (group_leader->group_leader != group_leader)
- goto err_context;
+ goto err_locked;
/* All events in a group should have the same clock */
if (group_leader->clock != event->clock)
- goto err_context;
+ goto err_locked;
/*
* Make sure we're both events for the same CPU;
@@ -12203,41 +12307,60 @@ SYSCALL_DEFINE5(perf_event_open,
* you can never concurrently schedule them anyhow.
*/
if (group_leader->cpu != event->cpu)
- goto err_context;
-
- /*
- * Make sure we're both on the same task, or both
- * per-CPU events.
- */
- if (group_leader->ctx->task != ctx->task)
- goto err_context;
+ goto err_locked;
/*
- * Do not allow to attach to a group in a different task
- * or CPU context. If we're moving SW events, we'll fix
- * this up later, so allow that.
- *
- * Racy, not holding group_leader->ctx->mutex, see comment with
- * perf_event_ctx_lock().
+ * Make sure we're both on the same context; either task or cpu.
*/
- if (!move_group && group_leader->ctx != ctx)
- goto err_context;
+ if (group_leader->ctx != ctx)
+ goto err_locked;
/*
* Only a group leader can be exclusive or pinned
*/
if (attr.exclusive || attr.pinned)
- goto err_context;
+ goto err_locked;
+
+ if (is_software_event(event) &&
+ !in_software_context(group_leader)) {
+ /*
+ * If the event is a sw event, but the group_leader
+ * is on hw context.
+ *
+ * Allow the addition of software events to hw
+ * groups, this is safe because software events
+ * never fail to schedule.
+ */
+ pmu = group_leader->pmu_ctx->pmu;
+ } else if (!is_software_event(event) &&
+ is_software_event(group_leader) &&
+ (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
+ /*
+ * In case the group is a pure software group, and we
+ * try to add a hardware event, move the whole group to
+ * the hardware context.
+ */
+ move_group = 1;
+ }
}
+ /*
+ * Now that we're certain of the pmu; find the pmu_ctx.
+ */
+ pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+ if (IS_ERR(pmu_ctx)) {
+ err = PTR_ERR(pmu_ctx);
+ goto err_locked;
+ }
+ event->pmu_ctx = pmu_ctx;
+
if (output_event) {
err = perf_event_set_output(event, output_event);
if (err)
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
- f_flags);
+ event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
if (IS_ERR(event_file)) {
err = PTR_ERR(event_file);
event_file = NULL;
@@ -12260,59 +12383,6 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
- if (move_group) {
- gctx = __perf_event_ctx_lock_double(group_leader, ctx);
-
- if (gctx->task == TASK_TOMBSTONE) {
- err = -ESRCH;
- goto err_locked;
- }
-
- /*
- * Check if we raced against another sys_perf_event_open() call
- * moving the software group underneath us.
- */
- if (!(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
- /*
- * If someone moved the group out from under us, check
- * if this new event wound up on the same ctx, if so
- * its the regular !move_group case, otherwise fail.
- */
- if (gctx != ctx) {
- err = -EINVAL;
- goto err_locked;
- } else {
- perf_event_ctx_unlock(group_leader, gctx);
- move_group = 0;
- goto not_move_group;
- }
- }
-
- /*
- * Failure to create exclusive events returns -EBUSY.
- */
- err = -EBUSY;
- if (!exclusive_event_installable(group_leader, ctx))
- goto err_locked;
-
- for_each_sibling_event(sibling, group_leader) {
- if (!exclusive_event_installable(sibling, ctx))
- goto err_locked;
- }
- } else {
- mutex_lock(&ctx->mutex);
-
- /*
- * Now that we hold ctx->lock, (re)validate group_leader->ctx == ctx,
- * see the group_leader && !move_group test earlier.
- */
- if (group_leader && group_leader->ctx != ctx) {
- err = -EINVAL;
- goto err_locked;
- }
- }
-not_move_group:
-
if (ctx->task == TASK_TOMBSTONE) {
err = -ESRCH;
goto err_locked;
@@ -12350,7 +12420,7 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (!exclusive_event_installable(event, ctx)) {
err = -EBUSY;
- goto err_locked;
+ goto err_cred;
}
WARN_ON_ONCE(ctx->parent_ctx);
@@ -12361,25 +12431,15 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (move_group) {
- /*
- * See perf_event_ctx_lock() for comments on the details
- * of swizzling perf_event::ctx.
- */
perf_remove_from_context(group_leader, 0);
- put_ctx(gctx);
+ put_pmu_ctx(group_leader->pmu_ctx);
for_each_sibling_event(sibling, group_leader) {
perf_remove_from_context(sibling, 0);
- put_ctx(gctx);
+ put_pmu_ctx(sibling->pmu_ctx);
}
/*
- * Wait for everybody to stop referencing the events through
- * the old lists, before installing it on new lists.
- */
- synchronize_rcu();
-
- /*
* Install the group siblings before the group leader.
*
* Because a group leader will try and install the entire group
@@ -12390,9 +12450,10 @@ SYSCALL_DEFINE5(perf_event_open,
* reachable through the group lists.
*/
for_each_sibling_event(sibling, group_leader) {
+ sibling->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
perf_event__state_init(sibling);
perf_install_in_context(ctx, sibling, sibling->cpu);
- get_ctx(ctx);
}
/*
@@ -12400,9 +12461,10 @@ SYSCALL_DEFINE5(perf_event_open,
* event. What we want here is event in the initial
* startup state, ready to be add into new context.
*/
+ group_leader->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
perf_event__state_init(group_leader);
perf_install_in_context(ctx, group_leader, group_leader->cpu);
- get_ctx(ctx);
}
/*
@@ -12419,8 +12481,6 @@ SYSCALL_DEFINE5(perf_event_open,
perf_install_in_context(ctx, event, event->cpu);
perf_unpin_context(ctx);
- if (move_group)
- perf_event_ctx_unlock(group_leader, gctx);
mutex_unlock(&ctx->mutex);
if (task) {
@@ -12442,16 +12502,15 @@ SYSCALL_DEFINE5(perf_event_open,
fd_install(event_fd, event_file);
return event_fd;
-err_locked:
- if (move_group)
- perf_event_ctx_unlock(group_leader, gctx);
- mutex_unlock(&ctx->mutex);
err_cred:
if (task)
up_read(&task->signal->exec_update_lock);
err_file:
fput(event_file);
err_context:
+ /* event->pmu_ctx freed by free_event() */
+err_locked:
+ mutex_unlock(&ctx->mutex);
perf_unpin_context(ctx);
put_ctx(ctx);
err_alloc:
@@ -12486,8 +12545,10 @@ perf_event_create_kernel_counter(struct
perf_overflow_handler_t overflow_handler,
void *context)
{
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event_context *ctx;
struct perf_event *event;
+ struct pmu *pmu;
int err;
/*
@@ -12506,16 +12567,32 @@ perf_event_create_kernel_counter(struct
/* Mark owner so we could distinguish it from user events. */
event->owner = TASK_TOMBSTONE;
+ pmu = event->pmu;
+
+ if (pmu->task_ctx_nr < 0 && task) {
+ err = -EINVAL;
+ goto err_alloc;
+ }
+
+ if (pmu->task_ctx_nr == perf_sw_context)
+ event->event_caps |= PERF_EV_CAP_SOFTWARE;
/*
* Get the target context (task or percpu):
*/
- ctx = find_get_context(event->pmu, task, event);
+ ctx = find_get_context(task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
- goto err_free;
+ goto err_alloc;
}
+ pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+ if (IS_ERR(pmu_ctx)) {
+ err = PTR_ERR(pmu_ctx);
+ goto err_ctx;
+ }
+ event->pmu_ctx = pmu_ctx;
+
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);
if (ctx->task == TASK_TOMBSTONE) {
@@ -12551,9 +12628,10 @@ perf_event_create_kernel_counter(struct
err_unlock:
mutex_unlock(&ctx->mutex);
+err_ctx:
perf_unpin_context(ctx);
put_ctx(ctx);
-err_free:
+err_alloc:
free_event(event);
err:
return ERR_PTR(err);
@@ -12562,6 +12640,7 @@ EXPORT_SYMBOL_GPL(perf_event_create_kern
void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
{
+#if 0 // XXX buggered - cpu hotplug, who cares
struct perf_event_context *src_ctx;
struct perf_event_context *dst_ctx;
struct perf_event *event, *tmp;
@@ -12622,6 +12701,7 @@ void perf_pmu_migrate_context(struct pmu
}
mutex_unlock(&dst_ctx->mutex);
mutex_unlock(&src_ctx->mutex);
+#endif
}
EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
@@ -12699,14 +12779,14 @@ perf_event_exit_event(struct perf_event
perf_event_wakeup(event);
}
-static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
+static void perf_event_exit_task_context(struct task_struct *child)
{
struct perf_event_context *child_ctx, *clone_ctx = NULL;
struct perf_event *child_event, *next;
WARN_ON_ONCE(child != current);
- child_ctx = perf_pin_task_context(child, ctxn);
+ child_ctx = perf_pin_task_context(child);
if (!child_ctx)
return;
@@ -12728,13 +12808,13 @@ static void perf_event_exit_task_context
* in.
*/
raw_spin_lock_irq(&child_ctx->lock);
- task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+ task_ctx_sched_out(child_ctx, EVENT_ALL);
/*
* Now that the context is inactive, destroy the task <-> ctx relation
* and mark the context dead.
*/
- RCU_INIT_POINTER(child->perf_event_ctxp[ctxn], NULL);
+ RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
put_ctx(child_ctx); /* cannot be last */
WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
put_task_struct(current); /* cannot be last */
@@ -12769,7 +12849,6 @@ static void perf_event_exit_task_context
void perf_event_exit_task(struct task_struct *child)
{
struct perf_event *event, *tmp;
- int ctxn;
mutex_lock(&child->perf_event_mutex);
list_for_each_entry_safe(event, tmp, &child->perf_event_list,
@@ -12785,8 +12864,7 @@ void perf_event_exit_task(struct task_st
}
mutex_unlock(&child->perf_event_mutex);
- for_each_task_context_nr(ctxn)
- perf_event_exit_task_context(child, ctxn);
+ perf_event_exit_task_context(child);
/*
* The perf_event_exit_task_context calls perf_event_task
@@ -12829,56 +12907,51 @@ void perf_event_free_task(struct task_st
{
struct perf_event_context *ctx;
struct perf_event *event, *tmp;
- int ctxn;
- for_each_task_context_nr(ctxn) {
- ctx = task->perf_event_ctxp[ctxn];
- if (!ctx)
- continue;
+ ctx = rcu_dereference(task->perf_event_ctxp);
+ if (!ctx)
+ return;
- mutex_lock(&ctx->mutex);
- raw_spin_lock_irq(&ctx->lock);
- /*
- * Destroy the task <-> ctx relation and mark the context dead.
- *
- * This is important because even though the task hasn't been
- * exposed yet the context has been (through child_list).
- */
- RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], NULL);
- WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
- raw_spin_unlock_irq(&ctx->lock);
+ mutex_lock(&ctx->mutex);
+ raw_spin_lock_irq(&ctx->lock);
+ /*
+ * Destroy the task <-> ctx relation and mark the context dead.
+ *
+ * This is important because even though the task hasn't been
+ * exposed yet the context has been (through child_list).
+ */
+ RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+ WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
+ put_task_struct(task); /* cannot be last */
+ raw_spin_unlock_irq(&ctx->lock);
- list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
- perf_free_event(event, ctx);
- mutex_unlock(&ctx->mutex);
+ list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
+ perf_free_event(event, ctx);
- /*
- * perf_event_release_kernel() could've stolen some of our
- * child events and still have them on its free_list. In that
- * case we must wait for these events to have been freed (in
- * particular all their references to this task must've been
- * dropped).
- *
- * Without this copy_process() will unconditionally free this
- * task (irrespective of its reference count) and
- * _free_event()'s put_task_struct(event->hw.target) will be a
- * use-after-free.
- *
- * Wait for all events to drop their context reference.
- */
- wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
- put_ctx(ctx); /* must be last */
- }
+ mutex_unlock(&ctx->mutex);
+
+ /*
+ * perf_event_release_kernel() could've stolen some of our
+ * child events and still have them on its free_list. In that
+ * case we must wait for these events to have been freed (in
+ * particular all their references to this task must've been
+ * dropped).
+ *
+ * Without this copy_process() will unconditionally free this
+ * task (irrespective of its reference count) and
+ * _free_event()'s put_task_struct(event->hw.target) will be a
+ * use-after-free.
+ *
+ * Wait for all events to drop their context reference.
+ */
+ wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
+ put_ctx(ctx); /* must be last */
}
void perf_event_delayed_put(struct task_struct *task)
{
- int ctxn;
-
- for_each_task_context_nr(ctxn)
- WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+ WARN_ON_ONCE(task->perf_event_ctxp);
}
struct file *perf_event_get(unsigned int fd)
@@ -12928,6 +13001,7 @@ inherit_event(struct perf_event *parent_
struct perf_event_context *child_ctx)
{
enum perf_event_state parent_state = parent_event->state;
+ struct perf_event_pmu_context *pmu_ctx;
struct perf_event *child_event;
unsigned long flags;
@@ -12948,17 +13022,12 @@ inherit_event(struct perf_event *parent_
if (IS_ERR(child_event))
return child_event;
-
- if ((child_event->attach_state & PERF_ATTACH_TASK_DATA) &&
- !child_ctx->task_ctx_data) {
- struct pmu *pmu = child_event->pmu;
-
- child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
- if (!child_ctx->task_ctx_data) {
- free_event(child_event);
- return ERR_PTR(-ENOMEM);
- }
+ pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
+ if (!pmu_ctx) {
+ free_event(child_event);
+ return NULL;
}
+ child_event->pmu_ctx = pmu_ctx;
/*
* is_orphaned_event() and list_add_tail(&parent_event->child_list)
@@ -13081,11 +13150,11 @@ static int inherit_group(struct perf_eve
static int
inherit_task_group(struct perf_event *event, struct task_struct *parent,
struct perf_event_context *parent_ctx,
- struct task_struct *child, int ctxn,
+ struct task_struct *child,
u64 clone_flags, int *inherited_all)
{
- int ret;
struct perf_event_context *child_ctx;
+ int ret;
if (!event->attr.inherit ||
(event->attr.inherit_thread && !(clone_flags & CLONE_THREAD)) ||
@@ -13095,7 +13164,7 @@ inherit_task_group(struct perf_event *ev
return 0;
}
- child_ctx = child->perf_event_ctxp[ctxn];
+ child_ctx = child->perf_event_ctxp;
if (!child_ctx) {
/*
* This is executed from the parent task context, so
@@ -13103,16 +13172,14 @@ inherit_task_group(struct perf_event *ev
* First allocate and initialize a context for the
* child.
*/
- child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+ child_ctx = alloc_perf_context(child);
if (!child_ctx)
return -ENOMEM;
- child->perf_event_ctxp[ctxn] = child_ctx;
+ child->perf_event_ctxp = child_ctx;
}
- ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
-
+ ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
if (ret)
*inherited_all = 0;
@@ -13122,8 +13189,7 @@ inherit_task_group(struct perf_event *ev
/*
* Initialize the perf_event context in task_struct
*/
-static int perf_event_init_context(struct task_struct *child, int ctxn,
- u64 clone_flags)
+static int perf_event_init_context(struct task_struct *child, u64 clone_flags)
{
struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
@@ -13133,14 +13199,14 @@ static int perf_event_init_context(struc
unsigned long flags;
int ret = 0;
- if (likely(!parent->perf_event_ctxp[ctxn]))
+ if (likely(!parent->perf_event_ctxp))
return 0;
/*
* If the parent's context is a clone, pin it so it won't get
* swapped under us.
*/
- parent_ctx = perf_pin_task_context(parent, ctxn);
+ parent_ctx = perf_pin_task_context(parent);
if (!parent_ctx)
return 0;
@@ -13163,8 +13229,7 @@ static int perf_event_init_context(struc
*/
perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, clone_flags,
- &inherited_all);
+ child, clone_flags, &inherited_all);
if (ret)
goto out_unlock;
}
@@ -13180,8 +13245,7 @@ static int perf_event_init_context(struc
perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, clone_flags,
- &inherited_all);
+ child, clone_flags, &inherited_all);
if (ret)
goto out_unlock;
}
@@ -13189,7 +13253,7 @@ static int perf_event_init_context(struc
raw_spin_lock_irqsave(&parent_ctx->lock, flags);
parent_ctx->rotate_disable = 0;
- child_ctx = child->perf_event_ctxp[ctxn];
+ child_ctx = child->perf_event_ctxp;
if (child_ctx && inherited_all) {
/*
@@ -13225,18 +13289,16 @@ static int perf_event_init_context(struc
*/
int perf_event_init_task(struct task_struct *child, u64 clone_flags)
{
- int ctxn, ret;
+ int ret;
- memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+ child->perf_event_ctxp = NULL;
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
- for_each_task_context_nr(ctxn) {
- ret = perf_event_init_context(child, ctxn, clone_flags);
- if (ret) {
- perf_event_free_task(child);
- return ret;
- }
+ ret = perf_event_init_context(child, clone_flags);
+ if (ret) {
+ perf_event_free_task(child);
+ return ret;
}
return 0;
@@ -13245,6 +13307,7 @@ int perf_event_init_task(struct task_str
static void __init perf_event_init_all_cpus(void)
{
struct swevent_htable *swhash;
+ struct perf_cpu_context *cpuctx;
int cpu;
zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL);
@@ -13252,7 +13315,6 @@ static void __init perf_event_init_all_c
for_each_possible_cpu(cpu) {
swhash = &per_cpu(swevent_htable, cpu);
mutex_init(&swhash->hlist_mutex);
- INIT_LIST_HEAD(&per_cpu(active_ctx_list, cpu));
INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
@@ -13261,6 +13323,14 @@ static void __init perf_event_init_all_c
INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
#endif
INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
+
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ __perf_event_init_context(&cpuctx->ctx);
+ lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
+ lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
+ cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
+ cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
+ cpuctx->heap = cpuctx->heap_default;
}
}
@@ -13282,12 +13352,12 @@ static void perf_swevent_init_cpu(unsign
#if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC_CORE
static void __perf_event_exit_context(void *__info)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_event_context *ctx = __info;
- struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct perf_event *event;
raw_spin_lock(&ctx->lock);
- ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+ ctx_sched_out(ctx, EVENT_TIME);
list_for_each_entry(event, &ctx->event_list, event_entry)
__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
raw_spin_unlock(&ctx->lock);
@@ -13297,18 +13367,16 @@ static void perf_event_exit_cpu_context(
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
- struct pmu *pmu;
+ // XXX simplify cpuctx->online
mutex_lock(&pmus_lock);
- list_for_each_entry(pmu, &pmus, entry) {
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- ctx = &cpuctx->ctx;
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ ctx = &cpuctx->ctx;
- mutex_lock(&ctx->mutex);
- smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
- cpuctx->online = 0;
- mutex_unlock(&ctx->mutex);
- }
+ mutex_lock(&ctx->mutex);
+ smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
+ cpuctx->online = 0;
+ mutex_unlock(&ctx->mutex);
cpumask_clear_cpu(cpu, perf_online_mask);
mutex_unlock(&pmus_lock);
}
@@ -13322,20 +13390,17 @@ int perf_event_init_cpu(unsigned int cpu
{
struct perf_cpu_context *cpuctx;
struct perf_event_context *ctx;
- struct pmu *pmu;
perf_swevent_init_cpu(cpu);
mutex_lock(&pmus_lock);
cpumask_set_cpu(cpu, perf_online_mask);
- list_for_each_entry(pmu, &pmus, entry) {
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- ctx = &cpuctx->ctx;
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
+ ctx = &cpuctx->ctx;
- mutex_lock(&ctx->mutex);
- cpuctx->online = 1;
- mutex_unlock(&ctx->mutex);
- }
+ mutex_lock(&ctx->mutex);
+ cpuctx->online = 1;
+ mutex_unlock(&ctx->mutex);
mutex_unlock(&pmus_lock);
return 0;
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
Another one of those lockdep splats:
> @@ -12147,42 +12256,37 @@ SYSCALL_DEFINE5(perf_event_open,
> if (pmu->task_ctx_nr == perf_sw_context)
> event->event_caps |= PERF_EV_CAP_SOFTWARE;
>
> - if (group_leader) {
> - if (is_software_event(event) &&
> - !in_software_context(group_leader)) {
> - /*
> - * If the event is a sw event, but the group_leader
> - * is on hw context.
> - *
> - * Allow the addition of software events to hw
> - * groups, this is safe because software events
> - * never fail to schedule.
> - */
> - pmu = group_leader->ctx->pmu;
> - } else if (!is_software_event(event) &&
> - is_software_event(group_leader) &&
> - (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
> - /*
> - * In case the group is a pure software group, and we
> - * try to add a hardware event, move the whole group to
> - * the hardware context.
> - */
> - move_group = 1;
> - }
> - }
> -
> /*
> * Get the target context (task or percpu):
> */
> - ctx = find_get_context(pmu, task, event);
> + ctx = find_get_context(task, event);
> if (IS_ERR(ctx)) {
> err = PTR_ERR(ctx);
> goto err_alloc;
> }
>
> - /*
> - * Look up the group leader (we will attach this event to it):
> - */
> + mutex_lock(&ctx->mutex);
> +
> + if (ctx->task == TASK_TOMBSTONE) {
> + err = -ESRCH;
> + goto err_locked;
> + }
> +
> + if (!task) {
> + /*
> + * Check if the @cpu we're creating an event for is online.
> + *
> + * We use the perf_cpu_context::ctx::mutex to serialize against
> + * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> + */
> + struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
> +
> + if (!cpuctx->online) {
> + err = -ENODEV;
> + goto err_locked;
> + }
> + }
> +
> if (group_leader) {
> err = -EINVAL;
>
pulling up the ctx->mutex makes things simpler, but also violates the
locking order vs exec_update_lock.
Pull that lock up as well...
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12254,13 +12254,29 @@ SYSCALL_DEFINE5(perf_event_open,
if (pmu->task_ctx_nr == perf_sw_context)
event->event_caps |= PERF_EV_CAP_SOFTWARE;
+ if (task) {
+ err = down_read_interruptible(&task->signal->exec_update_lock);
+ if (err)
+ goto err_alloc;
+
+ /*
+ * We must hold exec_update_lock across this and any potential
+ * perf_install_in_context() call for this new event to
+ * serialize against exec() altering our credentials (and the
+ * perf_event_exit_task() that could imply).
+ */
+ err = -EACCES;
+ if (!perf_check_permission(&attr, task))
+ goto err_cred;
+ }
+
/*
* Get the target context (task or percpu):
*/
ctx = find_get_context(task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
- goto err_alloc;
+ goto err_cred;
}
mutex_lock(&ctx->mutex);
@@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
- if (IS_ERR(event_file)) {
- err = PTR_ERR(event_file);
- event_file = NULL;
- goto err_context;
- }
-
- if (task) {
- err = down_read_interruptible(&task->signal->exec_update_lock);
- if (err)
- goto err_file;
-
- /*
- * We must hold exec_update_lock across this and any potential
- * perf_install_in_context() call for this new event to
- * serialize against exec() altering our credentials (and the
- * perf_event_exit_task() that could imply).
- */
- err = -EACCES;
- if (!perf_check_permission(&attr, task))
- goto err_cred;
- }
-
- if (ctx->task == TASK_TOMBSTONE) {
- err = -ESRCH;
- goto err_locked;
- }
-
if (!perf_event_validate_size(event)) {
err = -E2BIG;
- goto err_locked;
- }
-
- if (!task) {
- /*
- * Check if the @cpu we're creating an event for is online.
- *
- * We use the perf_cpu_context::ctx::mutex to serialize against
- * the hotplug notifiers. See perf_event_{init,exit}_cpu().
- */
- struct perf_cpu_context *cpuctx =
- container_of(ctx, struct perf_cpu_context, ctx);
-
- if (!cpuctx->online) {
- err = -ENODEV;
- goto err_locked;
- }
+ goto err_context;
}
if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) {
err = -EINVAL;
- goto err_locked;
+ goto err_context;
}
/*
@@ -12418,11 +12390,18 @@ SYSCALL_DEFINE5(perf_event_open,
*/
if (!exclusive_event_installable(event, ctx)) {
err = -EBUSY;
- goto err_cred;
+ goto err_context;
}
WARN_ON_ONCE(ctx->parent_ctx);
+ event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
+ if (IS_ERR(event_file)) {
+ err = PTR_ERR(event_file);
+ event_file = NULL;
+ goto err_context;
+ }
+
/*
* This is the point on no return; we cannot fail hereafter. This is
* where we start modifying current state.
@@ -12500,17 +12479,15 @@ SYSCALL_DEFINE5(perf_event_open,
fd_install(event_fd, event_file);
return event_fd;
-err_cred:
- if (task)
- up_read(&task->signal->exec_update_lock);
-err_file:
- fput(event_file);
err_context:
/* event->pmu_ctx freed by free_event() */
err_locked:
mutex_unlock(&ctx->mutex);
perf_unpin_context(ctx);
put_ctx(ctx);
+err_cred:
+ if (task)
+ up_read(&task->signal->exec_update_lock);
err_alloc:
/*
* If event_file is set, the fput() above will have called ->release()
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>
>
> Right, so sorry for being incredibly tardy on this. Find below the
> patch fwd ported to something recent.
>
> I'll reply to this with fixes and comments.
You write:
>> A simple perf stat/record/top survives with the patch but machine
>> crashes with first run of perf test (stale cpc->task_epc causing the
>>crash). Lockdep is also screaming a lot :)
> @@ -7669,20 +7877,15 @@ static void perf_event_addr_filters_exec
> void perf_event_exec(void)
> {
> struct perf_event_context *ctx;
> - int ctxn;
> -
> - for_each_task_context_nr(ctxn) {
> - perf_event_enable_on_exec(ctxn);
> - perf_event_remove_on_exec(ctxn);
>
> - rcu_read_lock();
> - ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> - if (ctx) {
> - perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
> - NULL, true);
> - }
> - rcu_read_unlock();
> + rcu_read_lock();
> + ctx = rcu_dereference(current->perf_event_ctxp);
> + if (ctx) {
> + perf_event_enable_on_exec(ctx);
> + perf_event_remove_on_exec(ctx);
> + perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
> }
> + rcu_read_unlock();
> }
>
> struct remote_output {
The above goes *bang* because perf_event_remove_on_exec() will take a
mutex, which isn't allowed under rcu_read_lock().
The below cures.
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4384,8 +4384,6 @@ static void perf_event_remove_on_exec(st
unsigned long flags;
bool modified = false;
- perf_pin_task_context(current);
-
mutex_lock(&ctx->mutex);
if (WARN_ON_ONCE(ctx->task != current))
@@ -4406,13 +4404,11 @@ static void perf_event_remove_on_exec(st
raw_spin_lock_irqsave(&ctx->lock, flags);
if (modified)
clone_ctx = unclone_ctx(ctx);
- --ctx->pin_count;
raw_spin_unlock_irqrestore(&ctx->lock, flags);
unlock:
mutex_unlock(&ctx->mutex);
- put_ctx(ctx);
if (clone_ctx)
put_ctx(clone_ctx);
}
@@ -7878,14 +7874,16 @@ void perf_event_exec(void)
{
struct perf_event_context *ctx;
- rcu_read_lock();
- ctx = rcu_dereference(current->perf_event_ctxp);
- if (ctx) {
- perf_event_enable_on_exec(ctx);
- perf_event_remove_on_exec(ctx);
- perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
- }
- rcu_read_unlock();
+ ctx = perf_pin_task_context(current);
+ if (!ctx)
+ return;
+
+ perf_event_enable_on_exec(ctx);
+ perf_event_remove_on_exec(ctx);
+ perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
+
+ perf_unpin_context(ctx);
+ put_ctx(ctx);
}
struct remote_output {
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> @@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
> .size = ARRAY_SIZE(itrs),
> };
> /* Events not within a CPU context may be on any CPU. */
> - __heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
> + __heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
> }
> evt = event_heap.data;
>
> - __heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
> + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
>
> #ifdef CONFIG_CGROUP_PERF
> for (; css; css = css->parent)
> - __heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
> + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
> #endif
>
> + if (event_heap.nr) {
> + /*
> + * XXX: For now, visit_groups_merge() gets called with pmu
> + * pointer never NULL. But these functions needs to be called
> + * once for each pmu if I implement pmu=NULL optimization.
> + */
> + __link_epc((*evt)->pmu_ctx);
> + perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
> + }
> +
> +
> min_heapify_all(&event_heap, &perf_min_heap);
>
> while (event_heap.nr) {
> @@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
> return 0;
> }
>
> -static void
> -ctx_pinned_sched_in(struct perf_event_context *ctx,
> - struct perf_cpu_context *cpuctx)
> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
> {
> + struct perf_event_pmu_context *pmu_ctx;
> int can_add_hw = 1;
>
> - if (ctx != &cpuctx->ctx)
> - cpuctx = NULL;
> -
> - visit_groups_merge(cpuctx, &ctx->pinned_groups,
> - smp_processor_id(),
> - merge_sched_in, &can_add_hw);
> + if (pmu) {
> + visit_groups_merge(ctx, &ctx->pinned_groups,
> + smp_processor_id(), pmu,
> + merge_sched_in, &can_add_hw);
> + } else {
> + /*
> + * XXX: This can be optimized for per-task context by calling
> + * visit_groups_merge() only once with:
> + * 1) pmu=NULL
> + * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
> + * 3) Making can_add_hw a per-pmu variable
> + *
> + * Though, it can not be opimized for per-cpu context because
> + * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
> + * consist of cgroup-subtrees. i.e. a cgroup events of same
> + * cgroup but different pmus are seperated out into respective
> + * pmu-subtrees.
> + */
> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> + can_add_hw = 1;
> + visit_groups_merge(ctx, &ctx->pinned_groups,
> + smp_processor_id(), pmu_ctx->pmu,
> + merge_sched_in, &can_add_hw);
> + }
> + }
> }
I'm not sure I follow.. task context can have multiple PMUs just the
same as CPU context can, that's more or less the entire point of the
patch.
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
> goto err_task;
> }
>
> + // XXX premature; what if this is allowed, but we get moved to a PMU
> + // that doesn't have this.
> if (is_sampling_event(event)) {
> if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
> err = -EOPNOTSUPP;
No; this really should be against the event's native PMU. If the event
can't natively sample, it can't sample when placed in another group
either.
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> @@ -3196,11 +3187,52 @@ static int perf_event_modify_attr(struct
> return err;
> }
>
> -static void ctx_sched_out(struct perf_event_context *ctx,
> - struct perf_cpu_context *cpuctx,
> - enum event_type_t event_type)
> +static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
> + enum event_type_t event_type)
> {
> + struct perf_event_context *ctx = pmu_ctx->ctx;
> struct perf_event *event, *tmp;
> + struct pmu *pmu = pmu_ctx->pmu;
> +
> + if (ctx->task && !ctx->is_active) {
> + struct perf_cpu_pmu_context *cpc;
> +
> + cpc = this_cpu_ptr(pmu->cpu_pmu_context);
> + WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
> + cpc->task_epc = NULL;
> + }
> +
> + if (!event_type)
> + return;
> +
> + perf_pmu_disable(pmu);
> + if (event_type & EVENT_PINNED) {
> + list_for_each_entry_safe(event, tmp,
> + &pmu_ctx->pinned_active,
> + active_list)
> + group_sched_out(event, ctx);
> + }
> +
> + if (event_type & EVENT_FLEXIBLE) {
> + list_for_each_entry_safe(event, tmp,
> + &pmu_ctx->flexible_active,
> + active_list)
> + group_sched_out(event, ctx);
> + /*
> + * Since we cleared EVENT_FLEXIBLE, also clear
> + * rotate_necessary, is will be reset by
> + * ctx_flexible_sched_in() when needed.
> + */
> + pmu_ctx->rotate_necessary = 0;
> + }
> + perf_pmu_enable(pmu);
> +}
> +
> +static void
> +ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
> +{
> + struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
> + struct perf_event_pmu_context *pmu_ctx;
> int is_active = ctx->is_active;
>
> lockdep_assert_held(&ctx->lock);
> @@ -3251,24 +3283,8 @@ static void ctx_sched_out(struct perf_ev
> if (!ctx->nr_active || !(is_active & EVENT_ALL))
> return;
>
> - perf_pmu_disable(ctx->pmu);
> - if (is_active & EVENT_PINNED) {
> - list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> - group_sched_out(event, cpuctx, ctx);
> - }
> -
> - if (is_active & EVENT_FLEXIBLE) {
> - list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
> - group_sched_out(event, cpuctx, ctx);
> -
> - /*
> - * Since we cleared EVENT_FLEXIBLE, also clear
> - * rotate_necessary, is will be reset by
> - * ctx_flexible_sched_in() when needed.
> - */
> - ctx->rotate_necessary = 0;
> - }
> - perf_pmu_enable(ctx->pmu);
> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
> + __pmu_ctx_sched_out(pmu_ctx, is_active);
> }
You mentioned trouble with cpc->task_epc, there's one rebase mistake
from you and an original bug from me.
You lost the last hunk, I forgot to clear cpc on
perf_remove_from_context().
With these fixes I can run: 'perf test' without things going
insta-splat.
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2311,6 +2311,7 @@ __perf_remove_from_context(struct perf_e
struct perf_event_context *ctx,
void *info)
{
+ struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
unsigned long flags = (unsigned long)info;
if (ctx->is_active & EVENT_TIME) {
@@ -2325,8 +2326,17 @@ __perf_remove_from_context(struct perf_e
perf_child_detach(event);
list_del_event(event, ctx);
- if (!event->pmu_ctx->nr_events)
- event->pmu_ctx->rotate_necessary = 0;
+ if (!pmu_ctx->nr_events) {
+ pmu_ctx->rotate_necessary = 0;
+
+ if (ctx->task) {
+ struct perf_cpu_pmu_context *cpc;
+
+ cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+ WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+ cpc->task_epc = NULL;
+ }
+ }
if (!ctx->nr_events && ctx->is_active) {
if (ctx == &cpuctx->ctx)
@@ -3198,7 +3208,7 @@ static void __pmu_ctx_sched_out(struct p
struct perf_cpu_pmu_context *cpc;
cpc = this_cpu_ptr(pmu->cpu_pmu_context);
- WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+ WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
cpc->task_epc = NULL;
}
@@ -3280,9 +3290,6 @@ ctx_sched_out(struct perf_event_context
is_active ^= ctx->is_active; /* changed bits */
- if (!ctx->nr_active || !(is_active & EVENT_ALL))
- return;
-
list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
__pmu_ctx_sched_out(pmu_ctx, is_active);
}
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> +/* XXX: No need of list now. Convert it to per-cpu variable */
> static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
Something like so I suppose...
---
include/linux/perf_event.h | 1
kernel/events/core.c | 70 ++++++++++++++-------------------------------
2 files changed, 22 insertions(+), 49 deletions(-)
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -936,7 +936,6 @@ struct perf_cpu_context {
#ifdef CONFIG_CGROUP_PERF
struct perf_cgroup *cgrp;
- struct list_head cgrp_cpuctx_entry;
#endif
/*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -829,55 +829,41 @@ perf_cgroup_set_timestamp(struct perf_cp
}
}
-/* XXX: No need of list now. Convert it to per-cpu variable */
-static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-
/*
* reschedule events based on the cgroup constraint of task.
*/
static void perf_cgroup_switch(struct task_struct *task)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
struct perf_cgroup *cgrp;
- struct perf_cpu_context *cpuctx, *tmp;
struct list_head *list;
unsigned long flags;
- /*
- * Disable interrupts and preemption to avoid this CPU's
- * cgrp_cpuctx_entry to change under us.
- */
- local_irq_save(flags);
-
cgrp = perf_cgroup_from_task(task, NULL);
- list = this_cpu_ptr(&cgrp_cpuctx_list);
- list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
- WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
- if (READ_ONCE(cpuctx->cgrp) == cgrp)
- continue;
-
- perf_ctx_lock(cpuctx, cpuctx->task_ctx);
- perf_ctx_disable(&cpuctx->ctx);
+ WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+ if (READ_ONCE(cpuctx->cgrp) == cgrp)
+ continue;
- ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
- /*
- * must not be done before ctxswout due
- * to update_cgrp_time_from_cpuctx() in
- * ctx_sched_out()
- */
- cpuctx->cgrp = cgrp;
- /*
- * set cgrp before ctxsw in to allow
- * perf_cgroup_set_timestamp() in ctx_sched_in()
- * to not have to pass task around
- */
- ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
+ perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+ perf_ctx_disable(&cpuctx->ctx);
- perf_ctx_enable(&cpuctx->ctx);
- perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
- }
+ ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
+ /*
+ * must not be done before ctxswout due
+ * to update_cgrp_time_from_cpuctx() in
+ * ctx_sched_out()
+ */
+ cpuctx->cgrp = cgrp;
+ /*
+ * set cgrp before ctxsw in to allow
+ * perf_cgroup_set_timestamp() in ctx_sched_in()
+ * to not have to pass task around
+ */
+ ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
- local_irq_restore(flags);
+ perf_ctx_enable(&cpuctx->ctx);
+ perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
}
static int perf_cgroup_ensure_storage(struct perf_event *event,
@@ -979,8 +965,6 @@ perf_cgroup_event_enable(struct perf_eve
return;
cpuctx->cgrp = perf_cgroup_from_task(current, ctx);
- list_add(&cpuctx->cgrp_cpuctx_entry,
- per_cpu_ptr(&cgrp_cpuctx_list, event->cpu));
}
static inline void
@@ -1001,7 +985,6 @@ perf_cgroup_event_disable(struct perf_ev
return;
cpuctx->cgrp = NULL;
- list_del(&cpuctx->cgrp_cpuctx_entry);
}
#else /* !CONFIG_CGROUP_PERF */
@@ -2372,11 +2355,7 @@ static void perf_remove_from_context(str
* event_function_call() user.
*/
raw_spin_lock_irq(&ctx->lock);
- /*
- * Cgroup events are per-cpu events, and must IPI because of
- * cgrp_cpuctx_list.
- */
- if (!ctx->is_active && !is_cgroup_event(event)) {
+ if (!ctx->is_active) {
__perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
ctx, (void *)flags);
raw_spin_unlock_irq(&ctx->lock);
@@ -2807,8 +2786,6 @@ perf_install_in_context(struct perf_even
* perf_event_attr::disabled events will not run and can be initialized
* without IPI. Except when this is the first event for the context, in
* that case we need the magic of the IPI to set ctx->is_active.
- * Similarly, cgroup events for the context also needs the IPI to
- * manipulate the cgrp_cpuctx_list.
*
* The IOC_ENABLE that is sure to follow the creation of a disabled
* event will issue the IPI and reprogram the hardware.
@@ -13301,9 +13278,6 @@ static void __init perf_event_init_all_c
INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
-#ifdef CONFIG_CGROUP_PERF
- INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
-#endif
INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
cpuctx = per_cpu_ptr(&cpu_context, cpu);
On 13-Jun-22 8:05 PM, Peter Zijlstra wrote:
>
>
> Right, so sorry for being incredibly tardy on this. Find below the
> patch fwd ported to something recent.
>
> I'll reply to this with fixes and comments.
Thanks! I've resumed on this but my mind has lost all the context so
it might take a while for me to reply to your comments. Please bear
with me if I'm bit slow.
Thanks,
Ravi
On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
>> goto err_task;
>> }
>>
>> + // XXX premature; what if this is allowed, but we get moved to a PMU
>> + // that doesn't have this.
>> if (is_sampling_event(event)) {
>> if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>> err = -EOPNOTSUPP;
>
> No; this really should be against the event's native PMU. If the event
> can't natively sample, it can't sample when placed in another group
> either.
Right. But IIUC, the question was, would there be any issue if we allow
grouping of perf_sw_context sampling event as group leader and
perf_{hw|invalid}_context counting event as group member. I think no. It
should just work fine. And, there could be real usecases of it as you
described in one old thread[1].
TL;DR
Although I can't find any such pmu combination on AMD(not considering real
sw pmus), I just tried opposite scenario:
Group leader: msr/tsc/ as counting event (perf_sw_context)
Group member: ibs_op/cnt_ctl=1/ as sampling event (perf_invalid_context)
And a simple test program seems to work fine:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <linux/perf_event.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#define PAGE_SIZE sysconf(_SC_PAGESIZE)
#define PERF_MMAP_DATA_PAGES 256
#define PERF_MMAP_DATA_SIZE (PERF_MMAP_DATA_PAGES * PAGE_SIZE)
#define PERF_MMAP_DATA_MASK (PERF_MMAP_DATA_SIZE - 1)
#define PERF_MMAP_TOTAL_PAGES (PERF_MMAP_DATA_PAGES + 1)
#define PERF_MMAP_TOTAL_SIZE (PERF_MMAP_TOTAL_PAGES * PAGE_SIZE)
#define rmb() asm volatile("lfence":::"memory")
struct perf_event {
int fd;
void *p;
};
static int perf_event_open(struct perf_event_attr *attr, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int fd = syscall(__NR_perf_event_open, attr, pid, cpu,
group_fd, flags);
if (fd < 0)
perror("perf_event_open() failed.");
return fd;
}
static void *perf_event_mmap(int fd)
{
void *p = mmap(NULL, PERF_MMAP_TOTAL_SIZE,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (p == MAP_FAILED)
perror("mmap() failed.");
return p;
}
static void
copy_event_data(void *src, unsigned long offset, void *dest, size_t size)
{
size_t chunk1_size, chunk2_size;
if ((offset + size) < PERF_MMAP_DATA_SIZE) {
memcpy(dest, src + offset, size);
} else {
chunk1_size = PERF_MMAP_DATA_SIZE - offset;
chunk2_size = size - chunk1_size;
memcpy(dest, src + offset, chunk1_size);
memcpy(dest + chunk1_size, src, chunk2_size);
}
}
static int mmap_read(struct perf_event_mmap_page *p, void *dest, size_t size)
{
void *base;
unsigned long data_tail, data_head;
/* Casting to (void *) is needed. */
base = (void *)p + PAGE_SIZE;
data_head = p->data_head;
rmb();
data_tail = p->data_tail;
if ((data_head - data_tail) < size)
return -1;
data_tail &= PERF_MMAP_DATA_MASK;
copy_event_data(base, data_tail, dest, size);
p->data_tail += size;
return 0;
}
static void mmap_skip(struct perf_event_mmap_page *p, size_t size)
{
int data_head = p->data_head;
rmb();
if ((p->data_tail + size) > data_head)
p->data_tail = data_head;
else
p->data_tail += size;
}
static void perf_read_event_details(struct perf_event_mmap_page *p)
{
struct perf_event_header hdr;
unsigned int pid, tid;
/*
* PERF_RECORD_SAMPLE:
* struct {
* struct perf_event_header hdr;
* u32 pid; // PERF_SAMPLE_TID
* u32 tid; // PERF_SAMPLE_TID
* };
*/
while(1) {
if (mmap_read(p, &hdr, sizeof(hdr)))
return;
if (hdr.type == PERF_RECORD_SAMPLE) {
if (mmap_read(p, &pid, sizeof(pid)))
perror("Error reading pid.");
if (mmap_read(p, &tid, sizeof(tid)))
perror("Error reading tid.");
printf("pid: %d, tid: %d\n", pid, tid);
} else {
mmap_skip(p, hdr.size - sizeof(hdr));
}
}
}
int main(int argc, char *argv[])
{
struct perf_event_attr attr;
struct perf_event events[2];
int i;
long long count1, count2;
memset(&attr, 0, sizeof(struct perf_event_attr));
attr.size = sizeof(struct perf_event_attr);
attr.type = 16; /* /sys/bus/event_source/devices/msr/type */
attr.config = 0x0; /* /sys/bus/event_source/devices/msr/events/tsc */
attr.disabled = 1;
events[0].fd = perf_event_open(&attr, -1, 0, -1, 0);
attr.type = 9; /* /sys/bus/event_source/devices/ibs_op/type */
attr.config = (0x1 << 19); /* /sys/bus/event_source/devices/ibs_op/format/cnt_ctl */
attr.disabled = 1;
/* perf_read_event_details() can parse PERF_SAMPLE_TID only */
attr.sample_type = PERF_SAMPLE_TID;
attr.sample_period = 10000000;
events[1].fd = perf_event_open(&attr, -1, 0, events[0].fd, 0);
events[1].p = perf_event_mmap(events[1].fd);
ioctl(events[0].fd, PERF_EVENT_IOC_RESET, 0);
ioctl(events[1].fd, PERF_EVENT_IOC_RESET, 0);
ioctl(events[0].fd, PERF_EVENT_IOC_ENABLE, 0);
ioctl(events[1].fd, PERF_EVENT_IOC_ENABLE, 0);
i = 5;
while(i--) {
sleep(1);
read(events[0].fd, &count1, sizeof(long long));
read(events[1].fd, &count2, sizeof(long long));
perf_read_event_details(events[1].p);
ioctl(events[0].fd, PERF_EVENT_IOC_RESET, 0);
ioctl(events[1].fd, PERF_EVENT_IOC_RESET, 0);
printf("%lld, %lld\n", count1, count2);
}
close(events[1].fd);
close(events[0].fd);
}
Example run:
[term1~]$ taskset -c 0 top
[term2~]$ pgrep top
85747
[term2~]$ sudo ./perf-group-sample-count
1996319080, 0
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 0, tid: 0
1996510960, 150000000
1996325400, 0
1996348600, 0
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 85747, tid: 85747
pid: 0, tid: 0
1996341420, 130000000
Thanks,
Ravi
[1] https://lore.kernel.org/all/[email protected]
[...]
> static void
> -ctx_sched_in(struct perf_event_context *ctx,
> - struct perf_cpu_context *cpuctx,
> - enum event_type_t event_type,
> +ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
> struct task_struct *task)
> {
> + struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
> int is_active = ctx->is_active;
> u64 now;
>
> @@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
> /* start ctx time */
> now = perf_clock();
> ctx->timestamp = now;
> + // XXX ctx->task =? task
Couldn't get this XXX, it's from your original patch. If you can recall, it
would be helpful.
> perf_cgroup_set_timestamp(task, ctx);
> }
Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
(is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?
Thanks,
Ravi
On 27-Jun-22 9:48 AM, Ravi Bangoria wrote:
>
> On 13-Jun-22 8:05 PM, Peter Zijlstra wrote:
>>
>>
>> Right, so sorry for being incredibly tardy on this. Find below the
>> patch fwd ported to something recent.
>>
>> I'll reply to this with fixes and comments.
>
> Thanks! I've resumed on this but my mind has lost all the context so
> it might take a while for me to reply to your comments. Please bear
> with me if I'm bit slow.
Sorry, it took a while for me to get started on it. Anyways, thanks for
providing fixes. I applied those and ran some tests on AMD Milan machine:
- Built in perf tests ran fine without any issues
- perf_event_tests reported one BUG_ON() and one WARN_ON(). I'll work on
fixing those.
- Ran perf fuzzer for almost a day. It reported one softlockup but system
recovered from it and later it reported one hardlockup but unfortunately
my config had HARDLOCKUP_PANIC set and thus couldn't confirm whether that
hardlockup was recoverable or not. Anyway, system was running pretty much
fine until then.
- No lockdep warnings were observed in any of the tests.
I'll work on verifying functionality changes.
Thanks,
Ravi
On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>
>> @@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
>> .size = ARRAY_SIZE(itrs),
>> };
>> /* Events not within a CPU context may be on any CPU. */
>> - __heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
>> + __heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
>> }
>> evt = event_heap.data;
>>
>> - __heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
>> + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
>>
>> #ifdef CONFIG_CGROUP_PERF
>> for (; css; css = css->parent)
>> - __heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
>> + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
>> #endif
>>
>> + if (event_heap.nr) {
>> + /*
>> + * XXX: For now, visit_groups_merge() gets called with pmu
>> + * pointer never NULL. But these functions needs to be called
>> + * once for each pmu if I implement pmu=NULL optimization.
>> + */
>> + __link_epc((*evt)->pmu_ctx);
>> + perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
>> + }
>> +
>> +
>> min_heapify_all(&event_heap, &perf_min_heap);
>>
>> while (event_heap.nr) {
>
>> @@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
>> return 0;
>> }
>>
>> -static void
>> -ctx_pinned_sched_in(struct perf_event_context *ctx,
>> - struct perf_cpu_context *cpuctx)
>> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>> {
>> + struct perf_event_pmu_context *pmu_ctx;
>> int can_add_hw = 1;
>>
>> - if (ctx != &cpuctx->ctx)
>> - cpuctx = NULL;
>> -
>> - visit_groups_merge(cpuctx, &ctx->pinned_groups,
>> - smp_processor_id(),
>> - merge_sched_in, &can_add_hw);
>> + if (pmu) {
>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>> + smp_processor_id(), pmu,
>> + merge_sched_in, &can_add_hw);
>> + } else {
>> + /*
>> + * XXX: This can be optimized for per-task context by calling
>> + * visit_groups_merge() only once with:
>> + * 1) pmu=NULL
>> + * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
>> + * 3) Making can_add_hw a per-pmu variable
>> + *
>> + * Though, it can not be opimized for per-cpu context because
>> + * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
>> + * consist of cgroup-subtrees. i.e. a cgroup events of same
>> + * cgroup but different pmus are seperated out into respective
>> + * pmu-subtrees.
>> + */
>> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>> + can_add_hw = 1;
>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>> + smp_processor_id(), pmu_ctx->pmu,
>> + merge_sched_in, &can_add_hw);
>> + }
>> + }
>> }
>
> I'm not sure I follow.. task context can have multiple PMUs just the
> same as CPU context can, that's more or less the entire point of the
> patch.
Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
task specific context is {cpu, group_idx} because cgroup_id is always 0. And
effective key for cpu specific context is {cgroup_id, group_idx} because cpu
is same for entire rbtree.
With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
explained above, effective key for task specific context will be {cpu, pmu,
group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
did in the very first RFC[1]. (This may make things more complicated though
because we might also need to increase heap size to accommodate all pmu events
in single heap. Current heap size is 2 for task specific context, which is
sufficient if we iterate over all pmus).
Same optimization won't work for cpu specific context because, it's effective
key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
cgroup subtrees.
Please correct me if my understanding is wrong.
Thanks,
Ravi
[1]:
https://lore.kernel.org/lkml/[email protected]
> pulling up the ctx->mutex makes things simpler, but also violates the
> locking order vs exec_update_lock.
>
> Pull that lock up as well...
I'm not able to apply this patch as is but I get the idea. Few
questions below...
>
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -12254,13 +12254,29 @@ SYSCALL_DEFINE5(perf_event_open,
> if (pmu->task_ctx_nr == perf_sw_context)
> event->event_caps |= PERF_EV_CAP_SOFTWARE;
>
> + if (task) {
> + err = down_read_interruptible(&task->signal->exec_update_lock);
> + if (err)
> + goto err_alloc;
> +
> + /*
> + * We must hold exec_update_lock across this and any potential
> + * perf_install_in_context() call for this new event to
> + * serialize against exec() altering our credentials (and the
> + * perf_event_exit_task() that could imply).
> + */
> + err = -EACCES;
> + if (!perf_check_permission(&attr, task))
> + goto err_cred;
> + }
> +
> /*
> * Get the target context (task or percpu):
> */
> ctx = find_get_context(task, event);
> if (IS_ERR(ctx)) {
> err = PTR_ERR(ctx);
> - goto err_alloc;
> + goto err_cred;
> }
>
> mutex_lock(&ctx->mutex);
> @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
> goto err_context;
> }
>
> - event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
> - if (IS_ERR(event_file)) {
> - err = PTR_ERR(event_file);
> - event_file = NULL;
> - goto err_context;
> - }
> -
> - if (task) {
> - err = down_read_interruptible(&task->signal->exec_update_lock);
> - if (err)
> - goto err_file;
> -
> - /*
> - * We must hold exec_update_lock across this and any potential
> - * perf_install_in_context() call for this new event to
> - * serialize against exec() altering our credentials (and the
> - * perf_event_exit_task() that could imply).
> - */
> - err = -EACCES;
> - if (!perf_check_permission(&attr, task))
> - goto err_cred;
> - }
> -
> - if (ctx->task == TASK_TOMBSTONE) {
> - err = -ESRCH;
> - goto err_locked;
> - }
I think we need to keep (ctx->task == TASK_TOMBSTONE) check?
> -
> if (!perf_event_validate_size(event)) {
> err = -E2BIG;
> - goto err_locked;
> - }
> -
> - if (!task) {
> - /*
> - * Check if the @cpu we're creating an event for is online.
> - *
> - * We use the perf_cpu_context::ctx::mutex to serialize against
> - * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> - */
> - struct perf_cpu_context *cpuctx =
> - container_of(ctx, struct perf_cpu_context, ctx);
> -
> - if (!cpuctx->online) {
> - err = -ENODEV;
> - goto err_locked;
> - }
> + goto err_context;
Why did you remove this hunk? We should confirm whether cpu is online or not
before creating event. No?
Thanks,
Ravi
[...]
> /*
> @@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> struct perf_event_context *task_ctx,
> enum event_type_t event_type)
> {
> - enum event_type_t ctx_event_type;
> bool cpu_event = !!(event_type & EVENT_CPU);
>
> /*
> @@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> if (event_type & EVENT_PINNED)
> event_type |= EVENT_FLEXIBLE;
>
> - ctx_event_type = event_type & EVENT_ALL;
> + event_type &= EVENT_ALL;
>
> - perf_pmu_disable(cpuctx->ctx.pmu);
> - if (task_ctx)
> - task_ctx_sched_out(cpuctx, task_ctx, event_type);
> + perf_ctx_disable(&cpuctx->ctx);
> + if (task_ctx) {
> + perf_ctx_disable(task_ctx);
> + task_ctx_sched_out(task_ctx, event_type);
> + }
>
> /*
> * Decide which cpu ctx groups to schedule out based on the types
> @@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> * - otherwise, do nothing more.
> */
> if (cpu_event)
> - cpu_ctx_sched_out(cpuctx, ctx_event_type);
> - else if (ctx_event_type & EVENT_PINNED)
> - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> + ctx_sched_out(&cpuctx->ctx, event_type);
> + else if (event_type & EVENT_PINNED)
> + ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
>
> perf_event_sched_in(cpuctx, task_ctx, current);
> - perf_pmu_enable(cpuctx->ctx.pmu);
> +
> + perf_ctx_enable(&cpuctx->ctx);
> + if (task_ctx)
> + perf_ctx_enable(task_ctx);
> }
ctx_resched() reschedule entire perf_event_context while adding new event
to the context or enabling existing event in the context. We can probably
optimize it by rescheduling only affected pmu_ctx.
Thanks,
Ravi
[...]
> You mentioned trouble with cpc->task_epc, there's one rebase mistake
> from you and an original bug from me.
>
> You lost the last hunk, I forgot to clear cpc on
> perf_remove_from_context().
>
> With these fixes I can run: 'perf test' without things going
> insta-splat.
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2311,6 +2311,7 @@ __perf_remove_from_context(struct perf_e
> struct perf_event_context *ctx,
> void *info)
> {
> + struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> unsigned long flags = (unsigned long)info;
>
> if (ctx->is_active & EVENT_TIME) {
> @@ -2325,8 +2326,17 @@ __perf_remove_from_context(struct perf_e
> perf_child_detach(event);
> list_del_event(event, ctx);
>
> - if (!event->pmu_ctx->nr_events)
> - event->pmu_ctx->rotate_necessary = 0;
> + if (!pmu_ctx->nr_events) {
> + pmu_ctx->rotate_necessary = 0;
> +
> + if (ctx->task) {
IIUC, this should also check for ctx->is_active? i.e.
if (ctx->task && ctx->is_active) {
...
> + struct perf_cpu_pmu_context *cpc;
> +
> + cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
> + WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
> + cpc->task_epc = NULL;
> + }
> + }
Thanks,
Ravi
> @@ -915,7 +925,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
> heap_size++;
>
> for_each_possible_cpu(cpu) {
> - cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
> + cpuctx = this_cpu_ptr(&cpu_context);
This should be fixed as well:
s/this_cpu_ptr(&cpu_context)/per_cpu_ptr(&cpu_context, cpu)/
Thanks,
Ravi
On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
>
> > pulling up the ctx->mutex makes things simpler, but also violates the
> > locking order vs exec_update_lock.
> >
> > Pull that lock up as well...
>
> I'm not able to apply this patch as is but I get the idea. Few
> questions below...
I was just about to rebase the 'series' to current, let me do that and
get back to you on the specifics.
On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> >
> > > pulling up the ctx->mutex makes things simpler, but also violates the
> > > locking order vs exec_update_lock.
> > >
> > > Pull that lock up as well...
> >
> > I'm not able to apply this patch as is but I get the idea. Few
> > questions below...
>
> I was just about to rebase the 'series' to current, let me do that and
> get back to you on the specifics.
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite
I need to go make dinner, but I'll try and remember how it all was
support to work later this evening when the loonies are in bed,
On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> > @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
> > goto err_context;
> > }
> >
> > - event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
> > - if (IS_ERR(event_file)) {
> > - err = PTR_ERR(event_file);
> > - event_file = NULL;
> > - goto err_context;
> > - }
> > -
> > - if (task) {
> > - err = down_read_interruptible(&task->signal->exec_update_lock);
> > - if (err)
> > - goto err_file;
> > -
> > - /*
> > - * We must hold exec_update_lock across this and any potential
> > - * perf_install_in_context() call for this new event to
> > - * serialize against exec() altering our credentials (and the
> > - * perf_event_exit_task() that could imply).
> > - */
> > - err = -EACCES;
> > - if (!perf_check_permission(&attr, task))
> > - goto err_cred;
> > - }
> > -
> > - if (ctx->task == TASK_TOMBSTONE) {
> > - err = -ESRCH;
> > - goto err_locked;
> > - }
>
> I think we need to keep (ctx->task == TASK_TOMBSTONE) check?
I think so too; in fact the code I have still has it, perhaps it was
there write before this patch?
> > -
> > if (!perf_event_validate_size(event)) {
> > err = -E2BIG;
> > - goto err_locked;
> > - }
> > -
> > - if (!task) {
> > - /*
> > - * Check if the @cpu we're creating an event for is online.
> > - *
> > - * We use the perf_cpu_context::ctx::mutex to serialize against
> > - * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> > - */
> > - struct perf_cpu_context *cpuctx =
> > - container_of(ctx, struct perf_cpu_context, ctx);
> > -
> > - if (!cpuctx->online) {
> > - err = -ENODEV;
> > - goto err_locked;
> > - }
> > + goto err_context;
>
> Why did you remove this hunk? We should confirm whether cpu is online or not
> before creating event. No?
Idem.
Perhaps it is best if we look at the end result of all these patches
combined and then I'll fold the lot if we're in agreement and then we
can forget about these intermediate steps.
On Tue, Aug 02, 2022 at 11:40:34AM +0530, Ravi Bangoria wrote:
> On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
> > On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> >> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
> >> goto err_task;
> >> }
> >>
> >> + // XXX premature; what if this is allowed, but we get moved to a PMU
> >> + // that doesn't have this.
> >> if (is_sampling_event(event)) {
> >> if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
> >> err = -EOPNOTSUPP;
> >
> > No; this really should be against the event's native PMU. If the event
> > can't natively sample, it can't sample when placed in another group
> > either.
>
> Right. But IIUC, the question was, would there be any issue if we allow
> grouping of perf_sw_context sampling event as group leader and
> perf_{hw|invalid}_context counting event as group member. I think no. It
> should just work fine. And, there could be real usecases of it as you
> described in one old thread[1].
Like you I need to bend my brain around this again, but I'm not seeing a
contradiction. The use-case from [1] is a software sampler with a bunch
of non-sampling uncore events.
The uncore events aren't sampling, the are simply read by the software
event (SAMPLE_READ). And moving the sampling software event to the
non-sample capable uncore PMU shouldn't matter.
That is; the code as it stands here seems right, we should check
is_sampling_event() against an event's native pmu->capabilities.
Or am I misunderstanding things?
On 22-Aug-22 9:13 PM, Peter Zijlstra wrote:
> On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
>>>
>>>> pulling up the ctx->mutex makes things simpler, but also violates the
>>>> locking order vs exec_update_lock.
>>>>
>>>> Pull that lock up as well...
>>>
>>> I'm not able to apply this patch as is but I get the idea. Few
>>> questions below...
>>
>> I was just about to rebase the 'series' to current, let me do that and
>> get back to you on the specifics.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite
Additional set of changes on top of this tree is required to build and boot,
atleast on my AMD machine:
---
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ccd231ea6a4e..94fb65d7b291 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1248,7 +1248,7 @@ static inline void amd_pmu_brs_add(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- perf_sched_cb_inc(event->ctx->pmu);
+ perf_sched_cb_inc(event->pmu_ctx->pmu);
cpuc->lbr_users++;
/*
* No need to reset BRS because it is reset
@@ -1263,7 +1263,7 @@ static inline void amd_pmu_brs_del(struct perf_event *event)
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
- perf_sched_cb_dec(event->ctx->pmu);
+ perf_sched_cb_dec(event->pmu_ctx->pmu);
}
void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 31ae032d6783..086e37fa32be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -843,7 +843,7 @@ static void perf_cgroup_switch(struct task_struct *task)
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
if (READ_ONCE(cpuctx->cgrp) == cgrp)
- continue;
+ return;
perf_ctx_lock(cpuctx, cpuctx->task_ctx);
perf_ctx_disable(&cpuctx->ctx);
@@ -881,7 +881,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
heap_size++;
for_each_possible_cpu(cpu) {
- cpuctx = this_cpu_ptr(&cpu_context);
+ cpuctx = per_cpu_ptr(&cpu_context, cpu);
if (heap_size <= cpuctx->heap_size)
continue;
@@ -2315,7 +2315,7 @@ __perf_remove_from_context(struct perf_event *event,
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
- if (ctx->task) {
+ if (ctx->task && ctx->is_active) {
struct perf_cpu_pmu_context *cpc;
cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
@@ -11972,6 +11972,15 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
goto out;
}
+static void mutex_lock_double(struct mutex *a, struct mutex *b)
+{
+ if (b < a)
+ swap(a, b);
+
+ mutex_lock(a);
+ mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
+}
+
static int
perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
{
---
With this, I can run 'perf test' and perf_event_tests without any error in
dmesg. I'll run perf fuzzer over night and see if it reports any issue.
Thanks,
Ravi
> With this, I can run 'perf test' and perf_event_tests without any error in
> dmesg. I'll run perf fuzzer over night and see if it reports any issue.
I hit kernel crash with fuzzer. I'm yet to debug it. Here is the trace:
BUG: kernel NULL pointer dereference, address: 0000000000000198
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 48 PID: 0 Comm: swapper/48 Not tainted 6.0.0-rc1-perf-event-context-peter-queue+ #153
Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.7.3 03/31/2022
RIP: 0010:x86_pmu_enable_event+0x3c/0x120
Code: a0 63 82 e8 26 7c cd 00 65 8b 05 4f 29 01 7f 85 c0 75 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 84 a0 63 82 e8 04 7c cd 00 <8b> 8b 98 01 00 00 65 48 8b 2d 2e 3a 01 7f 85 c9 0f 85 9a 00 00 00
RSP: 0018:ffffc900004e7d78 EFLAGS: 00010002
RAX: 0000000000000030 RBX: 0000000000000000 RCX: 00000000c0010200
RDX: 0000000000000000 RSI: ffffffff8263a084 RDI: ffffffff825d5466
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000006 R11: ffffc900004e7ba0 R12: ffff88bff6019c60
R13: ffff88bff6019e60 R14: ffffffff82c35220 R15: ffffc9003ca83d38
FS: 0000000000000000(0000) GS:ffff88bff6000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000198 CR3: 000000407be26003 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
<TASK>
amd_pmu_enable_all+0x68/0xb0
ctx_resched+0xd9/0x150
event_function+0xb8/0x130
? hrtimer_start_range_ns+0x141/0x4a0
? perf_duration_warn+0x30/0x30
remote_function+0x4d/0x60
__flush_smp_call_function_queue+0xc4/0x500
flush_smp_call_function_queue+0x11d/0x1b0
do_idle+0x18f/0x2d0
cpu_startup_entry+0x19/0x20
start_secondary+0x121/0x160
secondary_startup_64_no_verify+0xe5/0xeb
</TASK>
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c n$netlink intel_rapl_msr intel_rapl_common kvm_amd kvm ipmi_ssif wmi_bmof irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sp5100_tco rapl pcspkr acpi_ipmi ccp k10temp i2c_piix4 wmi ipmi_si acpi_power_meter vfat fat ext4 mbcache
g200 i2c_algo_bit drm_shmem_helper drm_kms_helper sg syscopyarea nvme sysfillrect sysimgblt fb_sys_fops nvme_core ahci libahci t10_pi drm crc32c_intel tg3 crc64_rocksoft libata crc64 megaraid_sas ipmi_devintf ipmi_msghandl$r fuse
CR2: 0000000000000198
---[ end trace 0000000000000000 ]---
RIP: 0010:x86_pmu_enable_event+0x3c/0x120
Code: a0 63 82 e8 26 7c cd 00 65 8b 05 4f 29 01 7f 85 c0 75 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 84 a0 63 82 e8 04 7c cd 00 <8b> 8b 98 01 00 00 65 48 8b 2d 2e 3a 01 7f 85 c9 0f 85 9a 00 00 00
RSP: 0018:ffffc900004e7d78 EFLAGS: 00010002
RAX: 0000000000000030 RBX: 0000000000000000 RCX: 00000000c0010200
RDX: 0000000000000000 RSI: ffffffff8263a084 RDI: ffffffff825d5466
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000006 R11: ffffc900004e7ba0 R12: ffff88bff6019c60
R13: ffff88bff6019e60 R14: ffffffff82c35220 R15: ffffc9003ca83d38
FS: 0000000000000000(0000) GS:ffff88bff6000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000198 CR3: 000000407be26003 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---
On 22-Aug-22 10:14 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:40:34AM +0530, Ravi Bangoria wrote:
>> On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
>>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>>>> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
>>>> goto err_task;
>>>> }
>>>>
>>>> + // XXX premature; what if this is allowed, but we get moved to a PMU
>>>> + // that doesn't have this.
>>>> if (is_sampling_event(event)) {
>>>> if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>>>> err = -EOPNOTSUPP;
>>>
>>> No; this really should be against the event's native PMU. If the event
>>> can't natively sample, it can't sample when placed in another group
>>> either.
>>
>> Right. But IIUC, the question was, would there be any issue if we allow
>> grouping of perf_sw_context sampling event as group leader and
>> perf_{hw|invalid}_context counting event as group member. I think no. It
>> should just work fine. And, there could be real usecases of it as you
>> described in one old thread[1].
>
> Like you I need to bend my brain around this again, but I'm not seeing a
> contradiction. The use-case from [1] is a software sampler with a bunch
> of non-sampling uncore events.
>
> The uncore events aren't sampling, the are simply read by the software
> event (SAMPLE_READ). And moving the sampling software event to the
> non-sample capable uncore PMU shouldn't matter.
Ok.
> That is; the code as it stands here seems right, we should check
> is_sampling_event() against an event's native pmu->capabilities.
>
> Or am I misunderstanding things?
No, that's correct. We must use event's native pmu to check capabilities.
I'll remove this comment from code.
Thanks,
Ravi
On 22-Aug-22 10:22 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
>
>>> @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
>>> goto err_context;
>>> }
>>>
>>> - event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
>>> - if (IS_ERR(event_file)) {
>>> - err = PTR_ERR(event_file);
>>> - event_file = NULL;
>>> - goto err_context;
>>> - }
>>> -
>>> - if (task) {
>>> - err = down_read_interruptible(&task->signal->exec_update_lock);
>>> - if (err)
>>> - goto err_file;
>>> -
>>> - /*
>>> - * We must hold exec_update_lock across this and any potential
>>> - * perf_install_in_context() call for this new event to
>>> - * serialize against exec() altering our credentials (and the
>>> - * perf_event_exit_task() that could imply).
>>> - */
>>> - err = -EACCES;
>>> - if (!perf_check_permission(&attr, task))
>>> - goto err_cred;
>>> - }
>>> -
>>> - if (ctx->task == TASK_TOMBSTONE) {
>>> - err = -ESRCH;
>>> - goto err_locked;
>>> - }
>>
>> I think we need to keep (ctx->task == TASK_TOMBSTONE) check?
>
> I think so too; in fact the code I have still has it, perhaps it was
> there write before this patch?
>
>>> -
>>> if (!perf_event_validate_size(event)) {
>>> err = -E2BIG;
>>> - goto err_locked;
>>> - }
>>> -
>>> - if (!task) {
>>> - /*
>>> - * Check if the @cpu we're creating an event for is online.
>>> - *
>>> - * We use the perf_cpu_context::ctx::mutex to serialize against
>>> - * the hotplug notifiers. See perf_event_{init,exit}_cpu().
>>> - */
>>> - struct perf_cpu_context *cpuctx =
>>> - container_of(ctx, struct perf_cpu_context, ctx);
>>> -
>>> - if (!cpuctx->online) {
>>> - err = -ENODEV;
>>> - goto err_locked;
>>> - }
>>> + goto err_context;
>>
>> Why did you remove this hunk? We should confirm whether cpu is online or not
>> before creating event. No?
>
> Idem.
>
> Perhaps it is best if we look at the end result of all these patches
> combined and then I'll fold the lot if we're in agreement and then we
> can forget about these intermediate steps.
Let me accumulate all these changes, rebase to v6.0-rc2 and send RFC v3.
Thanks,
Ravi
On Mon, Aug 22, 2022 at 10:07:45PM +0530, Ravi Bangoria wrote:
> On 22-Aug-22 9:13 PM, Peter Zijlstra wrote:
> > On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
> >> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> >>>
> >>>> pulling up the ctx->mutex makes things simpler, but also violates the
> >>>> locking order vs exec_update_lock.
> >>>>
> >>>> Pull that lock up as well...
> >>>
> >>> I'm not able to apply this patch as is but I get the idea. Few
> >>> questions below...
> >>
> >> I was just about to rebase the 'series' to current, let me do that and
> >> get back to you on the specifics.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite
>
> Additional set of changes on top of this tree is required to build and boot,
> atleast on my AMD machine:
Right; clearly I didn't even test build that thing... your changes look
fine and added then on top, above tree should be updated.
On Tue, Aug 02, 2022 at 11:43:03AM +0530, Ravi Bangoria wrote:
> [...]
>
> > /*
> > @@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> > struct perf_event_context *task_ctx,
> > enum event_type_t event_type)
> > {
> > - enum event_type_t ctx_event_type;
> > bool cpu_event = !!(event_type & EVENT_CPU);
> >
> > /*
> > @@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> > if (event_type & EVENT_PINNED)
> > event_type |= EVENT_FLEXIBLE;
> >
> > - ctx_event_type = event_type & EVENT_ALL;
> > + event_type &= EVENT_ALL;
> >
> > - perf_pmu_disable(cpuctx->ctx.pmu);
> > - if (task_ctx)
> > - task_ctx_sched_out(cpuctx, task_ctx, event_type);
> > + perf_ctx_disable(&cpuctx->ctx);
> > + if (task_ctx) {
> > + perf_ctx_disable(task_ctx);
> > + task_ctx_sched_out(task_ctx, event_type);
> > + }
> >
> > /*
> > * Decide which cpu ctx groups to schedule out based on the types
> > @@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> > * - otherwise, do nothing more.
> > */
> > if (cpu_event)
> > - cpu_ctx_sched_out(cpuctx, ctx_event_type);
> > - else if (ctx_event_type & EVENT_PINNED)
> > - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> > + ctx_sched_out(&cpuctx->ctx, event_type);
> > + else if (event_type & EVENT_PINNED)
> > + ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
> >
> > perf_event_sched_in(cpuctx, task_ctx, current);
> > - perf_pmu_enable(cpuctx->ctx.pmu);
> > +
> > + perf_ctx_enable(&cpuctx->ctx);
> > + if (task_ctx)
> > + perf_ctx_enable(task_ctx);
> > }
>
> ctx_resched() reschedule entire perf_event_context while adding new event
> to the context or enabling existing event in the context. We can probably
> optimize it by rescheduling only affected pmu_ctx.
Yes, it would probably make sense to add a pmu argument there and limit
the rescheduling where possible.
On Tue, Aug 02, 2022 at 11:47:24AM +0530, Ravi Bangoria wrote:
> [...]
>
> > static void
> > -ctx_sched_in(struct perf_event_context *ctx,
> > - struct perf_cpu_context *cpuctx,
> > - enum event_type_t event_type,
> > +ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
> > struct task_struct *task)
> > {
> > + struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
> > int is_active = ctx->is_active;
> > u64 now;
> >
> > @@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
> > /* start ctx time */
> > now = perf_clock();
> > ctx->timestamp = now;
> > + // XXX ctx->task =? task
>
> Couldn't get this XXX, it's from your original patch. If you can recall, it
> would be helpful.
No memories at all; but looking at it; it seems to worry if ctx->task is
up-to-date; in this context the only thing that relies on the task is
the cgroup for which we update the timestamp in the next statement:
> > perf_cgroup_set_timestamp(task, ctx);
I suppose I should really write less cryptic notes; then again, I never
imagined this would take that many years to complete :/
> > }
>
> Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
> (is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?
So that code is identical to what it currently is upstream; but yes that
looks somewhat dodgy.
So the code itself (does as the comment says) starts time. This should
only be done if EVENT_TIME is not set. That is, I'm thinking it should
be something like:
!(is_active & EVENT_TIME)
which happens to be the same as:
is_active ^ EVENT_TIME
under the assumption is_active contains no other bits -- which I don't
think is a valid assumption.
On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
> > On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> >> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
> >> {
> >> + struct perf_event_pmu_context *pmu_ctx;
> >> int can_add_hw = 1;
> >>
> >> - if (ctx != &cpuctx->ctx)
> >> - cpuctx = NULL;
> >> -
> >> - visit_groups_merge(cpuctx, &ctx->pinned_groups,
> >> - smp_processor_id(),
> >> - merge_sched_in, &can_add_hw);
> >> + if (pmu) {
> >> + visit_groups_merge(ctx, &ctx->pinned_groups,
> >> + smp_processor_id(), pmu,
> >> + merge_sched_in, &can_add_hw);
> >> + } else {
> >> + /*
> >> + * XXX: This can be optimized for per-task context by calling
> >> + * visit_groups_merge() only once with:
> >> + * 1) pmu=NULL
> >> + * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
> >> + * 3) Making can_add_hw a per-pmu variable
> >> + *
> >> + * Though, it can not be opimized for per-cpu context because
> >> + * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
> >> + * consist of cgroup-subtrees. i.e. a cgroup events of same
> >> + * cgroup but different pmus are seperated out into respective
> >> + * pmu-subtrees.
> >> + */
> >> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> >> + can_add_hw = 1;
> >> + visit_groups_merge(ctx, &ctx->pinned_groups,
> >> + smp_processor_id(), pmu_ctx->pmu,
> >> + merge_sched_in, &can_add_hw);
> >> + }
> >> + }
> >> }
> >
> > I'm not sure I follow.. task context can have multiple PMUs just the
> > same as CPU context can, that's more or less the entire point of the
> > patch.
>
> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
> is same for entire rbtree.
>
> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
> explained above, effective key for task specific context will be {cpu, pmu,
> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
> did in the very first RFC[1]. (This may make things more complicated though
> because we might also need to increase heap size to accommodate all pmu events
> in single heap. Current heap size is 2 for task specific context, which is
> sufficient if we iterate over all pmus).
>
> Same optimization won't work for cpu specific context because, it's effective
> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
> cgroup subtrees.
Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}
Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
leftmost group_idx event to schedule next.
However, since cgroup events are per-cpu events, per-task events will
always have cgroup=NULL. Resulting in the subtrees:
{-1, pmu, NULL} and {cpu, pmu, NULL}
Which is what the code does, it iterates ctx->pmu_ctx_list to find all
@pmu values and then for each does the schedule dance.
Now, I suppose making that:
{-1, NULL, NULL}, {cpu, NULL, NULL}
could work, but wouldn't iterating the the tree be more expensive than
just finding the sub-trees as we do now?
You also talk about extending extending the heap, which I read like
doing the heap-merge over:
{-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
{cpu, pmu0, NULL}, ...
But that doesn't make sense, the schedule dance is per-pmu.
Or am I just still not getting it?
>> Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
>> (is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?
>
> So that code is identical to what it currently is upstream; but yes that
> looks somewhat dodgy.
>
> So the code itself (does as the comment says) starts time.
Got it.
> This should only be done if EVENT_TIME is not set.
Does that mean context time should be started only when context is getting
scheduled I.e. ctx->is_active is 0 ?
> That is, I'm thinking it should be something like:
>
> !(is_active & EVENT_TIME)
>
> which happens to be the same as:
>
> is_active ^ EVENT_TIME
>
> under the assumption is_active contains no other bits -- which I don't
> think is a valid assumption.
Correct, we can't assume that. There are cases where we call
ctx_sched_out(EVENT_TIME) followed by ctx_sched_in(EVENT_TIME) when PINNED /
FLEXIBLE are also set in ctx->is_active. For ex, perf_event_enable_on_exec().
In such cases, we will not advance ctx->time. Example:
child()
{
...
execv();
}
main()
{
pid = fork();
attr.enable_on_exec = 0;
fd0 = perf_event_open(&attr, pid, -1, -1, 0);
...
wait(NULL);
}
Here execv() will cause call to ctx_sched_in() --> __update_context_time()
with adv=false. I think that's fine. Sometime later we will anyway advance
ctx->time.
Sorry, I've not spend enough time with this time keeping code. Please let
me know if I'm talking nonsense.
Thanks,
Ravi
On 23-Aug-22 2:27 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
>> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
>>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>
>>>> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>>>> {
>>>> + struct perf_event_pmu_context *pmu_ctx;
>>>> int can_add_hw = 1;
>>>>
>>>> - if (ctx != &cpuctx->ctx)
>>>> - cpuctx = NULL;
>>>> -
>>>> - visit_groups_merge(cpuctx, &ctx->pinned_groups,
>>>> - smp_processor_id(),
>>>> - merge_sched_in, &can_add_hw);
>>>> + if (pmu) {
>>>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> + smp_processor_id(), pmu,
>>>> + merge_sched_in, &can_add_hw);
>>>> + } else {
>>>> + /*
>>>> + * XXX: This can be optimized for per-task context by calling
>>>> + * visit_groups_merge() only once with:
>>>> + * 1) pmu=NULL
>>>> + * 2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
>>>> + * 3) Making can_add_hw a per-pmu variable
>>>> + *
>>>> + * Though, it can not be opimized for per-cpu context because
>>>> + * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
>>>> + * consist of cgroup-subtrees. i.e. a cgroup events of same
>>>> + * cgroup but different pmus are seperated out into respective
>>>> + * pmu-subtrees.
>>>> + */
>>>> + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>>> + can_add_hw = 1;
>>>> + visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> + smp_processor_id(), pmu_ctx->pmu,
>>>> + merge_sched_in, &can_add_hw);
>>>> + }
>>>> + }
>>>> }
>>>
>>> I'm not sure I follow.. task context can have multiple PMUs just the
>>> same as CPU context can, that's more or less the entire point of the
>>> patch.
>>
>> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
>> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
>> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
>> is same for entire rbtree.
>>
>> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
>> explained above, effective key for task specific context will be {cpu, pmu,
>> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
>> did in the very first RFC[1]. (This may make things more complicated though
>> because we might also need to increase heap size to accommodate all pmu events
>> in single heap. Current heap size is 2 for task specific context, which is
>> sufficient if we iterate over all pmus).
>>
>> Same optimization won't work for cpu specific context because, it's effective
>> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
>> cgroup subtrees.
>
> Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}
>
> Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
> leftmost group_idx event to schedule next.
>
> However, since cgroup events are per-cpu events, per-task events will
> always have cgroup=NULL. Resulting in the subtrees:
>
> {-1, pmu, NULL} and {cpu, pmu, NULL}
>
> Which is what the code does, it iterates ctx->pmu_ctx_list to find all
> @pmu values and then for each does the schedule dance.
>
> Now, I suppose making that:
>
> {-1, NULL, NULL}, {cpu, NULL, NULL}
>
> could work, but wouldn't iterating the the tree be more expensive than
> just finding the sub-trees as we do now?
pmu=NULL can be used while scheduling entire context. We can just traverse
through all pmu events of both cpu subtrees.
>
> You also talk about extending extending the heap, which I read like
> doing the heap-merge over:
>
> {-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
> {cpu, pmu0, NULL}, ...
>
> But that doesn't make sense, the schedule dance is per-pmu.
>
> Or am I just still not getting it?
Ok. Let's not complicate the design. We can go with current approach of
iterating over all pmus in the first phase and think about optimizing it
later.
Thanks,
Ravi
On Wed, Aug 24, 2022 at 10:37:36AM +0530, Ravi Bangoria wrote:
> > Now, I suppose making that:
> >
> > {-1, NULL, NULL}, {cpu, NULL, NULL}
> >
> > could work, but wouldn't iterating the the tree be more expensive than
> > just finding the sub-trees as we do now?
>
> pmu=NULL can be used while scheduling entire context. We can just traverse
> through all pmu events of both cpu subtrees.
But imagine the case where we have 50 event for a PMU that can only
schedule 8. Then we have to iterate 42 events for naught instead of
directly jumping to the next PMU.
On 24-Aug-22 12:57 PM, Peter Zijlstra wrote:
> On Wed, Aug 24, 2022 at 10:37:36AM +0530, Ravi Bangoria wrote:
>
>>> Now, I suppose making that:
>>>
>>> {-1, NULL, NULL}, {cpu, NULL, NULL}
>>>
>>> could work, but wouldn't iterating the the tree be more expensive than
>>> just finding the sub-trees as we do now?
>>
>> pmu=NULL can be used while scheduling entire context. We can just traverse
>> through all pmu events of both cpu subtrees.
>
> But imagine the case where we have 50 event for a PMU that can only
> schedule 8. Then we have to iterate 42 events for naught instead of
> directly jumping to the next PMU.
Yes, that needs to be handled. And, IIRC, you proposed maintaining a list
of leftmost event from each pmu subtree.
Thanks,
Ravi
On Fri, Jun 17, 2022 at 03:36:51PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> > +/* XXX: No need of list now. Convert it to per-cpu variable */
> > static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>
> Something like so I suppose...
>
I need this on top to avoid a spat on perf_cgroup_attach()
---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0cd81a3ef374..c6b64a48dea6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13536,9 +13536,12 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
static int __perf_cgroup_move(void *info)
{
struct task_struct *task = info;
- rcu_read_lock();
- perf_cgroup_switch(task);
- rcu_read_unlock();
+
+ preempt_disable();
+ if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
+ perf_cgroup_switch(task);
+ preempt_enable();
+
return 0;
}
On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
> {
> - struct perf_cpu_context *cpuctx;
> + /* XXX: Don't need this quirk anymore */
> + /*struct perf_cpu_context *cpuctx;
>
> if (!pmu->pmu_cpu_context)
> return;
>
> cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
> - cpuctx->ctx.pmu = pmu;
> + cpuctx->ctx.pmu = pmu;*/
> }
Confirmed; my ADL seems to work fine without all that.
---
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index fd043cd0e3c9..7a2d12ad6d1f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2059,24 +2059,6 @@ void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
pr_info("... event mask: %016Lx\n", intel_ctrl);
}
-/*
- * The generic code is not hybrid friendly. The hybrid_pmu->pmu
- * of the first registered PMU is unconditionally assigned to
- * each possible cpuctx->ctx.pmu.
- * Update the correct hybrid PMU to the cpuctx->ctx.pmu.
- */
-void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
-{
- /* XXX: Don't need this quirk anymore */
- /*struct perf_cpu_context *cpuctx;
-
- if (!pmu->pmu_cpu_context)
- return;
-
- cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
- cpuctx->ctx.pmu = pmu;*/
-}
-
static int __init init_hw_perf_events(void)
{
struct x86_pmu_quirk *quirk;
@@ -2197,9 +2179,6 @@ static int __init init_hw_perf_events(void)
(hybrid_pmu->cpu_type == hybrid_big) ? PERF_TYPE_RAW : -1);
if (err)
break;
-
- if (cpu_type == hybrid_pmu->cpu_type)
- x86_pmu_update_cpu_context(&hybrid_pmu->pmu, raw_smp_processor_id());
}
if (i < x86_pmu.num_hybrid_pmus) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8a72e6fe27a5..768771e5e4e9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4508,8 +4508,6 @@ static bool init_hybrid_pmu(int cpu)
cpumask_set_cpu(cpu, &pmu->supported_cpus);
cpuc->pmu = &pmu->pmu;
- x86_pmu_update_cpu_context(&pmu->pmu, cpu);
-
return true;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 94fb65d7b291..9c835ecb232e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1175,8 +1175,6 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
u64 intel_ctrl);
-void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu);
-
extern struct event_constraint emptyconstraint;
extern struct event_constraint unconstrained;
On Wed, Aug 24, 2022 at 02:15:13PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> > void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
> > {
> > - struct perf_cpu_context *cpuctx;
> > + /* XXX: Don't need this quirk anymore */
> > + /*struct perf_cpu_context *cpuctx;
> >
> > if (!pmu->pmu_cpu_context)
> > return;
> >
> > cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
> > - cpuctx->ctx.pmu = pmu;
> > + cpuctx->ctx.pmu = pmu;*/
> > }
>
> Confirmed; my ADL seems to work fine without all that.
Additionally; this doesn't insta crash.
---
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index cb69ff1e6138..016072a89f8f 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -1019,10 +1019,10 @@ static int armv8pmu_set_event_filter(struct hw_perf_event *event,
return 0;
}
-static int armv8pmu_filter_match(struct perf_event *event)
+static bool armv8pmu_filter(struct pmu *pmu, int cpu)
{
- unsigned long evtype = event->hw.config_base & ARMV8_PMU_EVTYPE_EVENT;
- return evtype != ARMV8_PMUV3_PERFCTR_CHAIN;
+ struct arm_pmu *armpmu = to_arm_pmu(pmu);
+ return !cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus);
}
static void armv8pmu_reset(void *info)
@@ -1253,7 +1253,7 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->stop = armv8pmu_stop;
cpu_pmu->reset = armv8pmu_reset;
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
- cpu_pmu->filter_match = armv8pmu_filter_match;
+ cpu_pmu->filter = armv8pmu_filter;
cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7a2d12ad6d1f..a8f1e38c66a7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -86,6 +86,8 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_swap_task_ctx, *x86_pmu.swap_task_ctx);
DEFINE_STATIC_CALL_NULL(x86_pmu_drain_pebs, *x86_pmu.drain_pebs);
DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, *x86_pmu.pebs_aliases);
+DEFINE_STATIC_CALL_NULL(x86_pmu_filter, *x86_pmu.filter);
+
/*
* This one is magic, it will get called even when PMU init fails (because
* there is no PMU), in which case it should simply return NULL.
@@ -2038,6 +2040,7 @@ static void x86_pmu_static_call_update(void)
static_call_update(x86_pmu_pebs_aliases, x86_pmu.pebs_aliases);
static_call_update(x86_pmu_guest_get_msrs, x86_pmu.guest_get_msrs);
+ static_call_update(x86_pmu_filter, x86_pmu.filter);
}
static void _x86_pmu_read(struct perf_event *event)
@@ -2668,12 +2671,13 @@ static int x86_pmu_aux_output_match(struct perf_event *event)
return 0;
}
-static int x86_pmu_filter_match(struct perf_event *event)
+static bool x86_pmu_filter(struct pmu *pmu, int cpu)
{
- if (x86_pmu.filter_match)
- return x86_pmu.filter_match(event);
+ bool ret = false;
- return 1;
+ static_call_cond(x86_pmu_filter)(pmu, cpu, &ret);
+
+ return ret;
}
static struct pmu pmu = {
@@ -2704,7 +2708,7 @@ static struct pmu pmu = {
.aux_output_match = x86_pmu_aux_output_match,
- .filter_match = x86_pmu_filter_match,
+ .filter = x86_pmu_filter,
};
void arch_perf_update_userpage(struct perf_event *event,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 768771e5e4e9..40cebd9b90a1 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4675,12 +4675,11 @@ static int intel_pmu_aux_output_match(struct perf_event *event)
return is_intel_pt_event(event);
}
-static int intel_pmu_filter_match(struct perf_event *event)
+static void intel_pmu_filter(struct pmu *pmu, int cpu, bool *ret)
{
- struct x86_hybrid_pmu *pmu = hybrid_pmu(event->pmu);
- unsigned int cpu = smp_processor_id();
+ struct x86_hybrid_pmu *hpmu = hybrid_pmu(pmu);
- return cpumask_test_cpu(cpu, &pmu->supported_cpus);
+ *ret = !cpumask_test_cpu(cpu, &hpmu->supported_cpus);
}
PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
@@ -6348,7 +6347,7 @@ __init int intel_pmu_init(void)
x86_pmu.update_topdown_event = adl_update_topdown_event;
x86_pmu.set_topdown_event_period = adl_set_topdown_event_period;
- x86_pmu.filter_match = intel_pmu_filter_match;
+ x86_pmu.filter = intel_pmu_filter;
x86_pmu.get_event_constraints = adl_get_event_constraints;
x86_pmu.hw_config = adl_hw_config;
x86_pmu.limit_period = spr_limit_period;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9c835ecb232e..b3ff55fc5794 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -924,7 +924,7 @@ struct x86_pmu {
int (*aux_output_match) (struct perf_event *event);
- int (*filter_match)(struct perf_event *event);
+ void (*filter)(struct pmu *pmu, int cpu, bool *ret);
/*
* Hybrid support
*
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 0407a38b470a..0f9519874fde 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -99,7 +99,7 @@ struct arm_pmu {
void (*stop)(struct arm_pmu *);
void (*reset)(void *);
int (*map_event)(struct perf_event *event);
- int (*filter_match)(struct perf_event *event);
+ bool (*filter)(struct pmu *pmu, int cpu);
int num_events;
bool secure_access; /* 32-bit ARM only */
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7847818e5397..4be3aaae89be 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -519,9 +519,10 @@ struct pmu {
/* optional */
/*
- * Filter events for PMU-specific reasons.
+ * Skip programming this PMU on the given CPU. Typically needed for
+ * big.LITTLE things.
*/
- int (*filter_match) (struct perf_event *event); /* optional */
+ bool (*filter) (struct pmu *pmu, int cpu); /* optional */
/*
* Check period value for PERF_EVENT_IOC_PERIOD ioctl.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6b64a48dea6..180842ba8473 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2181,38 +2181,11 @@ static bool is_orphaned_event(struct perf_event *event)
return event->state == PERF_EVENT_STATE_DEAD;
}
-static inline int __pmu_filter_match(struct perf_event *event)
-{
- struct pmu *pmu = event->pmu;
- return pmu->filter_match ? pmu->filter_match(event) : 1;
-}
-
-/*
- * Check whether we should attempt to schedule an event group based on
- * PMU-specific filtering. An event group can consist of HW and SW events,
- * potentially with a SW leader, so we must check all the filters, to
- * determine whether a group is schedulable:
- */
-static inline int pmu_filter_match(struct perf_event *event)
-{
- struct perf_event *sibling;
-
- if (!__pmu_filter_match(event))
- return 0;
-
- for_each_sibling_event(sibling, event) {
- if (!__pmu_filter_match(sibling))
- return 0;
- }
-
- return 1;
-}
-
static inline int
event_filter_match(struct perf_event *event)
{
return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
- perf_cgroup_match(event) && pmu_filter_match(event);
+ perf_cgroup_match(event);
}
static void
@@ -3661,6 +3634,9 @@ static noinline int visit_groups_merge(struct perf_event_context *ctx,
struct perf_event **evt;
int ret;
+ if (pmu->filter && pmu->filter(pmu, cpu))
+ return 0;
+
if (!ctx->task) {
cpuctx = this_cpu_ptr(&cpu_context);
event_heap = (struct min_heap){
> -static inline int __pmu_filter_match(struct perf_event *event)
> -{
> - struct pmu *pmu = event->pmu;
> - return pmu->filter_match ? pmu->filter_match(event) : 1;
> -}
> -
> -/*
> - * Check whether we should attempt to schedule an event group based on
> - * PMU-specific filtering. An event group can consist of HW and SW events,
> - * potentially with a SW leader, so we must check all the filters, to
> - * determine whether a group is schedulable:
> - */
> -static inline int pmu_filter_match(struct perf_event *event)
> -{
> - struct perf_event *sibling;
> -
> - if (!__pmu_filter_match(event))
> - return 0;
> -
> - for_each_sibling_event(sibling, event) {
> - if (!__pmu_filter_match(sibling))
> - return 0;
> - }
> -
> - return 1;
> -}
> -
> static inline int
> event_filter_match(struct perf_event *event)
> {
> return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
> - perf_cgroup_match(event) && pmu_filter_match(event);
> + perf_cgroup_match(event);
There are many callers of event_filter_match() which might not endup calling
visit_groups_merge(). I hope this is intentional change?
> }
>
> static void
> @@ -3661,6 +3634,9 @@ static noinline int visit_groups_merge(struct perf_event_context *ctx,
> struct perf_event **evt;
> int ret;
>
> + if (pmu->filter && pmu->filter(pmu, cpu))
> + return 0;
> +
> if (!ctx->task) {
> cpuctx = this_cpu_ptr(&cpu_context);
> event_heap = (struct min_heap){
Thanks,
Ravi
On Thu, Aug 25, 2022 at 11:09:05AM +0530, Ravi Bangoria wrote:
> > -static inline int __pmu_filter_match(struct perf_event *event)
> > -{
> > - struct pmu *pmu = event->pmu;
> > - return pmu->filter_match ? pmu->filter_match(event) : 1;
> > -}
> > -
> > -/*
> > - * Check whether we should attempt to schedule an event group based on
> > - * PMU-specific filtering. An event group can consist of HW and SW events,
> > - * potentially with a SW leader, so we must check all the filters, to
> > - * determine whether a group is schedulable:
> > - */
> > -static inline int pmu_filter_match(struct perf_event *event)
> > -{
> > - struct perf_event *sibling;
> > -
> > - if (!__pmu_filter_match(event))
> > - return 0;
> > -
> > - for_each_sibling_event(sibling, event) {
> > - if (!__pmu_filter_match(sibling))
> > - return 0;
> > - }
> > -
> > - return 1;
> > -}
> > -
> > static inline int
> > event_filter_match(struct perf_event *event)
> > {
> > return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
> > - perf_cgroup_match(event) && pmu_filter_match(event);
> > + perf_cgroup_match(event);
>
> There are many callers of event_filter_match() which might not endup calling
> visit_groups_merge(). I hope this is intentional change?
I thought I did, but lets go through them again.
event_filter_match() is called from:
- __perf_event_enable(); here we'll end up in ctx_sched_in() which
will dutifully skip the pmu in question.
(fwiw, this is one of those sites where ctx_sched_{out,in}() could do
with a @pmu argument.
- merge_sched_in(); this is after the new callsite in
visit_groups_merge().
- perf_adjust_freq_unthrottle_context(); if the pmu was skipped in
visit_groups_merge() then ->state != ACTIVE and we'll bail out.
- perf_iterate_ctx() / perf_iterate_sb_cpu(); these are for generating
side-band events, and arguably not delivering them when running on
the 'wrong' CPU wasn't right to begin with.
So I tihnk we're good. Hmm?
On 24-Aug-22 8:29 PM, Peter Zijlstra wrote:
> On Wed, Aug 24, 2022 at 02:15:13PM +0200, Peter Zijlstra wrote:
>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>>> void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
>>> {
>>> - struct perf_cpu_context *cpuctx;
>>> + /* XXX: Don't need this quirk anymore */
>>> + /*struct perf_cpu_context *cpuctx;
>>>
>>> if (!pmu->pmu_cpu_context)
>>> return;
>>>
>>> cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
>>> - cpuctx->ctx.pmu = pmu;
>>> + cpuctx->ctx.pmu = pmu;*/
>>> }
>>
>> Confirmed; my ADL seems to work fine without all that.
>
> Additionally; this doesn't insta crash.
While collating this I came across armv8pmu_start() which does:
struct perf_event_context *task_ctx =
this_cpu_ptr(cpu_pmu->pmu.pmu_cpu_context)->task_ctx;
if (sysctl_perf_user_access && task_ctx && task_ctx->nr_user)
Not sure why it does not lock task_ctx. Should it be changed to
something like below? Untested:
---
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 016072a89f8f..747415a5f2b2 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -806,10 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
static void armv8pmu_start(struct arm_pmu *cpu_pmu)
{
- struct perf_event_context *task_ctx =
- this_cpu_ptr(cpu_pmu->pmu.pmu_cpu_context)->task_ctx;
+ struct perf_event_context *ctx;
+ int nr_user = 0;
+
+ rcu_read_lock();
+ ctx = rcu_dereference(current->perf_event_ctxp);
+ if (ctx) {
+ raw_spin_lock(&ctx->lock);
+ nr_user = ctx->nr_user;
+ raw_spin_unlock(&ctx->lock);
+ }
+ rcu_read_unlock();
- if (sysctl_perf_user_access && task_ctx && task_ctx->nr_user)
+ if (sysctl_perf_user_access && nr_user)
armv8pmu_enable_user_access(cpu_pmu);
else
armv8pmu_disable_user_access();
---
Thanks,
Ravi
On 23-Aug-22 9:50 AM, Ravi Bangoria wrote:
>
>> With this, I can run 'perf test' and perf_event_tests without any error in
>> dmesg. I'll run perf fuzzer over night and see if it reports any issue.
>
> I hit kernel crash with fuzzer. I'm yet to debug it. Here is the trace:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000198
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 48 PID: 0 Comm: swapper/48 Not tainted 6.0.0-rc1-perf-event-context-peter-queue+ #153
> Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.7.3 03/31/2022
> RIP: 0010:x86_pmu_enable_event+0x3c/0x120
I was able to reproduce this with vanilla v6.0-rc2 kernel.
Thanks,
Ravi
> With this, I can run 'perf test' and perf_event_tests without any error in
> dmesg. I'll run perf fuzzer over night and see if it reports any issue.
I also ran fuzzer on Intel machine over the weekend. I see only one WARN_ON()
hit. Otherwise system is running normal. FWIW, I was running fuzzer as normal
user with perf_event_paranoid=0.
WARNING: CPU: 3 PID: 2840537 at arch/x86/events/core.c:1606 x86_pmu_stop+0xd0/0x100
Modules linked in: ipmi_ssif intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel wmi_bmof kvm rapl intel_cstate input_leds ee1004 joydev mei_me mei intel_pch_thermal ie31200_edac acpi_ipmi wmi ipmi_si mac_hid acpi_pad acpi_power_meter acpi_tad tcp_westwood sch_fq_codel dm_multipath scsi_dh_rdac bonding scsi_dh_emc tls scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic uas usbhid cdc_ether hid usb_storage usbnet mii i915 ast drm_vram_helper drm_ttm_helper i2c_algo_bit drm_buddy drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd i2c_i801 drm i40e cryptd
i2c_smbus ahci xhci_pci libahci xhci_pci_renesas video pinctrl_cannonlake
CPU: 3 PID: 2840537 Comm: perf_fuzzer Not tainted 6.0.0-rc2+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4I-NL, BIOS L2.09C 09/23/2020
RIP: 0010:x86_pmu_stop+0xd0/0x100
Code: c8 01 41 89 84 24 d8 01 00 00 eb 9f 4c 89 e7 e8 76 fe ff ff 5b 41 83 8c 24 d8 01 00 00 02 41 5c 41 5d 41 5e 5d c3 cc cc cc cc <0f> 0b eb d1 4c 89 f6 48 c7 c7 00 86 03 b1 e8 cd 18 76 00 e9 48 ff
RSP: 0000:ffffbda8c818fbd0 EFLAGS: 00010002
RAX: 0000000000000003 RBX: ffff97b71de19c60 RCX: 0000000000000188
RDX: 0000000000000000 RSI: 00000000001382d0 RDI: 0000000000000188
RBP: ffffbda8c818fbf0 R08: ffffffffb1039100 R09: 0000000000000005
R10: ffff97b71de1a388 R11: 0000000000000004 R12: ffff97b069c19d40
R13: 0000000000000004 R14: 0000000000000002 R15: ffff97b71de00000
FS: 00007fbf787c6740(0000) GS:ffff97b71de00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563033ec5010 CR3: 00000001ab91c002 CR4: 00000000003707e0
DR0: 0000000000000000 DR1: 000000000000ffff DR2: 0000000081008000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
<TASK>
x86_pmu_del+0x8e/0x2d0
? debug_smp_processor_id+0x17/0x20
event_sched_out+0x10b/0x2b0
? x86_pmu_del+0x5c/0x2d0
merge_sched_in+0x39f/0x410
visit_groups_merge.constprop.0.isra.0+0x207/0x670
ctx_flexible_sched_in+0xb8/0xd0
ctx_sched_in+0x10a/0x290
ctx_resched+0x97/0x100
__perf_event_enable+0x21b/0x310
event_function+0xb3/0x120
? perf_duration_warn+0x30/0x30
remote_function+0x52/0x70
__flush_smp_call_function_queue+0xc4/0x510
generic_smp_call_function_single_interrupt+0x1a/0xb0
__sysvec_call_function_single+0x48/0x1f0
sysvec_call_function_single+0x56/0xd0
asm_sysvec_call_function_single+0x1b/0x20
RIP: 0033:0x563033ec501b
Code: 0f 1e fa 48 89 d1 31 c0 48 89 f2 89 fe bf 41 01 00 00 e9 48 f7 fe ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 31 c9 b9 1f a1 07 00 <ff> c9 75 fc 31 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3
RSP: 002b:00007ffd7b16aaf8 EFLAGS: 00000202
RAX: 0000000000002303 RBX: 0000000000000000 RCX: 000000000006c1b6
RDX: 00007fbf787c6a00 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007ffd7b16ab10 R08: 0000000000000000 R09: 00007fbf787c6740
R10: 00007fbf7880d0c8 R11: 0000000000000246 R12: 00007ffd7b16cf28
R13: 0000563033eb527a R14: 0000563033ed1b68 R15: 00007fbf7880c040
</TASK>
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffaf0bfef8>] copy_process+0xa38/0x1f80
softirqs last enabled at (0): [<ffffffffaf0bfef8>] copy_process+0xa38/0x1f80
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 0000000000000000 ]---
On Mon, Aug 29, 2022 at 09:30:50AM +0530, Ravi Bangoria wrote:
> > With this, I can run 'perf test' and perf_event_tests without any error in
> > dmesg. I'll run perf fuzzer over night and see if it reports any issue.
>
> I also ran fuzzer on Intel machine over the weekend. I see only one WARN_ON()
> hit. Otherwise system is running normal. FWIW, I was running fuzzer as normal
> user with perf_event_paranoid=0.
>
> WARNING: CPU: 3 PID: 2840537 at arch/x86/events/core.c:1606 x86_pmu_stop+0xd0/0x100
That's the WARN about PERF_HES_STOPPED already being set.
> Call Trace:
> <TASK>
> x86_pmu_del+0x8e/0x2d0
> ? debug_smp_processor_id+0x17/0x20
> event_sched_out+0x10b/0x2b0
> ? x86_pmu_del+0x5c/0x2d0
> merge_sched_in+0x39f/0x410
And this callchain suggests this is the group_error path.
I can't immediately spot a fail there, but I'll try and stare at it
some.