This series is preparation for supporting perf user counter access on arm64.
Originally, the arm64 implementation was just a copy of the x86 version, but
Will did not like the mm_context state tracking nor the IPIs (mm_cpumask
doesn't work for arm64). The hook into switch_mm and the IPIs feel like
working around limitations in the perf core (aka the platform problem).
So this series aims to solve that such that all the state for RDPMC is
tracked and controlled via the perf core.
Alternatively, I could avoid all the x86 changes here with a new PMU
callback (.set_user_access()?) and plumb that into the perf core context
and mmap code. However, it's better in the long run if there's a common
implementation.
So far, I've only tested the perf core changes with the arm64 version of
the code which is similar to the x86 version here. I'm hoping for some
quick feedback on the direction here and whether I've missed some usecase
that isn't handled.
Rob
Rob Herring (3):
x86: perf: Move RDPMC event flag to a common definition
perf/x86: Control RDPMC access from .enable() hook
perf/x86: Call mmap event callbacks on event's CPU
arch/x86/events/core.c | 113 +++++++++++++++--------------
arch/x86/events/perf_event.h | 2 +-
arch/x86/include/asm/mmu.h | 1 -
arch/x86/include/asm/mmu_context.h | 6 --
arch/x86/include/asm/perf_event.h | 1 -
arch/x86/mm/tlb.c | 29 +-------
include/linux/perf_event.h | 9 ++-
kernel/events/core.c | 56 +++++++++++---
8 files changed, 113 insertions(+), 104 deletions(-)
--
2.27.0
In preparation to enable user counter access on arm64 and to move some
of the user access handling to perf core, create a common event flag for
user counter access and convert x86 to use it.
Since the architecture specific flags start at the LSB, starting at the
MSB for common flags.
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Rob Herring <[email protected]>
---
arch/x86/events/core.c | 10 +++++-----
arch/x86/events/perf_event.h | 2 +-
include/linux/perf_event.h | 2 ++
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..03f87fd4c017 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2471,7 +2471,7 @@ static int x86_pmu_event_init(struct perf_event *event)
if (READ_ONCE(x86_pmu.attr_rdpmc) &&
!(event->hw.flags & PERF_X86_EVENT_LARGE_PEBS))
- event->hw.flags |= PERF_X86_EVENT_RDPMC_ALLOWED;
+ event->hw.flags |= PERF_EVENT_FLAG_USER_READ_CNT;
return err;
}
@@ -2503,7 +2503,7 @@ void perf_clear_dirty_counters(void)
static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
{
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
return;
/*
@@ -2524,7 +2524,7 @@ static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
{
- if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
return;
if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
@@ -2535,7 +2535,7 @@ static int x86_pmu_event_idx(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- if (!(hwc->flags & PERF_X86_EVENT_RDPMC_ALLOWED))
+ if (!(hwc->flags & PERF_EVENT_FLAG_USER_READ_CNT))
return 0;
if (is_metric_idx(hwc->idx))
@@ -2718,7 +2718,7 @@ void arch_perf_update_userpage(struct perf_event *event,
userpg->cap_user_time = 0;
userpg->cap_user_time_zero = 0;
userpg->cap_user_rdpmc =
- !!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED);
+ !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
userpg->pmc_width = x86_pmu.cntval_bits;
if (!using_native_sched_clock() || !sched_clock_stable())
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2bf1c7ea2758..84d803c5cc87 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -73,7 +73,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_PEBS_NA_HSW 0x0010 /* haswell style datala, unknown */
#define PERF_X86_EVENT_EXCL 0x0020 /* HT exclusivity on counter */
#define PERF_X86_EVENT_DYNAMIC 0x0040 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED 0x0080 /* grant rdpmc permission */
+
#define PERF_X86_EVENT_EXCL_ACCT 0x0100 /* accounted EXCL event */
#define PERF_X86_EVENT_AUTO_RELOAD 0x0200 /* use PEBS auto-reload */
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750ed..f5815448ca9b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -142,6 +142,8 @@ struct hw_perf_event {
int event_base_rdpmc;
int idx;
int last_cpu;
+
+#define PERF_EVENT_FLAG_USER_READ_CNT 0x80000000
int flags;
struct hw_perf_event_extra extra_reg;
--
2.27.0
Mark suggested that mmapping of events should be treated like other
event changes in that the PMU callbacks run on the CPU the event is on.
Given that the only implementation of .event_(un)mapped() (on x86) end up
running on each CPU, this makes sense.
Since the .event_(un)mapped() callbacks are now called on multiple CPUs,
the tracking of enabling RDPMC is moved to the context in the perf core.
This allows removing perf_rdpmc_allowed, and for arm64 to share the same
user access tracking.
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Rob Herring <[email protected]>
---
Note that the intent here is to only call event_mapped() on the first mmap
and event_unmapped on the last unmapping using the event->mmap_count. I'm
not sure if there's some flaw with that idea?
arch/x86/events/core.c | 38 +++++++-------------------
arch/x86/include/asm/mmu.h | 1 -
include/linux/perf_event.h | 7 +++--
kernel/events/core.c | 56 ++++++++++++++++++++++++++++++++------
4 files changed, 61 insertions(+), 41 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5c1703206ef5..c755190c3970 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -753,9 +753,13 @@ static void perf_clear_dirty_counters(struct cpu_hw_events *cpuc)
static void x86_pmu_set_user_access(struct cpu_hw_events *cpuc)
{
+ struct perf_cpu_context *cpuctx = this_cpu_ptr(cpuc->pmu->pmu_cpu_context);
+ struct perf_event_context *task_ctx = cpuctx->task_ctx;
+
if (static_branch_unlikely(&rdpmc_always_available_key) ||
(!static_branch_unlikely(&rdpmc_never_available_key) &&
- atomic_read(&(this_cpu_read(cpu_tlbstate.loaded_mm)->context.perf_rdpmc_allowed)))) {
+ (atomic_read(&cpuctx->ctx.nr_user) ||
+ (task_ctx && atomic_read(&task_ctx->nr_user))))) {
/*
* Clear the existing dirty counters to
* prevent the leak for an RDPMC task.
@@ -2522,34 +2526,12 @@ static void x86_pmu_set_user_access_ipi(void *unused)
x86_pmu_set_user_access(this_cpu_ptr(&cpu_hw_events));
}
-static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
-{
- if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
- return;
-
- /*
- * This function relies on not being called concurrently in two
- * tasks in the same mm. Otherwise one task could observe
- * perf_rdpmc_allowed > 1 and return all the way back to
- * userspace with CR4.PCE clear while another task is still
- * doing on_each_cpu_mask() to propagate CR4.PCE.
- *
- * For now, this can't happen because all callers hold mmap_lock
- * for write. If this changes, we'll need a different solution.
- */
- mmap_assert_write_locked(mm);
-
- if (atomic_inc_return(&mm->context.perf_rdpmc_allowed) == 1)
- on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
-}
-
-static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+static void x86_pmu_event_map_changed(struct perf_event *event)
{
- if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ if (atomic_read(&event->ctx->nr_user) != 1)
return;
- if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
- on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
+ x86_pmu_set_user_access_ipi(NULL);
}
static int x86_pmu_event_idx(struct perf_event *event)
@@ -2707,8 +2689,8 @@ static struct pmu pmu = {
.event_init = x86_pmu_event_init,
- .event_mapped = x86_pmu_event_mapped,
- .event_unmapped = x86_pmu_event_unmapped,
+ .event_mapped = x86_pmu_event_map_changed,
+ .event_unmapped = x86_pmu_event_map_changed,
.add = x86_pmu_add,
.del = x86_pmu_del,
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..bd27fc666024 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -46,7 +46,6 @@ typedef struct {
void __user *vdso; /* vdso base address */
const struct vdso_image *vdso_image; /* vdso image in use */
- atomic_t perf_rdpmc_allowed; /* nonzero if rdpmc is allowed */
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
/*
* One bit per protection key says whether userspace can
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5815448ca9b..23944f9386b3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -329,10 +329,10 @@ struct pmu {
/*
* Notification that the event was mapped or unmapped. Called
- * in the context of the mapping task.
+ * on each CPU the event is on.
*/
- void (*event_mapped) (struct perf_event *event, struct mm_struct *mm); /* optional */
- void (*event_unmapped) (struct perf_event *event, struct mm_struct *mm); /* optional */
+ void (*event_mapped) (struct perf_event *event); /* optional */
+ void (*event_unmapped) (struct perf_event *event); /* optional */
/*
* Flags for ->add()/->del()/ ->start()/->stop(). There are
@@ -824,6 +824,7 @@ struct perf_event_context {
int nr_stat;
int nr_freq;
int rotate_disable;
+ atomic_t nr_user;
/*
* Set when nr_events != nr_active, except tolerant to events not
* necessary to be active due to scheduling constraints, such as cgroups.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 464917096e73..26c3fb962e4a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5988,6 +5988,48 @@ void ring_buffer_put(struct perf_buffer *rb)
call_rcu(&rb->rcu_head, rb_free_rcu);
}
+static void __perf_event_mapped(struct perf_event *event,
+ struct perf_cpu_context *cpuctx,
+ struct perf_event_context *ctx,
+ void *info)
+{
+ event->pmu->event_mapped(event);
+}
+
+static void perf_event_mapped(struct perf_event *event)
+{
+ struct perf_event_context *ctx;
+
+ if (!event->pmu->event_mapped || !(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ return;
+
+ ctx = perf_event_ctx_lock(event);
+ event_function_call(event, __perf_event_mapped, NULL);
+ atomic_dec(&ctx->nr_user);
+ perf_event_ctx_unlock(event, ctx);
+}
+
+static void __perf_event_unmapped(struct perf_event *event,
+ struct perf_cpu_context *cpuctx,
+ struct perf_event_context *ctx,
+ void *info)
+{
+ event->pmu->event_unmapped(event);
+}
+
+static void perf_event_unmapped(struct perf_event *event)
+{
+ struct perf_event_context *ctx;
+
+ if (!event->pmu->event_unmapped || !(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+ return;
+
+ ctx = perf_event_ctx_lock(event);
+ atomic_inc(&ctx->nr_user);
+ event_function_call(event, __perf_event_unmapped, NULL);
+ perf_event_ctx_unlock(event, ctx);
+}
+
static void perf_mmap_open(struct vm_area_struct *vma)
{
struct perf_event *event = vma->vm_file->private_data;
@@ -5997,9 +6039,6 @@ static void perf_mmap_open(struct vm_area_struct *vma)
if (vma->vm_pgoff)
atomic_inc(&event->rb->aux_mmap_count);
-
- if (event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
}
static void perf_pmu_output_stop(struct perf_event *event);
@@ -6021,9 +6060,6 @@ static void perf_mmap_close(struct vm_area_struct *vma)
unsigned long size = perf_data_size(rb);
bool detach_rest = false;
- if (event->pmu->event_unmapped)
- event->pmu->event_unmapped(event, vma->vm_mm);
-
/*
* rb->aux_mmap_count will always drop before rb->mmap_count and
* event->mmap_count, so it is ok to use event->mmap_mutex to
@@ -6056,6 +6092,8 @@ static void perf_mmap_close(struct vm_area_struct *vma)
if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
goto out_put;
+ perf_event_unmapped(event);
+
ring_buffer_attach(event, NULL);
mutex_unlock(&event->mmap_mutex);
@@ -6330,6 +6368,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
atomic_dec(&rb->mmap_count);
}
aux_unlock:
+ if (atomic_read(&event->mmap_count) == 1)
+ perf_event_mapped(event);
+
mutex_unlock(&event->mmap_mutex);
/*
@@ -6339,9 +6380,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_ops = &perf_mmap_vmops;
- if (event->pmu->event_mapped)
- event->pmu->event_mapped(event, vma->vm_mm);
-
return ret;
}
--
2.27.0
Rather than controlling RDPMC access behind the scenes from switch_mm(),
move RDPMC access controls to the PMU .enable() hook. The .enable() hook
is called whenever the perf CPU or task context changes which is when
the RDPMC access may need to change.
This is the first step in moving the RDPMC state tracking out of the mm
context to the perf context.
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
Signed-off-by: Rob Herring <[email protected]>
---
Not sure, but I think the set_attr_rdpmc() IPI needs to hold the perf
ctx lock?
arch/x86/events/core.c | 75 +++++++++++++++++++-----------
arch/x86/include/asm/mmu_context.h | 6 ---
arch/x86/include/asm/perf_event.h | 1 -
arch/x86/mm/tlb.c | 29 +-----------
4 files changed, 49 insertions(+), 62 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 03f87fd4c017..5c1703206ef5 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -52,8 +52,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
.pmu = &pmu,
};
-DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
-DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
+static DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
+static DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
/*
@@ -727,11 +727,52 @@ static void x86_pmu_disable(struct pmu *pmu)
static_call(x86_pmu_disable_all)();
}
+static void perf_clear_dirty_counters(struct cpu_hw_events *cpuc)
+{
+ int i;
+
+ /* Don't need to clear the assigned counter. */
+ for (i = 0; i < cpuc->n_events; i++)
+ __clear_bit(cpuc->assign[i], cpuc->dirty);
+
+ if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
+ return;
+
+ for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) {
+ /* Metrics and fake events don't have corresponding HW counters. */
+ if (is_metric_idx(i) || (i == INTEL_PMC_IDX_FIXED_VLBR))
+ continue;
+ else if (i >= INTEL_PMC_IDX_FIXED)
+ wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - INTEL_PMC_IDX_FIXED), 0);
+ else
+ wrmsrl(x86_pmu_event_addr(i), 0);
+ }
+
+ bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX);
+}
+
+static void x86_pmu_set_user_access(struct cpu_hw_events *cpuc)
+{
+ if (static_branch_unlikely(&rdpmc_always_available_key) ||
+ (!static_branch_unlikely(&rdpmc_never_available_key) &&
+ atomic_read(&(this_cpu_read(cpu_tlbstate.loaded_mm)->context.perf_rdpmc_allowed)))) {
+ /*
+ * Clear the existing dirty counters to
+ * prevent the leak for an RDPMC task.
+ */
+ perf_clear_dirty_counters(cpuc);
+ cr4_set_bits_irqsoff(X86_CR4_PCE);
+ } else
+ cr4_clear_bits_irqsoff(X86_CR4_PCE);
+}
+
void x86_pmu_enable_all(int added)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int idx;
+ x86_pmu_set_user_access(cpuc);
+
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
struct hw_perf_event *hwc = &cpuc->events[idx]->hw;
@@ -2476,29 +2517,9 @@ static int x86_pmu_event_init(struct perf_event *event)
return err;
}
-void perf_clear_dirty_counters(void)
+static void x86_pmu_set_user_access_ipi(void *unused)
{
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- int i;
-
- /* Don't need to clear the assigned counter. */
- for (i = 0; i < cpuc->n_events; i++)
- __clear_bit(cpuc->assign[i], cpuc->dirty);
-
- if (bitmap_empty(cpuc->dirty, X86_PMC_IDX_MAX))
- return;
-
- for_each_set_bit(i, cpuc->dirty, X86_PMC_IDX_MAX) {
- /* Metrics and fake events don't have corresponding HW counters. */
- if (is_metric_idx(i) || (i == INTEL_PMC_IDX_FIXED_VLBR))
- continue;
- else if (i >= INTEL_PMC_IDX_FIXED)
- wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + (i - INTEL_PMC_IDX_FIXED), 0);
- else
- wrmsrl(x86_pmu_event_addr(i), 0);
- }
-
- bitmap_zero(cpuc->dirty, X86_PMC_IDX_MAX);
+ x86_pmu_set_user_access(this_cpu_ptr(&cpu_hw_events));
}
static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
@@ -2519,7 +2540,7 @@ static void x86_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
mmap_assert_write_locked(mm);
if (atomic_inc_return(&mm->context.perf_rdpmc_allowed) == 1)
- on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
+ on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
}
static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
@@ -2528,7 +2549,7 @@ static void x86_pmu_event_unmapped(struct perf_event *event, struct mm_struct *m
return;
if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
- on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
+ on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
}
static int x86_pmu_event_idx(struct perf_event *event)
@@ -2584,7 +2605,7 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
else if (x86_pmu.attr_rdpmc == 2)
static_branch_dec(&rdpmc_always_available_key);
- on_each_cpu(cr4_update_pce, NULL, 1);
+ on_each_cpu(x86_pmu_set_user_access_ipi, NULL, 1);
x86_pmu.attr_rdpmc = val;
}
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..1cbb32ac245e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -22,12 +22,6 @@ static inline void paravirt_activate_mm(struct mm_struct *prev,
}
#endif /* !CONFIG_PARAVIRT_XXL */
-#ifdef CONFIG_PERF_EVENTS
-DECLARE_STATIC_KEY_FALSE(rdpmc_never_available_key);
-DECLARE_STATIC_KEY_FALSE(rdpmc_always_available_key);
-void cr4_update_pce(void *ignored);
-#endif
-
#ifdef CONFIG_MODIFY_LDT_SYSCALL
/*
* ldt_structs can be allocated, used, and freed, but they are never
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8fc1b5003713..544f41a179fb 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -478,7 +478,6 @@ struct x86_pmu_lbr {
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
-extern void perf_clear_dirty_counters(void);
extern int x86_perf_rdpmc_index(struct perf_event *event);
#else
static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cfe6b1e85fa6..060a3de78380 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -400,31 +400,6 @@ static void cond_ibpb(struct task_struct *next)
}
}
-#ifdef CONFIG_PERF_EVENTS
-static inline void cr4_update_pce_mm(struct mm_struct *mm)
-{
- if (static_branch_unlikely(&rdpmc_always_available_key) ||
- (!static_branch_unlikely(&rdpmc_never_available_key) &&
- atomic_read(&mm->context.perf_rdpmc_allowed))) {
- /*
- * Clear the existing dirty counters to
- * prevent the leak for an RDPMC task.
- */
- perf_clear_dirty_counters();
- cr4_set_bits_irqsoff(X86_CR4_PCE);
- } else
- cr4_clear_bits_irqsoff(X86_CR4_PCE);
-}
-
-void cr4_update_pce(void *ignored)
-{
- cr4_update_pce_mm(this_cpu_read(cpu_tlbstate.loaded_mm));
-}
-
-#else
-static inline void cr4_update_pce_mm(struct mm_struct *mm) { }
-#endif
-
void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
@@ -581,10 +556,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
- if (next != real_prev) {
- cr4_update_pce_mm(next);
+ if (next != real_prev)
switch_ldt(real_prev, next);
- }
}
/*
--
2.27.0
On 7/28/21 4:02 PM, Rob Herring wrote:
> Rather than controlling RDPMC access behind the scenes from switch_mm(),
> move RDPMC access controls to the PMU .enable() hook. The .enable() hook
> is called whenever the perf CPU or task context changes which is when
> the RDPMC access may need to change.
>
> This is the first step in moving the RDPMC state tracking out of the mm
> context to the perf context.
Is this series supposed to be a user-visible change or not? I'm confused.
If you intend to have an entire mm have access to RDPMC if an event is
mapped, then surely access needs to be context switched for the whole
mm. If you intend to only have the thread to which the event is bound
have access, then the only reason I see to use IPIs is to revoke access
on munmap from the wrong thread. But even that latter case could be
handled with a more targeted approach, not a broadcast to all of mm_cpumask.
Can you clarify what the overall intent is and what this particular
patch is trying to do?
>
> if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
> - on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
> + on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
Here you do something for the whole mm.
> - on_each_cpu(cr4_update_pce, NULL, 1);
> + on_each_cpu(x86_pmu_set_user_access_ipi, NULL, 1);
Here too.
> void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> struct task_struct *tsk)
> {
> @@ -581,10 +556,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> this_cpu_write(cpu_tlbstate.loaded_mm, next);
> this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
>
> - if (next != real_prev) {
> - cr4_update_pce_mm(next);
> + if (next != real_prev)
> switch_ldt(real_prev, next);
> - }
> }
But if you remove this, then what handles context switching?
>
> /*
> --
> 2.27.0
>
On Thu, Aug 12, 2021 at 11:50 AM Andy Lutomirski <[email protected]> wrote:
>
> On 7/28/21 4:02 PM, Rob Herring wrote:
> > Rather than controlling RDPMC access behind the scenes from switch_mm(),
> > move RDPMC access controls to the PMU .enable() hook. The .enable() hook
> > is called whenever the perf CPU or task context changes which is when
> > the RDPMC access may need to change.
> >
> > This is the first step in moving the RDPMC state tracking out of the mm
> > context to the perf context.
>
> Is this series supposed to be a user-visible change or not? I'm confused.
It should not be user-visible. Or at least not user-visible for what
any user would notice. If an event is not part of the perf context on
another thread sharing the mm, does that thread need rdpmc access? No
access would be a user-visible change, but I struggle with how that's
a useful scenario?
> If you intend to have an entire mm have access to RDPMC if an event is
> mapped, then surely access needs to be context switched for the whole
> mm. If you intend to only have the thread to which the event is bound
> have access, then the only reason I see to use IPIs is to revoke access
> on munmap from the wrong thread. But even that latter case could be
> handled with a more targeted approach, not a broadcast to all of mm_cpumask.
Right, that's what patch 3 does. When we mmap/munmap an event, then
the perf core invokes the callback only on the active contexts in
which the event resides.
> Can you clarify what the overall intent is and what this particular
> patch is trying to do?
>
> >
> > if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
> > - on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
> > + on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
>
> Here you do something for the whole mm.
It's just a rename of the callback though.
>
> > - on_each_cpu(cr4_update_pce, NULL, 1);
> > + on_each_cpu(x86_pmu_set_user_access_ipi, NULL, 1);
>
> Here too.
It's not just the whole mm here as changing sysfs rdpmc permission is
a global state.
> > void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > struct task_struct *tsk)
> > {
> > @@ -581,10 +556,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > this_cpu_write(cpu_tlbstate.loaded_mm, next);
> > this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
> >
> > - if (next != real_prev) {
> > - cr4_update_pce_mm(next);
> > + if (next != real_prev)
> > switch_ldt(real_prev, next);
> > - }
> > }
>
> But if you remove this, then what handles context switching?
perf. On a context switch, perf is going to context switch the events
and set the access based on an event in the context being mmapped.
Though I guess if rdpmc needs to be enabled without any events opened,
this is not going to work. So maybe I need to keep the
rdpmc_always_available_key and rdpmc_never_available_key cases here.
The always available case is something we specifically don't want to
support for arm64. I'm trying to start with access more locked down,
rather than trying to lock it down after the fact as x86 is doing.
Rob
On Thu, Aug 12, 2021, at 11:16 AM, Rob Herring wrote:
> On Thu, Aug 12, 2021 at 11:50 AM Andy Lutomirski <[email protected]> wrote:
> >
> > On 7/28/21 4:02 PM, Rob Herring wrote:
> > > Rather than controlling RDPMC access behind the scenes from switch_mm(),
> > > move RDPMC access controls to the PMU .enable() hook. The .enable() hook
> > > is called whenever the perf CPU or task context changes which is when
> > > the RDPMC access may need to change.
> > >
> > > This is the first step in moving the RDPMC state tracking out of the mm
> > > context to the perf context.
> >
> > Is this series supposed to be a user-visible change or not? I'm confused.
>
> It should not be user-visible. Or at least not user-visible for what
> any user would notice. If an event is not part of the perf context on
> another thread sharing the mm, does that thread need rdpmc access? No
> access would be a user-visible change, but I struggle with how that's
> a useful scenario?
>
This is what I mean by user-visible -- it changes semantics in a way that a user program could detect. I'm not saying it's a problem, but I do think you need to document the new semantics.
> > If you intend to have an entire mm have access to RDPMC if an event is
> > mapped, then surely access needs to be context switched for the whole
> > mm. If you intend to only have the thread to which the event is bound
> > have access, then the only reason I see to use IPIs is to revoke access
> > on munmap from the wrong thread. But even that latter case could be
> > handled with a more targeted approach, not a broadcast to all of mm_cpumask.
>
> Right, that's what patch 3 does. When we mmap/munmap an event, then
> the perf core invokes the callback only on the active contexts in
> which the event resides.
>
> > Can you clarify what the overall intent is and what this particular
> > patch is trying to do?
> >
> > >
> > > if (atomic_dec_and_test(&mm->context.perf_rdpmc_allowed))
> > > - on_each_cpu_mask(mm_cpumask(mm), cr4_update_pce, NULL, 1);
> > > + on_each_cpu_mask(mm_cpumask(mm), x86_pmu_set_user_access_ipi, NULL, 1);
> >
> > Here you do something for the whole mm.
>
> It's just a rename of the callback though.
>
> >
> > > - on_each_cpu(cr4_update_pce, NULL, 1);
> > > + on_each_cpu(x86_pmu_set_user_access_ipi, NULL, 1);
> >
> > Here too.
>
> It's not just the whole mm here as changing sysfs rdpmc permission is
> a global state.
Whoops, missed that.
>
> > > void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > > struct task_struct *tsk)
> > > {
> > > @@ -581,10 +556,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > > this_cpu_write(cpu_tlbstate.loaded_mm, next);
> > > this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
> > >
> > > - if (next != real_prev) {
> > > - cr4_update_pce_mm(next);
> > > + if (next != real_prev)
> > > switch_ldt(real_prev, next);
> > > - }
> > > }
> >
> > But if you remove this, then what handles context switching?
>
> perf. On a context switch, perf is going to context switch the events
> and set the access based on an event in the context being mmapped.
> Though I guess if rdpmc needs to be enabled without any events opened,
> this is not going to work. So maybe I need to keep the
> rdpmc_always_available_key and rdpmc_never_available_key cases here.
You seem to have a weird combination of per-thread and per-mm stuff going on in this patch, though. Maybe it's all reasonable after patch 3 is applied, but this patch is very hard to review in its current state.
On Thu, Aug 26, 2021 at 1:13 PM Andy Lutomirski <[email protected]> wrote:
>
>
>
> On Thu, Aug 12, 2021, at 11:16 AM, Rob Herring wrote:
> > On Thu, Aug 12, 2021 at 11:50 AM Andy Lutomirski <[email protected]> wrote:
> > >
> > > On 7/28/21 4:02 PM, Rob Herring wrote:
> > > > Rather than controlling RDPMC access behind the scenes from switch_mm(),
> > > > move RDPMC access controls to the PMU .enable() hook. The .enable() hook
> > > > is called whenever the perf CPU or task context changes which is when
> > > > the RDPMC access may need to change.
> > > >
> > > > This is the first step in moving the RDPMC state tracking out of the mm
> > > > context to the perf context.
> > >
> > > Is this series supposed to be a user-visible change or not? I'm confused.
> >
> > It should not be user-visible. Or at least not user-visible for what
> > any user would notice. If an event is not part of the perf context on
> > another thread sharing the mm, does that thread need rdpmc access? No
> > access would be a user-visible change, but I struggle with how that's
> > a useful scenario?
> >
>
> This is what I mean by user-visible -- it changes semantics in a way that a user program could detect. I'm not saying it's a problem, but I do think you need to document the new semantics.
After testing some scenarios and finding perf_event_tests[1], this
series isn't going to work for x86 unless rdpmc is restricted to task
events only or allowed to segfault on CPU events when read on the
wrong CPU rather than just returning garbage. It's been discussed
before here[2].
Ultimately, I'm just trying to define the behavior for arm64 where we
don't have an existing ABI to maintain and don't have to recreate the
mistakes of x86 rdpmc ABI. Tying the access to mmap is messy. As we
explicitly request user access on perf_event_open(), I think it may be
better to just enable access when the event's context is active and
ignore mmap(). Maybe you have an opinion there since you added the
mmap() part?
Rob
[1] https://github.com/deater/perf_event_tests
[2] https://lore.kernel.org/lkml/alpine.DEB.2.21.1901101229010.3358@macbook-air/
On Thu, Aug 26, 2021, at 12:09 PM, Rob Herring wrote:
> On Thu, Aug 26, 2021 at 1:13 PM Andy Lutomirski <[email protected]> wrote:
> >
> >
> >
> > On Thu, Aug 12, 2021, at 11:16 AM, Rob Herring wrote:
> > > On Thu, Aug 12, 2021 at 11:50 AM Andy Lutomirski <[email protected]> wrote:
> > > >
> > > > On 7/28/21 4:02 PM, Rob Herring wrote:
> > > > > Rather than controlling RDPMC access behind the scenes from switch_mm(),
> > > > > move RDPMC access controls to the PMU .enable() hook. The .enable() hook
> > > > > is called whenever the perf CPU or task context changes which is when
> > > > > the RDPMC access may need to change.
> > > > >
> > > > > This is the first step in moving the RDPMC state tracking out of the mm
> > > > > context to the perf context.
> > > >
> > > > Is this series supposed to be a user-visible change or not? I'm confused.
> > >
> > > It should not be user-visible. Or at least not user-visible for what
> > > any user would notice. If an event is not part of the perf context on
> > > another thread sharing the mm, does that thread need rdpmc access? No
> > > access would be a user-visible change, but I struggle with how that's
> > > a useful scenario?
> > >
> >
> > This is what I mean by user-visible -- it changes semantics in a way that a user program could detect. I'm not saying it's a problem, but I do think you need to document the new semantics.
>
> After testing some scenarios and finding perf_event_tests[1], this
> series isn't going to work for x86 unless rdpmc is restricted to task
> events only or allowed to segfault on CPU events when read on the
> wrong CPU rather than just returning garbage. It's been discussed
> before here[2].
>
> Ultimately, I'm just trying to define the behavior for arm64 where we
> don't have an existing ABI to maintain and don't have to recreate the
> mistakes of x86 rdpmc ABI. Tying the access to mmap is messy. As we
> explicitly request user access on perf_event_open(), I think it may be
> better to just enable access when the event's context is active and
> ignore mmap(). Maybe you have an opinion there since you added the
> mmap() part?
That makes sense to me. The mmap() part was always a giant kludge.
There is fundamentally a race, at least if rseq isn’t used: if you check that you’re on the right CPU, do RDPMC, and throw out the result if you were on the wrong CPU (determined by looking at the mmap), you still would very much prefer not to fault.
Maybe rseq or a vDSO helper is the right solution for ARM.
>
> Rob
>
> [1] https://github.com/deater/perf_event_tests
> [2] https://lore.kernel.org/lkml/alpine.DEB.2.21.1901101229010.3358@macbook-air/
>
On Fri, 27 Aug 2021, Andy Lutomirski wrote:
> On Thu, Aug 26, 2021, at 12:09 PM, Rob Herring wrote:
> > After testing some scenarios and finding perf_event_tests[1], this
> > series isn't going to work for x86 unless rdpmc is restricted to task
> > events only or allowed to segfault on CPU events when read on the
> > wrong CPU rather than just returning garbage. It's been discussed
> > before here[2].
> >
> > Ultimately, I'm just trying to define the behavior for arm64 where we
> > don't have an existing ABI to maintain and don't have to recreate the
> > mistakes of x86 rdpmc ABI. Tying the access to mmap is messy. As we
> > explicitly request user access on perf_event_open(), I think it may be
> > better to just enable access when the event's context is active and
> > ignore mmap(). Maybe you have an opinion there since you added the
> > mmap() part?
>
> That makes sense to me. The mmap() part was always a giant kludge.
>
> There is fundamentally a race, at least if rseq isn’t used: if you check
> that you’re on the right CPU, do RDPMC, and throw out the result if you
> were on the wrong CPU (determined by looking at the mmap), you still
> would very much prefer not to fault.
>
> Maybe rseq or a vDSO helper is the right solution for ARM.
as the author of those perf_event tests for rdpmc, I have to say if ARM
comes up with a cleaner implementation I'd be glad to have x86 transition
to something better.
The rdpmc code is a huge mess and has all kinds of corner cases. I'm not
sure anyone besides the PAPI library tries to use it, and while it's a
nice performance improvement to use rdpmc it is really hard to get things
working right.
As a PAPI developer we actually have run into the issue where the CPU
switches and we were reporting the wrong results. Also if I recall (it's
been a while) we were having issues where the setup lets you attach to a
process on another CPU for monitoring using the rdpmc interface and it
returns results even though I think that will rarely ever work in
practice.
Vince
On Sun, Aug 29, 2021 at 11:05:55PM -0400, Vince Weaver wrote:
> as the author of those perf_event tests for rdpmc, I have to say if ARM
> comes up with a cleaner implementation I'd be glad to have x86 transition
> to something better.
>
> The rdpmc code is a huge mess and has all kinds of corner cases. I'm not
> sure anyone besides the PAPI library tries to use it, and while it's a
> nice performance improvement to use rdpmc it is really hard to get things
> working right.
>
> As a PAPI developer we actually have run into the issue where the CPU
> switches and we were reporting the wrong results. Also if I recall (it's
> been a while) we were having issues where the setup lets you attach to a
> process on another CPU for monitoring using the rdpmc interface and it
> returns results even though I think that will rarely ever work in
> practice.
There's just not much we can do to validate the usage, fundamentally at
RDPMC time we're not running any kernel code, so we can't validate the
conditions under which we're called.
I suppose one way would be to create a mode where RDPMC is disabled but
emulated -- which completely voids the reason for using RDPMC in the
first place (performance), but would allow us to validate the usage.
Fundamentally, we must call RDPMC only for events that are currently
actuve on *this* CPU. Currently we rely on userspace to DTRT and if it
doesn't we have no way of knowing and it gets to keep the pieces.
On Mon, 30 Aug 2021, Peter Zijlstra wrote:
> There's just not much we can do to validate the usage, fundamentally at
> RDPMC time we're not running any kernel code, so we can't validate the
> conditions under which we're called.
>
> I suppose one way would be to create a mode where RDPMC is disabled but
> emulated -- which completely voids the reason for using RDPMC in the
> first place (performance), but would allow us to validate the usage.
>
> Fundamentally, we must call RDPMC only for events that are currently
> actuve on *this* CPU. Currently we rely on userspace to DTRT and if it
> doesn't we have no way of knowing and it gets to keep the pieces.
yes, though it would be nice for cases where things will never work (such
as process-attach? I think even if pinned to the same CPU that won't
work?) Maybe somehow the mmap page could be set in a way to indicate we
should fall back to the syscall. Maybe set pc->index to an invalid value
so we can use the existing syscall fallback code.
We could force every userspace program to know allthe unsupoorted cases
but it seems like it could be easier and less failure-prone to centralize
this in the kernel.
I was looking into maybe creating a patch for this but the magic perf
mmap page implementation is complex enough that I'm not sure I'm qualified
to mess with it.
Vince
On Sun, Aug 29, 2021 at 10:06 PM Vince Weaver <[email protected]> wrote:
>
> On Fri, 27 Aug 2021, Andy Lutomirski wrote:
>
> > On Thu, Aug 26, 2021, at 12:09 PM, Rob Herring wrote:
>
> > > After testing some scenarios and finding perf_event_tests[1], this
> > > series isn't going to work for x86 unless rdpmc is restricted to task
> > > events only or allowed to segfault on CPU events when read on the
> > > wrong CPU rather than just returning garbage. It's been discussed
> > > before here[2].
> > >
> > > Ultimately, I'm just trying to define the behavior for arm64 where we
> > > don't have an existing ABI to maintain and don't have to recreate the
> > > mistakes of x86 rdpmc ABI. Tying the access to mmap is messy. As we
> > > explicitly request user access on perf_event_open(), I think it may be
> > > better to just enable access when the event's context is active and
> > > ignore mmap(). Maybe you have an opinion there since you added the
> > > mmap() part?
> >
> > That makes sense to me. The mmap() part was always a giant kludge.
> >
> > There is fundamentally a race, at least if rseq isn’t used: if you check
> > that you’re on the right CPU, do RDPMC, and throw out the result if you
> > were on the wrong CPU (determined by looking at the mmap), you still
> > would very much prefer not to fault.
> >
> > Maybe rseq or a vDSO helper is the right solution for ARM.
There was a version using rseq[1]. AIUI, that would solve the reading
from the wrong CPU problem. I don't think using rseq would change the
kernel implementation other than whether we enable events on specific
CPUs.
> as the author of those perf_event tests for rdpmc, I have to say if ARM
> comes up with a cleaner implementation I'd be glad to have x86 transition
> to something better.
Thanks for chiming in.
My plan is to be more restricted in terms of what works, and fail or
disable user access for what's not supported. Unless I hear events on
specific CPUs is really important, that means only monitoring of a
thread on all (for big.LITTLE, all homogeneous) CPUs is supported.
That doesn't require a better/cleaner interface. It just means cpu
must be -1 for perf_event_open if you want rdpmc. The difference on
Arm is just that we can enforce/indicate that.
We could also enable CPU events, but abort if read on the wrong CPU.
The user in that case either has to control the thread affinity or
possibly use rseq.
> The rdpmc code is a huge mess and has all kinds of corner cases. I'm not
> sure anyone besides the PAPI library tries to use it, and while it's a
> nice performance improvement to use rdpmc it is really hard to get things
> working right.
Yes, I've been reading thru the bugs you reported and related tests. I
just wish I found them sooner...
> As a PAPI developer we actually have run into the issue where the CPU
> switches and we were reporting the wrong results. Also if I recall (it's
> been a while) we were having issues where the setup lets you attach to a
> process on another CPU for monitoring using the rdpmc interface and it
> returns results even though I think that will rarely ever work in
> practice.
Returning the wrong results is obviously bad for the user, but making
that "work" also complicates the kernel implementation.
Rob
[1] https://x-lore.kernel.org/all/[email protected]/
On Mon, Aug 30, 2021 at 3:21 PM Vince Weaver <[email protected]> wrote:
>
> On Mon, 30 Aug 2021, Peter Zijlstra wrote:
>
> > There's just not much we can do to validate the usage, fundamentally at
> > RDPMC time we're not running any kernel code, so we can't validate the
> > conditions under which we're called.
> >
> > I suppose one way would be to create a mode where RDPMC is disabled but
> > emulated -- which completely voids the reason for using RDPMC in the
> > first place (performance), but would allow us to validate the usage.
> >
> > Fundamentally, we must call RDPMC only for events that are currently
> > actuve on *this* CPU. Currently we rely on userspace to DTRT and if it
> > doesn't we have no way of knowing and it gets to keep the pieces.
>
> yes, though it would be nice for cases where things will never work (such
> as process-attach? I think even if pinned to the same CPU that won't
> work?) Maybe somehow the mmap page could be set in a way to indicate we
> should fall back to the syscall. Maybe set pc->index to an invalid value
> so we can use the existing syscall fallback code.
>
> We could force every userspace program to know allthe unsupoorted cases
> but it seems like it could be easier and less failure-prone to centralize
> this in the kernel.
>
> I was looking into maybe creating a patch for this but the magic perf
> mmap page implementation is complex enough that I'm not sure I'm qualified
> to mess with it.
There's now an implementation in libperf[1]. perf_evsel__read() will
use it[2] and fallback to read() call if necessary (but will still
happily give you wrong values if reading on the wrong CPU).
Rob
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n302
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/evsel.c#n305