2022-06-28 10:19:55

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 00/13] perf/hw_breakpoint: Optimize for thousands of tasks

The hw_breakpoint subsystem's code has seen little change in over 10
years. In that time, systems with >100s of CPUs have become common,
along with improvements to the perf subsystem: using breakpoints on
thousands of concurrent tasks should be a supported usecase.

The breakpoint constraints accounting algorithm is the major bottleneck
in doing so:

1. toggle_bp_slot() and fetch_bp_busy_slots() are O(#cpus * #tasks):
Both iterate through all CPUs and call task_bp_pinned(), which is
O(#tasks).

2. Everything is serialized on a global mutex, 'nr_bp_mutex'.

The series progresses with the simpler optimizations and finishes with
the more complex optimizations:

1. We first optimize task_bp_pinned() to only take O(1) on average.

2. Rework synchronization to allow concurrency when checking and
updating breakpoint constraints for tasks.

3. Eliminate the O(#cpus) loops in the CPU-independent case.

Along the way, smaller micro-optimizations and cleanups are done as they
seemed obvious when staring at the code (but likely insignificant).

The result is (on a system with 256 CPUs) that we go from:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
[ ^ more aggressive benchmark parameters took too long ]
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 236.418 [sec]
|
| 123134.794271 usecs/op
| 7880626.833333 usecs/op/cpu

... to the following with all optimizations:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| Total time: 0.067 [sec]
|
| 35.292187 usecs/op
| 2258.700000 usecs/op/cpu

On the used test system, that's an effective speedup of ~3490x per op.

Which is on par with the theoretical ideal performance through
optimizations in hw_breakpoint.c (constraints accounting disabled), and
only 12% slower than no breakpoints at all.

Changelog
---------

v2:
* Add KUnit test suite.
* Remove struct bp_busy_slots and simplify functions.
* Add "powerpc/hw_breakpoint: Avoid relying on caller synchronization".
* Add "locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()".
* Use percpu-rwsem instead of rwlock.
* Use task_struct::perf_event_mutex instead of sharded mutex.
* Drop v1 "perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent".
* Add "perf/hw_breakpoint: Introduce bp_slots_histogram".
* Add "perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets".
* Add "perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets".
* Apply Acked-by/Reviewed-by given in v1 for unchanged patches.
==> Speedup of ~3490x (vs. ~3315x in v1).

v1: https://lore.kernel.org/all/[email protected]/

Marco Elver (13):
perf/hw_breakpoint: Add KUnit test for constraints accounting
perf/hw_breakpoint: Clean up headers
perf/hw_breakpoint: Optimize list of per-task breakpoints
perf/hw_breakpoint: Mark data __ro_after_init
perf/hw_breakpoint: Optimize constant number of breakpoint slots
perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
perf/hw_breakpoint: Remove useless code related to flexible
breakpoints
powerpc/hw_breakpoint: Avoid relying on caller synchronization
locking/percpu-rwsem: Add percpu_is_write_locked() and
percpu_is_read_locked()
perf/hw_breakpoint: Reduce contention with large number of tasks
perf/hw_breakpoint: Introduce bp_slots_histogram
perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent
task targets
perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task
targets

arch/powerpc/kernel/hw_breakpoint.c | 53 ++-
arch/sh/include/asm/hw_breakpoint.h | 5 +-
arch/x86/include/asm/hw_breakpoint.h | 5 +-
include/linux/hw_breakpoint.h | 1 -
include/linux/percpu-rwsem.h | 6 +
include/linux/perf_event.h | 3 +-
kernel/events/Makefile | 1 +
kernel/events/hw_breakpoint.c | 594 ++++++++++++++++++++-------
kernel/events/hw_breakpoint_test.c | 321 +++++++++++++++
kernel/locking/percpu-rwsem.c | 6 +
lib/Kconfig.debug | 10 +
11 files changed, 826 insertions(+), 179 deletions(-)
create mode 100644 kernel/events/hw_breakpoint_test.c

--
2.37.0.rc0.161.g10f37bed90-goog


2022-06-28 10:26:10

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 02/13] perf/hw_breakpoint: Clean up headers

Clean up headers:

- Remove unused <linux/kallsyms.h>

- Remove unused <linux/kprobes.h>

- Remove unused <linux/module.h>

- Remove unused <linux/smp.h>

- Add <linux/export.h> for EXPORT_SYMBOL_GPL().

- Add <linux/mutex.h> for mutex.

- Sort alphabetically.

- Move <linux/hw_breakpoint.h> to top to test it compiles on its own.

Signed-off-by: Marco Elver <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
---
v2:
* Move to start of series.
---
kernel/events/hw_breakpoint.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index f32320ac02fd..1b013968b395 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -17,23 +17,22 @@
* This file contains the arch-independent routines.
*/

+#include <linux/hw_breakpoint.h>
+
+#include <linux/bug.h>
+#include <linux/cpu.h>
+#include <linux/export.h>
+#include <linux/init.h>
#include <linux/irqflags.h>
-#include <linux/kallsyms.h>
-#include <linux/notifier.h>
-#include <linux/kprobes.h>
#include <linux/kdebug.h>
#include <linux/kernel.h>
-#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
#include <linux/percpu.h>
#include <linux/sched.h>
-#include <linux/init.h>
#include <linux/slab.h>
-#include <linux/list.h>
-#include <linux/cpu.h>
-#include <linux/smp.h>
-#include <linux/bug.h>

-#include <linux/hw_breakpoint.h>
/*
* Constraints data
*/
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:28:51

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 06/13] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable

Due to being a __weak function, hw_breakpoint_weight() will cause the
compiler to always emit a call to it. This generates unnecessarily bad
code (register spills etc.) for no good reason; in fact it appears in
profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:

...
0.70% [kernel] [k] hw_breakpoint_weight
...

While a small percentage, no architecture defines its own
hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
which makes the fact it is currently __weak a poor choice.

Change hw_breakpoint_weight()'s definition to follow a similar protocol
to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
hw_breakpoint_weight(), we'll use it instead.

The result is that it is inlined and no longer shows up in profiles.

Signed-off-by: Marco Elver <[email protected]>
---
include/linux/hw_breakpoint.h | 1 -
kernel/events/hw_breakpoint.c | 4 +++-
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
index 78dd7035d1e5..9fa3547acd87 100644
--- a/include/linux/hw_breakpoint.h
+++ b/include/linux/hw_breakpoint.h
@@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
extern int dbg_release_bp_slot(struct perf_event *bp);
extern int reserve_bp_slot(struct perf_event *bp);
extern void release_bp_slot(struct perf_event *bp);
-int hw_breakpoint_weight(struct perf_event *bp);
int arch_reserve_bp_slot(struct perf_event *bp);
void arch_release_bp_slot(struct perf_event *bp);
void arch_unregister_hw_breakpoint(struct perf_event *bp);
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index a089302ddf59..a124786e3ade 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -124,10 +124,12 @@ static __init int init_breakpoint_slots(void)
}
#endif

-__weak int hw_breakpoint_weight(struct perf_event *bp)
+#ifndef hw_breakpoint_weight
+static inline int hw_breakpoint_weight(struct perf_event *bp)
{
return 1;
}
+#endif

static inline enum bp_type_idx find_slot_idx(u64 bp_type)
{
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:29:11

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 12/13] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets

Running the perf benchmark with (note: more aggressive parameters vs.
preceding changes, but same 256 CPUs host):

| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.989 [sec]
|
| 38.854160 usecs/op
| 4973.332500 usecs/op/cpu

20.43% [kernel] [k] queued_spin_lock_slowpath
18.75% [kernel] [k] osq_lock
16.98% [kernel] [k] rhashtable_jhash2
8.34% [kernel] [k] task_bp_pinned
4.23% [kernel] [k] smp_cfm_core_cond
3.65% [kernel] [k] bcmp
2.83% [kernel] [k] toggle_bp_slot
1.87% [kernel] [k] find_next_bit
1.49% [kernel] [k] __reserve_bp_slot

We can see that a majority of the time is now spent hashing task
pointers to index into task_bps_ht in task_bp_pinned().

Obtaining the max_bp_pinned_slots() for CPU-independent task targets
currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
the result of task_bp_pinned() is CPU-independent.

The loop in max_bp_pinned_slots() wants to compute the maximum slots
across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
obtaining the max slots across all CPUs and adding task_bp_pinned().

To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.

After this optimization:

| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 100 threads with 4 breakpoints and 128 parallelism
| Total time: 1.930 [sec]
|
| 37.697832 usecs/op
| 4825.322500 usecs/op/cpu

19.13% [kernel] [k] queued_spin_lock_slowpath
18.21% [kernel] [k] rhashtable_jhash2
15.46% [kernel] [k] osq_lock
6.27% [kernel] [k] toggle_bp_slot
5.91% [kernel] [k] task_bp_pinned
5.05% [kernel] [k] smp_cfm_core_cond
1.78% [kernel] [k] update_sg_lb_stats
1.36% [kernel] [k] llist_reverse_order
1.34% [kernel] [k] find_next_bit
1.19% [kernel] [k] bcmp

Suggesting that time spent in task_bp_pinned() has been reduced.
However, we're still hashing too much, which will be addressed in the
subsequent change.

Signed-off-by: Marco Elver <[email protected]>
---
v2:
* New patch.
---
kernel/events/hw_breakpoint.c | 45 +++++++++++++++++++++++++++++++----
1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 18886f115abc..b5180a2ccfbf 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -64,6 +64,9 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
return per_cpu_ptr(bp_cpuinfo + type, cpu);
}

+/* Number of pinned CPU breakpoints globally. */
+static struct bp_slots_histogram cpu_pinned[TYPE_MAX];
+
/* Keep track of the breakpoints attached to tasks */
static struct rhltable task_bps_ht;
static const struct rhashtable_params task_bps_ht_params = {
@@ -194,6 +197,10 @@ static __init int init_breakpoint_slots(void)
goto err;
}
}
+ for (i = 0; i < TYPE_MAX; i++) {
+ if (!bp_slots_histogram_alloc(&cpu_pinned[i], i))
+ goto err;
+ }

return 0;
err:
@@ -203,6 +210,8 @@ static __init int init_breakpoint_slots(void)
if (err_cpu == cpu)
break;
}
+ for (i = 0; i < TYPE_MAX; i++)
+ bp_slots_histogram_free(&cpu_pinned[i]);

return -ENOMEM;
}
@@ -270,6 +279,9 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
/*
* Count the number of breakpoints of the same type and same task.
* The given event must be not on the list.
+ *
+ * If @cpu is -1, but the result of task_bp_pinned() is not CPU-independent,
+ * returns a negative value.
*/
static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
{
@@ -288,9 +300,18 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
goto out;

rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
- if (find_slot_idx(iter->attr.bp_type) == type &&
- (iter->cpu < 0 || cpu == iter->cpu))
- count += hw_breakpoint_weight(iter);
+ if (find_slot_idx(iter->attr.bp_type) != type)
+ continue;
+
+ if (iter->cpu >= 0) {
+ if (cpu == -1) {
+ count = -1;
+ goto out;
+ } else if (cpu != iter->cpu)
+ continue;
+ }
+
+ count += hw_breakpoint_weight(iter);
}

out:
@@ -316,6 +337,19 @@ max_bp_pinned_slots(struct perf_event *bp, enum bp_type_idx type)
int pinned_slots = 0;
int cpu;

+ if (bp->hw.target && bp->cpu < 0) {
+ int max_pinned = task_bp_pinned(-1, bp, type);
+
+ if (max_pinned >= 0) {
+ /*
+ * Fast path: task_bp_pinned() is CPU-independent and
+ * returns the same value for any CPU.
+ */
+ max_pinned += bp_slots_histogram_max(&cpu_pinned[type], type);
+ return max_pinned;
+ }
+ }
+
for_each_cpu(cpu, cpumask) {
struct bp_cpuinfo *info = get_bp_info(cpu, type);
int nr;
@@ -366,8 +400,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,

/* Pinned counter cpu profiling */
if (!bp->hw.target) {
+ struct bp_cpuinfo *info = get_bp_info(bp->cpu, type);
+
lockdep_assert_held_write(&bp_cpuinfo_sem);
- get_bp_info(bp->cpu, type)->cpu_pinned += weight;
+ bp_slots_histogram_add(&cpu_pinned[type], info->cpu_pinned, weight);
+ info->cpu_pinned += weight;
return 0;
}

--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:43:34

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization

Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
implementation have relied on nr_bp_mutex serializing access to them.

Before overhauling synchronization of kernel/events/hw_breakpoint.c,
introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.

Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
---
v2:
* New patch.
---
arch/powerpc/kernel/hw_breakpoint.c | 53 ++++++++++++++++++++++-------
1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
index 2669f80b3a49..8db1a15d7acb 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -15,6 +15,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/smp.h>
+#include <linux/spinlock.h>
#include <linux/debugfs.h>
#include <linux/init.h>

@@ -129,7 +130,14 @@ struct breakpoint {
bool ptrace_bp;
};

+/*
+ * While kernel/events/hw_breakpoint.c does its own synchronization, we cannot
+ * rely on it safely synchronizing internals here; however, we can rely on it
+ * not requesting more breakpoints than available.
+ */
+static DEFINE_SPINLOCK(cpu_bps_lock);
static DEFINE_PER_CPU(struct breakpoint *, cpu_bps[HBP_NUM_MAX]);
+static DEFINE_SPINLOCK(task_bps_lock);
static LIST_HEAD(task_bps);

static struct breakpoint *alloc_breakpoint(struct perf_event *bp)
@@ -174,7 +182,9 @@ static int task_bps_add(struct perf_event *bp)
if (IS_ERR(tmp))
return PTR_ERR(tmp);

+ spin_lock(&task_bps_lock);
list_add(&tmp->list, &task_bps);
+ spin_unlock(&task_bps_lock);
return 0;
}

@@ -182,6 +192,7 @@ static void task_bps_remove(struct perf_event *bp)
{
struct list_head *pos, *q;

+ spin_lock(&task_bps_lock);
list_for_each_safe(pos, q, &task_bps) {
struct breakpoint *tmp = list_entry(pos, struct breakpoint, list);

@@ -191,6 +202,7 @@ static void task_bps_remove(struct perf_event *bp)
break;
}
}
+ spin_unlock(&task_bps_lock);
}

/*
@@ -200,12 +212,17 @@ static void task_bps_remove(struct perf_event *bp)
static bool all_task_bps_check(struct perf_event *bp)
{
struct breakpoint *tmp;
+ bool ret = false;

+ spin_lock(&task_bps_lock);
list_for_each_entry(tmp, &task_bps, list) {
- if (!can_co_exist(tmp, bp))
- return true;
+ if (!can_co_exist(tmp, bp)) {
+ ret = true;
+ break;
+ }
}
- return false;
+ spin_unlock(&task_bps_lock);
+ return ret;
}

/*
@@ -215,13 +232,18 @@ static bool all_task_bps_check(struct perf_event *bp)
static bool same_task_bps_check(struct perf_event *bp)
{
struct breakpoint *tmp;
+ bool ret = false;

+ spin_lock(&task_bps_lock);
list_for_each_entry(tmp, &task_bps, list) {
if (tmp->bp->hw.target == bp->hw.target &&
- !can_co_exist(tmp, bp))
- return true;
+ !can_co_exist(tmp, bp)) {
+ ret = true;
+ break;
+ }
}
- return false;
+ spin_unlock(&task_bps_lock);
+ return ret;
}

static int cpu_bps_add(struct perf_event *bp)
@@ -234,6 +256,7 @@ static int cpu_bps_add(struct perf_event *bp)
if (IS_ERR(tmp))
return PTR_ERR(tmp);

+ spin_lock(&cpu_bps_lock);
cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
for (i = 0; i < nr_wp_slots(); i++) {
if (!cpu_bp[i]) {
@@ -241,6 +264,7 @@ static int cpu_bps_add(struct perf_event *bp)
break;
}
}
+ spin_unlock(&cpu_bps_lock);
return 0;
}

@@ -249,6 +273,7 @@ static void cpu_bps_remove(struct perf_event *bp)
struct breakpoint **cpu_bp;
int i = 0;

+ spin_lock(&cpu_bps_lock);
cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
for (i = 0; i < nr_wp_slots(); i++) {
if (!cpu_bp[i])
@@ -260,19 +285,25 @@ static void cpu_bps_remove(struct perf_event *bp)
break;
}
}
+ spin_unlock(&cpu_bps_lock);
}

static bool cpu_bps_check(int cpu, struct perf_event *bp)
{
struct breakpoint **cpu_bp;
+ bool ret = false;
int i;

+ spin_lock(&cpu_bps_lock);
cpu_bp = per_cpu_ptr(cpu_bps, cpu);
for (i = 0; i < nr_wp_slots(); i++) {
- if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp))
- return true;
+ if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp)) {
+ ret = true;
+ break;
+ }
}
- return false;
+ spin_unlock(&cpu_bps_lock);
+ return ret;
}

static bool all_cpu_bps_check(struct perf_event *bp)
@@ -286,10 +317,6 @@ static bool all_cpu_bps_check(struct perf_event *bp)
return false;
}

-/*
- * We don't use any locks to serialize accesses to cpu_bps or task_bps
- * because are already inside nr_bp_mutex.
- */
int arch_reserve_bp_slot(struct perf_event *bp)
{
int ret;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:44:24

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 04/13] perf/hw_breakpoint: Mark data __ro_after_init

Mark read-only data after initialization as __ro_after_init.

While we are here, turn 'constraints_initialized' into a bool.

Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
---
kernel/events/hw_breakpoint.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index add1b9c59631..270be965f829 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -46,7 +46,7 @@ struct bp_cpuinfo {
};

static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
-static int nr_slots[TYPE_MAX];
+static int nr_slots[TYPE_MAX] __ro_after_init;

static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
{
@@ -62,7 +62,7 @@ static const struct rhashtable_params task_bps_ht_params = {
.automatic_shrinking = true,
};

-static int constraints_initialized;
+static bool constraints_initialized __ro_after_init;

/* Gather the number of total pinned and un-pinned bp in a cpuset */
struct bp_busy_slots {
@@ -710,7 +710,7 @@ int __init init_hw_breakpoint(void)
if (ret)
goto err;

- constraints_initialized = 1;
+ constraints_initialized = true;

perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);

--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:47:28

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 05/13] perf/hw_breakpoint: Optimize constant number of breakpoint slots

Optimize internal hw_breakpoint state if the architecture's number of
breakpoint slots is constant. This avoids several kmalloc() calls and
potentially unnecessary failures if the allocations fail, as well as
subtly improves code generation and cache locality.

The protocol is that if an architecture defines hw_breakpoint_slots via
the preprocessor, it must be constant and the same for all types.

Signed-off-by: Marco Elver <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
---
arch/sh/include/asm/hw_breakpoint.h | 5 +-
arch/x86/include/asm/hw_breakpoint.h | 5 +-
kernel/events/hw_breakpoint.c | 92 ++++++++++++++++++----------
3 files changed, 62 insertions(+), 40 deletions(-)

diff --git a/arch/sh/include/asm/hw_breakpoint.h b/arch/sh/include/asm/hw_breakpoint.h
index 199d17b765f2..361a0f57bdeb 100644
--- a/arch/sh/include/asm/hw_breakpoint.h
+++ b/arch/sh/include/asm/hw_breakpoint.h
@@ -48,10 +48,7 @@ struct pmu;
/* Maximum number of UBC channels */
#define HBP_NUM 2

-static inline int hw_breakpoint_slots(int type)
-{
- return HBP_NUM;
-}
+#define hw_breakpoint_slots(type) (HBP_NUM)

/* arch/sh/kernel/hw_breakpoint.c */
extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
diff --git a/arch/x86/include/asm/hw_breakpoint.h b/arch/x86/include/asm/hw_breakpoint.h
index a1f0e90d0818..0bc931cd0698 100644
--- a/arch/x86/include/asm/hw_breakpoint.h
+++ b/arch/x86/include/asm/hw_breakpoint.h
@@ -44,10 +44,7 @@ struct arch_hw_breakpoint {
/* Total number of available HW breakpoint registers */
#define HBP_NUM 4

-static inline int hw_breakpoint_slots(int type)
-{
- return HBP_NUM;
-}
+#define hw_breakpoint_slots(type) (HBP_NUM)

struct perf_event_attr;
struct perf_event;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 270be965f829..a089302ddf59 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -40,13 +40,16 @@ struct bp_cpuinfo {
/* Number of pinned cpu breakpoints in a cpu */
unsigned int cpu_pinned;
/* tsk_pinned[n] is the number of tasks having n+1 breakpoints */
+#ifdef hw_breakpoint_slots
+ unsigned int tsk_pinned[hw_breakpoint_slots(0)];
+#else
unsigned int *tsk_pinned;
+#endif
/* Number of non-pinned cpu/task breakpoints in a cpu */
unsigned int flexible; /* XXX: placeholder, see fetch_this_slot() */
};

static DEFINE_PER_CPU(struct bp_cpuinfo, bp_cpuinfo[TYPE_MAX]);
-static int nr_slots[TYPE_MAX] __ro_after_init;

static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
{
@@ -73,6 +76,54 @@ struct bp_busy_slots {
/* Serialize accesses to the above constraints */
static DEFINE_MUTEX(nr_bp_mutex);

+#ifdef hw_breakpoint_slots
+/*
+ * Number of breakpoint slots is constant, and the same for all types.
+ */
+static_assert(hw_breakpoint_slots(TYPE_INST) == hw_breakpoint_slots(TYPE_DATA));
+static inline int hw_breakpoint_slots_cached(int type) { return hw_breakpoint_slots(type); }
+static inline int init_breakpoint_slots(void) { return 0; }
+#else
+/*
+ * Dynamic number of breakpoint slots.
+ */
+static int __nr_bp_slots[TYPE_MAX] __ro_after_init;
+
+static inline int hw_breakpoint_slots_cached(int type)
+{
+ return __nr_bp_slots[type];
+}
+
+static __init int init_breakpoint_slots(void)
+{
+ int i, cpu, err_cpu;
+
+ for (i = 0; i < TYPE_MAX; i++)
+ __nr_bp_slots[i] = hw_breakpoint_slots(i);
+
+ for_each_possible_cpu(cpu) {
+ for (i = 0; i < TYPE_MAX; i++) {
+ struct bp_cpuinfo *info = get_bp_info(cpu, i);
+
+ info->tsk_pinned = kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL);
+ if (!info->tsk_pinned)
+ goto err;
+ }
+ }
+
+ return 0;
+err:
+ for_each_possible_cpu(err_cpu) {
+ for (i = 0; i < TYPE_MAX; i++)
+ kfree(get_bp_info(err_cpu, i)->tsk_pinned);
+ if (err_cpu == cpu)
+ break;
+ }
+
+ return -ENOMEM;
+}
+#endif
+
__weak int hw_breakpoint_weight(struct perf_event *bp)
{
return 1;
@@ -95,7 +146,7 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
unsigned int *tsk_pinned = get_bp_info(cpu, type)->tsk_pinned;
int i;

- for (i = nr_slots[type] - 1; i >= 0; i--) {
+ for (i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
if (tsk_pinned[i] > 0)
return i + 1;
}
@@ -312,7 +363,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type)
fetch_this_slot(&slots, weight);

/* Flexible counters need to keep at least one slot */
- if (slots.pinned + (!!slots.flexible) > nr_slots[type])
+ if (slots.pinned + (!!slots.flexible) > hw_breakpoint_slots_cached(type))
return -ENOSPC;

ret = arch_reserve_bp_slot(bp);
@@ -687,42 +738,19 @@ static struct pmu perf_breakpoint = {

int __init init_hw_breakpoint(void)
{
- int cpu, err_cpu;
- int i, ret;
-
- for (i = 0; i < TYPE_MAX; i++)
- nr_slots[i] = hw_breakpoint_slots(i);
-
- for_each_possible_cpu(cpu) {
- for (i = 0; i < TYPE_MAX; i++) {
- struct bp_cpuinfo *info = get_bp_info(cpu, i);
-
- info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int),
- GFP_KERNEL);
- if (!info->tsk_pinned) {
- ret = -ENOMEM;
- goto err;
- }
- }
- }
+ int ret;

ret = rhltable_init(&task_bps_ht, &task_bps_ht_params);
if (ret)
- goto err;
+ return ret;
+
+ ret = init_breakpoint_slots();
+ if (ret)
+ return ret;

constraints_initialized = true;

perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT);

return register_die_notifier(&hw_breakpoint_exceptions_nb);
-
-err:
- for_each_possible_cpu(err_cpu) {
- for (i = 0; i < TYPE_MAX; i++)
- kfree(get_bp_info(err_cpu, i)->tsk_pinned);
- if (err_cpu == cpu)
- break;
- }
-
- return ret;
}
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 10:49:07

by Marco Elver

[permalink] [raw]
Subject: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets

We can still see that a majority of the time is spent hashing task pointers:

...
16.98% [kernel] [k] rhashtable_jhash2
...

Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
CPU-independent. The reason for this is to update the per-CPU
'tsk_pinned' histogram.

To optimize the CPU-independent case to O(1), keep a separate
CPU-independent 'tsk_pinned_all' histogram.

The major source of complexity are transitions between "all
CPU-independent task breakpoints" and "mixed CPU-independent and
CPU-dependent task breakpoints". The code comments list all cases that
require handling.

After this optimization:

| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
| Total time: 1.758 [sec]
|
| 34.336621 usecs/op
| 4395.087500 usecs/op/cpu

38.08% [kernel] [k] queued_spin_lock_slowpath
10.81% [kernel] [k] smp_cfm_core_cond
3.01% [kernel] [k] update_sg_lb_stats
2.58% [kernel] [k] osq_lock
2.57% [kernel] [k] llist_reverse_order
1.45% [kernel] [k] find_next_bit
1.21% [kernel] [k] flush_tlb_func_common
1.01% [kernel] [k] arch_install_hw_breakpoint

Showing that the time spent hashing keys has become insignificant.

With the given benchmark parameters, that's an improvement of 12%
compared with the old O(#cpus) version.

And finally, using the less aggressive parameters from the preceding
changes, we now observe:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| Total time: 0.067 [sec]
|
| 35.292187 usecs/op
| 2258.700000 usecs/op/cpu

Which is an improvement of 12% compared to without the histogram
optimizations (baseline is 40 usecs/op). This is now on par with the
theoretical ideal (constraints disabled), and only 12% slower than no
breakpoints at all.

Signed-off-by: Marco Elver <[email protected]>
---
v2:
* New patch.
---
kernel/events/hw_breakpoint.c | 152 +++++++++++++++++++++++++++-------
1 file changed, 121 insertions(+), 31 deletions(-)

diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index b5180a2ccfbf..31b24e42f2b5 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -66,6 +66,8 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)

/* Number of pinned CPU breakpoints globally. */
static struct bp_slots_histogram cpu_pinned[TYPE_MAX];
+/* Number of pinned CPU-independent task breakpoints. */
+static struct bp_slots_histogram tsk_pinned_all[TYPE_MAX];

/* Keep track of the breakpoints attached to tasks */
static struct rhltable task_bps_ht;
@@ -200,6 +202,8 @@ static __init int init_breakpoint_slots(void)
for (i = 0; i < TYPE_MAX; i++) {
if (!bp_slots_histogram_alloc(&cpu_pinned[i], i))
goto err;
+ if (!bp_slots_histogram_alloc(&tsk_pinned_all[i], i))
+ goto err;
}

return 0;
@@ -210,8 +214,10 @@ static __init int init_breakpoint_slots(void)
if (err_cpu == cpu)
break;
}
- for (i = 0; i < TYPE_MAX; i++)
+ for (i = 0; i < TYPE_MAX; i++) {
bp_slots_histogram_free(&cpu_pinned[i]);
+ bp_slots_histogram_free(&tsk_pinned_all[i]);
+ }

return -ENOMEM;
}
@@ -245,6 +251,26 @@ bp_slots_histogram_max(struct bp_slots_histogram *hist, enum bp_type_idx type)
return 0;
}

+static int
+bp_slots_histogram_max_merge(struct bp_slots_histogram *hist1, struct bp_slots_histogram *hist2,
+ enum bp_type_idx type)
+{
+ for (int i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
+ const int count1 = atomic_read(&hist1->count[i]);
+ const int count2 = atomic_read(&hist2->count[i]);
+
+ /* Catch unexpected writers; we want a stable snapshot. */
+ ASSERT_EXCLUSIVE_WRITER(hist1->count[i]);
+ ASSERT_EXCLUSIVE_WRITER(hist2->count[i]);
+ if (count1 + count2 > 0)
+ return i + 1;
+ WARN(count1 < 0, "inconsistent breakpoint slots histogram");
+ WARN(count2 < 0, "inconsistent breakpoint slots histogram");
+ }
+
+ return 0;
+}
+
#ifndef hw_breakpoint_weight
static inline int hw_breakpoint_weight(struct perf_event *bp)
{
@@ -273,7 +299,7 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
* toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot.
*/
lockdep_assert_held_write(&bp_cpuinfo_sem);
- return bp_slots_histogram_max(tsk_pinned, type);
+ return bp_slots_histogram_max_merge(tsk_pinned, &tsk_pinned_all[type], type);
}

/*
@@ -366,40 +392,22 @@ max_bp_pinned_slots(struct perf_event *bp, enum bp_type_idx type)
return pinned_slots;
}

-/*
- * Add a pinned breakpoint for the given task in our constraint table
- */
-static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
- enum bp_type_idx type, int weight)
-{
- struct bp_slots_histogram *tsk_pinned = &get_bp_info(cpu, type)->tsk_pinned;
-
- /*
- * If bp->hw.target, tsk_pinned is only modified, but not used
- * otherwise. We can permit concurrent updates as long as there are no
- * other uses: having acquired bp_cpuinfo_sem as a reader allows
- * concurrent updates here. Uses of tsk_pinned will require acquiring
- * bp_cpuinfo_sem as a writer to stabilize tsk_pinned's value.
- */
- lockdep_assert_held_read(&bp_cpuinfo_sem);
- bp_slots_histogram_add(tsk_pinned, task_bp_pinned(cpu, bp, type), weight);
-}
-
/*
* Add/remove the given breakpoint in our constraint table
*/
static int
-toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
- int weight)
+toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, int weight)
{
- const struct cpumask *cpumask = cpumask_of_bp(bp);
- int cpu;
+ int cpu, next_tsk_pinned;

if (!enable)
weight = -weight;

- /* Pinned counter cpu profiling */
if (!bp->hw.target) {
+ /*
+ * Update the pinned CPU slots, in per-CPU bp_cpuinfo and in the
+ * global histogram.
+ */
struct bp_cpuinfo *info = get_bp_info(bp->cpu, type);

lockdep_assert_held_write(&bp_cpuinfo_sem);
@@ -408,9 +416,91 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
return 0;
}

- /* Pinned counter task profiling */
- for_each_cpu(cpu, cpumask)
- toggle_bp_task_slot(bp, cpu, type, weight);
+ /*
+ * If bp->hw.target, tsk_pinned is only modified, but not used
+ * otherwise. We can permit concurrent updates as long as there are no
+ * other uses: having acquired bp_cpuinfo_sem as a reader allows
+ * concurrent updates here. Uses of tsk_pinned will require acquiring
+ * bp_cpuinfo_sem as a writer to stabilize tsk_pinned's value.
+ */
+ lockdep_assert_held_read(&bp_cpuinfo_sem);
+
+ /*
+ * Update the pinned task slots, in per-CPU bp_cpuinfo and in the global
+ * histogram. We need to take care of 5 cases:
+ *
+ * 1. This breakpoint targets all CPUs (cpu < 0), and there may only
+ * exist other task breakpoints targeting all CPUs. In this case we
+ * can simply update the global slots histogram.
+ *
+ * 2. This breakpoint targets a specific CPU (cpu >= 0), but there may
+ * only exist other task breakpoints targeting all CPUs.
+ *
+ * a. On enable: remove the existing breakpoints from the global
+ * slots histogram and use the per-CPU histogram.
+ *
+ * b. On disable: re-insert the existing breakpoints into the global
+ * slots histogram and remove from per-CPU histogram.
+ *
+ * 3. Some other existing task breakpoints target specific CPUs. Only
+ * update the per-CPU slots histogram.
+ */
+
+ if (!enable) {
+ /*
+ * Remove before updating histograms so we can determine if this
+ * was the last task breakpoint for a specific CPU.
+ */
+ int ret = rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
+
+ if (ret)
+ return ret;
+ }
+ /*
+ * Note: If !enable, next_tsk_pinned will not count the to-be-removed breakpoint.
+ */
+ next_tsk_pinned = task_bp_pinned(-1, bp, type);
+
+ if (next_tsk_pinned >= 0) {
+ if (bp->cpu < 0) { /* Case 1: fast path */
+ if (!enable)
+ next_tsk_pinned += hw_breakpoint_weight(bp);
+ bp_slots_histogram_add(&tsk_pinned_all[type], next_tsk_pinned, weight);
+ } else if (enable) { /* Case 2.a: slow path */
+ /* Add existing to per-CPU histograms. */
+ for_each_possible_cpu(cpu) {
+ bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
+ 0, next_tsk_pinned);
+ }
+ /* Add this first CPU-pinned task breakpoint. */
+ bp_slots_histogram_add(&get_bp_info(bp->cpu, type)->tsk_pinned,
+ next_tsk_pinned, weight);
+ /* Rebalance global task pinned histogram. */
+ bp_slots_histogram_add(&tsk_pinned_all[type], next_tsk_pinned,
+ -next_tsk_pinned);
+ } else { /* Case 2.b: slow path */
+ /* Remove this last CPU-pinned task breakpoint. */
+ bp_slots_histogram_add(&get_bp_info(bp->cpu, type)->tsk_pinned,
+ next_tsk_pinned + hw_breakpoint_weight(bp), weight);
+ /* Remove all from per-CPU histograms. */
+ for_each_possible_cpu(cpu) {
+ bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
+ next_tsk_pinned, -next_tsk_pinned);
+ }
+ /* Rebalance global task pinned histogram. */
+ bp_slots_histogram_add(&tsk_pinned_all[type], 0, next_tsk_pinned);
+ }
+ } else { /* Case 3: slow path */
+ const struct cpumask *cpumask = cpumask_of_bp(bp);
+
+ for_each_cpu(cpu, cpumask) {
+ next_tsk_pinned = task_bp_pinned(cpu, bp, type);
+ if (!enable)
+ next_tsk_pinned += hw_breakpoint_weight(bp);
+ bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
+ next_tsk_pinned, weight);
+ }
+ }

/*
* Readers want a stable snapshot of the per-task breakpoint list.
@@ -419,8 +509,8 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,

if (enable)
return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
- else
- return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
+
+ return 0;
}

__weak int arch_reserve_bp_slot(struct perf_event *bp)
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-28 11:17:20

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets

On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
[...]
> + /*
> + * Update the pinned task slots, in per-CPU bp_cpuinfo and in the global
> + * histogram. We need to take care of 5 cases:

This is a typo: "5 cases" -> "4 cases".

2022-06-28 13:33:27

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization

On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
>
> Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
> implementation have relied on nr_bp_mutex serializing access to them.
>
> Before overhauling synchronization of kernel/events/hw_breakpoint.c,
> introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
> thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.
>
> Reported-by: Dmitry Vyukov <[email protected]>
> Signed-off-by: Marco Elver <[email protected]>

Acked-by: Dmitry Vyukov <[email protected]>

> ---
> v2:
> * New patch.
> ---
> arch/powerpc/kernel/hw_breakpoint.c | 53 ++++++++++++++++++++++-------
> 1 file changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
> index 2669f80b3a49..8db1a15d7acb 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -15,6 +15,7 @@
> #include <linux/kernel.h>
> #include <linux/sched.h>
> #include <linux/smp.h>
> +#include <linux/spinlock.h>
> #include <linux/debugfs.h>
> #include <linux/init.h>
>
> @@ -129,7 +130,14 @@ struct breakpoint {
> bool ptrace_bp;
> };
>
> +/*
> + * While kernel/events/hw_breakpoint.c does its own synchronization, we cannot
> + * rely on it safely synchronizing internals here; however, we can rely on it
> + * not requesting more breakpoints than available.
> + */
> +static DEFINE_SPINLOCK(cpu_bps_lock);
> static DEFINE_PER_CPU(struct breakpoint *, cpu_bps[HBP_NUM_MAX]);
> +static DEFINE_SPINLOCK(task_bps_lock);
> static LIST_HEAD(task_bps);
>
> static struct breakpoint *alloc_breakpoint(struct perf_event *bp)
> @@ -174,7 +182,9 @@ static int task_bps_add(struct perf_event *bp)
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> + spin_lock(&task_bps_lock);
> list_add(&tmp->list, &task_bps);
> + spin_unlock(&task_bps_lock);
> return 0;
> }
>
> @@ -182,6 +192,7 @@ static void task_bps_remove(struct perf_event *bp)
> {
> struct list_head *pos, *q;
>
> + spin_lock(&task_bps_lock);
> list_for_each_safe(pos, q, &task_bps) {
> struct breakpoint *tmp = list_entry(pos, struct breakpoint, list);
>
> @@ -191,6 +202,7 @@ static void task_bps_remove(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&task_bps_lock);
> }
>
> /*
> @@ -200,12 +212,17 @@ static void task_bps_remove(struct perf_event *bp)
> static bool all_task_bps_check(struct perf_event *bp)
> {
> struct breakpoint *tmp;
> + bool ret = false;
>
> + spin_lock(&task_bps_lock);
> list_for_each_entry(tmp, &task_bps, list) {
> - if (!can_co_exist(tmp, bp))
> - return true;
> + if (!can_co_exist(tmp, bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&task_bps_lock);
> + return ret;
> }
>
> /*
> @@ -215,13 +232,18 @@ static bool all_task_bps_check(struct perf_event *bp)
> static bool same_task_bps_check(struct perf_event *bp)
> {
> struct breakpoint *tmp;
> + bool ret = false;
>
> + spin_lock(&task_bps_lock);
> list_for_each_entry(tmp, &task_bps, list) {
> if (tmp->bp->hw.target == bp->hw.target &&
> - !can_co_exist(tmp, bp))
> - return true;
> + !can_co_exist(tmp, bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&task_bps_lock);
> + return ret;
> }
>
> static int cpu_bps_add(struct perf_event *bp)
> @@ -234,6 +256,7 @@ static int cpu_bps_add(struct perf_event *bp)
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> if (!cpu_bp[i]) {
> @@ -241,6 +264,7 @@ static int cpu_bps_add(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&cpu_bps_lock);
> return 0;
> }
>
> @@ -249,6 +273,7 @@ static void cpu_bps_remove(struct perf_event *bp)
> struct breakpoint **cpu_bp;
> int i = 0;
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> if (!cpu_bp[i])
> @@ -260,19 +285,25 @@ static void cpu_bps_remove(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&cpu_bps_lock);
> }
>
> static bool cpu_bps_check(int cpu, struct perf_event *bp)
> {
> struct breakpoint **cpu_bp;
> + bool ret = false;
> int i;
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> - if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp))
> - return true;
> + if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&cpu_bps_lock);
> + return ret;
> }
>
> static bool all_cpu_bps_check(struct perf_event *bp)
> @@ -286,10 +317,6 @@ static bool all_cpu_bps_check(struct perf_event *bp)
> return false;
> }
>
> -/*
> - * We don't use any locks to serialize accesses to cpu_bps or task_bps
> - * because are already inside nr_bp_mutex.
> - */
> int arch_reserve_bp_slot(struct perf_event *bp)
> {
> int ret;
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-28 13:57:33

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 06/13] perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable

On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
>
> Due to being a __weak function, hw_breakpoint_weight() will cause the
> compiler to always emit a call to it. This generates unnecessarily bad
> code (register spills etc.) for no good reason; in fact it appears in
> profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
>
> ...
> 0.70% [kernel] [k] hw_breakpoint_weight
> ...
>
> While a small percentage, no architecture defines its own
> hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
> which makes the fact it is currently __weak a poor choice.
>
> Change hw_breakpoint_weight()'s definition to follow a similar protocol
> to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
> hw_breakpoint_weight(), we'll use it instead.
>
> The result is that it is inlined and no longer shows up in profiles.
>
> Signed-off-by: Marco Elver <[email protected]>

Reviewed-by: Dmitry Vyukov <[email protected]>

> ---
> include/linux/hw_breakpoint.h | 1 -
> kernel/events/hw_breakpoint.c | 4 +++-
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
> index 78dd7035d1e5..9fa3547acd87 100644
> --- a/include/linux/hw_breakpoint.h
> +++ b/include/linux/hw_breakpoint.h
> @@ -79,7 +79,6 @@ extern int dbg_reserve_bp_slot(struct perf_event *bp);
> extern int dbg_release_bp_slot(struct perf_event *bp);
> extern int reserve_bp_slot(struct perf_event *bp);
> extern void release_bp_slot(struct perf_event *bp);
> -int hw_breakpoint_weight(struct perf_event *bp);
> int arch_reserve_bp_slot(struct perf_event *bp);
> void arch_release_bp_slot(struct perf_event *bp);
> void arch_unregister_hw_breakpoint(struct perf_event *bp);
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index a089302ddf59..a124786e3ade 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -124,10 +124,12 @@ static __init int init_breakpoint_slots(void)
> }
> #endif
>
> -__weak int hw_breakpoint_weight(struct perf_event *bp)
> +#ifndef hw_breakpoint_weight
> +static inline int hw_breakpoint_weight(struct perf_event *bp)
> {
> return 1;
> }
> +#endif
>
> static inline enum bp_type_idx find_slot_idx(u64 bp_type)
> {
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-28 15:53:50

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 12/13] perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets

On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
>
> Running the perf benchmark with (note: more aggressive parameters vs.
> preceding changes, but same 256 CPUs host):
>
> | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> | # Running 'breakpoint/thread' benchmark:
> | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> | Total time: 1.989 [sec]
> |
> | 38.854160 usecs/op
> | 4973.332500 usecs/op/cpu
>
> 20.43% [kernel] [k] queued_spin_lock_slowpath
> 18.75% [kernel] [k] osq_lock
> 16.98% [kernel] [k] rhashtable_jhash2
> 8.34% [kernel] [k] task_bp_pinned
> 4.23% [kernel] [k] smp_cfm_core_cond
> 3.65% [kernel] [k] bcmp
> 2.83% [kernel] [k] toggle_bp_slot
> 1.87% [kernel] [k] find_next_bit
> 1.49% [kernel] [k] __reserve_bp_slot
>
> We can see that a majority of the time is now spent hashing task
> pointers to index into task_bps_ht in task_bp_pinned().
>
> Obtaining the max_bp_pinned_slots() for CPU-independent task targets
> currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
> the result of task_bp_pinned() is CPU-independent.
>
> The loop in max_bp_pinned_slots() wants to compute the maximum slots
> across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
> obtaining the max slots across all CPUs and adding task_bp_pinned().
>
> To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
>
> After this optimization:
>
> | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> | # Running 'breakpoint/thread' benchmark:
> | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> | Total time: 1.930 [sec]
> |
> | 37.697832 usecs/op
> | 4825.322500 usecs/op/cpu
>
> 19.13% [kernel] [k] queued_spin_lock_slowpath
> 18.21% [kernel] [k] rhashtable_jhash2
> 15.46% [kernel] [k] osq_lock
> 6.27% [kernel] [k] toggle_bp_slot
> 5.91% [kernel] [k] task_bp_pinned
> 5.05% [kernel] [k] smp_cfm_core_cond
> 1.78% [kernel] [k] update_sg_lb_stats
> 1.36% [kernel] [k] llist_reverse_order
> 1.34% [kernel] [k] find_next_bit
> 1.19% [kernel] [k] bcmp
>
> Suggesting that time spent in task_bp_pinned() has been reduced.
> However, we're still hashing too much, which will be addressed in the
> subsequent change.
>
> Signed-off-by: Marco Elver <[email protected]>

Reviewed-by: Dmitry Vyukov <[email protected]>

> ---
> v2:
> * New patch.
> ---
> kernel/events/hw_breakpoint.c | 45 +++++++++++++++++++++++++++++++----
> 1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index 18886f115abc..b5180a2ccfbf 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -64,6 +64,9 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
> return per_cpu_ptr(bp_cpuinfo + type, cpu);
> }
>
> +/* Number of pinned CPU breakpoints globally. */
> +static struct bp_slots_histogram cpu_pinned[TYPE_MAX];
> +
> /* Keep track of the breakpoints attached to tasks */
> static struct rhltable task_bps_ht;
> static const struct rhashtable_params task_bps_ht_params = {
> @@ -194,6 +197,10 @@ static __init int init_breakpoint_slots(void)
> goto err;
> }
> }
> + for (i = 0; i < TYPE_MAX; i++) {
> + if (!bp_slots_histogram_alloc(&cpu_pinned[i], i))
> + goto err;
> + }
>
> return 0;
> err:
> @@ -203,6 +210,8 @@ static __init int init_breakpoint_slots(void)
> if (err_cpu == cpu)
> break;
> }
> + for (i = 0; i < TYPE_MAX; i++)
> + bp_slots_histogram_free(&cpu_pinned[i]);
>
> return -ENOMEM;
> }
> @@ -270,6 +279,9 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
> /*
> * Count the number of breakpoints of the same type and same task.
> * The given event must be not on the list.
> + *
> + * If @cpu is -1, but the result of task_bp_pinned() is not CPU-independent,
> + * returns a negative value.
> */
> static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
> {
> @@ -288,9 +300,18 @@ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type)
> goto out;
>
> rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) {
> - if (find_slot_idx(iter->attr.bp_type) == type &&
> - (iter->cpu < 0 || cpu == iter->cpu))
> - count += hw_breakpoint_weight(iter);
> + if (find_slot_idx(iter->attr.bp_type) != type)
> + continue;
> +
> + if (iter->cpu >= 0) {
> + if (cpu == -1) {
> + count = -1;
> + goto out;
> + } else if (cpu != iter->cpu)
> + continue;
> + }
> +
> + count += hw_breakpoint_weight(iter);
> }
>
> out:
> @@ -316,6 +337,19 @@ max_bp_pinned_slots(struct perf_event *bp, enum bp_type_idx type)
> int pinned_slots = 0;
> int cpu;
>
> + if (bp->hw.target && bp->cpu < 0) {
> + int max_pinned = task_bp_pinned(-1, bp, type);
> +
> + if (max_pinned >= 0) {
> + /*
> + * Fast path: task_bp_pinned() is CPU-independent and
> + * returns the same value for any CPU.
> + */
> + max_pinned += bp_slots_histogram_max(&cpu_pinned[type], type);
> + return max_pinned;
> + }
> + }
> +
> for_each_cpu(cpu, cpumask) {
> struct bp_cpuinfo *info = get_bp_info(cpu, type);
> int nr;
> @@ -366,8 +400,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>
> /* Pinned counter cpu profiling */
> if (!bp->hw.target) {
> + struct bp_cpuinfo *info = get_bp_info(bp->cpu, type);
> +
> lockdep_assert_held_write(&bp_cpuinfo_sem);
> - get_bp_info(bp->cpu, type)->cpu_pinned += weight;
> + bp_slots_histogram_add(&cpu_pinned[type], info->cpu_pinned, weight);
> + info->cpu_pinned += weight;
> return 0;
> }
>
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-28 16:09:09

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets

On Tue, 28 Jun 2022 at 17:45, Dmitry Vyukov <[email protected]> wrote:
>
> On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
> >
> > We can still see that a majority of the time is spent hashing task pointers:
> >
> > ...
> > 16.98% [kernel] [k] rhashtable_jhash2
> > ...
> >
> > Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
> > calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
> > CPU-independent. The reason for this is to update the per-CPU
> > 'tsk_pinned' histogram.
> >
> > To optimize the CPU-independent case to O(1), keep a separate
> > CPU-independent 'tsk_pinned_all' histogram.
> >
> > The major source of complexity are transitions between "all
> > CPU-independent task breakpoints" and "mixed CPU-independent and
> > CPU-dependent task breakpoints". The code comments list all cases that
> > require handling.
> >
> > After this optimization:
> >
> > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> > | Total time: 1.758 [sec]
> > |
> > | 34.336621 usecs/op
> > | 4395.087500 usecs/op/cpu
> >
> > 38.08% [kernel] [k] queued_spin_lock_slowpath
> > 10.81% [kernel] [k] smp_cfm_core_cond
> > 3.01% [kernel] [k] update_sg_lb_stats
> > 2.58% [kernel] [k] osq_lock
> > 2.57% [kernel] [k] llist_reverse_order
> > 1.45% [kernel] [k] find_next_bit
> > 1.21% [kernel] [k] flush_tlb_func_common
> > 1.01% [kernel] [k] arch_install_hw_breakpoint
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, that's an improvement of 12%
> > compared with the old O(#cpus) version.
> >
> > And finally, using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > | Total time: 0.067 [sec]
> > |
> > | 35.292187 usecs/op
> > | 2258.700000 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without the histogram
> > optimizations (baseline is 40 usecs/op). This is now on par with the
> > theoretical ideal (constraints disabled), and only 12% slower than no
> > breakpoints at all.
> >
> > Signed-off-by: Marco Elver <[email protected]>
>
> Reviewed-by: Dmitry Vyukov <[email protected]>
>
> I don't see any bugs. But the code is quite complex. Does it make
> sense to add some asserts to the histogram type? E.g. counters don't
> underflow, weight is not negative (e.g. accidentally added -1 returned
> from task_bp_pinned()). Not sure if it will be enough to catch all
> types of bugs, though.
> Could kunit tests check that histograms are all 0's at the end?
>
> I am not just about the current code (which may be correct), but also
> future modifications to this code.

I'll think of some more options.

bp_slots_histogram_max*() already has asserts (WARN about underflow;
some with KCSAN help).

The main thing I did to raise my own confidence in the code is inject
bugs and see if the KUnit test catches it. If it didn't I extended the
tests. I'll do that some more maybe.

2022-06-28 16:24:08

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets

On Tue, 28 Jun 2022 at 11:59, Marco Elver <[email protected]> wrote:
>
> We can still see that a majority of the time is spent hashing task pointers:
>
> ...
> 16.98% [kernel] [k] rhashtable_jhash2
> ...
>
> Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
> calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
> CPU-independent. The reason for this is to update the per-CPU
> 'tsk_pinned' histogram.
>
> To optimize the CPU-independent case to O(1), keep a separate
> CPU-independent 'tsk_pinned_all' histogram.
>
> The major source of complexity are transitions between "all
> CPU-independent task breakpoints" and "mixed CPU-independent and
> CPU-dependent task breakpoints". The code comments list all cases that
> require handling.
>
> After this optimization:
>
> | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> | Total time: 1.758 [sec]
> |
> | 34.336621 usecs/op
> | 4395.087500 usecs/op/cpu
>
> 38.08% [kernel] [k] queued_spin_lock_slowpath
> 10.81% [kernel] [k] smp_cfm_core_cond
> 3.01% [kernel] [k] update_sg_lb_stats
> 2.58% [kernel] [k] osq_lock
> 2.57% [kernel] [k] llist_reverse_order
> 1.45% [kernel] [k] find_next_bit
> 1.21% [kernel] [k] flush_tlb_func_common
> 1.01% [kernel] [k] arch_install_hw_breakpoint
>
> Showing that the time spent hashing keys has become insignificant.
>
> With the given benchmark parameters, that's an improvement of 12%
> compared with the old O(#cpus) version.
>
> And finally, using the less aggressive parameters from the preceding
> changes, we now observe:
>
> | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> | Total time: 0.067 [sec]
> |
> | 35.292187 usecs/op
> | 2258.700000 usecs/op/cpu
>
> Which is an improvement of 12% compared to without the histogram
> optimizations (baseline is 40 usecs/op). This is now on par with the
> theoretical ideal (constraints disabled), and only 12% slower than no
> breakpoints at all.
>
> Signed-off-by: Marco Elver <[email protected]>

Reviewed-by: Dmitry Vyukov <[email protected]>

I don't see any bugs. But the code is quite complex. Does it make
sense to add some asserts to the histogram type? E.g. counters don't
underflow, weight is not negative (e.g. accidentally added -1 returned
from task_bp_pinned()). Not sure if it will be enough to catch all
types of bugs, though.
Could kunit tests check that histograms are all 0's at the end?

I am not just about the current code (which may be correct), but also
future modifications to this code.


> ---
> v2:
> * New patch.
> ---
> kernel/events/hw_breakpoint.c | 152 +++++++++++++++++++++++++++-------
> 1 file changed, 121 insertions(+), 31 deletions(-)
>
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index b5180a2ccfbf..31b24e42f2b5 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -66,6 +66,8 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type)
>
> /* Number of pinned CPU breakpoints globally. */
> static struct bp_slots_histogram cpu_pinned[TYPE_MAX];
> +/* Number of pinned CPU-independent task breakpoints. */
> +static struct bp_slots_histogram tsk_pinned_all[TYPE_MAX];
>
> /* Keep track of the breakpoints attached to tasks */
> static struct rhltable task_bps_ht;
> @@ -200,6 +202,8 @@ static __init int init_breakpoint_slots(void)
> for (i = 0; i < TYPE_MAX; i++) {
> if (!bp_slots_histogram_alloc(&cpu_pinned[i], i))
> goto err;
> + if (!bp_slots_histogram_alloc(&tsk_pinned_all[i], i))
> + goto err;
> }
>
> return 0;
> @@ -210,8 +214,10 @@ static __init int init_breakpoint_slots(void)
> if (err_cpu == cpu)
> break;
> }
> - for (i = 0; i < TYPE_MAX; i++)
> + for (i = 0; i < TYPE_MAX; i++) {
> bp_slots_histogram_free(&cpu_pinned[i]);
> + bp_slots_histogram_free(&tsk_pinned_all[i]);
> + }
>
> return -ENOMEM;
> }
> @@ -245,6 +251,26 @@ bp_slots_histogram_max(struct bp_slots_histogram *hist, enum bp_type_idx type)
> return 0;
> }
>
> +static int
> +bp_slots_histogram_max_merge(struct bp_slots_histogram *hist1, struct bp_slots_histogram *hist2,
> + enum bp_type_idx type)
> +{
> + for (int i = hw_breakpoint_slots_cached(type) - 1; i >= 0; i--) {
> + const int count1 = atomic_read(&hist1->count[i]);
> + const int count2 = atomic_read(&hist2->count[i]);
> +
> + /* Catch unexpected writers; we want a stable snapshot. */
> + ASSERT_EXCLUSIVE_WRITER(hist1->count[i]);
> + ASSERT_EXCLUSIVE_WRITER(hist2->count[i]);
> + if (count1 + count2 > 0)
> + return i + 1;
> + WARN(count1 < 0, "inconsistent breakpoint slots histogram");
> + WARN(count2 < 0, "inconsistent breakpoint slots histogram");
> + }
> +
> + return 0;
> +}
> +
> #ifndef hw_breakpoint_weight
> static inline int hw_breakpoint_weight(struct perf_event *bp)
> {
> @@ -273,7 +299,7 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type)
> * toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot.
> */
> lockdep_assert_held_write(&bp_cpuinfo_sem);
> - return bp_slots_histogram_max(tsk_pinned, type);
> + return bp_slots_histogram_max_merge(tsk_pinned, &tsk_pinned_all[type], type);
> }
>
> /*
> @@ -366,40 +392,22 @@ max_bp_pinned_slots(struct perf_event *bp, enum bp_type_idx type)
> return pinned_slots;
> }
>
> -/*
> - * Add a pinned breakpoint for the given task in our constraint table
> - */
> -static void toggle_bp_task_slot(struct perf_event *bp, int cpu,
> - enum bp_type_idx type, int weight)
> -{
> - struct bp_slots_histogram *tsk_pinned = &get_bp_info(cpu, type)->tsk_pinned;
> -
> - /*
> - * If bp->hw.target, tsk_pinned is only modified, but not used
> - * otherwise. We can permit concurrent updates as long as there are no
> - * other uses: having acquired bp_cpuinfo_sem as a reader allows
> - * concurrent updates here. Uses of tsk_pinned will require acquiring
> - * bp_cpuinfo_sem as a writer to stabilize tsk_pinned's value.
> - */
> - lockdep_assert_held_read(&bp_cpuinfo_sem);
> - bp_slots_histogram_add(tsk_pinned, task_bp_pinned(cpu, bp, type), weight);
> -}
> -
> /*
> * Add/remove the given breakpoint in our constraint table
> */
> static int
> -toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
> - int weight)
> +toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, int weight)
> {
> - const struct cpumask *cpumask = cpumask_of_bp(bp);
> - int cpu;
> + int cpu, next_tsk_pinned;
>
> if (!enable)
> weight = -weight;
>
> - /* Pinned counter cpu profiling */
> if (!bp->hw.target) {
> + /*
> + * Update the pinned CPU slots, in per-CPU bp_cpuinfo and in the
> + * global histogram.
> + */
> struct bp_cpuinfo *info = get_bp_info(bp->cpu, type);
>
> lockdep_assert_held_write(&bp_cpuinfo_sem);
> @@ -408,9 +416,91 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
> return 0;
> }
>
> - /* Pinned counter task profiling */
> - for_each_cpu(cpu, cpumask)
> - toggle_bp_task_slot(bp, cpu, type, weight);
> + /*
> + * If bp->hw.target, tsk_pinned is only modified, but not used
> + * otherwise. We can permit concurrent updates as long as there are no
> + * other uses: having acquired bp_cpuinfo_sem as a reader allows
> + * concurrent updates here. Uses of tsk_pinned will require acquiring
> + * bp_cpuinfo_sem as a writer to stabilize tsk_pinned's value.
> + */
> + lockdep_assert_held_read(&bp_cpuinfo_sem);
> +
> + /*
> + * Update the pinned task slots, in per-CPU bp_cpuinfo and in the global
> + * histogram. We need to take care of 5 cases:
> + *
> + * 1. This breakpoint targets all CPUs (cpu < 0), and there may only
> + * exist other task breakpoints targeting all CPUs. In this case we
> + * can simply update the global slots histogram.
> + *
> + * 2. This breakpoint targets a specific CPU (cpu >= 0), but there may
> + * only exist other task breakpoints targeting all CPUs.
> + *
> + * a. On enable: remove the existing breakpoints from the global
> + * slots histogram and use the per-CPU histogram.
> + *
> + * b. On disable: re-insert the existing breakpoints into the global
> + * slots histogram and remove from per-CPU histogram.
> + *
> + * 3. Some other existing task breakpoints target specific CPUs. Only
> + * update the per-CPU slots histogram.
> + */
> +
> + if (!enable) {
> + /*
> + * Remove before updating histograms so we can determine if this
> + * was the last task breakpoint for a specific CPU.
> + */
> + int ret = rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
> +
> + if (ret)
> + return ret;
> + }
> + /*
> + * Note: If !enable, next_tsk_pinned will not count the to-be-removed breakpoint.
> + */
> + next_tsk_pinned = task_bp_pinned(-1, bp, type);
> +
> + if (next_tsk_pinned >= 0) {
> + if (bp->cpu < 0) { /* Case 1: fast path */
> + if (!enable)
> + next_tsk_pinned += hw_breakpoint_weight(bp);
> + bp_slots_histogram_add(&tsk_pinned_all[type], next_tsk_pinned, weight);
> + } else if (enable) { /* Case 2.a: slow path */
> + /* Add existing to per-CPU histograms. */
> + for_each_possible_cpu(cpu) {
> + bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
> + 0, next_tsk_pinned);
> + }
> + /* Add this first CPU-pinned task breakpoint. */
> + bp_slots_histogram_add(&get_bp_info(bp->cpu, type)->tsk_pinned,
> + next_tsk_pinned, weight);
> + /* Rebalance global task pinned histogram. */
> + bp_slots_histogram_add(&tsk_pinned_all[type], next_tsk_pinned,
> + -next_tsk_pinned);
> + } else { /* Case 2.b: slow path */
> + /* Remove this last CPU-pinned task breakpoint. */
> + bp_slots_histogram_add(&get_bp_info(bp->cpu, type)->tsk_pinned,
> + next_tsk_pinned + hw_breakpoint_weight(bp), weight);
> + /* Remove all from per-CPU histograms. */
> + for_each_possible_cpu(cpu) {
> + bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
> + next_tsk_pinned, -next_tsk_pinned);
> + }
> + /* Rebalance global task pinned histogram. */
> + bp_slots_histogram_add(&tsk_pinned_all[type], 0, next_tsk_pinned);
> + }
> + } else { /* Case 3: slow path */
> + const struct cpumask *cpumask = cpumask_of_bp(bp);
> +
> + for_each_cpu(cpu, cpumask) {
> + next_tsk_pinned = task_bp_pinned(cpu, bp, type);
> + if (!enable)
> + next_tsk_pinned += hw_breakpoint_weight(bp);
> + bp_slots_histogram_add(&get_bp_info(cpu, type)->tsk_pinned,
> + next_tsk_pinned, weight);
> + }
> + }
>
> /*
> * Readers want a stable snapshot of the per-task breakpoint list.
> @@ -419,8 +509,8 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type,
>
> if (enable)
> return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
> - else
> - return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params);
> +
> + return 0;
> }
>
> __weak int arch_reserve_bp_slot(struct perf_event *bp)
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-07-01 09:04:26

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization

Hi Marco,

Le 28/06/2022 à 11:58, Marco Elver a écrit :
> Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
> implementation have relied on nr_bp_mutex serializing access to them.
>
> Before overhauling synchronization of kernel/events/hw_breakpoint.c,
> introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
> thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.

We have an still opened old issue in our database related to
hw_breakpoint, I was wondering if it could have any link with the
changes you are doing and whether you could handle it at the same time.

https://github.com/linuxppc/issues/issues/38

Maybe it is completely unrelated, but as your series modifies only
powerpc and as the issue says that powerpc is the only one to do that, I
thought it might be worth a hand up.

Thanks
Christophe

>
> Reported-by: Dmitry Vyukov <[email protected]>
> Signed-off-by: Marco Elver <[email protected]>
> ---
> v2:
> * New patch.
> ---
> arch/powerpc/kernel/hw_breakpoint.c | 53 ++++++++++++++++++++++-------
> 1 file changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
> index 2669f80b3a49..8db1a15d7acb 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -15,6 +15,7 @@
> #include <linux/kernel.h>
> #include <linux/sched.h>
> #include <linux/smp.h>
> +#include <linux/spinlock.h>
> #include <linux/debugfs.h>
> #include <linux/init.h>
>
> @@ -129,7 +130,14 @@ struct breakpoint {
> bool ptrace_bp;
> };
>
> +/*
> + * While kernel/events/hw_breakpoint.c does its own synchronization, we cannot
> + * rely on it safely synchronizing internals here; however, we can rely on it
> + * not requesting more breakpoints than available.
> + */
> +static DEFINE_SPINLOCK(cpu_bps_lock);
> static DEFINE_PER_CPU(struct breakpoint *, cpu_bps[HBP_NUM_MAX]);
> +static DEFINE_SPINLOCK(task_bps_lock);
> static LIST_HEAD(task_bps);
>
> static struct breakpoint *alloc_breakpoint(struct perf_event *bp)
> @@ -174,7 +182,9 @@ static int task_bps_add(struct perf_event *bp)
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> + spin_lock(&task_bps_lock);
> list_add(&tmp->list, &task_bps);
> + spin_unlock(&task_bps_lock);
> return 0;
> }
>
> @@ -182,6 +192,7 @@ static void task_bps_remove(struct perf_event *bp)
> {
> struct list_head *pos, *q;
>
> + spin_lock(&task_bps_lock);
> list_for_each_safe(pos, q, &task_bps) {
> struct breakpoint *tmp = list_entry(pos, struct breakpoint, list);
>
> @@ -191,6 +202,7 @@ static void task_bps_remove(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&task_bps_lock);
> }
>
> /*
> @@ -200,12 +212,17 @@ static void task_bps_remove(struct perf_event *bp)
> static bool all_task_bps_check(struct perf_event *bp)
> {
> struct breakpoint *tmp;
> + bool ret = false;
>
> + spin_lock(&task_bps_lock);
> list_for_each_entry(tmp, &task_bps, list) {
> - if (!can_co_exist(tmp, bp))
> - return true;
> + if (!can_co_exist(tmp, bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&task_bps_lock);
> + return ret;
> }
>
> /*
> @@ -215,13 +232,18 @@ static bool all_task_bps_check(struct perf_event *bp)
> static bool same_task_bps_check(struct perf_event *bp)
> {
> struct breakpoint *tmp;
> + bool ret = false;
>
> + spin_lock(&task_bps_lock);
> list_for_each_entry(tmp, &task_bps, list) {
> if (tmp->bp->hw.target == bp->hw.target &&
> - !can_co_exist(tmp, bp))
> - return true;
> + !can_co_exist(tmp, bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&task_bps_lock);
> + return ret;
> }
>
> static int cpu_bps_add(struct perf_event *bp)
> @@ -234,6 +256,7 @@ static int cpu_bps_add(struct perf_event *bp)
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> if (!cpu_bp[i]) {
> @@ -241,6 +264,7 @@ static int cpu_bps_add(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&cpu_bps_lock);
> return 0;
> }
>
> @@ -249,6 +273,7 @@ static void cpu_bps_remove(struct perf_event *bp)
> struct breakpoint **cpu_bp;
> int i = 0;
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, bp->cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> if (!cpu_bp[i])
> @@ -260,19 +285,25 @@ static void cpu_bps_remove(struct perf_event *bp)
> break;
> }
> }
> + spin_unlock(&cpu_bps_lock);
> }
>
> static bool cpu_bps_check(int cpu, struct perf_event *bp)
> {
> struct breakpoint **cpu_bp;
> + bool ret = false;
> int i;
>
> + spin_lock(&cpu_bps_lock);
> cpu_bp = per_cpu_ptr(cpu_bps, cpu);
> for (i = 0; i < nr_wp_slots(); i++) {
> - if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp))
> - return true;
> + if (cpu_bp[i] && !can_co_exist(cpu_bp[i], bp)) {
> + ret = true;
> + break;
> + }
> }
> - return false;
> + spin_unlock(&cpu_bps_lock);
> + return ret;
> }
>
> static bool all_cpu_bps_check(struct perf_event *bp)
> @@ -286,10 +317,6 @@ static bool all_cpu_bps_check(struct perf_event *bp)
> return false;
> }
>
> -/*
> - * We don't use any locks to serialize accesses to cpu_bps or task_bps
> - * because are already inside nr_bp_mutex.
> - */
> int arch_reserve_bp_slot(struct perf_event *bp)
> {
> int ret;

2022-07-01 10:19:50

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization

On Fri, 1 Jul 2022 at 10:54, Christophe Leroy
<[email protected]> wrote:
>
> Hi Marco,
>
> Le 28/06/2022 à 11:58, Marco Elver a écrit :
> > Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
> > implementation have relied on nr_bp_mutex serializing access to them.
> >
> > Before overhauling synchronization of kernel/events/hw_breakpoint.c,
> > introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
> > thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.
>
> We have an still opened old issue in our database related to
> hw_breakpoint, I was wondering if it could have any link with the
> changes you are doing and whether you could handle it at the same time.
>
> https://github.com/linuxppc/issues/issues/38
>
> Maybe it is completely unrelated, but as your series modifies only
> powerpc and as the issue says that powerpc is the only one to do that, I
> thought it might be worth a hand up.

I see the powerpc issue unrelated to the optimizations in this series;
perhaps by fixing the powerpc issue, it would also become more
optimal. But all I saw is that it just so happens that powerpc relied
on the nr_bp_mutex which is going away.

This series will become even more complex if I decided to add a
powerpc rework on top (notwithstanding the fact I don't have any ppc
hardware at my disposal either). A separate series/patch seems much
more appropriate.

Thanks,
-- Marco

2022-07-01 10:52:58

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2 08/13] powerpc/hw_breakpoint: Avoid relying on caller synchronization



Le 01/07/2022 à 11:41, Marco Elver a écrit :
> On Fri, 1 Jul 2022 at 10:54, Christophe Leroy
> <[email protected]> wrote:
>>
>> Hi Marco,
>>
>> Le 28/06/2022 à 11:58, Marco Elver a écrit :
>>> Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
>>> implementation have relied on nr_bp_mutex serializing access to them.
>>>
>>> Before overhauling synchronization of kernel/events/hw_breakpoint.c,
>>> introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
>>> thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.
>>
>> We have an still opened old issue in our database related to
>> hw_breakpoint, I was wondering if it could have any link with the
>> changes you are doing and whether you could handle it at the same time.
>>
>> https://github.com/linuxppc/issues/issues/38
>>
>> Maybe it is completely unrelated, but as your series modifies only
>> powerpc and as the issue says that powerpc is the only one to do that, I
>> thought it might be worth a hand up.
>
> I see the powerpc issue unrelated to the optimizations in this series;
> perhaps by fixing the powerpc issue, it would also become more
> optimal. But all I saw is that it just so happens that powerpc relied
> on the nr_bp_mutex which is going away.
>
> This series will become even more complex if I decided to add a
> powerpc rework on top (notwithstanding the fact I don't have any ppc
> hardware at my disposal either). A separate series/patch seems much
> more appropriate.
>

Fair enough. Thanks for answering and clarifying.

Christophe