2022-07-04 15:24:37

by Marco Elver

[permalink] [raw]
Subject: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks

The hw_breakpoint subsystem's code has seen little change in over 10
years. In that time, systems with >100s of CPUs have become common,
along with improvements to the perf subsystem: using breakpoints on
thousands of concurrent tasks should be a supported usecase.

The breakpoint constraints accounting algorithm is the major bottleneck
in doing so:

1. toggle_bp_slot() and fetch_bp_busy_slots() are O(#cpus * #tasks):
Both iterate through all CPUs and call task_bp_pinned(), which is
O(#tasks).

2. Everything is serialized on a global mutex, 'nr_bp_mutex'.

The series progresses with the simpler optimizations and finishes with
the more complex optimizations:

1. We first optimize task_bp_pinned() to only take O(1) on average.

2. Rework synchronization to allow concurrency when checking and
updating breakpoint constraints for tasks.

3. Eliminate the O(#cpus) loops in the CPU-independent case.

Along the way, smaller micro-optimizations and cleanups are done as they
seemed obvious when staring at the code (but likely insignificant).

The result is (on a system with 256 CPUs) that we go from:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
[ ^ more aggressive benchmark parameters took too long ]
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 236.418 [sec]
|
| 123134.794271 usecs/op
| 7880626.833333 usecs/op/cpu

... to the following with all optimizations:

| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
| # Running 'breakpoint/thread' benchmark:
| # Created/joined 30 threads with 4 breakpoints and 64 parallelism
| Total time: 0.067 [sec]
|
| 35.292187 usecs/op
| 2258.700000 usecs/op/cpu

On the used test system, that's an effective speedup of ~3490x per op.

Which is on par with the theoretical ideal performance through
optimizations in hw_breakpoint.c (constraints accounting disabled), and
only 12% slower than no breakpoints at all.

Changelog
---------

v3:
* Fix typos.
* Introduce hw_breakpoint_is_used() for the test.
* Add WARN_ON in bp_blots_histogram_add().
* Don't use raw_smp_processor_id() in test.
* Apply Acked-by/Reviewed-by given in v2 for mostly unchanged patches.

v2: https://lkml.kernel.org/r/[email protected]
* Add KUnit test suite.
* Remove struct bp_busy_slots and simplify functions.
* Add "powerpc/hw_breakpoint: Avoid relying on caller synchronization".
* Add "locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked()".
* Use percpu-rwsem instead of rwlock.
* Use task_struct::perf_event_mutex instead of sharded mutex.
* Drop v1 "perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent".
* Add "perf/hw_breakpoint: Introduce bp_slots_histogram".
* Add "perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets".
* Add "perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets".
* Apply Acked-by/Reviewed-by given in v1 for unchanged patches.
==> Speedup of ~3490x (vs. ~3315x in v1).

v1: https://lore.kernel.org/all/[email protected]/

Marco Elver (14):
perf/hw_breakpoint: Add KUnit test for constraints accounting
perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test
perf/hw_breakpoint: Clean up headers
perf/hw_breakpoint: Optimize list of per-task breakpoints
perf/hw_breakpoint: Mark data __ro_after_init
perf/hw_breakpoint: Optimize constant number of breakpoint slots
perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
perf/hw_breakpoint: Remove useless code related to flexible
breakpoints
powerpc/hw_breakpoint: Avoid relying on caller synchronization
locking/percpu-rwsem: Add percpu_is_write_locked() and
percpu_is_read_locked()
perf/hw_breakpoint: Reduce contention with large number of tasks
perf/hw_breakpoint: Introduce bp_slots_histogram
perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent
task targets
perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task
targets

arch/powerpc/kernel/hw_breakpoint.c | 53 ++-
arch/sh/include/asm/hw_breakpoint.h | 5 +-
arch/x86/include/asm/hw_breakpoint.h | 5 +-
include/linux/hw_breakpoint.h | 4 +-
include/linux/percpu-rwsem.h | 6 +
include/linux/perf_event.h | 3 +-
kernel/events/Makefile | 1 +
kernel/events/hw_breakpoint.c | 638 ++++++++++++++++++++-------
kernel/events/hw_breakpoint_test.c | 333 ++++++++++++++
kernel/locking/percpu-rwsem.c | 6 +
lib/Kconfig.debug | 10 +
11 files changed, 885 insertions(+), 179 deletions(-)
create mode 100644 kernel/events/hw_breakpoint_test.c

--
2.37.0.rc0.161.g10f37bed90-goog


2022-07-04 15:39:23

by Marco Elver

[permalink] [raw]
Subject: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

Add KUnit test for hw_breakpoint constraints accounting, with various
interesting mixes of breakpoint targets (some care was taken to catch
interesting corner cases via bug-injection).

The test cannot be built as a module because it requires access to
hw_breakpoint_slots(), which is not inlinable or exported on all
architectures.

Signed-off-by: Marco Elver <[email protected]>
---
v3:
* Don't use raw_smp_processor_id().

v2:
* New patch.
---
kernel/events/Makefile | 1 +
kernel/events/hw_breakpoint_test.c | 323 +++++++++++++++++++++++++++++
lib/Kconfig.debug | 10 +
3 files changed, 334 insertions(+)
create mode 100644 kernel/events/hw_breakpoint_test.c

diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 8591c180b52b..91a62f566743 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,4 +2,5 @@
obj-y := core.o ring_buffer.o callchain.o

obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
+obj-$(CONFIG_HW_BREAKPOINT_KUNIT_TEST) += hw_breakpoint_test.o
obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/hw_breakpoint_test.c b/kernel/events/hw_breakpoint_test.c
new file mode 100644
index 000000000000..433c5c45e2a5
--- /dev/null
+++ b/kernel/events/hw_breakpoint_test.c
@@ -0,0 +1,323 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit test for hw_breakpoint constraints accounting logic.
+ *
+ * Copyright (C) 2022, Google LLC.
+ */
+
+#include <kunit/test.h>
+#include <linux/cpumask.h>
+#include <linux/hw_breakpoint.h>
+#include <linux/kthread.h>
+#include <linux/perf_event.h>
+#include <asm/hw_breakpoint.h>
+
+#define TEST_REQUIRES_BP_SLOTS(test, slots) \
+ do { \
+ if ((slots) > get_test_bp_slots()) { \
+ kunit_skip((test), "Requires breakpoint slots: %d > %d", slots, \
+ get_test_bp_slots()); \
+ } \
+ } while (0)
+
+#define TEST_EXPECT_NOSPC(expr) KUNIT_EXPECT_EQ(test, -ENOSPC, PTR_ERR(expr))
+
+#define MAX_TEST_BREAKPOINTS 512
+
+static char break_vars[MAX_TEST_BREAKPOINTS];
+static struct perf_event *test_bps[MAX_TEST_BREAKPOINTS];
+static struct task_struct *__other_task;
+
+static struct perf_event *register_test_bp(int cpu, struct task_struct *tsk, int idx)
+{
+ struct perf_event_attr attr = {};
+
+ if (WARN_ON(idx < 0 || idx >= MAX_TEST_BREAKPOINTS))
+ return NULL;
+
+ hw_breakpoint_init(&attr);
+ attr.bp_addr = (unsigned long)&break_vars[idx];
+ attr.bp_len = HW_BREAKPOINT_LEN_1;
+ attr.bp_type = HW_BREAKPOINT_RW;
+ return perf_event_create_kernel_counter(&attr, cpu, tsk, NULL, NULL);
+}
+
+static void unregister_test_bp(struct perf_event **bp)
+{
+ if (WARN_ON(IS_ERR(*bp)))
+ return;
+ if (WARN_ON(!*bp))
+ return;
+ unregister_hw_breakpoint(*bp);
+ *bp = NULL;
+}
+
+static int get_test_bp_slots(void)
+{
+ static int slots;
+
+ if (!slots)
+ slots = hw_breakpoint_slots(TYPE_DATA);
+
+ return slots;
+}
+
+static void fill_one_bp_slot(struct kunit *test, int *id, int cpu, struct task_struct *tsk)
+{
+ struct perf_event *bp = register_test_bp(cpu, tsk, *id);
+
+ KUNIT_ASSERT_NOT_NULL(test, bp);
+ KUNIT_ASSERT_FALSE(test, IS_ERR(bp));
+ KUNIT_ASSERT_NULL(test, test_bps[*id]);
+ test_bps[(*id)++] = bp;
+}
+
+/*
+ * Fills up the given @cpu/@tsk with breakpoints, only leaving @skip slots free.
+ *
+ * Returns true if this can be called again, continuing at @id.
+ */
+static bool fill_bp_slots(struct kunit *test, int *id, int cpu, struct task_struct *tsk, int skip)
+{
+ for (int i = 0; i < get_test_bp_slots() - skip; ++i)
+ fill_one_bp_slot(test, id, cpu, tsk);
+
+ return *id + get_test_bp_slots() <= MAX_TEST_BREAKPOINTS;
+}
+
+static int dummy_kthread(void *arg)
+{
+ return 0;
+}
+
+static struct task_struct *get_other_task(struct kunit *test)
+{
+ struct task_struct *tsk;
+
+ if (__other_task)
+ return __other_task;
+
+ tsk = kthread_create(dummy_kthread, NULL, "hw_breakpoint_dummy_task");
+ KUNIT_ASSERT_FALSE(test, IS_ERR(tsk));
+ __other_task = tsk;
+ return __other_task;
+}
+
+static int get_test_cpu(int num)
+{
+ int cpu;
+
+ WARN_ON(num < 0);
+
+ for_each_online_cpu(cpu) {
+ if (num-- <= 0)
+ break;
+ }
+
+ return cpu;
+}
+
+/* ===== Test cases ===== */
+
+static void test_one_cpu(struct kunit *test)
+{
+ int idx = 0;
+
+ fill_bp_slots(test, &idx, get_test_cpu(0), NULL, 0);
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+}
+
+static void test_many_cpus(struct kunit *test)
+{
+ int idx = 0;
+ int cpu;
+
+ /* Test that CPUs are independent. */
+ for_each_online_cpu(cpu) {
+ bool do_continue = fill_bp_slots(test, &idx, cpu, NULL, 0);
+
+ TEST_EXPECT_NOSPC(register_test_bp(cpu, NULL, idx));
+ if (!do_continue)
+ break;
+ }
+}
+
+static void test_one_task_on_all_cpus(struct kunit *test)
+{
+ int idx = 0;
+
+ fill_bp_slots(test, &idx, -1, current, 0);
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ /* Remove one and adding back CPU-target should work. */
+ unregister_test_bp(&test_bps[0]);
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
+}
+
+static void test_two_tasks_on_all_cpus(struct kunit *test)
+{
+ int idx = 0;
+
+ /* Test that tasks are independent. */
+ fill_bp_slots(test, &idx, -1, current, 0);
+ fill_bp_slots(test, &idx, -1, get_other_task(test), 0);
+
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ /* Remove one from first task and adding back CPU-target should not work. */
+ unregister_test_bp(&test_bps[0]);
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+}
+
+static void test_one_task_on_one_cpu(struct kunit *test)
+{
+ int idx = 0;
+
+ fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ /*
+ * Remove one and adding back CPU-target should work; this case is
+ * special vs. above because the task's constraints are CPU-dependent.
+ */
+ unregister_test_bp(&test_bps[0]);
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
+}
+
+static void test_one_task_mixed(struct kunit *test)
+{
+ int idx = 0;
+
+ TEST_REQUIRES_BP_SLOTS(test, 3);
+
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), current);
+ fill_bp_slots(test, &idx, -1, current, 1);
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+
+ /* Transition from CPU-dependent pinned count to CPU-independent. */
+ unregister_test_bp(&test_bps[0]);
+ unregister_test_bp(&test_bps[1]);
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+}
+
+static void test_two_tasks_on_one_cpu(struct kunit *test)
+{
+ int idx = 0;
+
+ fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
+ fill_bp_slots(test, &idx, get_test_cpu(0), get_other_task(test), 0);
+
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ /* Can still create breakpoints on some other CPU. */
+ fill_bp_slots(test, &idx, get_test_cpu(1), NULL, 0);
+}
+
+static void test_two_tasks_on_one_all_cpus(struct kunit *test)
+{
+ int idx = 0;
+
+ fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
+ fill_bp_slots(test, &idx, -1, get_other_task(test), 0);
+
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ /* Cannot create breakpoints on some other CPU either. */
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
+}
+
+static void test_task_on_all_and_one_cpu(struct kunit *test)
+{
+ int tsk_on_cpu_idx, cpu_idx;
+ int idx = 0;
+
+ TEST_REQUIRES_BP_SLOTS(test, 3);
+
+ fill_bp_slots(test, &idx, -1, current, 2);
+ /* Transitioning from only all CPU breakpoints to mixed. */
+ tsk_on_cpu_idx = idx;
+ fill_one_bp_slot(test, &idx, get_test_cpu(0), current);
+ fill_one_bp_slot(test, &idx, -1, current);
+
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+
+ /* We should still be able to use up another CPU's slots. */
+ cpu_idx = idx;
+ fill_one_bp_slot(test, &idx, get_test_cpu(1), NULL);
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
+
+ /* Transitioning back to task target on all CPUs. */
+ unregister_test_bp(&test_bps[tsk_on_cpu_idx]);
+ /* Still have a CPU target breakpoint in get_test_cpu(1). */
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ /* Remove it and try again. */
+ unregister_test_bp(&test_bps[cpu_idx]);
+ fill_one_bp_slot(test, &idx, -1, current);
+
+ TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
+ TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
+}
+
+static struct kunit_case hw_breakpoint_test_cases[] = {
+ KUNIT_CASE(test_one_cpu),
+ KUNIT_CASE(test_many_cpus),
+ KUNIT_CASE(test_one_task_on_all_cpus),
+ KUNIT_CASE(test_two_tasks_on_all_cpus),
+ KUNIT_CASE(test_one_task_on_one_cpu),
+ KUNIT_CASE(test_one_task_mixed),
+ KUNIT_CASE(test_two_tasks_on_one_cpu),
+ KUNIT_CASE(test_two_tasks_on_one_all_cpus),
+ KUNIT_CASE(test_task_on_all_and_one_cpu),
+ {},
+};
+
+static int test_init(struct kunit *test)
+{
+ /* Most test cases want 2 distinct CPUs. */
+ return num_online_cpus() < 2 ? -EINVAL : 0;
+}
+
+static void test_exit(struct kunit *test)
+{
+ for (int i = 0; i < MAX_TEST_BREAKPOINTS; ++i) {
+ if (test_bps[i])
+ unregister_test_bp(&test_bps[i]);
+ }
+
+ if (__other_task) {
+ kthread_stop(__other_task);
+ __other_task = NULL;
+ }
+}
+
+static struct kunit_suite hw_breakpoint_test_suite = {
+ .name = "hw_breakpoint",
+ .test_cases = hw_breakpoint_test_cases,
+ .init = test_init,
+ .exit = test_exit,
+};
+
+kunit_test_suites(&hw_breakpoint_test_suite);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Marco Elver <[email protected]>");
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2e24db4bff19..4c87a6edf046 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2513,6 +2513,16 @@ config STACKINIT_KUNIT_TEST
CONFIG_GCC_PLUGIN_STRUCTLEAK, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF,
or CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL.

+config HW_BREAKPOINT_KUNIT_TEST
+ bool "Test hw_breakpoint constraints accounting" if !KUNIT_ALL_TESTS
+ depends on HAVE_HW_BREAKPOINT
+ depends on KUNIT=y
+ default KUNIT_ALL_TESTS
+ help
+ Tests for hw_breakpoint constraints accounting.
+
+ If unsure, say N.
+
config TEST_UDELAY
tristate "udelay test driver"
help
--
2.37.0.rc0.161.g10f37bed90-goog

2022-07-12 14:02:53

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks

On Mon, 4 Jul 2022 at 17:05, Marco Elver <[email protected]> wrote:
>
> The hw_breakpoint subsystem's code has seen little change in over 10
> years. In that time, systems with >100s of CPUs have become common,
> along with improvements to the perf subsystem: using breakpoints on
> thousands of concurrent tasks should be a supported usecase.
[...]
> Marco Elver (14):
> perf/hw_breakpoint: Add KUnit test for constraints accounting
> perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test
> perf/hw_breakpoint: Clean up headers
> perf/hw_breakpoint: Optimize list of per-task breakpoints
> perf/hw_breakpoint: Mark data __ro_after_init
> perf/hw_breakpoint: Optimize constant number of breakpoint slots
> perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
> perf/hw_breakpoint: Remove useless code related to flexible
> breakpoints
> powerpc/hw_breakpoint: Avoid relying on caller synchronization
> locking/percpu-rwsem: Add percpu_is_write_locked() and
> percpu_is_read_locked()
> perf/hw_breakpoint: Reduce contention with large number of tasks
> perf/hw_breakpoint: Introduce bp_slots_histogram
> perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent
> task targets
> perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task
> targets
[...]

This is ready from our side, and given the silence, assume it's ready
to pick up and/or have a maintainer take a look. Since this is mostly
kernel/events, would -tip/perf/core be appropriate?

Thanks,
-- Marco

2022-07-20 16:22:33

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks

On Tue, Jul 12, 2022 at 6:41 AM Marco Elver <[email protected]> wrote:
>
> On Mon, 4 Jul 2022 at 17:05, Marco Elver <[email protected]> wrote:
> >
> > The hw_breakpoint subsystem's code has seen little change in over 10
> > years. In that time, systems with >100s of CPUs have become common,
> > along with improvements to the perf subsystem: using breakpoints on
> > thousands of concurrent tasks should be a supported usecase.
> [...]
> > Marco Elver (14):
> > perf/hw_breakpoint: Add KUnit test for constraints accounting
> > perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test
> > perf/hw_breakpoint: Clean up headers
> > perf/hw_breakpoint: Optimize list of per-task breakpoints
> > perf/hw_breakpoint: Mark data __ro_after_init
> > perf/hw_breakpoint: Optimize constant number of breakpoint slots
> > perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
> > perf/hw_breakpoint: Remove useless code related to flexible
> > breakpoints
> > powerpc/hw_breakpoint: Avoid relying on caller synchronization
> > locking/percpu-rwsem: Add percpu_is_write_locked() and
> > percpu_is_read_locked()
> > perf/hw_breakpoint: Reduce contention with large number of tasks
> > perf/hw_breakpoint: Introduce bp_slots_histogram
> > perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent
> > task targets
> > perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task
> > targets
> [...]
>
> This is ready from our side, and given the silence, assume it's ready
> to pick up and/or have a maintainer take a look. Since this is mostly
> kernel/events, would -tip/perf/core be appropriate?

These are awesome improvements, I've added my acked-by to every
change. I hope we can pull these changes, as you say, into tip.git
perf/core and get them into 5.20.

Thanks,
Ian

> Thanks,
> -- Marco

2022-07-21 16:57:04

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

Hi Marco,

[adding Will]

On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> Add KUnit test for hw_breakpoint constraints accounting, with various
> interesting mixes of breakpoint targets (some care was taken to catch
> interesting corner cases via bug-injection).
>
> The test cannot be built as a module because it requires access to
> hw_breakpoint_slots(), which is not inlinable or exported on all
> architectures.
>
> Signed-off-by: Marco Elver <[email protected]>

As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
v5.19-rc7:

| TAP version 14
| 1..1
| # Subtest: hw_breakpoint
| 1..9
| ok 1 - test_one_cpu
| ok 2 - test_many_cpus
| # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 3 - test_one_task_on_all_cpus
| # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 4 - test_two_tasks_on_all_cpus
| # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 5 - test_one_task_on_one_cpu
| # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 6 - test_one_task_mixed
| # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 7 - test_two_tasks_on_one_cpu
| # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 8 - test_two_tasks_on_one_all_cpus
| # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
| Expected IS_ERR(bp) to be false, but is true
| not ok 9 - test_task_on_all_and_one_cpu
| # hw_breakpoint: pass:2 fail:7 skip:0 total:9
| # Totals: pass:2 fail:7 skip:0 total:9

... which seems to be becasue arm64 currently forbids per-task
breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:

/*
* Disallow per-task kernel breakpoints since these would
* complicate the stepping code.
*/
if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
return -EINVAL;

... which has been the case since day one in commit:

478fcb2cdb2351dc ("arm64: Debugging support")

I'm not immediately sure what would be necessary to support per-task kernel
breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
invasive.

Mark.

> ---
> v3:
> * Don't use raw_smp_processor_id().
>
> v2:
> * New patch.
> ---
> kernel/events/Makefile | 1 +
> kernel/events/hw_breakpoint_test.c | 323 +++++++++++++++++++++++++++++
> lib/Kconfig.debug | 10 +
> 3 files changed, 334 insertions(+)
> create mode 100644 kernel/events/hw_breakpoint_test.c
>
> diff --git a/kernel/events/Makefile b/kernel/events/Makefile
> index 8591c180b52b..91a62f566743 100644
> --- a/kernel/events/Makefile
> +++ b/kernel/events/Makefile
> @@ -2,4 +2,5 @@
> obj-y := core.o ring_buffer.o callchain.o
>
> obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
> +obj-$(CONFIG_HW_BREAKPOINT_KUNIT_TEST) += hw_breakpoint_test.o
> obj-$(CONFIG_UPROBES) += uprobes.o
> diff --git a/kernel/events/hw_breakpoint_test.c b/kernel/events/hw_breakpoint_test.c
> new file mode 100644
> index 000000000000..433c5c45e2a5
> --- /dev/null
> +++ b/kernel/events/hw_breakpoint_test.c
> @@ -0,0 +1,323 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit test for hw_breakpoint constraints accounting logic.
> + *
> + * Copyright (C) 2022, Google LLC.
> + */
> +
> +#include <kunit/test.h>
> +#include <linux/cpumask.h>
> +#include <linux/hw_breakpoint.h>
> +#include <linux/kthread.h>
> +#include <linux/perf_event.h>
> +#include <asm/hw_breakpoint.h>
> +
> +#define TEST_REQUIRES_BP_SLOTS(test, slots) \
> + do { \
> + if ((slots) > get_test_bp_slots()) { \
> + kunit_skip((test), "Requires breakpoint slots: %d > %d", slots, \
> + get_test_bp_slots()); \
> + } \
> + } while (0)
> +
> +#define TEST_EXPECT_NOSPC(expr) KUNIT_EXPECT_EQ(test, -ENOSPC, PTR_ERR(expr))
> +
> +#define MAX_TEST_BREAKPOINTS 512
> +
> +static char break_vars[MAX_TEST_BREAKPOINTS];
> +static struct perf_event *test_bps[MAX_TEST_BREAKPOINTS];
> +static struct task_struct *__other_task;
> +
> +static struct perf_event *register_test_bp(int cpu, struct task_struct *tsk, int idx)
> +{
> + struct perf_event_attr attr = {};
> +
> + if (WARN_ON(idx < 0 || idx >= MAX_TEST_BREAKPOINTS))
> + return NULL;
> +
> + hw_breakpoint_init(&attr);
> + attr.bp_addr = (unsigned long)&break_vars[idx];
> + attr.bp_len = HW_BREAKPOINT_LEN_1;
> + attr.bp_type = HW_BREAKPOINT_RW;
> + return perf_event_create_kernel_counter(&attr, cpu, tsk, NULL, NULL);
> +}
> +
> +static void unregister_test_bp(struct perf_event **bp)
> +{
> + if (WARN_ON(IS_ERR(*bp)))
> + return;
> + if (WARN_ON(!*bp))
> + return;
> + unregister_hw_breakpoint(*bp);
> + *bp = NULL;
> +}
> +
> +static int get_test_bp_slots(void)
> +{
> + static int slots;
> +
> + if (!slots)
> + slots = hw_breakpoint_slots(TYPE_DATA);
> +
> + return slots;
> +}
> +
> +static void fill_one_bp_slot(struct kunit *test, int *id, int cpu, struct task_struct *tsk)
> +{
> + struct perf_event *bp = register_test_bp(cpu, tsk, *id);
> +
> + KUNIT_ASSERT_NOT_NULL(test, bp);
> + KUNIT_ASSERT_FALSE(test, IS_ERR(bp));
> + KUNIT_ASSERT_NULL(test, test_bps[*id]);
> + test_bps[(*id)++] = bp;
> +}
> +
> +/*
> + * Fills up the given @cpu/@tsk with breakpoints, only leaving @skip slots free.
> + *
> + * Returns true if this can be called again, continuing at @id.
> + */
> +static bool fill_bp_slots(struct kunit *test, int *id, int cpu, struct task_struct *tsk, int skip)
> +{
> + for (int i = 0; i < get_test_bp_slots() - skip; ++i)
> + fill_one_bp_slot(test, id, cpu, tsk);
> +
> + return *id + get_test_bp_slots() <= MAX_TEST_BREAKPOINTS;
> +}
> +
> +static int dummy_kthread(void *arg)
> +{
> + return 0;
> +}
> +
> +static struct task_struct *get_other_task(struct kunit *test)
> +{
> + struct task_struct *tsk;
> +
> + if (__other_task)
> + return __other_task;
> +
> + tsk = kthread_create(dummy_kthread, NULL, "hw_breakpoint_dummy_task");
> + KUNIT_ASSERT_FALSE(test, IS_ERR(tsk));
> + __other_task = tsk;
> + return __other_task;
> +}
> +
> +static int get_test_cpu(int num)
> +{
> + int cpu;
> +
> + WARN_ON(num < 0);
> +
> + for_each_online_cpu(cpu) {
> + if (num-- <= 0)
> + break;
> + }
> +
> + return cpu;
> +}
> +
> +/* ===== Test cases ===== */
> +
> +static void test_one_cpu(struct kunit *test)
> +{
> + int idx = 0;
> +
> + fill_bp_slots(test, &idx, get_test_cpu(0), NULL, 0);
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> +}
> +
> +static void test_many_cpus(struct kunit *test)
> +{
> + int idx = 0;
> + int cpu;
> +
> + /* Test that CPUs are independent. */
> + for_each_online_cpu(cpu) {
> + bool do_continue = fill_bp_slots(test, &idx, cpu, NULL, 0);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(cpu, NULL, idx));
> + if (!do_continue)
> + break;
> + }
> +}
> +
> +static void test_one_task_on_all_cpus(struct kunit *test)
> +{
> + int idx = 0;
> +
> + fill_bp_slots(test, &idx, -1, current, 0);
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + /* Remove one and adding back CPU-target should work. */
> + unregister_test_bp(&test_bps[0]);
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
> +}
> +
> +static void test_two_tasks_on_all_cpus(struct kunit *test)
> +{
> + int idx = 0;
> +
> + /* Test that tasks are independent. */
> + fill_bp_slots(test, &idx, -1, current, 0);
> + fill_bp_slots(test, &idx, -1, get_other_task(test), 0);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + /* Remove one from first task and adding back CPU-target should not work. */
> + unregister_test_bp(&test_bps[0]);
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> +}
> +
> +static void test_one_task_on_one_cpu(struct kunit *test)
> +{
> + int idx = 0;
> +
> + fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + /*
> + * Remove one and adding back CPU-target should work; this case is
> + * special vs. above because the task's constraints are CPU-dependent.
> + */
> + unregister_test_bp(&test_bps[0]);
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
> +}
> +
> +static void test_one_task_mixed(struct kunit *test)
> +{
> + int idx = 0;
> +
> + TEST_REQUIRES_BP_SLOTS(test, 3);
> +
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), current);
> + fill_bp_slots(test, &idx, -1, current, 1);
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> +
> + /* Transition from CPU-dependent pinned count to CPU-independent. */
> + unregister_test_bp(&test_bps[0]);
> + unregister_test_bp(&test_bps[1]);
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), NULL);
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> +}
> +
> +static void test_two_tasks_on_one_cpu(struct kunit *test)
> +{
> + int idx = 0;
> +
> + fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
> + fill_bp_slots(test, &idx, get_test_cpu(0), get_other_task(test), 0);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + /* Can still create breakpoints on some other CPU. */
> + fill_bp_slots(test, &idx, get_test_cpu(1), NULL, 0);
> +}
> +
> +static void test_two_tasks_on_one_all_cpus(struct kunit *test)
> +{
> + int idx = 0;
> +
> + fill_bp_slots(test, &idx, get_test_cpu(0), current, 0);
> + fill_bp_slots(test, &idx, -1, get_other_task(test), 0);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(-1, get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), get_other_task(test), idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + /* Cannot create breakpoints on some other CPU either. */
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
> +}
> +
> +static void test_task_on_all_and_one_cpu(struct kunit *test)
> +{
> + int tsk_on_cpu_idx, cpu_idx;
> + int idx = 0;
> +
> + TEST_REQUIRES_BP_SLOTS(test, 3);
> +
> + fill_bp_slots(test, &idx, -1, current, 2);
> + /* Transitioning from only all CPU breakpoints to mixed. */
> + tsk_on_cpu_idx = idx;
> + fill_one_bp_slot(test, &idx, get_test_cpu(0), current);
> + fill_one_bp_slot(test, &idx, -1, current);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> +
> + /* We should still be able to use up another CPU's slots. */
> + cpu_idx = idx;
> + fill_one_bp_slot(test, &idx, get_test_cpu(1), NULL);
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
> +
> + /* Transitioning back to task target on all CPUs. */
> + unregister_test_bp(&test_bps[tsk_on_cpu_idx]);
> + /* Still have a CPU target breakpoint in get_test_cpu(1). */
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + /* Remove it and try again. */
> + unregister_test_bp(&test_bps[cpu_idx]);
> + fill_one_bp_slot(test, &idx, -1, current);
> +
> + TEST_EXPECT_NOSPC(register_test_bp(-1, current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), current, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(0), NULL, idx));
> + TEST_EXPECT_NOSPC(register_test_bp(get_test_cpu(1), NULL, idx));
> +}
> +
> +static struct kunit_case hw_breakpoint_test_cases[] = {
> + KUNIT_CASE(test_one_cpu),
> + KUNIT_CASE(test_many_cpus),
> + KUNIT_CASE(test_one_task_on_all_cpus),
> + KUNIT_CASE(test_two_tasks_on_all_cpus),
> + KUNIT_CASE(test_one_task_on_one_cpu),
> + KUNIT_CASE(test_one_task_mixed),
> + KUNIT_CASE(test_two_tasks_on_one_cpu),
> + KUNIT_CASE(test_two_tasks_on_one_all_cpus),
> + KUNIT_CASE(test_task_on_all_and_one_cpu),
> + {},
> +};
> +
> +static int test_init(struct kunit *test)
> +{
> + /* Most test cases want 2 distinct CPUs. */
> + return num_online_cpus() < 2 ? -EINVAL : 0;
> +}
> +
> +static void test_exit(struct kunit *test)
> +{
> + for (int i = 0; i < MAX_TEST_BREAKPOINTS; ++i) {
> + if (test_bps[i])
> + unregister_test_bp(&test_bps[i]);
> + }
> +
> + if (__other_task) {
> + kthread_stop(__other_task);
> + __other_task = NULL;
> + }
> +}
> +
> +static struct kunit_suite hw_breakpoint_test_suite = {
> + .name = "hw_breakpoint",
> + .test_cases = hw_breakpoint_test_cases,
> + .init = test_init,
> + .exit = test_exit,
> +};
> +
> +kunit_test_suites(&hw_breakpoint_test_suite);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Marco Elver <[email protected]>");
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 2e24db4bff19..4c87a6edf046 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2513,6 +2513,16 @@ config STACKINIT_KUNIT_TEST
> CONFIG_GCC_PLUGIN_STRUCTLEAK, CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF,
> or CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL.
>
> +config HW_BREAKPOINT_KUNIT_TEST
> + bool "Test hw_breakpoint constraints accounting" if !KUNIT_ALL_TESTS
> + depends on HAVE_HW_BREAKPOINT
> + depends on KUNIT=y
> + default KUNIT_ALL_TESTS
> + help
> + Tests for hw_breakpoint constraints accounting.
> +
> + If unsure, say N.
> +
> config TEST_UDELAY
> tristate "udelay test driver"
> help
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-07-22 09:40:57

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Thu, Jul 21, 2022 at 05:22:07PM +0100, Mark Rutland wrote:
> Hi Marco,
>
> [adding Will]
>
> On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > Add KUnit test for hw_breakpoint constraints accounting, with various
> > interesting mixes of breakpoint targets (some care was taken to catch
> > interesting corner cases via bug-injection).
> >
> > The test cannot be built as a module because it requires access to
> > hw_breakpoint_slots(), which is not inlinable or exported on all
> > architectures.
> >
> > Signed-off-by: Marco Elver <[email protected]>
>
> As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
> v5.19-rc7:
>
> | TAP version 14
> | 1..1
> | # Subtest: hw_breakpoint
> | 1..9
> | ok 1 - test_one_cpu
> | ok 2 - test_many_cpus
> | # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 3 - test_one_task_on_all_cpus
> | # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 4 - test_two_tasks_on_all_cpus
> | # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 5 - test_one_task_on_one_cpu
> | # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 6 - test_one_task_mixed
> | # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 7 - test_two_tasks_on_one_cpu
> | # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 8 - test_two_tasks_on_one_all_cpus
> | # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 9 - test_task_on_all_and_one_cpu
> | # hw_breakpoint: pass:2 fail:7 skip:0 total:9
> | # Totals: pass:2 fail:7 skip:0 total:9
>
> ... which seems to be becasue arm64 currently forbids per-task
> breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:
>
> /*
> * Disallow per-task kernel breakpoints since these would
> * complicate the stepping code.
> */
> if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
> return -EINVAL;
>
> ... which has been the case since day one in commit:
>
> 478fcb2cdb2351dc ("arm64: Debugging support")
>
> I'm not immediately sure what would be necessary to support per-task kernel
> breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> invasive.

I would actually like to remove HW_BREAKPOINT completely for arm64 as it
doesn't really work and causes problems for other interfaces such as ptrace
and kgdb.

Will

2022-07-22 10:18:35

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Fri, 22 Jul 2022 at 11:10, Will Deacon <[email protected]> wrote:
> > [adding Will]
> >
> > On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > > Add KUnit test for hw_breakpoint constraints accounting, with various
> > > interesting mixes of breakpoint targets (some care was taken to catch
> > > interesting corner cases via bug-injection).
> > >
> > > The test cannot be built as a module because it requires access to
> > > hw_breakpoint_slots(), which is not inlinable or exported on all
> > > architectures.
> > >
> > > Signed-off-by: Marco Elver <[email protected]>
> >
> > As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
> > v5.19-rc7:
> >
> > | TAP version 14
> > | 1..1
> > | # Subtest: hw_breakpoint
> > | 1..9
> > | ok 1 - test_one_cpu
> > | ok 2 - test_many_cpus
> > | # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 3 - test_one_task_on_all_cpus
> > | # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 4 - test_two_tasks_on_all_cpus
> > | # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 5 - test_one_task_on_one_cpu
> > | # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 6 - test_one_task_mixed
> > | # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 7 - test_two_tasks_on_one_cpu
> > | # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 8 - test_two_tasks_on_one_all_cpus
> > | # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > | Expected IS_ERR(bp) to be false, but is true
> > | not ok 9 - test_task_on_all_and_one_cpu
> > | # hw_breakpoint: pass:2 fail:7 skip:0 total:9
> > | # Totals: pass:2 fail:7 skip:0 total:9
> >
> > ... which seems to be becasue arm64 currently forbids per-task
> > breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:
> >
> > /*
> > * Disallow per-task kernel breakpoints since these would
> > * complicate the stepping code.
> > */
> > if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
> > return -EINVAL;
> >
> > ... which has been the case since day one in commit:
> >
> > 478fcb2cdb2351dc ("arm64: Debugging support")
> >
> > I'm not immediately sure what would be necessary to support per-task kernel
> > breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> > invasive.
>
> I would actually like to remove HW_BREAKPOINT completely for arm64 as it
> doesn't really work and causes problems for other interfaces such as ptrace
> and kgdb.

Will it be a localized removal of code that will be easy to revert in
future? Or will it touch lots of code here and there?
Let's say we come up with a very important use case for HW_BREAKPOINT
and will need to make it work on arm64 as well in future.

2022-07-22 10:52:07

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Fri, 22 Jul 2022 at 12:11, Will Deacon <[email protected]> wrote:
> > > > [adding Will]
> > > >
> > > > On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > > > > Add KUnit test for hw_breakpoint constraints accounting, with various
> > > > > interesting mixes of breakpoint targets (some care was taken to catch
> > > > > interesting corner cases via bug-injection).
> > > > >
> > > > > The test cannot be built as a module because it requires access to
> > > > > hw_breakpoint_slots(), which is not inlinable or exported on all
> > > > > architectures.
> > > > >
> > > > > Signed-off-by: Marco Elver <[email protected]>
> > > >
> > > > As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
> > > > v5.19-rc7:
> > > >
> > > > | TAP version 14
> > > > | 1..1
> > > > | # Subtest: hw_breakpoint
> > > > | 1..9
> > > > | ok 1 - test_one_cpu
> > > > | ok 2 - test_many_cpus
> > > > | # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 3 - test_one_task_on_all_cpus
> > > > | # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 4 - test_two_tasks_on_all_cpus
> > > > | # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 5 - test_one_task_on_one_cpu
> > > > | # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 6 - test_one_task_mixed
> > > > | # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 7 - test_two_tasks_on_one_cpu
> > > > | # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 8 - test_two_tasks_on_one_all_cpus
> > > > | # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > > | Expected IS_ERR(bp) to be false, but is true
> > > > | not ok 9 - test_task_on_all_and_one_cpu
> > > > | # hw_breakpoint: pass:2 fail:7 skip:0 total:9
> > > > | # Totals: pass:2 fail:7 skip:0 total:9
> > > >
> > > > ... which seems to be becasue arm64 currently forbids per-task
> > > > breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:
> > > >
> > > > /*
> > > > * Disallow per-task kernel breakpoints since these would
> > > > * complicate the stepping code.
> > > > */
> > > > if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
> > > > return -EINVAL;
> > > >
> > > > ... which has been the case since day one in commit:
> > > >
> > > > 478fcb2cdb2351dc ("arm64: Debugging support")
> > > >
> > > > I'm not immediately sure what would be necessary to support per-task kernel
> > > > breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> > > > invasive.
> > >
> > > I would actually like to remove HW_BREAKPOINT completely for arm64 as it
> > > doesn't really work and causes problems for other interfaces such as ptrace
> > > and kgdb.
> >
> > Will it be a localized removal of code that will be easy to revert in
> > future? Or will it touch lots of code here and there?
> > Let's say we come up with a very important use case for HW_BREAKPOINT
> > and will need to make it work on arm64 as well in future.
>
> My (rough) plan is to implement a lower-level abstraction for handling the
> underlying hardware resources, so we can layer consumers on top of that
> instead of funneling through hw_breakpoint. So if we figure out how to make
> bits of hw_breakpoint work on arm64, then it should just go on top.
>
> The main pain point for hw_breakpoint is kernel-side {break,watch}points
> and I think there are open design questions about how they should work
> on arm64, particularly when considering the interaction with user
> watchpoints triggering on uaccess routines and the possibility of hitting
> a kernel watchpoint in irq context.

I see. Our main interest would be break/watchpoints on user addresses
firing from both user-space and kernel (uaccess), so at least on irqs.

2022-07-22 11:07:39

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Fri, Jul 22, 2022 at 11:20:25AM +0200, Dmitry Vyukov wrote:
> On Fri, 22 Jul 2022 at 11:10, Will Deacon <[email protected]> wrote:
> > > [adding Will]
> > >
> > > On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > > > Add KUnit test for hw_breakpoint constraints accounting, with various
> > > > interesting mixes of breakpoint targets (some care was taken to catch
> > > > interesting corner cases via bug-injection).
> > > >
> > > > The test cannot be built as a module because it requires access to
> > > > hw_breakpoint_slots(), which is not inlinable or exported on all
> > > > architectures.
> > > >
> > > > Signed-off-by: Marco Elver <[email protected]>
> > >
> > > As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
> > > v5.19-rc7:
> > >
> > > | TAP version 14
> > > | 1..1
> > > | # Subtest: hw_breakpoint
> > > | 1..9
> > > | ok 1 - test_one_cpu
> > > | ok 2 - test_many_cpus
> > > | # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 3 - test_one_task_on_all_cpus
> > > | # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 4 - test_two_tasks_on_all_cpus
> > > | # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 5 - test_one_task_on_one_cpu
> > > | # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 6 - test_one_task_mixed
> > > | # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 7 - test_two_tasks_on_one_cpu
> > > | # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 8 - test_two_tasks_on_one_all_cpus
> > > | # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> > > | Expected IS_ERR(bp) to be false, but is true
> > > | not ok 9 - test_task_on_all_and_one_cpu
> > > | # hw_breakpoint: pass:2 fail:7 skip:0 total:9
> > > | # Totals: pass:2 fail:7 skip:0 total:9
> > >
> > > ... which seems to be becasue arm64 currently forbids per-task
> > > breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:
> > >
> > > /*
> > > * Disallow per-task kernel breakpoints since these would
> > > * complicate the stepping code.
> > > */
> > > if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
> > > return -EINVAL;
> > >
> > > ... which has been the case since day one in commit:
> > >
> > > 478fcb2cdb2351dc ("arm64: Debugging support")
> > >
> > > I'm not immediately sure what would be necessary to support per-task kernel
> > > breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> > > invasive.
> >
> > I would actually like to remove HW_BREAKPOINT completely for arm64 as it
> > doesn't really work and causes problems for other interfaces such as ptrace
> > and kgdb.
>
> Will it be a localized removal of code that will be easy to revert in
> future? Or will it touch lots of code here and there?
> Let's say we come up with a very important use case for HW_BREAKPOINT
> and will need to make it work on arm64 as well in future.

My (rough) plan is to implement a lower-level abstraction for handling the
underlying hardware resources, so we can layer consumers on top of that
instead of funneling through hw_breakpoint. So if we figure out how to make
bits of hw_breakpoint work on arm64, then it should just go on top.

The main pain point for hw_breakpoint is kernel-side {break,watch}points
and I think there are open design questions about how they should work
on arm64, particularly when considering the interaction with user
watchpoints triggering on uaccess routines and the possibility of hitting
a kernel watchpoint in irq context.

Will

2022-07-22 11:44:02

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Fri, Jul 22, 2022 at 12:31:45PM +0200, Dmitry Vyukov wrote:
> On Fri, 22 Jul 2022 at 12:11, Will Deacon <[email protected]> wrote:
> > > > > On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > > > > I'm not immediately sure what would be necessary to support per-task kernel
> > > > > breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> > > > > invasive.
> > > >
> > > > I would actually like to remove HW_BREAKPOINT completely for arm64 as it
> > > > doesn't really work and causes problems for other interfaces such as ptrace
> > > > and kgdb.
> > >
> > > Will it be a localized removal of code that will be easy to revert in
> > > future? Or will it touch lots of code here and there?
> > > Let's say we come up with a very important use case for HW_BREAKPOINT
> > > and will need to make it work on arm64 as well in future.
> >
> > My (rough) plan is to implement a lower-level abstraction for handling the
> > underlying hardware resources, so we can layer consumers on top of that
> > instead of funneling through hw_breakpoint. So if we figure out how to make
> > bits of hw_breakpoint work on arm64, then it should just go on top.
> >
> > The main pain point for hw_breakpoint is kernel-side {break,watch}points
> > and I think there are open design questions about how they should work
> > on arm64, particularly when considering the interaction with user
> > watchpoints triggering on uaccess routines and the possibility of hitting
> > a kernel watchpoint in irq context.
>
> I see. Our main interest would be break/watchpoints on user addresses
> firing from both user-space and kernel (uaccess), so at least on irqs.

Interesting. Do other architectures report watchpoint hits on user
addresses from kernel uaccess? It feels like this might be surprising to
some users, and it opens up questions about accesses using different virtual
aliases (e.g. via GUP) or from other entities as well (e.g. firmware,
KVM guests, DMA).

Will

2022-07-22 14:17:01

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Fri, 22 Jul 2022 at 13:03, Will Deacon <[email protected]> wrote:
> > > > > > On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > > > > > I'm not immediately sure what would be necessary to support per-task kernel
> > > > > > breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> > > > > > invasive.
> > > > >
> > > > > I would actually like to remove HW_BREAKPOINT completely for arm64 as it
> > > > > doesn't really work and causes problems for other interfaces such as ptrace
> > > > > and kgdb.
> > > >
> > > > Will it be a localized removal of code that will be easy to revert in
> > > > future? Or will it touch lots of code here and there?
> > > > Let's say we come up with a very important use case for HW_BREAKPOINT
> > > > and will need to make it work on arm64 as well in future.
> > >
> > > My (rough) plan is to implement a lower-level abstraction for handling the
> > > underlying hardware resources, so we can layer consumers on top of that
> > > instead of funneling through hw_breakpoint. So if we figure out how to make
> > > bits of hw_breakpoint work on arm64, then it should just go on top.
> > >
> > > The main pain point for hw_breakpoint is kernel-side {break,watch}points
> > > and I think there are open design questions about how they should work
> > > on arm64, particularly when considering the interaction with user
> > > watchpoints triggering on uaccess routines and the possibility of hitting
> > > a kernel watchpoint in irq context.
> >
> > I see. Our main interest would be break/watchpoints on user addresses
> > firing from both user-space and kernel (uaccess), so at least on irqs.
>
> Interesting. Do other architectures report watchpoint hits on user
> addresses from kernel uaccess? It feels like this might be surprising to
> some users, and it opens up questions about accesses using different virtual
> aliases (e.g. via GUP) or from other entities as well (e.g. firmware,
> KVM guests, DMA).

x86 supports this.
There is that attr.exclude_kernel flag that requires special permissions:
https://elixir.bootlin.com/linux/v5.19-rc7/source/kernel/events/core.c#L12061
https://elixir.bootlin.com/linux/v5.19-rc7/source/kernel/events/core.c#L9323
But if I understand correctly, it only filters out delivery, the HW
breakpoint fires even if attr.exclude_kernel is set.

We also wanted to relax this permission check somewhat:
https://lore.kernel.org/all/[email protected]/

Yes, if the kernel maps the page at a different virtual address, then
the breakpoint won't fire I think.
Don't know what are the issues with firmware/KVM.

2022-07-25 11:45:36

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] perf/hw_breakpoint: Add KUnit test for constraints accounting

On Thu, 21 Jul 2022 at 18:22, Mark Rutland <[email protected]> wrote:
>
> Hi Marco,
>
> [adding Will]
>
> On Mon, Jul 04, 2022 at 05:05:01PM +0200, Marco Elver wrote:
> > Add KUnit test for hw_breakpoint constraints accounting, with various
> > interesting mixes of breakpoint targets (some care was taken to catch
> > interesting corner cases via bug-injection).
> >
> > The test cannot be built as a module because it requires access to
> > hw_breakpoint_slots(), which is not inlinable or exported on all
> > architectures.
> >
> > Signed-off-by: Marco Elver <[email protected]>
>
> As mentioned on IRC, I'm seeing these tests fail on arm64 when applied atop
> v5.19-rc7:
>
> | TAP version 14
> | 1..1
> | # Subtest: hw_breakpoint
> | 1..9
> | ok 1 - test_one_cpu
> | ok 2 - test_many_cpus
> | # test_one_task_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 3 - test_one_task_on_all_cpus
> | # test_two_tasks_on_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 4 - test_two_tasks_on_all_cpus
> | # test_one_task_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 5 - test_one_task_on_one_cpu
> | # test_one_task_mixed: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 6 - test_one_task_mixed
> | # test_two_tasks_on_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 7 - test_two_tasks_on_one_cpu
> | # test_two_tasks_on_one_all_cpus: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 8 - test_two_tasks_on_one_all_cpus
> | # test_task_on_all_and_one_cpu: ASSERTION FAILED at kernel/events/hw_breakpoint_test.c:70
> | Expected IS_ERR(bp) to be false, but is true
> | not ok 9 - test_task_on_all_and_one_cpu
> | # hw_breakpoint: pass:2 fail:7 skip:0 total:9
> | # Totals: pass:2 fail:7 skip:0 total:9
>
> ... which seems to be becasue arm64 currently forbids per-task
> breakpoints/watchpoints in hw_breakpoint_arch_parse(), where we have:
>
> /*
> * Disallow per-task kernel breakpoints since these would
> * complicate the stepping code.
> */
> if (hw->ctrl.privilege == AARCH64_BREAKPOINT_EL1 && bp->hw.target)
> return -EINVAL;
>
> ... which has been the case since day one in commit:
>
> 478fcb2cdb2351dc ("arm64: Debugging support")
>
> I'm not immediately sure what would be necessary to support per-task kernel
> breakpoints, but given a lot of that state is currently per-cpu, I imagine it's
> invasive.

Thanks for investigating - so the test is working as intended. ;-)

However it's a shame that arm64's support is limited. And what Will
said about possible removal/rework of arm64 hw_breakpoint support
doesn't sound too reassuring.

We will definitely want to revisit arm64's hw_breakpoint support in future.

Thanks,
-- Marco

2022-08-16 15:10:18

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] perf/hw_breakpoint: Optimize for thousands of tasks

On Wed, 20 Jul 2022 at 17:47, Ian Rogers <[email protected]> wrote:
> On Tue, Jul 12, 2022 at 6:41 AM Marco Elver <[email protected]> wrote:
> > On Mon, 4 Jul 2022 at 17:05, Marco Elver <[email protected]> wrote:
> > > The hw_breakpoint subsystem's code has seen little change in over 10
> > > years. In that time, systems with >100s of CPUs have become common,
> > > along with improvements to the perf subsystem: using breakpoints on
> > > thousands of concurrent tasks should be a supported usecase.
> > [...]
> > > Marco Elver (14):
> > > perf/hw_breakpoint: Add KUnit test for constraints accounting
> > > perf/hw_breakpoint: Provide hw_breakpoint_is_used() and use in test
> > > perf/hw_breakpoint: Clean up headers
> > > perf/hw_breakpoint: Optimize list of per-task breakpoints
> > > perf/hw_breakpoint: Mark data __ro_after_init
> > > perf/hw_breakpoint: Optimize constant number of breakpoint slots
> > > perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
> > > perf/hw_breakpoint: Remove useless code related to flexible
> > > breakpoints
> > > powerpc/hw_breakpoint: Avoid relying on caller synchronization
> > > locking/percpu-rwsem: Add percpu_is_write_locked() and
> > > percpu_is_read_locked()
> > > perf/hw_breakpoint: Reduce contention with large number of tasks
> > > perf/hw_breakpoint: Introduce bp_slots_histogram
> > > perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent
> > > task targets
> > > perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task
> > > targets
> > [...]
> >
> > This is ready from our side, and given the silence, assume it's ready
> > to pick up and/or have a maintainer take a look. Since this is mostly
> > kernel/events, would -tip/perf/core be appropriate?
>
> These are awesome improvements, I've added my acked-by to every
> change. I hope we can pull these changes, as you say, into tip.git
> perf/core and get them into 5.20.

These still apply cleanly to 6.0-rc1 and the test passes, but let me
know if I shall send a rebased version.

Thanks
-- Marco