by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent

On Fri, 10 Jun 2022 at 10:25, Marco Elver <[email protected]> wrote:
>
> On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
> <[email protected]> wrote:
> >
> > On Thu, 9 Jun 2022 at 13:31, Marco Elver <[email protected]> wrote:
> > >
> > > Running the perf benchmark with (note: more aggressive parameters vs.
> > > preceding changes, but same host with 256 CPUs):
> > >
> > > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> > > | Total time: 1.953 [sec]
> > > |
> > > | 38.146289 usecs/op
> > > | 4882.725000 usecs/op/cpu
> > >
> > > 16.29% [kernel] [k] rhashtable_jhash2
> > > 16.19% [kernel] [k] osq_lock
> > > 14.22% [kernel] [k] queued_spin_lock_slowpath
> > > 8.58% [kernel] [k] task_bp_pinned
> > > 8.30% [kernel] [k] mutex_spin_on_owner
> > > 4.03% [kernel] [k] smp_cfm_core_cond
> > > 2.97% [kernel] [k] toggle_bp_slot
> > > 2.94% [kernel] [k] bcmp
> > >
> > > We can see that a majority of the time is now spent hashing task
> > > pointers to index into task_bps_ht in task_bp_pinned().
> > >
> > > However, if task_bp_pinned()'s computation is independent of any CPU,
> > > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > > identical. With increasing CPU-count, this problem worsens.
> > >
> > > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > > and cache the result. Use the cached result instead of a call to
> > > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > > if the cached result can be used.
> > >
> > > After this optimization:
> > >
> > > 21.96% [kernel] [k] queued_spin_lock_slowpath
> > > 16.39% [kernel] [k] osq_lock
> > > 9.82% [kernel] [k] toggle_bp_slot
> > > 9.81% [kernel] [k] find_next_bit
> > > 4.93% [kernel] [k] mutex_spin_on_owner
> > > 4.71% [kernel] [k] smp_cfm_core_cond
> > > 4.30% [kernel] [k] __reserve_bp_slot
> > > 2.65% [kernel] [k] cpumask_next
> > >
> > > Showing that the time spent hashing keys has become insignificant.
> > >
> > > With the given benchmark parameters, however, we see no statistically
> > > significant improvement in performance on the test system with 256 CPUs.
> > > This is very likely due to the benchmark parameters being too aggressive
> > > and contention elsewhere becoming dominant.
> > >
> > > Indeed, when using the less aggressive parameters from the preceding
> > > changes, we now observe:
> > >
> > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > | Total time: 0.071 [sec]
> > > |
> > > | 37.134896 usecs/op
> > > | 2376.633333 usecs/op/cpu
> > >
> > > Which is an improvement of 12% compared to without this optimization
> > > (baseline is 42 usecs/op). This is now only 5% slower than the
> > > theoretical ideal (constraints disabled), and 18% slower than no
> > > breakpoints at all.
> > >
> > > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> > > more consistent with other functions (which have bp first, before the
> > > cpu argument). ]
> >
> > There are 3 main cases:
> > 1. Per-cpu bp.
>
> Yes, CPU-target breakpoint on just 1 CPU.
>
> > 2. Per-task and per-cpu bp.
>
> Task-target breakpoint but pinned to 1 CPU.
>
> > 3. Per-task bp (on all cpus)
>
> Task-target breakpoint that can run on all CPUs.
>
> > right?
> >
> > For case 1 we still seem to do lots of unnecessary work in
> > fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> > only the CPU's cpu_pinned, so that's the only CPU we need to
> > fetch/check.
>
> It'll just do 1 iteration, because cpumask_of_bp() will return a mask
> with just the event's target CPU in it.

You are right. I missed the use of cpumask_of_bp().

> > For case 2 we also do lots of unnecessary work, again we also need to
> > check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> > atomic_dec/inc on all other CPUs (they dec/inc the same variable).
>
> Same as above, just 1 iteration because cpumask_of_bp() does the right
> thing. cached_tbp_pinned may still be used if all existing task
> breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
> task_bp_pinned() once or twice probably is irrelevant in this case).
>
> > Case 3 is the only one when we need to check all CPUs and
> > cached_tbp_pinned may be useful.
> > But I wonder if we could instead add a per-task
> > has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> > CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> > then we could optimize the code even more by making it O(1) instead of
> > O(N).
>
> > Namely, we add global tsk_pinned for tasks that don't have
> > per-cpu breakpoints, and we update only that tsk_pinned instead of
> > iterating over all CPUs.
>
> That seems reasonable.
>
> > I think this will require adding cpu_pinned as well (similar to
> > tsk_pinned but aggregated over all CPUs).
>
> > Then the fast path capacity check can become just:
> >
> > if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
> > if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> > hw_breakpoint_weight(bp) > nr_slots[type])
> > return -ENOSPC;
> > }
> >
> > Does it make any sense?
>
> Yes, I think this might work. I'll see if I can make it work.

Actually!
This is somewhat orthogonal to the optimizations you are doing, but
the most interesting case for us is inherited events. And it seems
that an inherited event can't possibly overflow the capacity.
Inherited events are a subset of the parent events and all parent
events have already passed validation and the child can't have its own
new events when inherited events are created.
So couldn't we somehow detect that reserve_bp_slot() is called from
inherit_event() and skip fetch_bp_busy_slots() altogether? Maybe that
can be detected by looking at bp->attr.inherit and presence of parent
context? Capacity validation may be kept as a debug-only check.

2022-06-10 10:10:24

by Marco Elver

[permalink] [raw]

Subject: Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints

On Fri, 10 Jun 2022 at 11:04, Dmitry Vyukov <[email protected]> wrote:
>
> On Thu, 9 Jun 2022 at 20:37, Marco Elver <[email protected]> wrote:
> > > /On Thu, 9 Jun 2022 at 16:56, Marco Elver <[email protected]> wrote:
> > > > > > On a machine with 256 CPUs, running the recently added perf breakpoint
> > > > > > benchmark results in:
> > > > > >
> > > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > > > > | # Running 'breakpoint/thread' benchmark:
> > > > > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > > > > | Total time: 236.418 [sec]
> > > > > > |
> > > > > > | 123134.794271 usecs/op
> > > > > > | 7880626.833333 usecs/op/cpu
> > > > > >
> > > > > > The benchmark tests inherited breakpoint perf events across many
> > > > > > threads.
> > > > > >
> > > > > > Looking at a perf profile, we can see that the majority of the time is
> > > > > > spent in various hw_breakpoint.c functions, which execute within the
> > > > > > 'nr_bp_mutex' critical sections which then results in contention on that
> > > > > > mutex as well:
> > > > > >
> > > > > > 37.27% [kernel] [k] osq_lock
> > > > > > 34.92% [kernel] [k] mutex_spin_on_owner
> > > > > > 12.15% [kernel] [k] toggle_bp_slot
> > > > > > 11.90% [kernel] [k] __reserve_bp_slot
> > > > > >
> > > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of
> > > > > > O(#tasks) due to storing all task breakpoints in the same list and
> > > > > > iterating through that list looking for a matching task. Clearly, this
> > > > > > does not scale to thousands of tasks.
> > > > > >
> > > > > > While one option would be to make task_struct a breakpoint list node,
> > > > > > this would only further bloat task_struct for infrequently used data.
> > > > >
> > > > > task_struct already has:
> > > > >
> > > > > #ifdef CONFIG_PERF_EVENTS
> > > > > struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
> > > > > struct mutex perf_event_mutex;
> > > > > struct list_head perf_event_list;
> > > > > #endif
> > > > >
> > > > > Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx?
> > > > > And possibly perf_event_list instead of task_bps_ht? It will contain
> > > > > other perf_event types, so we will need to test type as well, but on
> > > > > the positive side, we don't need any management of the separate
> > > > > container.
> > > >
> > > > Hmm, yes, I looked at that but then decided against messing the
> > > > perf/core internals. The main issue I have with using perf_event_mutex
> > > > is that we might interfere with perf/core's locking rules as well as
> > > > interfere with other concurrent perf event additions. Using
> > > > perf_event_list is very likely a no-go because it requires reworking
> > > > perf/core as well.
> > > >
> > > > I can already hear Peter shouting, but maybe I'm wrong. :-)
> > >
> > > Let's wait for Peter to shout then :)
> > > A significant part of this change is having per-task data w/o having
> > > per-task data.
> > >
> > > The current perf-related data in task_struct is already multiple words
> > > and it's also not used in lots of production cases.
> > > Maybe we could have something like:
> > >
> > > struct perf_task_data* lazily_allocated_perf_data;
> > >
> > > that's lazily allocated on first use instead of the current
> > > perf_event_ctxp/perf_event_mutex/perf_event_list.
> > > This way we could both reduce task_size when perf is not used and have
> > > more perf-related data (incl breakpoints) when it's used.
> >
> > I don't mind either option, so keeping task_struct bloat in mind, we have:
> >
> > 1. rhashtable option, no changes to task_struct.
> >
> > 2. add the breakpoint mutex + list to task_struct.
> >
> > 3. add something like hw_breakpoint_task_data* and allocate lazily.
> >
> > 4. (your proposal) move all of perf data into a new struct (+add
> > hw_breakpoint things in there) that is lazily allocated.
> >
> > I don't think perf is that infrequently used, and I can't estimate
> > performance impact, so I don't like #4 too much personally. My
> > preferred compromise would be #3, but at the same time I'd rather not
> > bloat task_struct even with 8 extra infrequently used bytes. Am I too
> > paranoid?
> >
> > Preferences?
>
>
> There is also this "could eventually get its own" comment:
>
> static struct pmu perf_breakpoint = {
> .task_ctx_nr = perf_sw_context, /* could eventually get its own */
> https://elixir.bootlin.com/linux/v5.19-rc1/source/kernel/events/hw_breakpoint.c#L669
>
> If it gets its own, then it also gets a perf_event_context pointer in
> task_struct:
> https://elixir.bootlin.com/linux/v5.19-rc1/source/include/linux/sched.h#L1229
> And perf_event_context has its own mutex and lots of other stuff.
> But I don't know what other implications it has.

Relying on perf events to be the only way that instantiates
breakpoints does not work, because hw_breakpoint is also used by
ptrace independently.

On a whole, adding lazily allocated data to task_struct is not as
simple as the rhashtable option (need to take care of fork and exit
and make sure the lazily allocated data lives long enough etc.). I
question the added complexity vs. the benefit, when using the
rhashtable avoids all that. If I get rid of the O(#cpu) loops it also
doesn't show up in profiles anymore and any efforts to optimize here
are not buying us much in terms of performance.

If the main issue is the mutex, I suppose we can find a hole in
task_struct and stick it there (there's a massive 32-byte hole above
task_struct::stats).

Was the mutex the only benefit?