Received: by 2002:a5d:925a:0:0:0:0:0 with SMTP id e26csp495352iol; Thu, 9 Jun 2022 07:53:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzpS4caYLrl/BAGRJDKMIYw5HhIuz2USE964QnWKl9W14zDk9Cgk+szb3TbtFivyR/2994P X-Received: by 2002:a63:e517:0:b0:3fe:4273:1063 with SMTP id r23-20020a63e517000000b003fe42731063mr6571690pgh.371.1654786411953; Thu, 09 Jun 2022 07:53:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654786411; cv=none; d=google.com; s=arc-20160816; b=tJWZApdSXfTkH90DaHcbV7UWxhi06BDqMlEnvOouFUYyCenecIVwXxtw9EP6BYckXJ j8w2qawRQJG9cY89fBOOjtW9kPU7fIO78Hpo+wAixELordXQ4If12Tiy6HuWAZj/+93M 0bvgvpRAIWm59IJSIa9XBdauLSNtwcKs44Y+um/gstkPP7FrRqyL4fj6k0qN9CCsDbBA ETkdpsybA6c3goefl0EOAwGGYY4f3mMHkn/3Vui0om8hSlZbz9BHxTPzfnkV/z7rP8JM eQTxZxaGJ+hmJ8wuBtLbc3jjKTUjU6TUCzu0lYno7yRMbgzs8UaGzDOa98FduhVvVGV8 SQiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=M9W9dRebSTBOmk3r6h+50xRRy5x7WHWQ6L9317Ru0kA=; b=udQo3cDv32KRPlugbpBVlhuByNJaCphPqzJOGEroZG94DGWc+uIQltLZ/s8gfO5rlB llaYmuZwOpZFfQhzG5XiLKx9VvIXH7qTS5aKodhB/UfUCZ4qJ0Qbm/NFQ3SGk6hLTkui 2rOcR+wDgtMFykOeyBfo9sBdc/IBFXHGAkLVmqnjEIYa0jCP0RyvS4kvoul3phyTsZCj 0yY1EVpIyJqqF8k12mgbTHmECXmm98EUhDFnD1RgLqbD2H1M5vfwt+DivoJV1nhNtB6x r2Tfum++N+jWnVHAvigG2rwpRRFwChcgoWIkBsYIWJInI2ZvxQP8WJ/+6fp2QDWajK9h c7Lg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=X9500eDv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s16-20020a170902ea1000b001616647d01asi17935380plg.560.2022.06.09.07.53.16; Thu, 09 Jun 2022 07:53:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=X9500eDv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343951AbiFIO3Y (ORCPT + 99 others); Thu, 9 Jun 2022 10:29:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243764AbiFIO3R (ORCPT ); Thu, 9 Jun 2022 10:29:17 -0400 Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C27B01E6F92 for ; Thu, 9 Jun 2022 07:29:15 -0700 (PDT) Received: by mail-lf1-x12e.google.com with SMTP id a2so32153770lfg.5 for ; Thu, 09 Jun 2022 07:29:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=M9W9dRebSTBOmk3r6h+50xRRy5x7WHWQ6L9317Ru0kA=; b=X9500eDveU8KKATmJfgDxeLS5Q33g+1dfFoDxw+uqcyCUINnN7TPd0ztpsPrOSglu+ ZYnBYXs4lW0hBch8EJI361d05OR+gzBYJaQvlAXiYJCMIfdXsHei5pkpq/Uk9DIXoBRi LpSCEX2FGfBiEBmLQFEZmd2A202vYREUgoy6pQ4PygjIVGuQLoEvZfK30WI/MetL/M9U 5jp9AqstbsZxpaoGbhanid8b4y+ykUuZgBbmtEGByI7/ZmzI2tU73cNFkKJbeAFreP0N cfaYTzdaefkNo+eXfdUfpqDFay+w3dQ/JIznLqGHjM7azqIcZ8rpShA8Qk33XYu5uOIP gDSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=M9W9dRebSTBOmk3r6h+50xRRy5x7WHWQ6L9317Ru0kA=; b=AdRAqPzLvx9aAKj8r1E3eotByMGPBolzXt8u3+O9ApIvztrZVhw4hhGSaQ3ObQ2Mj4 A3S343pSZ3vP8A0hhMdxrNofaWgPr+WWEDRZSyrSoRePuSgoAyi+4SjyGav+5VqGeogJ buYa5PNAQdY8We4eJOZ1Ku33zWu3HCqMdnpQR/xRJ/ujtPoq4dLMP5Fj4T5uTDgg51B1 rfYQZTm9GVfkSsKGjTAvziw2HPdO3gyAN7ohk/4UU/Yd/TPFCHDXXIwlQudlpI794cbx FY3pgqja1EufZcB/1J8uVpme4qVgYE2Y8cYHRlo4zoPswAWi83HoMwM5hLcXm3XQ8erm K9Ww== X-Gm-Message-State: AOAM530By1+AF9wetIBzimstC0PhSid5lI1XyHQP/A5Lrg5duJ6LPiTy nC8W0M9sgV909fHZ0MsJbzFawkwjuzkio+EDnVy6uQ== X-Received: by 2002:a05:6512:1085:b0:479:478b:d2cc with SMTP id j5-20020a056512108500b00479478bd2ccmr13054137lfg.540.1654784953738; Thu, 09 Jun 2022 07:29:13 -0700 (PDT) MIME-Version: 1.0 References: <20220609113046.780504-1-elver@google.com> <20220609113046.780504-2-elver@google.com> In-Reply-To: <20220609113046.780504-2-elver@google.com> From: Dmitry Vyukov Date: Thu, 9 Jun 2022 16:29:02 +0200 Message-ID: Subject: Re: [PATCH 1/8] perf/hw_breakpoint: Optimize list of per-task breakpoints To: Marco Elver Cc: Peter Zijlstra , Frederic Weisbecker , Ingo Molnar , Thomas Gleixner , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , linux-perf-users@vger.kernel.org, x86@kernel.org, linux-sh@vger.kernel.org, kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 9 Jun 2022 at 13:31, Marco Elver wrote: > > On a machine with 256 CPUs, running the recently added perf breakpoint > benchmark results in: > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > | # Running 'breakpoint/thread' benchmark: > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > | Total time: 236.418 [sec] > | > | 123134.794271 usecs/op > | 7880626.833333 usecs/op/cpu > > The benchmark tests inherited breakpoint perf events across many > threads. > > Looking at a perf profile, we can see that the majority of the time is > spent in various hw_breakpoint.c functions, which execute within the > 'nr_bp_mutex' critical sections which then results in contention on that > mutex as well: > > 37.27% [kernel] [k] osq_lock > 34.92% [kernel] [k] mutex_spin_on_owner > 12.15% [kernel] [k] toggle_bp_slot > 11.90% [kernel] [k] __reserve_bp_slot > > The culprit here is task_bp_pinned(), which has a runtime complexity of > O(#tasks) due to storing all task breakpoints in the same list and > iterating through that list looking for a matching task. Clearly, this > does not scale to thousands of tasks. > > While one option would be to make task_struct a breakpoint list node, > this would only further bloat task_struct for infrequently used data. task_struct already has: #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; struct mutex perf_event_mutex; struct list_head perf_event_list; #endif Wonder if it's possible to use perf_event_mutex instead of the task_sharded_mtx? And possibly perf_event_list instead of task_bps_ht? It will contain other perf_event types, so we will need to test type as well, but on the positive side, we don't need any management of the separate container. > Instead, make use of the "rhashtable" variant "rhltable" which stores > multiple items with the same key in a list. This results in average > runtime complexity of O(1) for task_bp_pinned(). > > With the optimization, the benchmark shows: > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > | # Running 'breakpoint/thread' benchmark: > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > | Total time: 0.208 [sec] > | > | 108.422396 usecs/op > | 6939.033333 usecs/op/cpu > > On this particular setup that's a speedup of ~1135x. > > Signed-off-by: Marco Elver > --- > include/linux/perf_event.h | 3 +- > kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++------------- > 2 files changed, 37 insertions(+), 22 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 01231f1d976c..e27360436dc6 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks { > }; > > #ifdef CONFIG_HAVE_HW_BREAKPOINT > +#include > #include > #endif > > @@ -178,7 +179,7 @@ struct hw_perf_event { > * creation and event initalization. > */ > struct arch_hw_breakpoint info; > - struct list_head bp_list; > + struct rhlist_head bp_list; > }; > #endif > struct { /* amd_iommu */ > diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c > index f32320ac02fd..25c94c6e918d 100644 > --- a/kernel/events/hw_breakpoint.c > +++ b/kernel/events/hw_breakpoint.c > @@ -28,7 +28,7 @@ > #include > #include > #include > -#include > +#include > #include > #include > #include > @@ -55,7 +55,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type) > } > > /* Keep track of the breakpoints attached to tasks */ > -static LIST_HEAD(bp_task_head); > +static struct rhltable task_bps_ht; > +static const struct rhashtable_params task_bps_ht_params = { > + .head_offset = offsetof(struct hw_perf_event, bp_list), > + .key_offset = offsetof(struct hw_perf_event, target), > + .key_len = sizeof_field(struct hw_perf_event, target), > + .automatic_shrinking = true, > +}; > > static int constraints_initialized; > > @@ -104,17 +110,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type) > */ > static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type) > { > - struct task_struct *tsk = bp->hw.target; > + struct rhlist_head *head, *pos; > struct perf_event *iter; > int count = 0; > > - list_for_each_entry(iter, &bp_task_head, hw.bp_list) { > - if (iter->hw.target == tsk && > - find_slot_idx(iter->attr.bp_type) == type && > + rcu_read_lock(); > + head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params); > + if (!head) > + goto out; > + > + rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) { > + if (find_slot_idx(iter->attr.bp_type) == type && > (iter->cpu < 0 || cpu == iter->cpu)) > count += hw_breakpoint_weight(iter); > } > > +out: > + rcu_read_unlock(); > return count; > } > > @@ -187,7 +199,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu, > /* > * Add/remove the given breakpoint in our constraint table > */ > -static void > +static int > toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, > int weight) > { > @@ -200,7 +212,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, > /* Pinned counter cpu profiling */ > if (!bp->hw.target) { > get_bp_info(bp->cpu, type)->cpu_pinned += weight; > - return; > + return 0; > } > > /* Pinned counter task profiling */ > @@ -208,9 +220,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, > toggle_bp_task_slot(bp, cpu, type, weight); > > if (enable) > - list_add_tail(&bp->hw.bp_list, &bp_task_head); > + return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params); > else > - list_del(&bp->hw.bp_list); > + return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params); > } > > __weak int arch_reserve_bp_slot(struct perf_event *bp) > @@ -308,9 +320,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type) > if (ret) > return ret; > > - toggle_bp_slot(bp, true, type, weight); > - > - return 0; > + return toggle_bp_slot(bp, true, type, weight); > } > > int reserve_bp_slot(struct perf_event *bp) > @@ -335,7 +345,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type) > > type = find_slot_idx(bp_type); > weight = hw_breakpoint_weight(bp); > - toggle_bp_slot(bp, false, type, weight); > + WARN_ON(toggle_bp_slot(bp, false, type, weight)); > } > > void release_bp_slot(struct perf_event *bp) > @@ -679,7 +689,7 @@ static struct pmu perf_breakpoint = { > int __init init_hw_breakpoint(void) > { > int cpu, err_cpu; > - int i; > + int i, ret; > > for (i = 0; i < TYPE_MAX; i++) > nr_slots[i] = hw_breakpoint_slots(i); > @@ -690,18 +700,24 @@ int __init init_hw_breakpoint(void) > > info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int), > GFP_KERNEL); > - if (!info->tsk_pinned) > - goto err_alloc; > + if (!info->tsk_pinned) { > + ret = -ENOMEM; > + goto err; > + } > } > } > > + ret = rhltable_init(&task_bps_ht, &task_bps_ht_params); > + if (ret) > + goto err; > + > constraints_initialized = 1; > > perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT); > > return register_die_notifier(&hw_breakpoint_exceptions_nb); > > - err_alloc: > +err: > for_each_possible_cpu(err_cpu) { > for (i = 0; i < TYPE_MAX; i++) > kfree(get_bp_info(err_cpu, i)->tsk_pinned); > @@ -709,7 +725,5 @@ int __init init_hw_breakpoint(void) > break; > } > > - return -ENOMEM; > + return ret; > } > - > - > -- > 2.36.1.255.ge46751e96f-goog >