Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp450836rwe; Thu, 1 Sep 2022 02:14:06 -0700 (PDT) X-Google-Smtp-Source: AA6agR6oudRDYjt3dD48SBwsi1UdzlmRIKGjj7tFaUtamznh/HbNC1GjIA1mH6aHsgLHFaSLQzyZ X-Received: by 2002:a05:6a00:1a14:b0:52d:5fee:d46b with SMTP id g20-20020a056a001a1400b0052d5feed46bmr30289705pfv.82.1662023646083; Thu, 01 Sep 2022 02:14:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662023646; cv=none; d=google.com; s=arc-20160816; b=UYROj/+zcuisQJzuVdB7LiWGMjn/r2g4+YMFbUbQKzkx1ao+lUbnsiIKkHDtQU9hbO A2W/YI+soD1LMx3Y+lgVUyacfuyenEJlEMqldDv9csxqmokTvpvXoLacxKHN7Bro3QMh uEGTG71ICDyzVIrjcM/3xazyCv7lHcQsu3AsGiqqsCnnOXGyLaQ+nDM0zVu5wxJBMFg7 Mxfi+fmMWsTmEQWZr1gyN4+zgFZ5b0wMVa1r5zn+zk1fY0VA94WSSUMoaUo/DFVCC7oZ DxwSolVZK90JPHd/3GgxB6Qj1Ra83b81ap0N/ALCQvQzLTBMgx9i/c66mFEeykidwWhq 0StA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:robot-unsubscribe :robot-id:message-id:mime-version:references:in-reply-to:cc:subject :to:reply-to:sender:from:dkim-signature:dkim-signature:date; bh=Z5heLwNvpbPU/qcM1bJuGEeJZW47rYs8ZLDh2pyte2Q=; b=U7ZHo162rg5Z+IJ/6ndqgP6nAdKCY6nceHgQerNhSpYNVpjhftOhqF57BruG1Ves/F zj5ZpF8xJF75wFkSuNYt2/Qbma7/JtQxUkETkBRPg0R7Cd/eph0XOaWYA104ql4meb/Q vb4G0LVwTGJ8ZQzDBBk8gAmcb3COfIjgA6+k+ZO9oWjmidHZB/o//XNK5n1i5qLh8dOs RGC3IPBzgB1Y9ZWh1FXUxJssSOSHdeEcEUn18YGORqIl3CILMVLcfG20cIrRpRGfSoDz XsWZidCtXLYwvL3NKtzRN/oqwjuNYcy7v1yynJs5bZ+Q9532xgy13fvtORl6aBlH0s3T Jdag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BYWCu7WA; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e6-20020a056a001a8600b0053725628628si18357236pfv.234.2022.09.01.02.13.54; Thu, 01 Sep 2022 02:14:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=BYWCu7WA; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234233AbiIAINq (ORCPT + 99 others); Thu, 1 Sep 2022 04:13:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45678 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234005AbiIAIM6 (ORCPT ); Thu, 1 Sep 2022 04:12:58 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA92B12BF66; Thu, 1 Sep 2022 01:12:54 -0700 (PDT) Date: Thu, 01 Sep 2022 08:12:50 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1662019971; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Z5heLwNvpbPU/qcM1bJuGEeJZW47rYs8ZLDh2pyte2Q=; b=BYWCu7WAfs1UhsDgax3p20Dl6mdHgUBQD3Mn5fvM6TalxQtuY8GYfkplmV6aZNcZpRWfg4 yhyp74D6zha7hpZWO8TQH7SQHP/n2kktUSsmf9Mwioqd1yH6AwDoCjvYNF52N5Uxi94xB3 yXRWIw4g1Ru5n7EfGdqP8rd6fAKn+/srff4xaeoQ2mbR2tp+n7DSK+IxERDriTjMqifr3M fKuxEA61qvKkU7i8OIJkn6hBC5PjTtG/D0rmBvHZtmp09Jzj0upSAFc5KrNtgT9k4zXWi1 mbC6G6wWgPHpAMeTuFtMmvpxIOPQhcWBMIc+Zbv7xfFYfoVzgAOxJAwU++6VKg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1662019971; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Z5heLwNvpbPU/qcM1bJuGEeJZW47rYs8ZLDh2pyte2Q=; b=oHhZIGJz4GFfPP66uLxowXUaExS/kkQHfF6Ou5usB4+XMLLRmI77y+k3hRF4/JYI6Q2C/b CqODAolJyN+8w9Dw== From: "tip-bot2 for Marco Elver" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: perf/core] perf/hw_breakpoint: Optimize list of per-task breakpoints Cc: Marco Elver , "Peter Zijlstra (Intel)" , Dmitry Vyukov , Ian Rogers , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20220829124719.675715-5-elver@google.com> References: <20220829124719.675715-5-elver@google.com> MIME-Version: 1.0 Message-ID: <166201997029.401.15207678182540811754.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the perf/core branch of tip: Commit-ID: 0370dc314df35579b751d1b77c9169f071444962 Gitweb: https://git.kernel.org/tip/0370dc314df35579b751d1b77c9169f071444962 Author: Marco Elver AuthorDate: Mon, 29 Aug 2022 14:47:09 +02:00 Committer: Peter Zijlstra CommitterDate: Tue, 30 Aug 2022 10:56:21 +02:00 perf/hw_breakpoint: Optimize list of per-task breakpoints On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.208 [sec] | | 108.422396 usecs/op | 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dmitry Vyukov Acked-by: Ian Rogers Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com --- include/linux/perf_event.h | 3 +- kernel/events/hw_breakpoint.c | 56 +++++++++++++++++++++------------- 2 files changed, 37 insertions(+), 22 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index ae30c61..1999408 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks { }; #ifdef CONFIG_HAVE_HW_BREAKPOINT +#include #include #endif @@ -178,7 +179,7 @@ struct hw_perf_event { * creation and event initalization. */ struct arch_hw_breakpoint info; - struct list_head bp_list; + struct rhlist_head bp_list; }; #endif struct { /* amd_iommu */ diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c index 6076c63..6d09edc 100644 --- a/kernel/events/hw_breakpoint.c +++ b/kernel/events/hw_breakpoint.c @@ -26,10 +26,10 @@ #include #include #include -#include #include #include #include +#include #include #include @@ -54,7 +54,13 @@ static struct bp_cpuinfo *get_bp_info(int cpu, enum bp_type_idx type) } /* Keep track of the breakpoints attached to tasks */ -static LIST_HEAD(bp_task_head); +static struct rhltable task_bps_ht; +static const struct rhashtable_params task_bps_ht_params = { + .head_offset = offsetof(struct hw_perf_event, bp_list), + .key_offset = offsetof(struct hw_perf_event, target), + .key_len = sizeof_field(struct hw_perf_event, target), + .automatic_shrinking = true, +}; static int constraints_initialized; @@ -103,17 +109,23 @@ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type) */ static int task_bp_pinned(int cpu, struct perf_event *bp, enum bp_type_idx type) { - struct task_struct *tsk = bp->hw.target; + struct rhlist_head *head, *pos; struct perf_event *iter; int count = 0; - list_for_each_entry(iter, &bp_task_head, hw.bp_list) { - if (iter->hw.target == tsk && - find_slot_idx(iter->attr.bp_type) == type && + rcu_read_lock(); + head = rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params); + if (!head) + goto out; + + rhl_for_each_entry_rcu(iter, pos, head, hw.bp_list) { + if (find_slot_idx(iter->attr.bp_type) == type && (iter->cpu < 0 || cpu == iter->cpu)) count += hw_breakpoint_weight(iter); } +out: + rcu_read_unlock(); return count; } @@ -186,7 +198,7 @@ static void toggle_bp_task_slot(struct perf_event *bp, int cpu, /* * Add/remove the given breakpoint in our constraint table */ -static void +static int toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, int weight) { @@ -199,7 +211,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, /* Pinned counter cpu profiling */ if (!bp->hw.target) { get_bp_info(bp->cpu, type)->cpu_pinned += weight; - return; + return 0; } /* Pinned counter task profiling */ @@ -207,9 +219,9 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum bp_type_idx type, toggle_bp_task_slot(bp, cpu, type, weight); if (enable) - list_add_tail(&bp->hw.bp_list, &bp_task_head); + return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params); else - list_del(&bp->hw.bp_list); + return rhltable_remove(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params); } __weak int arch_reserve_bp_slot(struct perf_event *bp) @@ -307,9 +319,7 @@ static int __reserve_bp_slot(struct perf_event *bp, u64 bp_type) if (ret) return ret; - toggle_bp_slot(bp, true, type, weight); - - return 0; + return toggle_bp_slot(bp, true, type, weight); } int reserve_bp_slot(struct perf_event *bp) @@ -334,7 +344,7 @@ static void __release_bp_slot(struct perf_event *bp, u64 bp_type) type = find_slot_idx(bp_type); weight = hw_breakpoint_weight(bp); - toggle_bp_slot(bp, false, type, weight); + WARN_ON(toggle_bp_slot(bp, false, type, weight)); } void release_bp_slot(struct perf_event *bp) @@ -707,7 +717,7 @@ static struct pmu perf_breakpoint = { int __init init_hw_breakpoint(void) { int cpu, err_cpu; - int i; + int i, ret; for (i = 0; i < TYPE_MAX; i++) nr_slots[i] = hw_breakpoint_slots(i); @@ -718,18 +728,24 @@ int __init init_hw_breakpoint(void) info->tsk_pinned = kcalloc(nr_slots[i], sizeof(int), GFP_KERNEL); - if (!info->tsk_pinned) - goto err_alloc; + if (!info->tsk_pinned) { + ret = -ENOMEM; + goto err; + } } } + ret = rhltable_init(&task_bps_ht, &task_bps_ht_params); + if (ret) + goto err; + constraints_initialized = 1; perf_pmu_register(&perf_breakpoint, "breakpoint", PERF_TYPE_BREAKPOINT); return register_die_notifier(&hw_breakpoint_exceptions_nb); - err_alloc: +err: for_each_possible_cpu(err_cpu) { for (i = 0; i < TYPE_MAX; i++) kfree(get_bp_info(err_cpu, i)->tsk_pinned); @@ -737,7 +753,5 @@ int __init init_hw_breakpoint(void) break; } - return -ENOMEM; + return ret; } - -