Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp556850rdg; Thu, 12 Oct 2023 13:37:39 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHKGbsPUXzxpNv+R46GuXgB+ybc9YFmxERZvPyTIEstsnZljE8XXYsw5uXK2YfCRRRXXDrW X-Received: by 2002:a17:90a:345:b0:27d:2dde:597a with SMTP id 5-20020a17090a034500b0027d2dde597amr1035312pjf.10.1697143059206; Thu, 12 Oct 2023 13:37:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697143059; cv=none; d=google.com; s=arc-20160816; b=XglznS6NYQ5OnUOuHm1EvaGf0/XEc+2ZqD+2l6KpVgpIJokVhcum20DVob3+/zild6 2wclhmiMQsEnuzm1kTeqX3xQGr5kox6+ypkjkjaMFuchBV7HUE3ljDUhIWaHgRNtee4x pcx4/FwAgr+yY/GXQW8ragoZs+w32apjveasVtbZFHWWGEFkZzk55gd+7gCCMS5TPps1 BoidEy4MBER6VqzwDs+TdND8cwlZwPCC1/VkY70+hEmxP7F7xRlofxKTVn5+YfxULHcV 5Up0CozdoPewparVt1Nweb8vUXwh2OnLVBbXYm3sBtHVR0SzMqTEc4L5AzRakRyFonMx hnOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=kgw3JXIBU8fnlkjMp7MJT24uOPYqMz1harld+42ZmVk=; fh=MlL9xuK1bL5YhHTfhEwdGxMCfB66v6TLl/GiGg/sHRw=; b=GFZFFulEgA/C5dv4dxvllKI1Y74MKW6C7yhsLcjmnVBBQlTl51Loe1/oTBONA8mhmC 7SWjnTEQ38oVLvWnnLkqWq0UyMzytwSpuJYNi+VH+l+SXDb3AfH1VS/ey2vkaZUQTCzS GcrhOJZ52cRRmQuD1a7CMU50SSY2/pAkaH0YFgNBeni8H/IBXK21+ZIGIgUiqyftVKQK w6QUXPiuVtLG1B2rEjkOMQE+5rnjbqfcRcOvesLWpYLXzbKm7piHo548ymwBP/V32i+O bmaAR5NU5dVt36Qj2nE2KUs+JB3rsCTt77g1J3Uu31Emy2jnlE4N+YJ8eNA526Hwkx0/ 6B+A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=pE39xo8q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id j8-20020a17090a734800b0027921228848si2994968pjs.133.2023.10.12.13.37.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 13:37:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=pE39xo8q; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id E250880845AF; Thu, 12 Oct 2023 13:36:42 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1442340AbjJLUga (ORCPT + 99 others); Thu, 12 Oct 2023 16:36:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1347394AbjJLUg2 (ORCPT ); Thu, 12 Oct 2023 16:36:28 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8A61A9D for ; Thu, 12 Oct 2023 13:36:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1697142985; bh=FPaY66C3ZPJy5gK2Z09xtlcQh0OXYz9y/MMCrQXYRXk=; h=From:To:Cc:Subject:Date:From; b=pE39xo8q970/u/dDuWfE2JdXtGoEW9tJAKiQcWQ0PttESH9jYzmzIiSGvxv72AHuf A+rCuoNf5fI7xdMCEFoOvVTDr2QFAB5UsGBXR8VqrHDfNjiUheSQQ0Cb2snJw7eylu OdYXcEN8wRqWxuSTBD4+fHyCzgLSN6KvAPwTODS2azICjIrsjm3F5qZp635DVewhxi pG6V1gRuNRmjZFPcDjaqaSLctWbaO3PTSClWAScxG2ytl33f10MZzSi0fbRI2+bUI+ XtwkbTnxjE6kC492Id31HoDxTCeW1vN/hufzjA4QfyC/6zXiuJPNycPno3ogq5F438 As5Z7FSXOMquQ== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4S61d54Kx6z1XSn; Thu, 12 Oct 2023 16:36:25 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Chen Yu , Tim Chen , K Prateek Nayak , "Gautham R . Shenoy" , x86@kernel.org Subject: [RFC PATCH] sched/fair: Introduce WAKEUP_BIAS_PREV_IDLE to reduce migrations Date: Thu, 12 Oct 2023 16:36:26 -0400 Message-Id: <20231012203626.1298944-1-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 12 Oct 2023 13:36:43 -0700 (PDT) Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature to reduce the task migration rate. For scenarios where the system is under-utilized (CPUs are partly idle), eliminate frequent task migrations from almost idle CPU to completely idle CPUs by introducing a bias towards the previous CPU if it is idle or almost idle in select_idle_sibling(). Use 1% of the CPU capacity of the previously used CPU as CPU utilization "almost idle" cutoff. For scenarios where the system is fully or over-utilized (CPUs are almost never idle), favor the previous CPU (rather than the target CPU) if all CPUs are busy to minimize migrations. (suggested by Chen Yu) The following benchmarks are performed on a v6.5.5 kernel with mitigations=off. This speeds up the following hackbench workload on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets): hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 from 49s to 31s. (37% speedup) We can observe that the number of migrations is reduced significantly (-90%) with this patch, which may explain the speedup: Baseline: 118M cpu-migrations (9.286 K/sec) With patch: 5M cpu-migrations (0.580 K/sec) As a consequence, the stalled-cycles-backend are reduced: Baseline: 8.16% backend cycles idle With patch: 6.85% backend cycles idle Interestingly, the rate of context switch increases with the patch, but it does not appear to be an issue performance-wise: Baseline: 454M context-switches (35.677 K/sec) With patch: 670M context-switches (70.805 K/sec) This was developed as part of the investigation into a weird regression reported by AMD where adding a raw spinlock in the scheduler context switch accelerated hackbench. It turned out that changing this raw spinlock for a loop of 10000x cpu_relax within do_idle() had similar benefits. This patch achieves a similar effect without the busy-waiting by allowing select_task_rq to favor almost idle previously used CPUs based on the utilization of that CPU. The threshold of 1% cpu_util for almost idle CPU has been identified empirically using the hackbench workload. Feedback is welcome. I am especially interested to learn whether this patch has positive or detrimental effects on performance of other workloads. Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/ Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/ Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/ Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Vincent Guittot Cc: Juri Lelli Cc: Swapnil Sapkal Cc: Aaron Lu Cc: Chen Yu Cc: Tim Chen Cc: K Prateek Nayak Cc: Gautham R . Shenoy Cc: x86@kernel.org --- kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++-- kernel/sched/features.h | 6 ++++++ 2 files changed, 49 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1d9c2482c5a3..70bffe3b6bd7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7113,6 +7113,23 @@ static inline bool asym_fits_cpu(unsigned long util, return true; } +static unsigned long cpu_util_without(int cpu, struct task_struct *p); + +/* + * A runqueue is considered almost idle if: + * + * cpu_util_without(cpu, p) / 1024 <= 1% * capacity_of(cpu) + * + * This inequality is transformed as follows to minimize arithmetic: + * + * cpu_util_without(cpu, p) <= 10 * capacity_of(cpu) + */ +static bool +almost_idle_cpu(int cpu, struct task_struct *p) +{ + return cpu_util_without(cpu, p) <= 10 * capacity_of(cpu); +} + /* * Try and locate an idle core/thread in the LLC cache domain. */ @@ -7139,18 +7156,33 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); + /* + * With the WAKEUP_BIAS_PREV_IDLE feature, if the previous CPU + * is cache affine and almost idle, prefer the previous CPU to + * the target CPU to inhibit costly task migration. + */ + if (sched_feat(WAKEUP_BIAS_PREV_IDLE) && + (prev == target || cpus_share_cache(prev, target)) && + (available_idle_cpu(prev) || sched_idle_cpu(prev) || almost_idle_cpu(prev, p)) && + asym_fits_cpu(task_util, util_min, util_max, prev)) + return prev; + if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; /* - * If the previous CPU is cache affine and idle, don't be stupid: + * Without the WAKEUP_BIAS_PREV_IDLE feature, use the previous + * CPU if it is cache affine and idle if the target cpu is not + * idle. */ - if (prev != target && cpus_share_cache(prev, target) && + if (!sched_feat(WAKEUP_BIAS_PREV_IDLE) && + prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; + /* * Allow a per-cpu kthread to stack with the wakee if the * kworker thread and the tasks previous CPUs are the same. @@ -7217,6 +7249,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* + * With the WAKEUP_BIAS_PREV_IDLE feature, if the previous CPU + * is cache affine, prefer the previous CPU when all CPUs are + * busy to inhibit migration. + */ + if (sched_feat(WAKEUP_BIAS_PREV_IDLE) && + prev != target && cpus_share_cache(prev, target)) + return prev; + return target; } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..1ba67d177fe0 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -37,6 +37,12 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true) */ SCHED_FEAT(WAKEUP_PREEMPTION, true) +/* + * Bias runqueue selection towards the previous runqueue if it is almost + * idle or if all CPUs are busy. + */ +SCHED_FEAT(WAKEUP_BIAS_PREV_IDLE, true) + SCHED_FEAT(HRTICK, false) SCHED_FEAT(HRTICK_DL, false) SCHED_FEAT(DOUBLE_TICK, false) -- 2.39.2