Received: by 2002:a89:288:0:b0:1f7:eeee:6653 with SMTP id j8csp492109lqh; Tue, 7 May 2024 05:50:52 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXQrenEmnvPT7pkgdRZFB0DGCrc5ZbwQ4PqtBxCHMKGz5AAB3aYZZH0CS0iBX3IPB4XfX2ZJF31JoxJkjecQIWPGIgzEu7Mr/MyEhCebQ== X-Google-Smtp-Source: AGHT+IEJMFkDwPylG4AlEZY5xa+sWzMsNgf7gcvVpGPKHRaNoN/v8x4J7+Ddp3u67nHKcSuA3lIt X-Received: by 2002:a50:8d1e:0:b0:566:f5d6:4b4 with SMTP id s30-20020a508d1e000000b00566f5d604b4mr7886413eds.12.1715086252567; Tue, 07 May 2024 05:50:52 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715086252; cv=pass; d=google.com; s=arc-20160816; b=UcWSH5xLq+yrU3LbjaPfSZeO8vdcieFQCZ6DM4B8p0kZkkjQoiANE8q0eVZQEdiy6K mePtIQiCtZfB2NMINyqRBrTZUF7WxKX4C4CoLuVhkBLQIRnte9S2v9cWLS7ut9NSOGyP 5DwGoHG2yTOA4NT2PKCoafg+WED4Ljt4riEdgn+oC3qvQQtGIUjC2wLIS8koEF8sCqOt +CvyZeQ8cEcXiDf2JtclaXpXti8zYP7F0H18C2zIbFE3xTMj2YKDbqufP5kwJC4PX6UL 8z5vMmwTK4UHVpPa2DGx07oTAmF9mVW8Y85ugrB2jzUgnIt8ATZ78oMG62m/VXWjHGgj eSpQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from; bh=2K22mfnFU572nsHfM8hQW2ACXv0GA5b98KxB4NWC+YA=; fh=g9ftKNwhG+/WSzzMgwE47pa7lpb74E1ygVQYwdcQ+gA=; b=mBmp3WKe2ygbihIO4tnw9rrKPDOZL3UO4+4AHy4E4rIkz7/8dUbB+IItfG8UM7LZeo 7aezFmd6a9t7qjGsr52TztEmnS3235KfLaULpAYa1pdmdG6EwqdqezM9xICMLDRaMXJp pzUy1JwRc2EOvs1rY8mK8jfdehVCi0gqeUkG4pr4V0zBKPndmrQLV/Jvq91IxNPYapy9 f6nQCHomf0thf4MTrbLJyX1XYhjupP+KW9Spjf77jYTBbMPHezs0TGzDHDCAB82tPzNW p2U199hzejtAPfWUsroL/54sUXtwM4pFPqeEqh/tQ+Bwu2XbBKH0Lg7AbFCqkGpxiRBm CGwA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-171309-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-171309-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id t17-20020aa7d4d1000000b00572cf04fed4si5357554edr.610.2024.05.07.05.50.52 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 May 2024 05:50:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-171309-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-171309-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-171309-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id E274B1F22B9F for ; Tue, 7 May 2024 12:50:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 81AF815B56E; Tue, 7 May 2024 12:50:44 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B1975156F24 for ; Tue, 7 May 2024 12:50:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715086243; cv=none; b=KZnfl+Awgu4VPLKfPl7qqlsp4Ks5OkSolK4uycaqyNZT+LLq+OamwrCKcO+n0Eoq/QjBQcY/wk6Nk22P+E+MpZjmpN1XjxWmOmwwLG2kSv+qcG2KATLtZSaFHaZ1wiMQRKPV7+fB7pcIv5max36AiXqnVPZRvauxxYy6Q1d0+is= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715086243; c=relaxed/simple; bh=Sr628eoIuCdX6Hr7vf8UDLcGevS6JWfBUFW53+MuhIA=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=YeCAoNL2XD7C6yjZivkbO8zobbvmqSCLWPLDHPEEctu8TVAeboR/YSIiJh7ucMHxUeWmPMbA0PvKQIZS+5tY/N4UDS2xR3EYy/t8Aa3SQ+5oRvZcrwCQ2VPYwWdKK8+h1NpODZxVCRVbbL0ccZO0gTJ5v6pplCRfK6zV0kpIRnY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 02E841063; Tue, 7 May 2024 05:51:07 -0700 (PDT) Received: from e130256.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9CDD33F793; Tue, 7 May 2024 05:50:39 -0700 (PDT) From: Hongyan Xia To: Ingo Molnar , Peter Zijlstra , Vincent Guittot , Dietmar Eggemann Cc: Qais Yousef , Morten Rasmussen , Lukasz Luba , Christian Loehle , pierre.gondois@arm.com, linux-kernel@vger.kernel.org, Hongyan Xia Subject: [RFC PATCH v3 0/6] Uclamp sum aggregation Date: Tue, 7 May 2024 13:50:23 +0100 Message-Id: X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Current uclamp implementation, max aggregation, has several drawbacks. This series gives an alternative implementation that addresses the problems and shows other advantages, mostly: 1. Simplicity. Sum aggregation implements uclamp with less than half of code than max aggregation. 2. Effectiveness. Sum aggregation shows better uclamp effectiveness, either in benchmark scores or more importantly, in energy efficiency. 3. Works on its own. No changes in cpufreq or other sub-systems are needed. 4. Low-overhead. No bucket operations and no need to tweak the number of buckets to balance between overhead and uclamp granularity. The key idea of sum aggregation is fairly simple. Each task has a util_avg_bias, which is obtained by: util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg; If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and obtain a rq total bias: rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN; Then we use the biased rq utilization rq_util + rq_bias to select OPP and to schedule tasks. PATCH BREAKDOWN: Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max aggregation. This patch is not needed and creates other problems for sum aggregation. It is discussed elsewhere that this patch will be improved and there may not be the need to revert it in the future. Patch 2 and 3 implement sum aggregation. Patch 4 and 5 remove max aggregation. Patch 6 applies PELT decay on negative util_avg_bias. This improves energy efficiency and task placement, but is not strictly necessary. TESTING: Two notebooks are shared at https://nbviewer.org/github/honxia02/notebooks/blob/bb97afd74f49d4b8add8b28ad4378ea337c695a8/whitebox/max.ipynb https://nbviewer.org/github/honxia02/notebooks/blob/bb97afd74f49d4b8add8b28ad4378ea337c695a8/whitebox/sum-offset.ipynb The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are little cores with capacity of 383. CPU4-5 are big cores. The rt-app profiles used for these experiments are included in the notebooks. Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110. The scheduling decisions are plotted in Out[11]. Both max and sum aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU, and this is the reason why sum aggregation reverts the 1st commit. However, the reverted patch may be improved and this revert may not be needed in the future. Scenario 2: Scheduling 2 tasks with UCLAMP_MIN and UCLAMP_MAX at a value slightly above the capacity of the little CPU. Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks on the big core. Both max and sum aggregation handle this correctly. Scenario 3: Task A is a task with a small utilization pinned to CPU4. Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped at 300. After a while, task A is then pinned to CPU5, joining B. Results are in Out[23]. Max aggregation sees a frequency spike at 239.75s. When zoomed in, one can see square-wave-like utilization values because of A periodically going to sleep. When A wakes up, its default UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency. When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq utilization. This happens repeatedly, hence the square wave. In contrast, sum aggregation sees a normal increase in utilization when A joins B, without any square-wave behavior. Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5). After a while, remove the CPU pinning of the 4 tasks on the big PD. Results are in Out[29]. After unpinning, max aggregation moves all 8 tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU after unpinning, which is the desired balanced task placement. Same as Scenario 1, the situation may not be as bad once the improvement of the reverted patch comes out in the future. Scenario 5: Scheduling 8 tasks with UCLAMP_MAX of 110. Results are in Out[35] and Out[36]. There's no doubt that sum aggregation yields substantially better scheduling decisions. This tests roughly the same thing as Scenario 4. EVALUATION: We backport patches to kernel v6.1 on Pixel 6 and run Android benchmarks. Speedometer: We run Speedometer 2.0 to test ADPF/uclamp effectiveness. Because sum aggregation does not circumvent the 20% OPP margin, we reduce uclamp values to 80% to be fair. ------------------------------------------------------ | score | score | % | CPU power (mW) | % | | max | 161.4 | | 2358.9 | | | sum_0.8 | 166.0 | +2.85 | 2485.0 | +5.35 | | sum_tuned | 162.6 | +0.74 | 2332.0 | -1.14 | ------------------------------------------------------ We see a consistant higher score and higher average power consumption. Note that a higher score also means a reduction in run-time, so total energy increase for sum_0.8 is only 1.88%. We then reduce uclamp values so that the Speedometer score is roughly the same. If we do so, then sum aggregation actually gives a reduced average power and total energy consumption than max aggregation. UIBench: ----------------------------------------------------------------- | score | jank percentage | % | CPU power (mW) | % | | max | 0.375% | | 122.75 | | | sum_0.8 | 0.440% | +17.33 | 116.35 | -5.21 | | sum_tuned | 0.220% | -41.33 | 119.35 | -2.77 | ----------------------------------------------------------------- UIBench on Pixel 6 by default already has a low enough jank percentage. Moving to sum aggregation gives higher jank percentage and lower power consumption. We then tune the hardcoded uclamp values in the Android image to take advantage of the power budget, and can achieve more than 41% jank reduction while still operating with less power consumption than max aggregation. This result is not suggesting that sum aggregation greatly outperforms max aggregation, because the jank percentage is already very low, but instead suggests that hardcoded uclamp values in the system (like in init scripts) need to be changed to perform well under sum aggregation. If tuned well, sum aggregation generally shows better effectiveness, or the same effectiveness but with less power consumption. --- Changed in v3: - Addresses the biggest concern from multiple people, that PELT and uclamp need to be separate. The new design is significantly simpler than the previous revision and separates util_avg_uclamp into the original util_avg (which this series doesn't touch at all) and the util_avg_bias component. - Keep the tri-state return value of util_fits_cpu(). - Keep both the unclamped and clamped util_est, so that we use the right one depending on the caller in frequency or energy calculations. Hongyan Xia (6): Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" sched/uclamp: Track a new util_avg_bias signal sched/fair: Use util biases for utilization and frequency sched/uclamp: Remove all uclamp bucket logic sched/uclamp: Simplify uclamp_eff_value() Propagate negative bias include/linux/sched.h | 8 +- init/Kconfig | 32 --- kernel/sched/core.c | 321 ++---------------------- kernel/sched/cpufreq_schedutil.c | 12 +- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 411 ++++++++++++++++--------------- kernel/sched/pelt.c | 39 +++ kernel/sched/rt.c | 4 - kernel/sched/sched.h | 129 +++------- 9 files changed, 319 insertions(+), 639 deletions(-) -- 2.34.1