Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933426AbbELTht (ORCPT ); Tue, 12 May 2015 15:37:49 -0400 Received: from foss.arm.com ([217.140.101.70]:33704 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753261AbbELThq (ORCPT ); Tue, 12 May 2015 15:37:46 -0400 From: Morten Rasmussen To: peterz@infradead.org, mingo@redhat.com Cc: vincent.guittot@linaro.org, Dietmar Eggemann , yuyang.du@intel.com, preeti@linux.vnet.ibm.com, mturquette@linaro.org, rjw@rjwysocki.net, Juri Lelli , sgurrappadi@nvidia.com, pang.xunlei@zte.com.cn, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, morten.rasmussen@arm.com Subject: [RFCv4 PATCH 00/34] sched: Energy cost model for energy-aware scheduling Date: Tue, 12 May 2015 20:38:35 +0100 Message-Id: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9744 Lines: 230 Several techniques for saving energy through various scheduler modifications have been proposed in the past, however most of the techniques have not been universally beneficial for all use-cases and platforms. For example, consolidating tasks on fewer cpus is an effective way to save energy on some platforms, while it might make things worse on others. This proposal, which is inspired by the Ksummit workshop discussions in 2013 [1], takes a different approach by using a (relatively) simple platform energy cost model to guide scheduling decisions. By providing the model with platform specific costing data the model can provide an estimate of the energy implications of scheduling decisions. So instead of blindly applying scheduling techniques that may or may not work for the current use-case, the scheduler can make informed energy-aware decisions. We believe this approach provides a methodology that can be adapted to any platform, including heterogeneous systems such as ARM big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or memory. Model data includes power consumption at each P-state and C-state. This is an RFC and there are some loose ends that have not been addressed here or in the code yet. The model and its infrastructure is in place in the scheduler and it is being used for load-balancing decisions. The energy model data is hardcoded and there are some limitations still to be addressed. However, the main idea is presented here, which is the use of an energy model for scheduling decisions. RFCv4 is a consolidation of the latest energy model related patches and patches adding scale-invariance to the CFS per-entity load-tracking (PELT) as well as fixing a few issues that have emerged as we use PELT more extensively for load-balancing. The patches are based on tip/sched/core. Many of the changes since RFCv3 are addressing issues pointed out during the review of v3 by Peter, Sai, and Xunlei. However, there are still a few issues that needs fixing. Energy-aware scheduling is now strictly following the 'tipping point' policy (with one minor exception). That is, when the system is deemed over-utilized (above the 'tipping point') all balancing decisions are made by the normal way based on priority scaled load and spreading of tasks. When below the tipping point energy-aware scheduling decisions are active. The rationale being that when below the tipping point we can safely shuffle tasks around without harming throughput. The focus is more on putting tasks on the right cpus at wake-up and less on periodic/idle/nohz_idle as the latter are less likely to have a chance of balancing tasks when below the tipping point as tasks are smaller and not always running/runnable. This has simplified the code a bit. The patch set now consists of two main parts but contains independent fixes that will be reposted separately later. The capacity rework [2] that was included in RFCv3 has been merged in v4.1-rc1 and [3] has been reworked. The latter is the first part of this patch set. Patch 01-12: sched: frequency and cpu invariant per-entity load-tracking and other load-tracking bits. Patch 13-34: sched: Energy cost model and energy-aware scheduling features. Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled: sysbench: Single task running for 3 seconds. rt-app [4]: mp3 playback use-case model rt-app [4]: 5 ~[6,13,19,25,31,38,44,50]% periodic (2ms) tasks Note: % is relative to the capacity of the fastest cpu at the highest frequency, i.e. the more busy ones do not fit on little cpus. A newer version of rt-app was used which supports a better but slightly different way of modelling the periodic tasks. Numbers are therefore _not_ comparable to the RFCv3 numbers. Average numbers for 20 runs per test (ARM TC2). Energy Mainline EAS noEAS sysbench 100 251* 227* rt-app mp3 100 63 111 rt-app 6% 100 42 102 rt-app 13% 100 58 101 rt-app 19% 100 87 101 rt-app 25% 100 94 104 rt-app 31% 100 93 104 rt-app 38% 100 114 117 rt-app 44% 100 115 118 rt-app 50% 100 125 126 The higher load rt-app runs show significant variation in the energy numbers for mainline as it schedules tasks randomly due to lack of proper compute capacity awareness - tasks may be scheduled on LITTLE cpus despite being too big. Early test results for ARM (64-bit) Juno (2xA57+4x53) with cpufreq enabled: Average numbers for 20 runs per test (ARM Juno). Energy Mainline EAS noEAS sysbench 100 219 196 rt-app mp3 100 82 120 rt-app 6% 100 65 108 rt-app 13% 100 75 102 rt-app 19% 100 86 104 rt-app 25% 100 84 105 rt-app 31% 100 87 111 rt-app 38% 100 136 132 rt-app 44% 100 141 141 rt-app 50% 100 146 142 * Sensitive to task placement on big.LITTLE. Mainline may put it on either cpu due to it's lack of compute capacity awareness, while EAS consistently puts heavy tasks on big cpus. The EAS energy increase came with a 2.06x (TC2)/1.70x (Juno) _increase_ in performance (throughput) vs Mainline. [1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search for 'cost') [2] https://lkml.org/lkml/2015/1/15/136 [3] https://lkml.org/lkml/2014/12/2/328 [4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen Changes: RFCv4: (0) Reordering of the whole patch-set: 01-02: Frequency-invariant PELT 03-08: CPU-invariant PELT 09-10: Track blocked usage 11-12: PELT fixes for forked and dying tasks 13-18: Energy model data structures 19-21: Energy model helper functions 22-24: Energy calculation functions 25-26: Tipping point and max cpu capacity 27-29: Idle-state index for energy model 30-34: Energy-aware scheduling (1) Rework frequency and cpu invariance arch support. - Remove weak arch functions and replace them with #defines and cpufreq notifiers. (2) Changed PELT initialization and immediate removal of dead tasks from PELT rq signals. (3) scheduler energy data setup. - Clean-up of allocation and initialization of energy data structures. (4) Fix issue in sched_group_energy() not using correct capacity index. (5) Rework energy-aware load balancing code. - Introduce a system-wide over-utilization indicator/tipping point. - Restrict periodic/idle/nohz_idle load balance to the detection of over-utilization scenarios. - Use conventional load-balance path when above tipping point and bail out when below. - Made energy-aware wake-up conditional on tipping point (only when below) and added capacity awareness to wake-ups when above. RFCv3: https://lkml.org/lkml/2015/2/4/537 Dietmar Eggemann (12): sched: Make load tracking frequency scale-invariant arm: vexpress: Add CPU clock-frequencies to TC2 device-tree sched: Make usage tracking cpu scale-invariant arm: Cpu invariant scheduler load-tracking support sched: Get rid of scaling usage by cpu_capacity_orig sched: Introduce energy data structures sched: Allocate and initialize energy data structures arm: topology: Define TC2 energy and provide it to the scheduler sched: Store system-wide maximum cpu capacity in root domain sched: Determine the current sched_group idle-state sched: Consider a not over-utilized energy-aware system as balanced sched: Enable idle balance to pull single task towards cpu with higher capacity Morten Rasmussen (22): arm: Frequency invariant scheduler load-tracking support sched: Convert arch_scale_cpu_capacity() from weak function to #define arm: Update arch_scale_cpu_capacity() to reflect change to define sched: Track blocked utilization contributions sched: Include blocked utilization in usage tracking sched: Remove blocked load and utilization contributions of dying tasks sched: Initialize CFS task load and usage before placing task on rq sched: Documentation for scheduler energy cost model sched: Make energy awareness a sched feature sched: Introduce SD_SHARE_CAP_STATES sched_domain flag sched: Compute cpu capacity available at current frequency sched: Relocated get_cpu_usage() and change return type sched: Highest energy aware balancing sched_domain level pointer sched: Calculate energy consumption of sched_group sched: Extend sched_group_energy to test load-balancing decisions sched: Estimate energy impact of scheduling decisions sched: Add over-utilization/tipping point indicator sched, cpuidle: Track cpuidle state index in the scheduler sched: Count number of shallower idle-states in struct sched_group_energy sched: Add cpu capacity awareness to wakeup balancing sched: Energy-aware wake-up task placement sched: Disable energy-unfriendly nohz kicks Documentation/scheduler/sched-energy.txt | 363 +++++++++++++++++ arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 + arch/arm/include/asm/topology.h | 11 + arch/arm/kernel/smp.c | 56 ++- arch/arm/kernel/topology.c | 204 +++++++--- include/linux/sched.h | 22 + kernel/sched/core.c | 139 ++++++- kernel/sched/fair.c | 634 +++++++++++++++++++++++++---- kernel/sched/features.h | 11 +- kernel/sched/idle.c | 2 + kernel/sched/sched.h | 81 +++- 11 files changed, 1391 insertions(+), 137 deletions(-) create mode 100644 Documentation/scheduler/sched-energy.txt -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/