Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S942515AbcJ1HLD (ORCPT ); Fri, 28 Oct 2016 03:11:03 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:44312 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751058AbcJ1HLA (ORCPT ); Fri, 28 Oct 2016 03:11:00 -0400 DMARC-Filter: OpenDMARC Filter v1.3.1 smtp.codeaurora.org 5730D6164F Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=pass smtp.mailfrom=markivx@codeaurora.org From: Vikram Mulukutla To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra , Ingo Molnar , Srivatsa Vaddagiri , Steve Muckle , Olav Haugan , Syed Rameez Mustafa , Joonwoo Park , Pavankumar Kondeti , Saravana Kannan , Bryan Huntsman , Juri Lelli , Morten Rasmussen , Dietmar Eggemann , Chris Redpath , Robin Randhawa , Patrick Bellasi , Todd Kjos , Srinath Sridharan , Andres Oportus , Leo Yan , Vincent Guittot , Vikram Mulukutla Subject: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking Date: Fri, 28 Oct 2016 00:10:39 -0700 Message-Id: <1477638642-17428-1-git-send-email-markivx@codeaurora.org> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19164 Lines: 298 We propose Window-Assisted Load Tracking (WALT) as an alternative or additional load tracking scheme in lieu of or along with PELT, one that in our estimation better tracks task demand and CPU utilization especially for use cases on mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations, (specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics have been used to guide task placement and p-state selection (load balancing in CFS still uses PELT statistics). WALT is now present in the android-common kernel as well. This RFC is not an attempt to supplant PELT statistics; this is more about starting a discussion; perhaps one that may result in changes to PELT to address the requirements listed below. Background __________ 1) Classifying a task as "heavy" faster helps in mobile world use-cases such as scrolling a UI or browsing a web page where tasks exhibit sporadically heavy load, allowing for instance, faster migration to a more capable CPU (assuming the scheduler uses task util and CPU capacity as inputs during task placement decision making, e.g EAS). Reclassification of a light task as heavy is also important - for example, a rendering thread may change its demand depending on what is being shown on-screen. We believe that quickly associating a task with its necessary resources requires a more dynamic demand or utilization signal than provided by PELT, which is subject to the geometric series and does not differentiate between task ramp-up and ramp-down. E.g, with a 32ms halflife, PELT takes ~74ms to reach 80% nice-0-load_avg/util_avg for a continuously executing 100% task, and ~139ms to reach 95%. 2) Historical task demand (even if a task sleeps for say 300ms due to network delay) is required to restore both CPU and frequency resources to meet performance throughput as well. PELT with a 32ms 'half-life' takes just ~213ms to decay utilization/load (effectively) zero, forgetting that task's history, requiring the task to re-execute its workload to gain that load_avg/util_avg again. 3) There is a need to quickly ramp up CPU frequency as a result of heavy CPU utilization, and to reduce frequency whenever possible to save power. Consider as an example a single 100% thread running continuously on a single core using schedutil with a 10ms transition latency. On an i7 8-core SMP machine with SMT, PELT takes 60-70ms to ramp to FMAX from FMIN. 4) PELT's blocked util decay implies that the underlying cpufreq governor would have to delay dropping frequency for longer than necessary (especially for mobile world usecases). PELT with a 32ms 'half-life' takes ~108ms to reduce the runqueue utilization/nice-0-load from max to 10%. Blocked load/util tracking is of course a conservative approach recognizing that load might return to the runqueue, but one that we feel is inflexible when it comes to certain real world use cases. WALT vs PELT ____________ WALT retains the per-entity properties of PELT; demand and utilization are accounted for at the task level, and a CPU rq's utilization contribution is the sum of its children's contributions. Wallclock time, however, is divided into a series of windows - to be clear about what WALT tracks exactly, we introduce the terms task demand and CPU busy-time here and show the benefits versus PELT that apply to the four points mentioned above: 1) Task Demand - The contribution of a task's running and runnable time to a window, averaged over N sliding windows. Note that a window has to be *completed* before its contribution is considered. A task's demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows. This metric can be roughly thought of as an unweighted load_avg without blocked load. WALT can take as little as one window of execution to report max task demand, although with freq and capacity invariance, this tends to be about 3 windows with a good frequency governor (so that's about 30ms with a 10ms window). This allows faster and more accurate task placement decisions. 2) WALT "forgets" blocked time entirely, i.e. the sliding windows exist only when the task is on the runqueue or running. This allows rapid reclassification of the same task as heavy after a short sleep. Thus a task does not need to re-execute to gain its demand once more and can be migrated up to a big CPU immediately. 3) CPU busy time - The sum of execution times of all tasks in the most recently completed window. For the same machine described above and with schedutil, WALT with a 10ms window takes around 30-50ms to ramp to FMAX, and with a 20ms window takes between 60-100ms (note that it is assumed that schedutil has a transition latency equal to or greater than the window size). 4) WALT "forgets" cpu utilization as soon as tasks are taken off of the runqueue, and thus the cpufreq governor can choose to drop frequency after just one window, potentially saving power as well as allowing the governor to choose a hysteresis setting as necessary based on more accurate utilization data. RFC patch _________ We realize that WALT has been commercialized along with several other out-of-tree changes to the Linux scheduler, and evaluating the code on top of all those changes is not trivial. Therefore, this RFC is an attempt to provide a simple, limited implementation of WALT - only the CPU utilization/frequency guidance portion - that can be run on the X86 architecture as well. It provides the basic building blocks in terms of the windowed view of time and how WALT hooks into various scheduling points. The WALT CPU utilization metric is fed to the schedutil governor which then sets the appropriate p-state. The intent here is to show that we don't harm server or desktop workload performance while still providing potential power savings. The full WALT implementation can be browsed through at: https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/walt.c This RFC patch has been tested on live X86 machines with the following sanity and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick Bellasi for initial code reviews): (Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti graphics, with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2 baseline. WALT window size was 10ms. Only deltas above 3% are considered non-noise.Power measured with Intel RAPL counters) +--------------------+----------------+----------------+----------+------------+ | | Schedutil+WALT | Schedutil+PELT | Ondemand | Delta | +--------------------+----------------+----------------+----------+------------+ | Chrome power test | | | | WALT saves | | 1080p 60fps video | 922 Joules | 967 J | 1011 J | 4.6% | +--------------------+--------------- +----------------+----------+------------+ | Firefox power test | | | | WALT saves | | 1080p 60fps video | 1079 Joules | 1210 J | 1248 J | 10.82% | +--------------------+--------------- +----------------+----------+------------+ | perf sched bench | | | | | | messaging | | | | | | -g 50 -l 5000 | 16.94s | 16.9326s | 16.9318s | No diff | +--------------------+----------------+----------------+-------- -+------------+ | phoronix apache | 26177 | | | | | benchmark | requests/s | 26088 | 26170 | No diff | +--------------------+----------------+----------------+----------+------------+ | phoronix pgbench | 295 | | | | | database r/w test | transactions/s | 300 | 302 | No diff | +--------------------+----------------+----------------+----------+------------+ | phoronix | | | | | | uniqingine-valley | 49fps | 49fps | 49fps | No diff | +--------------------+----------------+----------------+----------+------------+ +-----------------+-------+--------+--------+-------------+-----------+--------+ | cpufreq-bench (from tools/power/cpupower/bench/) | | measures % of work done against the performance govneror | +-----------------+-------+--------+--------+-------------+-----------+--------+ | Schedutil + WALT| round | load | sleep | performance | schedutil | % | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 0 | 50000 | 50000 | 50255 | 74574 | 67.389 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 1 | 100000 | 100000 | 99931 | 125724 | 80.367 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 2 | 150000 | 150000 | 150496 | 175244 | 85.878 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 3 | 200000 | 200000 | 200772 | 224903 | 89.270 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 4 | 250000 | 250000 | 250716 | 275955 | 90.854 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | | | | | | | +-----------------+-------+--------+--------+-------------+-----------+--------+ | Schedutil + PELT| round | load | sleep | performance | schedutil | % | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 0 | 50000 | 50000 | 50055 | 65408 | 76.528 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 1 | 100000 | 100000 | 100494 | 131842 | 76.223 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 2 | 150000 | 150000 | 150423 | 182352 | 82.490 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 3 | 200000 | 200000 | 200605 | 234065 | 85.705 | +-----------------+-------+--------+--------+-------------+-----------+--------+ | | 4 | 250000 | 250000 | 251724 | 284186 | 88.577 | +-----------------+-------+--------+--------+-------------+-----------+--------+ ARM data with EAS (Energy Aware Scheduler) and sched-freq _________________________________________________________ The following is data collected on a Snapdragon 820 device. The WALT implementation in this kernel is used by the EAS scheduler coupled with the sched-freq governor and is a complete one with both task demand and cpu utilization tracking (as opposed to the limited patch in this RFC). Power is measured at battery. EAS uses cfs_rq->avg.util_avg and se->avg.util_avg when using PELT and rq->prev_runnable_sum and p->ravg.demand when using WALT metrics. The "Jankbench" and power data are real world usecases where WALT provides the most benefit, while the benchmarks are often used as a metric to compare OEM devices. WALT uses a 10ms window in all of the following data. +---------------------------------+---------+----------+----------------------+ | ARM data (ARMv8 based Snapdragon 820) | | Two clusters with 2 Kryo CPUs each, 1.8Ghz and 2.4GHz FMAX each. | | Running Android with EAS + WALT and EAS + PELT, with the sched-freq govneror| +---------------------------------+---------+----------+----------------------+ | | WALT | PELT | | +---------------------------------+---------+----------+----------------------+ | Geekbench 3 Single Core | No difference | +---------------------------------+---------+----------+----------------------+ | Geekbench 3 Multicore | No difference | +---------------------------------+---------+----------+----------------------+ | Geekbench 4 Single Core | No difference | +---------------------------------+---------+----------+----------------------+ | Geekbench 4 Multicore | No difference | +---------------------------------+---------+----------+----------------------+ | MT Linpack Native | 525 | 483 | WALT is 8.6% better | +---------------------------------+---------+----------+----------------------+ | MT Linpack Java | 682 | 710 | PELT is 4.1% better | +---------------------------------+---------+----------+----------------------+ | Threadbench | 200 | 103 | WALT is 94.1% better | +---------------------------------+---------+----------+----------------------+ | sysbench | 230 | 259 | PELT is 12% better | +---------------------------------+---------+----------+----------------------+ | Parsec | 287 | 220 | WALT is 30% better | +---------------------------------+---------+----------+----------------------+ | IPC (Android Binder) | 158 | 167 | PELT is 5.6% better | +---------------------------------+---------+----------+----------------------+ | | | | | +---------------------------------+---------+----------+----------------------+ | Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms | | or less. Lower is better. | | This test scrolls text and graphics on a mobile screen in Android. | | (list view, imagelist view subtests) | +---------------------------------+---------+----------+----------------------+ | | WALT | PELT | | +---------------------------------+---------+----------+----------------------+ | 90th percentile | 6.32ms | 6.50ms | WALT is 3% better | +---------------------------------+---------+----------+----------------------+ | 95th percentile | 7.20ms | 7.50s | WALT is 4% better | +---------------------------------+---------+----------+----------------------+ | 99th percentile | 8.50ms | 8.53ms | No diff | +---------------------------------+---------+----------+----------------------+ | Power (avg, inst. current) | 115.1mA | 116.8mA | No difference | +---------------------------------+---------+----------+----------------------+ | | | | | +---------------------------------+---------+----------+----------------------+ | Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms | | or less. Lower is better. | | This test scrolls text and graphics on a mobile screen in Android. | | (high bit-rate text render, edit text input, overdraw subtests) | +---------------------------------+---------+----------+----------------------+ | 90th percentile | 17.45ms | 18.23ms | WALT is 4.2% better | +---------------------------------+---------+----------+----------------------+ | 95th percentile | 19.20ms | 19.68ms | No diff | +---------------------------------+---------+----------+----------------------+ | 99th percentile | 38.4ms | 41.44ms | WALT is 7.33% better | +---------------------------------+---------+----------+----------------------+ | Power (avg, inst. current) | 378.0mA | 380.0mA | No difference | +---------------------------------+---------+----------+----------------------+ | | | | | +---------------------------------+---------+----------+----------------------+ | Video 1080p (big buck bunny) 60 seconds of playback, sampling rate 1000/s | +---------------------------------+---------+----------+----------------------+ | Avg Instantaneous Current | 103mA | 114mA | WALT is 9.6% better | +---------------------------------+---------+----------+----------------------+ | | | | | +---------------------------------+---------+----------+----------------------+ | Video 2160p (big buck bunny) 60 seconds of playback, sampling rate 1000/s | +---------------------------------+---------+----------+----------------------+ | Avg Instantaneous Current | 160mA | 164.5mA | WALT is 3% better | +---------------------------------+---------+----------+----------------------+ | | | | | +---------------------------------+---------+----------+----------------------+ This patch is based on v4.8.2 Srivatsa Vaddagiri (3): sched: Introduce structures necessary for WALT sched: Introduce Window-Assisted CPU utilization Tracking sched: Introduce WALT hooks into core and scheduling classes include/linux/sched.h | 35 ++++++ include/linux/sched/sysctl.h | 1 + init/Kconfig | 9 ++ kernel/sched/Makefile | 1 + kernel/sched/core.c | 29 ++++- kernel/sched/cputime.c | 10 +- kernel/sched/deadline.c | 7 ++ kernel/sched/debug.c | 10 ++ kernel/sched/fair.c | 11 +- kernel/sched/rt.c | 6 + kernel/sched/sched.h | 12 ++ kernel/sched/walt.c | 433 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/walt.h | 69 ++++++++++ kernel/sysctl.c | 11 ++ 14 files changed, 638 insertions(+), 6 deletions(-) Some references: Steve Muckle's 2014 presentation on Qualcomm's HMP scheduler (with WALT): http://www.slideshare.net/linaroorg/lcu14-406-a-quick-take-on-energyaware-scheduling https://www.youtube.com/watch?v=2xb0vOV-E6E 2016 presentation of Qualcomm data on WALT (was called WinLT then!): http://www.slideshare.net/linaroorg/bkk16208-eas https://www.youtube.com/watch?v=g4srUao5ahg Qualcomm data on WALT vs PELT CPU utilization on X86: https://www.youtube.com/watch?v=hc0aufpfWvY Steve Muckle's presentation including WALT+schedutil data on a HiKey Board: http://www.slideshare.net/linaroorg/las16307-benchmarking-schedutil-in-android https://www.youtube.com/watch?v=Wxu5ApDRYH4 The WALT wiki page https://wiki.codeaurora.org/xwiki/bin/QKERNEL/Window+Assisted+Load+Tracking/ -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project