From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Steve Muckle <steve.muckle@linaro.org>, Leo Yan <leo.yan@linaro.org>,
        Viresh Kumar <viresh.kumar@linaro.org>,
        "Rafael J . Wysocki" <rjw@rjwysocki.net>, Todd Kjos <tkjos@google.com>,
        Srinath Sridharan <srinathsr@google.com>,
        Andres Oportus <andresoportus@google.com>,
        Juri Lelli <juri.lelli@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Chris Redpath <chris.redpath@arm.com>,
        Robin Randhawa <robin.randhawa@arm.com>,
        Patrick Bellasi <patrick.bellasi@arm.com>
Subject: [RFC v2 0/8] SchedTune: central, scheduler-driven, power-perfomance control
Date: Thu, 27 Oct 2016 18:41:00 +0100
Message-Id: <20161027174108.31139-1-patrick.bellasi@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9110
Lines: 205


This RFC is an update to the initial SchedTune proposal [1] for a central
scheduler-driven power-performance control.
The posting is being made ahead of the LPC to facilitate discussions there.

The initial proposal was refined, eventually merged into the AOSP, and it
currently finds good use in production mobile devices [*]. This series is a
scaled down version of the complete solution that aims to restart discussions.

The focus is on a suitable user-space <-> kernel space interface for tuning the
scheduler’s behavior at run-time. Specifically, the intention is to highlight
how the proposed interface can be used by the scheduler to bias the selection
of the CPU's operating frequency depending on information injected from
userspace.


Patch Set Organization
======================

The concept of a simple power-performance tunable that is wholly scheduler
centric is implemented by patches [01-04].
This is where we introduce a ‘global task boosting’ knob which is integrated
with schedutil to allow the scheduler to bias OPP selection. These first 5
patches allow to dynamically tune schedutil up to the point where it behaves
like the existing ‘performance’ governor.

Patches [05-07] extend the basic mechanism to use different boost values for
different tasks. This allows informed runtimes (e.g. Android and ChromeOS) to
feed the scheduler with information related to their knowledge about the
specific demand of different tasks and/or use-cases.
Thanks to SchedTune’s defined interface, the scheduler is now able to collect
simple yet powerful information about tasks: how much the user cares about
their performance.
Although it can be argued that something similar is already provided by the
existing concept of task priority, we believe that the proposed interface is
much more generic and can be further extended to support both OPP selection and
task placement, thus leading in the future to a more comprehensive energy-aware
scheduler driven solution.
These patches enable schedutil to service interactive workloads like touch
screen interaction. Only out of tree cpufreq governors like the Interactive
governor were thus far able to service such use cases.

The last patch in the series introduces the concept of ‘negative boosting’.
Negative boosting is beneficial for mobile devices in scenarios where it is
desired to intentionally reduce the performance of a task by running it at a
lower frequency than the one selected by schedutil.
For certain tasks, like compute intensive background operations or memory
bounded tasks, negative boosting  can have measurable energy-saving benefits.
In these cases, a negative SchedTune value allows to bias schedutil towards the
selection of a lower OPP. Importantly, this can be achieved using the same
SchedTune interface.
This patch allows to dynamically tune schedutil up to the point where it
effectively replaces the “powersave” governor.

The patches are based on tip/sched/core:
   a225023 - sched/core: Explain sleep/wakeup in a better way

For testing purposes an integration branch, providing the required dependencies
as well as a set of debugging tracepoints, is available here:

   git://www.linux-arm.com/linux-pb eas/stune/rfcv2


Test results
============

Extensive testing of the proposed solution has already been done as SchedTune
is shipping on a production mobile device, with benefits observed for key
use-cases (e.g. improved responsiveness and performance of key workloads).

The following synthetic focused tests are used to show functional benefits and
report overheads. All these tests have been performed on an HiKey board, an
octa-core (ARM CortexA53 @1.2GHz) SMP platform, running a Debian image on a
mainline kernel and using schedutil configured with a 1ms rate limit value.


Performance boosting validation
-------------------------------

The functional validation of the boost mechanism has been performed considering
a ramp task generated using the rt-app provided by the LISA testing suite [2].

The ramp is configured as a 16ms periodic task which increases its utilization
by 5% every second, starting from 5% up to 60%. The task is pinned to run on a
single CPU and executed with different boost values:
  0%, 15%, 30%, 60% and -100%.

The following table reports:
 - the value used to boost the task in each experiment
 - the rt-app’s reported performance index:
      PerfIndex Avg (the higher the better)
    which expresses the average time left from completion of a task
    activation (i.e. a fixed amount of work) until its next activation
 - the CPU average frequency (FreqAvg)
 - the actual boost measured for the PerfIndex and FreqAvg

  Boost     PerfIndex   Actual    FreqAvg  Actual
  value	   Avg 	   Std   Boost      [MHz]   Boost
      0	  0.53    0.12	    0%        606      0%
     15	  0.61    0.07	   17%        658      9%
     30	  0.68    0.07	   26%        739     22%
     60	  0.71    0.05	   40%        852     41%
   -100 -98.84  120.00	  -2K%        363    -36%

For positive boost values, SchedTune can improve the performance of a task
(i.e. its time to completion) by a quantity which is proportional to the boost
value. This is reported by the increasingly higher values of the PerfIndex Avg
as well as the average frequencies used to execute the task.

For negative boost values the performance is progressively reduced, in the
reported case of -100% boost we verified that the system runs most of its time
at one of the lowest OPPs (thus providing a behavior similar to the powersave
governor) while still running at higher OPPs when other (not negative boosted)
tasks needs to run. That’s why the reported average frequency (363MHz) is
slightly higher than the minimum OPP (208MHz).

A graphical representation of the task’s behaviors at different boost values
and the corresponding CPUs frequencies is available here:
   https://gist.github.com/derkling/8be0a8ac365c935b3df585cb24afec6c


Impact on scheduler performance
-------------------------------

Performance impact has been evaluated using the hackbench test provided by perf
with this command line:

     perf bench sched messaging --thread --group 25 --loop 1000

Reported completion times (CTime) in seconds are averages over 10 runs:


                    |           |     SchedTune (per-task) boost value |
                    | Schedutil |         0% |        10% |        90% |
  ------------------+-----------+------------+------------+------------+
  CTime         [s] |     12.93 |      13.08 |      13.32 |      13.27 |
   vs Schedutil [%] |           |       1.1% |       3.0% |       2.7% |


SchedTune currently introduces overheads when used on saturated systems such as
the one generated by running the hackbench test. This is possibly due to the
currently used locking schema which can be further optimized.

On the other hand, the SchedTune extension is mainly useful for lightly loaded
systems (mobile devices, laptops, etc.) where the additional overhead has been
verified to be compensated by the performance benefits due to (for example) a
faster task completion. Some of these benefits are reported in the following
section.


ChangeLog
=========

Changes since v1:
 - Rebase on tip/sched/core:
      A225023 sched/core: Explain sleep/wakeup in a better way
 - Integrated with schedutil (in replacement of SchedFreq)
 - Improved tasks accounting for correct boostgroups activations
 - Added support for negative boosting
 - Extensively tested on production-grade devices


Credits
=======
[*] This work has been supported by an extensive collaborative effort between
    ARM, Linaro and Google, targeting production devices.


References
==========

[1] https://lkml.org/lkml/2015/8/19/419
[2] https://github.com/ARM-software/lisa


Patrick Bellasi (8):
  sched/tune: add detailed documentation
  sched/tune: add sysctl interface to define a boost value
  sched/fair: add function to convert boost value into "margin"
  sched/fair: add boosted CPU usage
  sched/tune: add initial support for CGroups based boosting
  sched/tune: compute and keep track of per CPU boost value
  sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
  sched/{fair,tune}: add support for negative boosting

 Documentation/scheduler/sched-tune.txt | 426 +++++++++++++++++++++++++
 include/linux/cgroup_subsys.h          |   4 +
 include/linux/sched/sysctl.h           |  16 +
 init/Kconfig                           |  73 +++++
 kernel/exit.c                          |   5 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/cpufreq_schedutil.c       |   4 +-
 kernel/sched/fair.c                    | 119 +++++++
 kernel/sched/sched.h                   |   2 +
 kernel/sched/tune.c                    | 561 +++++++++++++++++++++++++++++++++
 kernel/sched/tune.h                    |  40 +++
 kernel/sysctl.c                        |  16 +
 12 files changed, 1265 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/scheduler/sched-tune.txt
 create mode 100644 kernel/sched/tune.c
 create mode 100644 kernel/sched/tune.h

-- 
2.10.1