From: Patrick Bellasi <patrick.bellasi@arm.com>
To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>,
        "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
        Paul Turner <pjt@google.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        John Stultz <john.stultz@linaro.org>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Juri Lelli <juri.lelli@arm.com>, Tim Murray <timmurray@google.com>,
        Todd Kjos <tkjos@android.com>,
        Andres Oportus <andresoportus@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Viresh Kumar <viresh.kumar@linaro.org>
Subject: [RFCv4 0/6] Add utilization clamping to the CPU controller
Date: Thu, 24 Aug 2017 19:08:51 +0100
Message-Id: <20170824180857.32103-1-patrick.bellasi@arm.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6543
Lines: 142

Was:
 - RFCv3: Add capacity capping support to the CPU controller
 - RFCv2: SchedTune: central, scheduler-driven, power-performance control

This is a respin of the series implementing the support for per-task
boosting and capping of CPU frequency. This new version addresses most
of the comments collected since last posting on LKML [1] and from the
discussions at the OSPM Summit [2].

Hereafter a short description of the main changes since the previous
posting [1].

.:: Concept: "capacity clamping" replaced by "utilization clamping"

The previous implementation was expressed in terms of "capacity
clamping", which generated some confusion mainly due to the fact that in
mainline the capacity is currently defined as a "constant property" of a
CPU. Here is a email [3] which resumes the confusion generated by the
previous proposal.

As Peter pointed out, the goal of this proposal is to "affect" the
util_avg metric, i.e. the CPU utilization, and the way that signal is
used for example by schedutil. Thus, both from a conceptual and
implementation standpoint, it actually makes a lot more sense
to talk about "utilization clamping".

In this new proposal, the couple of new attributes added to the CPU
controller allow to define the minimum and maximum utilization which
should be considered for the set of tasks in a group.
These utilization clamp values can be used, for example, to either
"boost" or "cap" the actual frequency selected by schedutil when one of
these tasks is RUNNABLE on a CPU.
A proper aggregation mechanism is also provided to handle the cases
where tasks with different utilization clamp values are co-scheduled on
the same CPU.

.:: Implementation: rb-trees replaced by reference counting

The previous implementation used a couple of rb-trees to aggregate the
different clamp values of tasks co-scheduled on the same CPU.  Although
being a simple solution, from the coding standpoint, Peter pointed out
that it was definitively adding not negligible overheads in the fast path
(i.e. tasks enqueue/dequeue) especially on highly loaded systems.

This new implementation is based on a much more lightweight mechanism
using reference counting. The new solution requires just to
{in,de}crement an integer counter each time a task is {en,de}eueued.
The most expensive operation is now a sequential scan of a small and
per-CPU array of integers, which is also defined to easily fit into a
single cache line.

Scheduler performance overheads have been measured using the performance
governor to run 20 iterations of:

   perf bench sched messaging --pipe --thread --group 2 --loop 5000

on a Juno R2 board (4xA53, 2xA72).
With this new implementation we cannot see any sensible impact when
comparing with the same benchmark running on tip/sched/core
(as in 9c8783201). For the records, the previous implementation showed
~1.5% overhead using the same test.

.:: Other comments: use-cases description

People had concerns about use-cases, in a previous posting [4] I've resumed
the main use cases we are targeting with this proposal. Further discussion
went on at OSPM, outside of the official tracks, and I've got the feeling
that people (at least Peter and Raphael) seem to recognize the interest in
having a support to both boosting or clamping of the CPU frequencies,
based on currently active tasks.

Main discussed use cases was (refer to [4] further details):

 - boosting: better interactive response for small tasks which
   are affecting the user experience. Consider for example the case of a
   small control thread for an external accelerator (e.g. GPU, DSP, other
   devices). In this case the scheduler does not have a complete view of
   what are the task bandwidth requirements and if, it's a small task,
   schedutil will keep selecting a lower frequency thus affecting the
   overall time required to complete its activations.

 - clamping: increase energy efficiency for background tasks not directly
   affecting the user experience. Since running at a lower frequency is in
   general more energy efficient, when the completion time is not a main
   goal then clamping the maximum frequency to use for certain (maybe big)
   tasks can have positive effects, both on power dissipation and energy
   consumption.
   Moreover, this last support allows also to make RT tasks more energy
   friendly on mobile systems, whenever running them at the maximum
   frequency is not strictly required.

.:: Other comments: usage of CGroups as a main interface

The current implementation is based on CGroups but does not strictly
depend on that API. We do not propose a different main interface just
because, so far, all the use-cases we have on hand can take advantage
from a CGroups API (notably the Android run-time).

In case there should be the need for a different API, the current
implementation can be easily extended to hook its internals to a
different API. However, we believe it's not worth adding the maintenance
burden for an additional API until there is not a real demand.

.:: Patches organization

The first three patches of this series introduce util_{min,max} tracking
in the core scheduler, as an extension of the CPU controller.
The fourth patch is dedicated to the synchronization between the cgroup
interface (slow-path) and the core scheduler (fast-path).
The last two patches integrate the utilization clamping support with
schedutil for FAIR tasks and RT/DL tasks too.

A detailed validation and analysis of the proposed features is available
in this notebook:
   https://gist.github.com/7f9170e613dea25fe248e14157e6cb23

Cheers Patrick

.:: References
[1] https://lkml.org/lkml/2017/2/28/355
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
    video:  http://youtu.be/6MC1jbYbQTo
[3] https://lkml.org/lkml/2017/4/11/670
[4] https://lkml.org/lkml/2017/3/20/688

Patrick Bellasi (6):
  sched/core: add utilization clamping to CPU controller
  sched/core: map cpu's task groups to clamp groups
  sched/core: reference count active tasks's clamp groups
  sched/core: sync task_group's with CPU's clamp groups
  cpufreq: schedutil: add util clamp for FAIR tasks
  cpufreq: schedutil: add util clamp for RT/DL tasks

 include/linux/sched.h            |  12 +
 init/Kconfig                     |  36 ++
 kernel/sched/core.c              | 706 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |  49 ++-
 kernel/sched/sched.h             | 199 +++++++++++
 5 files changed, 998 insertions(+), 4 deletions(-)

-- 
2.14.1