2015-08-19 18:47:43

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control

The topic of a single simple power-performance tunable, that is wholly
scheduler centric with defined and predictable properties, has come up on
several occasions in the past [1,2]. With techniques such as scheduler driven
DVFS [4], we now have a good framework for implementing such a tunable.

This RFC introduces the foundation to add a single, central tunable 'knob' to
the scheduler. The main goal of this RFC is to present an initial proposal for
a possible solution as well as triggering a discussion on how the ideas here
may be extended for integration with Energy Aware Scheduling [5].

Patch set organization
======================

The following patches implement the tunable knob stacked on top of sched-DVFS.
The knob extends the functionality provided by sched-DVFS to support task
performance boosting. The knob is expressed as a simple user-space facing
interface that allows the tuning of system wide scheduler behaviors ranging
from energy efficiency at one end through to full performance boosting at the
other end.

The tunable can be used globally such that it affects all tasks. It can also be
used for a select set of tasks via a new CGroup controller.

The content of this RFC consists of three main parts:

Patches 01-07: sched:
Juri's patches on sched-DVFS, which have been updated to
address review comments from the last EAS RFCv5 posting.

Patches 08-11: sched, schedtune:
A new proposal for "global task boosting"

Patches 12-14: sched, schedtune:
An extension, based on CGroups, to support per-task boosting

These patches are based on tip/sched/core and depend on:

1. patches to "compute capacity invariant load/utilization tracking"
recently posted by Morten Rasmussen [8]

2. patches for "scheduler-driven cpu frequency selection"
which add the new sched-DVFS CPUFreq governor
initially posted by Mike Turquette [4]
and recently updated in the series [5] posted by Morten Rasmussen

For testing purposes an integration branch providing all the required
dependencies as well as the patches of this RFC is available here:

git://http://www.linux-arm.com/linux-power eas/stune/rfcv1


Test results
============

Tests have been performed on an ARM Juno board, booted using only the LITTLE
cluster (4x ARM64 CortexA53 @ 850 MHz) to mimic a standard SMP platform.

Impact on scheduler performance
-------------------------------

Performance impact has been evaluated using the hackbench test provided by
perf with this command line:

perf bench sched messaging --thread --group 25 --loop 1000

Reported completion times (CTime) in seconds are averages over 10 runs

| | | SchedTune
| Ondemand | sched-DVFS | Global PerTask
-----------------+-----------+------------+--------------------
CTime [s] | 50.9 | 50.3 | 51.1 51.3
vs Ondemand [%] | 0.00 | -1.19 | 0.34 0.84
-----------------+-----------+------------+--------------------
Energy | | |
vs Ondemand [%] | 0.00 | -0.80 | 1.16 1.45
-----------------+-----------+------------+--------------------

Overall considerations are:

1) sched-DVFS is quite well positioned compare to the Ondemand
governor with respect to both performance and energy consumption

2) SchedTune is always worse than the Ondemand governor due to the
missing optimizations in the current implementation for working on
saturated conditions

The SchedTune extension is useful only on a lightly loaded system.
On the other hand, when the system is saturated, the SchedTune support
should be automatically disabled. This automatic disabling is currently
being implemented and will be posted in the next revision of this RFC.


Performance/energy impacts of task boosting
-------------------------------------------

We considered a set of rt-app [5] generated workloads.
All the tests are executed using:
- 4 threads (to match the number of available CPUs)
- each thread has a 2ms period
- duty-cycle (at highest OPP) is (6,13,19,25,31,38 and 44)%
- each workload runs for 60s

The energy metric (EnergyDiff) is based on energy counters available on the
Juno platform and it reports the energy consumption for the complete execution
of the workload.

The performance evaluation is based on data obtained by rt-app [6] using the
same metric introduced with the EAS RFCv5 [5].

The following table reports the percentage variation on each metric.
Each variation compares:
base : workload run using the sched-DVFS governor but without boosting
testNN : workload run using the sched-DVFS governor with a NN boost value
configured for just the tasks of the workload,
i.e. using per-task boosting

Reported numbers are averages on 10 runs for each test configuration.
Numbers in (parenthesis) are reference for the comments below the table.


Test Id : Comparison | EnergyDiff [%] | PerfIndex [%] |
----------------------------+----------------+----------------+
Test_43 : test05 vs base | (1) -0.24 | (4) -1.22 |
Test_43 : test10 vs base | -0.25 | -0.82 |
Test_43 : test30 vs base | -0.22 | -0.62 |
Test_43 : test80 vs base | 22.72 | 10.40 |
--------------------- ------+----------------+----------------+
Test_44 : test05 vs base | (1) -0.37 | 1.43 |
Test_44 : test10 vs base | -0.30 | 0.70 |
Test_44 : test30 vs base | 0.52 | 1.57 |
Test_44 : test80 vs base | 21.08 | 17.36 |
--------------------- ------+----------------+----------------+
Test_45 : test05 vs base | (1) -0.17 | 1.00 |
Test_45 : test10 vs base | -0.12 | -0.22 |
Test_45 : test30 vs base | 4.15 | 8.25 |
Test_45 : test80 vs base | 21.84 | 22.38 |
--------------------- ------+----------------+----------------+
Test_46 : test05 vs base | (1) -0.09 | -0.48 |
Test_46 : test10 vs base | -0.02 | -1.06 |
Test_46 : test30 vs base | 4.36 | 13.01 |
Test_46 : test80 vs base | (2) 21.15 | (3) 29.58 |
--------------------- ------+----------------+----------------+
Test_47 : test05 vs base | 0.11 | 1.15 |
Test_47 : test10 vs base | 0.58 | 1.99 |
Test_47 : test30 vs base | 5.44 | 8.54 |
Test_47 : test80 vs base | (2) 22.47 | (3) 30.88 |
--------------------- ------+----------------+----------------+
Test_48 : test05 vs base | 4.23 | 5.00 |
Test_48 : test10 vs base | 7.32 | 16.88 |
Test_48 : test30 vs base | 14.75 | 28.72 |
Test_48 : test80 vs base | (2) 29.11 | (3) 42.30 |
--------------------- ------+----------------+----------------+
Test_49 : test05 vs base | 0.21 | 1.15 |
Test_49 : test10 vs base | 0.50 | 2.47 |
Test_49 : test30 vs base | 6.60 | 11.51 |
Test_49 : test80 vs base | (2) 18.22 | (3) 27.45 |


Comments on Results
===================

The goal of the proposed task boosting strategy is to speed-up task
completion, by running them at a higher Operating Performance Point (OPP),
with respect to the lowest OPP required by the specific workload.

Here are some considerations on reported results:

a) Low intensity workloads present a small decrease in energy
consumption (1) probably due to a race-to-idle effect when running
at lower OPP. Otherwise, in general we observe an increase in
energy consumption which is monotonic and proportional wrt the
configured boost value.

b) Higher boost values (2) are subject to 20-30% more energy
consumption which is however compensated by an expected improvement
in the performance metric.

c) The PerfIndex is in general aligned with the magnitude of the boost
value. The more we boost the workload the sooner tasks activation complete
and thus the better the PerfIndex metric (3)

d) On really small workloads, when the boosting value is relatively small (4),
the overhead introduced by SchedTune is not compensated by the possibility
to select an higher OPP.
This aspect is part of the SchedTune optimization that we will target for
the following posting.


Conclusions
===========

The proposed patch set provides a simple and effective tunable knob which
allows to boost performance of low-intensity tasks. This tunable works by
biasing sched-DVFS in the selection of the operating frequency.
This allows to trade-off increased energy consumptions for faster tasks
completion time.

This patch set provides just the foundation bits which focus on OPP
selection. A further extension of this patch set is under development
to target the integration with the Energy Aware Scheduler (EAS) [5] by
biasing CPU selection.

This will allow to complete the boosting knob semantics by providing a single
knob which allows:
a) to tune sched-DVFS to mimic (dynamically and on a per-task base) the
behaviors of other governors (i.e. ondemand, performance and interactive)
b) to tune EAS to be more energy-efficient or performance boosting oriented


References
==========

[1] Remove stale power aware scheduling remnants and dysfunctional knobs
http://lkml.org/lkml/2012/5/18/91
[2] Power-efficient scheduling design
http://lwn.net/Articles/552889
[3] Compute capacity invariant load/utilization tracking
http://lkml.org/lkml/2015/8/14/296
[4] Scheduler-driven CPU frequency selection (RFCv3)
http://lkml.org/lkml/2015/6/26/620
[5] Energy cost model for energy-aware scheduling (RFCv5)
https://lkml.org/lkml/2015/7/7/754
[6] Extended RT-App to report Time to Completion
https://github.com/scheduler-tools/rt-app.git exp/eas_v5


Juri Lelli (7):
sched/cpufreq_sched: use static key for cpu frequency selection
sched/fair: add triggers for OPP change requests
sched/{core,fair}: trigger OPP change request on fork()
sched/{fair,cpufreq_sched}: add reset_capacity interface
sched/fair: jump to max OPP when crossing UP threshold
sched/cpufreq_sched: modify pcpu_capacity handling
sched/fair: cpufreq_sched triggers for load balancing

Patrick Bellasi (7):
sched/tune: add detailed documentation
sched/tune: add sysctl interface to define a boost value
sched/fair: add function to convert boost value into "margin"
sched/fair: add boosted CPU usage
sched/tune: add initial support for CGroups based boosting
sched/tune: compute and keep track of per CPU boost value
sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value

Documentation/scheduler/sched-tune.txt | 367 +++++++++++++++++++++++++++++
include/linux/cgroup_subsys.h | 4 +
include/linux/sched/sysctl.h | 16 ++
init/Kconfig | 43 ++++
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/cpufreq_sched.c | 28 ++-
kernel/sched/fair.c | 168 +++++++++++++-
kernel/sched/sched.h | 10 +
kernel/sched/tune.c | 411 +++++++++++++++++++++++++++++++++
kernel/sched/tune.h | 23 ++
kernel/sysctl.c | 15 ++
12 files changed, 1082 insertions(+), 6 deletions(-)
create mode 100644 Documentation/scheduler/sched-tune.txt
create mode 100644 kernel/sched/tune.c
create mode 100644 kernel/sched/tune.h

--
2.5.0


2015-08-19 18:51:42

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection

From: Juri Lelli <[email protected]>

Introduce a static key to only affect scheduler hot paths when sched
governor is enabled.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 14 ++++++++++++++
kernel/sched/fair.c | 2 ++
kernel/sched/sched.h | 6 ++++++
3 files changed, 22 insertions(+)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 5020f24..2968f3a 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ out:
return;
}

+static inline void set_sched_energy_freq(void)
+{
+ if (!sched_energy_freq())
+ static_key_slow_inc(&__sched_energy_freq);
+}
+
+static inline void clear_sched_energy_freq(void)
+{
+ if (sched_energy_freq())
+ static_key_slow_dec(&__sched_energy_freq);
+}
+
static int cpufreq_sched_start(struct cpufreq_policy *policy)
{
struct gov_data *gd;
@@ -243,6 +255,7 @@ static int cpufreq_sched_start(struct cpufreq_policy *policy)

policy->governor_data = gd;
gd->policy = policy;
+ set_sched_energy_freq();
return 0;

err:
@@ -254,6 +267,7 @@ static int cpufreq_sched_stop(struct cpufreq_policy *policy)
{
struct gov_data *gd = policy->governor_data;

+ clear_sched_energy_freq();
if (cpufreq_driver_might_sleep()) {
kthread_stop(gd->task);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a04b074..b35d90b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4075,6 +4075,8 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5af84b..07ab036 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1415,6 +1415,12 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif

+extern struct static_key __sched_energy_freq;
+static inline bool sched_energy_freq(void)
+{
+ return static_key_false(&__sched_energy_freq);
+}
+
#ifdef CONFIG_CPU_FREQ_GOV_SCHED
void cpufreq_sched_set_cap(int cpu, unsigned long util);
#else
--
2.5.0

2015-08-19 18:47:46

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests

From: Juri Lelli <[email protected]>

Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose. Only trigger a freq request if we are effectively waking up
or going to sleep. Filter out load balancing related calls to reduce the
number of triggers.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b35d90b..ebf86b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4075,6 +4075,26 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+/*
+ * ~20% capacity margin that we add to every capacity change
+ * request to provide some head room if task utilization further
+ * increases.
+ */
+static unsigned int capacity_margin = 1280;
+static unsigned long capacity_orig_of(int cpu);
+static int cpu_util(int cpu);
+
+static void update_capacity_of(int cpu)
+{
+ unsigned long req_cap;
+
+ if (!sched_energy_freq())
+ return;
+
+ req_cap = cpu_util(cpu) * capacity_margin / capacity_orig_of(cpu);
+ cpufreq_sched_set_cap(cpu, req_cap);
+}
+
struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE;

/*
@@ -4087,6 +4107,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_new = !(flags & ENQUEUE_WAKEUP);

for_each_sched_entity(se) {
if (se->on_rq)
@@ -4118,9 +4139,22 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}

- if (!se)
+ if (!se) {
add_nr_running(rq, 1);

+ /*
+ * We want to potentially trigger a freq switch request only for
+ * tasks that are waking up; this is because we get here also during
+ * load balancing, but in these cases it seems wise to trigger
+ * as single request after load balancing is done.
+ *
+ * XXX: how about fork()? Do we need a special flag/something
+ * to tell if we are here after a fork() (wakeup_task_new)?
+ *
+ */
+ if (!task_new)
+ update_capacity_of(cpu_of(rq));
+ }
hrtick_update(rq);
}

@@ -4178,9 +4212,18 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}

- if (!se)
+ if (!se) {
sub_nr_running(rq, 1);

+ /*
+ * We want to potentially trigger a freq switch request only for
+ * tasks that are going to sleep; this is because we get here also
+ * during load balancing, but in these cases it seems wise to trigger
+ * as single request after load balancing is done.
+ */
+ if (task_sleep)
+ update_capacity_of(cpu_of(rq));
+ }
hrtick_update(rq);
}

--
2.5.0

2015-08-19 18:51:11

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork()

From: Juri Lelli <[email protected]>

Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.

However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.

This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 9 +++------
kernel/sched/sched.h | 1 +
3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43952c7..e901340 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2360,7 +2360,7 @@ void wake_up_new_task(struct task_struct *p)
#endif

rq = __task_rq_lock(p);
- activate_task(rq, p, 0);
+ activate_task(rq, p, ENQUEUE_WAKEUP_NEW);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebf86b4..a75ea07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4107,7 +4107,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
- int task_new = !(flags & ENQUEUE_WAKEUP);
+ int task_new = flags & ENQUEUE_WAKEUP_NEW;
+ int task_wakeup = flags & ENQUEUE_WAKEUP;

for_each_sched_entity(se) {
if (se->on_rq)
@@ -4147,12 +4148,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* tasks that are waking up; this is because we get here also during
* load balancing, but in these cases it seems wise to trigger
* as single request after load balancing is done.
- *
- * XXX: how about fork()? Do we need a special flag/something
- * to tell if we are here after a fork() (wakeup_task_new)?
- *
*/
- if (!task_new)
+ if (task_new || task_wakeup)
update_capacity_of(cpu_of(rq));
}
hrtick_update(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 07ab036..1f0b433 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1164,6 +1164,7 @@ static const u32 prio_to_wmult[40] = {
#define ENQUEUE_WAKING 0
#endif
#define ENQUEUE_REPLENISH 8
+#define ENQUEUE_WAKEUP_NEW 16

#define DEQUEUE_SLEEP 1

--
2.5.0

2015-08-19 18:47:50

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface

From: Juri Lelli <[email protected]>

When a CPU is going idle it is pointless to ask for an OPP update as we
would wake up another task only to ask for the same capacity we are already
running at (utilization gets moved to blocked_utilization). We thus add
cpufreq_sched_reset_capacity() interface to just reset our current capacity
request without triggering any real update. At wakeup we will use the
decayed utilization to select an appropriate OPP.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/cpufreq_sched.c | 12 ++++++++++++
kernel/sched/fair.c | 8 ++++++--
kernel/sched/sched.h | 3 +++
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index 2968f3a..e6b4a22 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -203,6 +203,18 @@ out:
return;
}

+/**
+ * cpufreq_sched_reset_capacity - interface to scheduler for resetting capacity
+ * requests
+ * @cpu: cpu whose capacity request has to be reset
+ *
+ * This _wont trigger_ any capacity update.
+ */
+void cpufreq_sched_reset_cap(int cpu)
+{
+ per_cpu(pcpu_capacity, cpu) = 0;
+}
+
static inline void set_sched_energy_freq(void)
{
if (!sched_energy_freq())
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a75ea07..2961e29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4218,8 +4218,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* during load balancing, but in these cases it seems wise to trigger
* as single request after load balancing is done.
*/
- if (task_sleep)
- update_capacity_of(cpu_of(rq));
+ if (task_sleep) {
+ if (rq->cfs.nr_running)
+ update_capacity_of(cpu_of(rq));
+ else if (sched_energy_freq())
+ cpufreq_sched_reset_cap(cpu_of(rq));
+ }
}
hrtick_update(rq);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f0b433..ad9293b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1424,9 +1424,12 @@ static inline bool sched_energy_freq(void)

#ifdef CONFIG_CPU_FREQ_GOV_SCHED
void cpufreq_sched_set_cap(int cpu, unsigned long util);
+void cpufreq_sched_reset_cap(int cpu);
#else
static inline void cpufreq_sched_set_cap(int cpu, unsigned long util)
{ }
+static inline void cpufreq_sched_reset_cap(int cpu)
+{ }
#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
--
2.5.0

2015-08-19 18:50:51

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold

From: Juri Lelli <[email protected]>

Since the true utilization of a long running task is not detectable while
it is running and might be bigger than the current cpu capacity, create the
maximum cpu capacity head room by requesting the maximum cpu capacity once
the cpu usage plus the capacity margin exceeds the current capacity. This
is also done to try to harm the performance of a task the least.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2961e29..6197b3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7864,6 +7864,24 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

if (numabalancing_enabled)
task_tick_numa(rq, curr);
+
+ /*
+ * To make free room for a task that is building up its "real"
+ * utilization and to harm its performance the least, request a
+ * jump to max OPP as soon as get_cpu_usage() crosses the UP
+ * threshold. The UP threshold is built relative to the current
+ * capacity (OPP), by using capacity_margin.
+ */
+ if (sched_energy_freq()) {
+ int cpu = cpu_of(rq);
+ unsigned long capacity_orig = capacity_orig_of(cpu);
+ unsigned long capacity_curr = capacity_curr_of(cpu);
+
+ if (capacity_curr < capacity_orig &&
+ (capacity_curr * SCHED_LOAD_SCALE) <
+ (cpu_util(cpu) * capacity_margin))
+ cpufreq_sched_set_cap(cpu, capacity_orig);
+ }
}

/*
--
2.5.0

2015-08-19 18:47:53

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling

From: Juri Lelli <[email protected]>

Use the cpu argument of cpufreq_sched_set_cap() to handle per_cpu writes,
as the thing can be called remotely (e.g., from load balacing code).

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Acked-by: Michael Turquette <[email protected]>
---
kernel/sched/cpufreq_sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
index e6b4a22..27f2cec 100644
--- a/kernel/sched/cpufreq_sched.c
+++ b/kernel/sched/cpufreq_sched.c
@@ -151,7 +151,7 @@ void cpufreq_sched_set_cap(int cpu, unsigned long capacity)
unsigned long capacity_max = 0;

/* update per-cpu capacity request */
- __this_cpu_write(pcpu_capacity, capacity);
+ per_cpu(pcpu_capacity, cpu) = capacity;

policy = cpufreq_cpu_get(cpu);
if (IS_ERR_OR_NULL(policy)) {
--
2.5.0

2015-08-19 18:50:08

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing

From: Juri Lelli <[email protected]>

As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/fair.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6197b3b..955dfe1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7030,6 +7030,11 @@ more_balance:
* ld_moved - cumulative load moved across iterations
*/
cur_ld_moved = detach_tasks(&env);
+ /*
+ * We want to potentially lower env.src_cpu's OPP.
+ */
+ if (cur_ld_moved)
+ update_capacity_of(env.src_cpu);

/*
* We've detached some tasks from busiest_rq. Every
@@ -7044,6 +7049,10 @@ more_balance:
if (cur_ld_moved) {
attach_tasks(&env);
ld_moved += cur_ld_moved;
+ /*
+ * We want to potentially raise env.dst_cpu's OPP.
+ */
+ update_capacity_of(env.dst_cpu);
}

local_irq_restore(flags);
@@ -7398,8 +7407,13 @@ static int active_load_balance_cpu_stop(void *data)
schedstat_inc(sd, alb_count);

p = detach_one_task(&env);
- if (p)
+ if (p) {
schedstat_inc(sd, alb_pushed);
+ /*
+ * We want to potentially lower env.src_cpu's OPP.
+ */
+ update_capacity_of(env.src_cpu);
+ }
else
schedstat_inc(sd, alb_failed);
}
@@ -7408,8 +7422,13 @@ out_unlock:
busiest_rq->active_balance = 0;
raw_spin_unlock(&busiest_rq->lock);

- if (p)
+ if (p) {
attach_one_task(target_rq, p);
+ /*
+ * We want to potentially raise target_cpu's OPP.
+ */
+ update_capacity_of(target_cpu);
+ }

local_irq_enable();

--
2.5.0

2015-08-19 18:47:56

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 08/14] sched/tune: add detailed documentation

The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, we now have a good framework for implementing
such a tunable.

This patch provides a detailed description of the motivations and design
decisions behind the implementation of the SchedTune.

cc: Jonathan Corbet <[email protected]>
cc: [email protected]
Signed-off-by: Patrick Bellasi <[email protected]>
---
Documentation/scheduler/sched-tune.txt | 367 +++++++++++++++++++++++++++++++++
1 file changed, 367 insertions(+)
create mode 100644 Documentation/scheduler/sched-tune.txt

diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..cb795e6
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,367 @@
+ Central, scheduler-driven, power-performance control
+ (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric, and has well defined and predictable properties has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+This document describes the overall ideas behind its design and implementation.
+
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Question and Answers
+ - What about "auto" mode?
+ - What about boosting on a congested system?
+ - How CPUs are boosted when we have tasks with multiple boost values?
+7. References
+
+
+1. Motivation
+=============
+
+Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
+scheduler to select the optimal DVFS operating point (OPP) for running a task
+allocated to a CPU. The introduction of sched-DVFS enables running workloads at
+the most energy efficient OPPs.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one that is actually required
+by it's CPU bandwidth demand.
+
+This last requirement is especially important if we consider that one of the
+main goals of the sched-DVFS component is to replace all currently available
+CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
+driven governors we currently have, it is already more responsive at selecting
+the optimal OPP to run tasks allocated to a CPU. However, just tracking the
+actual task load demand may not be enough from a performance standpoint. For
+example, it is not possible to get behaviors similar to those provided by the
+"performance" and "interactive" CPUFreq governors.
+
+This document describes an implementation of a tunable, stacked on top of the
+sched-DVFS which extends its functionality to support task performance
+boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates). For
+example, if we consider a simple periodic task which executes the same workload
+for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
+that task must complete each of its activations in less than 5[s].
+
+A previous attempt [5] to introduce such a boosting feature has not been
+successful mainly because of the complexity of the proposed solution. The
+approach described in this document exposes a single simple interface to
+user-space. This single tunable knob allows the tuning of system wide
+scheduler behaviours ranging from energy efficiency at one end through to
+incremental performance boosting at the other end. This first tunable affects
+all tasks. However, a more advanced extension of the concept is also provided
+which uses CGroups to boost the performance of only selected tasks while using
+the energy efficient default for all others.
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface with a single power-performance
+tunable:
+
+ /proc/sys/kernel/sched_cfs_boost
+
+This permits expressing a boost value as an integer in the range [0..100].
+
+A value of 0 (default) configures the CFS scheduler for maximum energy
+efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
+required to satisfy their workload demand.
+A value of 100 configures scheduler for maximum performance, which translates
+to the selection of the maximum OPP on that CPU.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably. For
+example to satisfy interactive response or depending on other system events
+(battery level etc).
+
+A CGroup based extension is also provided, which permits further user-space
+defined task classification to tune the scheduler for different goals depending
+on the specific nature of the task, e.g. background vs interactive vs
+low-priority.
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
+Performance Point (OPP) selection.
+Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
+the operating frequency of that CPU to better match the workload demand. The
+selection of the actual OPP being activated is influenced by the global boost
+value, or the boost value for the task CGroup when in use.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob.
+The only new concept introduced is that of signal boosting.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few load tracking signals
+which basically track the CPU bandwidth requirements for tasks and the capacity
+of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
+some of these load tracking signals to make a task or RQ appears more demanding
+that it actually is.
+
+Which signals have to be inflated depends on the specific "consumer". However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+ margin := boosting_strategy(sched_cfs_boost, signal)
+ boosted_signal := signal + margin
+
+Different boosting strategies were identified and analyzed before selecting the
+one found to be most effective.
+
+Signal Proportional Compensation (SPC)
+--------------------------------------
+
+In this boosting strategy the sched_cfs_boost value is used to compute a
+margin which is proportional to the complement of the original signal.
+When a signal has a maximum possible value, its complement is defined as
+the delta from the actual value and its possible maximum.
+
+Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
+the maximum possible value, the margin becomes:
+
+ margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+ question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+- 0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+- 50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a 50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+ ^
+ | SCHED_LOAD_SCALE
+ +-----------------------------------------------------------------+
+ |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+ |
+ | boosted_signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb
+ |
+ | original signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+ | |
+ |bbbbbbbbbbbbbbbbbb |
+ | |
+ | |
+ | |
+ | +-----------------------+
+ | |
+ | |
+ | |
+ |------------------+
+ |
+ |
+ +----------------------------------------------------------------------->
+
+The plot above shows a ramped load signal (titled 'original_signal') and it's
+boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new load
+signals. Instead, it provides an API to tune existing signals. This tuning is
+done on demand and only in scheduler code paths where it is sensible to do so.
+The new API calls are defined to return either the default signal or a boosted
+one, depending on the value of sched_cfs_boost. This is a clean an non invasive
+modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+ cpu_util()
+ boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where sched-DVFS needs to
+decide the OPP to run a CPU at.
+For example, this allows selecting the highest OPP for a CPU which has
+the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+The availability of a single knob which is used to boost all tasks in the
+system is certainly a simple solution but it quite likely doesn't fit many
+utilization scenarios, especially in the mobile device space.
+
+For example, on battery powered devices there usually are many background
+services which are long running and need energy efficient scheduling. On the
+other hand, some applications are more performance sensitive and require an
+interactive response and/or maximum performance, regardless of the energy cost.
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", could be enabled which allows to
+defined and configure task groups with different boosting values.
+Tasks that require special performance can be put into separate CGroups.
+The value of the boost associated with the tasks in this group can be specified
+using a single knob exposed by the CGroup controller:
+
+ schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+ 1) It is only possible to create 1 level depth hierarchies
+
+ The root control groups define the system-wide boost value to be applied
+ by default to all tasks. Its direct subgroups are named "boost groups" and
+ they define the boost value for specific set of tasks.
+ Further nested subgroups are not allowed since they do not have a sensible
+ meaning from a user-space standpoint.
+
+ 2) It is possible to define only a limited number of "boost groups"
+
+ This number is defined at compile time and by default configured to 16.
+ This is a design decision motivated by two main reasons:
+ a) In a real system we do not expect utilization scenarios with more then few
+ boost groups. For example, a reasonable collection of groups could be
+ just "background", "interactive" and "performance".
+ b) It simplifies the implementation considerably, especially for the code
+ which has to compute the per CPU boosting once there are multiple
+ RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios identified
+so far. It provides a simple interface which can be used to manage the
+power-performance of all tasks or only selected tasks.
+Moreover, this interface can be easily integrated by user-space run-times (e.g.
+Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CGROUP_SCHEDTUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+ root@linaro-nano:~# cat /proc/cgroups
+ #subsys_name hierarchy num_cgroups enabled
+ cpuset 0 1 1
+ cpu 0 1 1
+ schedtune 0 1 1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+ root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
+ root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Setup the system-wide boost value (Optional)
+
+ If not configured the root control group has a 0% boost value, which
+ basically disables boosting for all tasks in the system thus running in
+ an energy-efficient mode.
+
+ root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
+
+5. Create task groups and configure their specific boost value (Optional)
+
+ For example here we create a "performance" boost group configure to boost
+ all its tasks to 100%
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
+ root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+6. Move tasks into the boost group
+
+ For example, the following moves the tasks with PID $TASKPID (and all its
+ threads) into the "performance" boost group.
+
+ root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the threads of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
+with some suitable user-space element. This element could use the exposed
+system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
+is boosted with a value which is the maximum of the boost values of the
+currently RUNNABLE tasks in its RQ.
+
+This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+7. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] http://lkml.org/lkml/2015/6/26/620
+
--
2.5.0

2015-08-19 18:49:43

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value

The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.

To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance.

This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
/proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
- 0% boost requires to operate in "standard" mode by scheduling
tasks at the minimum capacities required by the workload demand
- 100% boost requires to push at maximum the task performances,
"regardless" of the incurred energy consumption

A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
include/linux/sched/sysctl.h | 16 ++++++++++++++++
init/Kconfig | 26 ++++++++++++++++++++++++++
kernel/sched/Makefile | 1 +
kernel/sched/tune.c | 17 +++++++++++++++++
kernel/sysctl.c | 11 +++++++++++
5 files changed, 71 insertions(+)
create mode 100644 kernel/sched/tune.c

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..4479e48 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -77,6 +77,22 @@ extern int sysctl_sched_rt_runtime;
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif

+#ifdef CONFIG_SCHED_TUNE
+extern unsigned int sysctl_sched_cfs_boost;
+int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length,
+ loff_t *ppos);
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+ return sysctl_sched_cfs_boost;
+}
+#else
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;
#endif
diff --git a/init/Kconfig b/init/Kconfig
index af09b4f..7fa3419 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1220,6 +1220,32 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.

+config SCHED_TUNE
+ bool "Boosting for CFS tasks (EXPERIMENTAL)"
+ help
+ This option enables the system-wide support for task boosting.
+ When this support is enabled a new sysctl interface is exposed to
+ userspace via:
+ /proc/sys/kernel/sched_cfs_boost
+ which allows to set a system-wide boost value in range [0..100].
+
+ The currently boosting strategy is implemented in such a way that:
+ - a 0% boost value requires to operate in "standard" mode by
+ scheduling all tasks at the minimum capacities required by their
+ workload demand
+ - a 100% boost value requires to push at maximum the task
+ performances, "regardless" of the incurred energy consumption
+
+ A boost value in between these two boundaries is used to bias the
+ power/performance trade-off, the higher the boost value the more the
+ scheduler is biased toward performance boosting instead of energy
+ efficiency.
+
+ Since this support exposes a single system-wide knob, the specified
+ boost value is applied to all (CFS) tasks in the system.
+
+ If unsure, say N.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 90ed832..f804ef3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -18,5 +18,6 @@ obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_SCHED_TUNE) += tune.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
new file mode 100644
index 0000000..4c44b1a
--- /dev/null
+++ b/kernel/sched/tune.c
@@ -0,0 +1,17 @@
+#include "sched.h"
+
+unsigned int sysctl_sched_cfs_boost __read_mostly;
+
+int
+sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (ret || !write)
+ return ret;
+
+ return 0;
+}
+
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 19b62b5..2b4673e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -433,6 +433,17 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SCHED_TUNE
+ {
+ .procname = "sched_cfs_boost",
+ .data = &sysctl_sched_cfs_boost,
+ .maxlen = sizeof(sysctl_sched_cfs_boost),
+ .mode = 0644,
+ .proc_handler = &sysctl_sched_cfs_boost_handler,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
2.5.0

2015-08-19 18:49:20

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin"

The basic idea of the boost knob is to "artificially inflate" a signal
to make a task or logical CPU appears more demanding than it actually
is. Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate the boost percentage into a "margin" value to be added
to the original signal to inflate
2. what is the meaning of a boost value from a user-space perspective

This patch provides the implementation of a possible boost semantic,
named "Signal Proportional Compensation" (SPC), where the boost
percentage (BP) is used to compute a margin (M) which is proportional to
the complement of the original signal (OS):
M = BP * (SCHED_LOAD_SCALE - OS)
The computed margin then added to the OS to obtain the Boosted Signal (BS)
BS = OS + M

The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
to the maximum available capacity in the system (i.e. SCHED_LOAD_SCALE)
- a 100% boosting has a clear understanding from a user-space perspective,
since it means simply to run (possibly) "all" tasks at the max OPP
- each boosting value means to improve the task performance by a quantity
which is proportional to the maximum achievable performance on that
system
Thus this semantics is somehow forcing a behaviour which is:

50% boosting means to run at half-way between the current and the
maximum performance which a task could achieve on that system

This patch provides the code to implement a fast integer division to
convert a boost percentage (BP) value into a margin (M).

NOTE: this code is suitable for all signals operating in range
[0..SCHED_LOAD_SCALE]

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 955dfe1..15fde75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,6 +4730,44 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
return 1;
}

+#ifdef CONFIG_SCHED_TUNE
+
+static unsigned long
+schedtune_margin(unsigned long signal, unsigned long boost)
+{
+ unsigned long long margin = 0;
+
+ /*
+ * Signal proportional compensation (SPC)
+ *
+ * The Boost (B) value is used to compute a Margin (M) which is
+ * proportional to the complement of the original Signal (S):
+ * M = B * (SCHED_LOAD_SCALE - S)
+ * The obtained M could be used by the caller to "boost" S.
+ */
+ margin = SCHED_LOAD_SCALE - signal;
+ margin *= boost;
+
+ /*
+ * Fast integer division by constant:
+ * Constant : (C) = 100
+ * Precision : 0.1% (P) = 0.1
+ * Reference : C * 100 / P (R) = 100000
+ *
+ * Thus:
+ * Shift bits : ceil(log(R,2)) (S) = 17
+ * Mult const : round(2^S/C) (M) = 1311
+ *
+ *
+ * */
+ margin *= 1311;
+ margin >>= 17;
+
+ return margin;
+}
+
+#endif /* CONFIG_SCHED_TUNE */
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
--
2.5.0

2015-08-19 18:47:58

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 11/14] sched/fair: add boosted CPU usage

The CPU usage signal is used by the scheduler as an estimation of the
overall bandwidth currently allocated on a CPU. When SchedDVFS is in
use, this signal affects the selection of the operating points (OPP)
required to accommodate all the workload allocated in a CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU usage signal each
time it is used to select an OPP.

This patch introduces a new function:
get_boosted_cpu_usage(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
1. computed based on the "boosting strategy" in use
2. proportional to the system-wide boost value defined by provided
user-space interface

The boosted signal is used by SchedDVFS (transparently) each time it
requires to get an estimation of the capacity required for a CPU.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 15fde75..633fcab4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4083,6 +4083,7 @@ static inline void hrtick_update(struct rq *rq)
static unsigned int capacity_margin = 1280;
static unsigned long capacity_orig_of(int cpu);
static int cpu_util(int cpu);
+static inline unsigned long boosted_cpu_util(int cpu);

static void update_capacity_of(int cpu)
{
@@ -4091,7 +4092,8 @@ static void update_capacity_of(int cpu)
if (!sched_energy_freq())
return;

- req_cap = cpu_util(cpu) * capacity_margin / capacity_orig_of(cpu);
+ req_cap = boosted_cpu_util(cpu);
+ req_cap = req_cap * capacity_margin / capacity_orig_of(cpu);
cpufreq_sched_set_cap(cpu, req_cap);
}

@@ -4766,8 +4768,36 @@ schedtune_margin(unsigned long signal, unsigned long boost)
return margin;
}

+static inline unsigned int
+schedtune_cpu_margin(unsigned long util)
+{
+ unsigned int boost = get_sysctl_sched_cfs_boost();
+
+ if (boost == 0)
+ return 0;
+
+ return schedtune_margin(util, boost);
+}
+
+#else /* CONFIG_SCHED_TUNE */
+
+static inline unsigned int
+schedtune_cpu_margin(unsigned long util)
+{
+ return 0;
+}
+
#endif /* CONFIG_SCHED_TUNE */

+static inline unsigned long
+boosted_cpu_util(int cpu)
+{
+ unsigned long util = cpu_util(cpu);
+ unsigned long margin = schedtune_cpu_margin(util);
+
+ return util + margin;
+}
+
/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
--
2.5.0

2015-08-19 18:48:04

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting

To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint. However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.

In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple but still effective for a boosting
strategy, the new controller:
1. allows only a two layer hierarchy
2. supports only a limited number of boost groups

A two layer hierarchy allows to place each task either:
a) in the root control group
thus being subject to a system-wide boosting value
b) in a child of the root group
thus being subject to the specific boost value defined by that
"boost group"

The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation especially for the code required to
compute the boost value for CPUs which have runnable tasks belonging to
different boost groups.

cc: Tejun Heo <[email protected]>
cc: Li Zefan <[email protected]>
cc: Johannes Weiner <[email protected]>
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 17 ++++
kernel/sched/tune.c | 200 ++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 4 +
4 files changed, 225 insertions(+)

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..23befa0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -15,6 +15,10 @@ SUBSYS(cpu)
SUBSYS(cpuacct)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_SCHEDTUNE)
+SUBSYS(schedtune)
+#endif
+
#if IS_ENABLED(CONFIG_BLK_CGROUP)
SUBSYS(blkio)
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 7fa3419..4555e97 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -982,6 +982,23 @@ config CGROUP_CPUACCT
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.

+config CGROUP_SCHEDTUNE
+ bool "CFS tasks boosting cgroup subsystem (EXPERIMENTAL)"
+ depends on SCHED_TUNE
+ help
+ This option provides the "schedtune" controller which improves the
+ flexibility of the task boosting mechanism by introducing the support
+ to define "per task" boost values.
+
+ This new controller:
+ 1. allows only a two layers hierarchy, where the root defines the
+ system-wide boost value and its direct childrens define each one a
+ different "class of tasks" to be boosted with a different value
+ 2. supports up to 16 different task classes, each one which could be
+ configured with a different boost value
+
+ Say N if unsure.
+
config PAGE_COUNTER
bool

diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 4c44b1a..a26295c 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -1,7 +1,207 @@
+#include <linux/cgroup.h>
+#include <linux/err.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
#include "sched.h"

unsigned int sysctl_sched_cfs_boost __read_mostly;

+#ifdef CONFIG_CGROUP_SCHEDTUNE
+
+/*
+ * EAS scheduler tunables for task groups.
+ */
+
+/* SchdTune tunables for a group of tasks */
+struct schedtune {
+ /* SchedTune CGroup subsystem */
+ struct cgroup_subsys_state css;
+
+ /* Boost group allocated ID */
+ int idx;
+
+ /* Boost value for tasks on that SchedTune CGroup */
+ int boost;
+
+};
+
+static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct schedtune, css) : NULL;
+}
+
+static inline struct schedtune *task_schedtune(struct task_struct *tsk)
+{
+ return css_st(task_css(tsk, schedtune_cgrp_id));
+}
+
+static inline struct schedtune *parent_st(struct schedtune *st)
+{
+ return css_st(st->css.parent);
+}
+
+/*
+ * SchedTune root control group
+ * The root control group is used to defined a system-wide boosting tuning,
+ * which is applied to all tasks in the system.
+ * Task specific boost tuning could be specified by creating and
+ * configuring a child control group under the root one.
+ * By default, system-wide boosting is disabled, i.e. no boosting is applied
+ * to tasks which are not into a child control group.
+ */
+static struct schedtune
+root_schedtune = {
+ .boost = 0,
+};
+
+/*
+ * Maximum number of boost groups to support
+ * When per-task boosting is used we still allow only limited number of
+ * boost groups for two main reasons:
+ * 1. on a real system we usually have only few classes of workloads which
+ * make sense to boost with different values (e.g. background vs foreground
+ * tasks, interactive vs low-priority tasks)
+ * 2. a limited number allows for a simpler and more memory/time efficient
+ * implementation especially for the computation of the per-CPU boost
+ * value
+ */
+#define BOOSTGROUPS_COUNT 16
+
+/* Array of configured boostgroups */
+static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
+ &root_schedtune,
+ NULL,
+};
+
+static u64
+boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct schedtune *st = css_st(css);
+
+ return st->boost;
+}
+
+static int
+boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 boost)
+{
+ struct schedtune *st = css_st(css);
+
+ if (boost < 0 || boost > 100)
+ return -EINVAL;
+
+ st->boost = boost;
+ if (css == &root_schedtune.css)
+ sysctl_sched_cfs_boost = boost;
+
+ return 0;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "boost",
+ .read_u64 = boost_read,
+ .write_u64 = boost_write,
+ },
+ { } /* terminate */
+};
+
+static int
+schedtune_boostgroup_init(struct schedtune *st)
+{
+ /* Keep track of allocated boost groups */
+ allocated_group[st->idx] = st;
+
+ return 0;
+}
+
+static int
+schedtune_init(void)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Initialize the per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ memset(bg, 0, sizeof(struct boost_groups));
+ }
+
+ pr_info(" schedtune configured to support %d boost groups\n",
+ BOOSTGROUPS_COUNT);
+ return 0;
+}
+
+static struct cgroup_subsys_state *
+schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct schedtune *st;
+ int idx;
+
+ if (!parent_css) {
+ schedtune_init();
+ return &root_schedtune.css;
+ }
+
+ /* Allow only single level hierachies */
+ if (parent_css != &root_schedtune.css) {
+ pr_err("Nested SchedTune boosting groups not allowed\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Allow only a limited number of boosting groups */
+ for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx)
+ if (!allocated_group[idx])
+ break;
+ if (idx == BOOSTGROUPS_COUNT) {
+ pr_err("Trying to create more than %d SchedTune boosting groups\n",
+ BOOSTGROUPS_COUNT);
+ return ERR_PTR(-ENOSPC);
+ }
+
+ st = kzalloc(sizeof(*st), GFP_KERNEL);
+ if (!st)
+ goto out;
+
+ /* Initialize per CPUs boost group support */
+ st->idx = idx;
+ if (schedtune_boostgroup_init(st))
+ goto release;
+
+ return &st->css;
+
+release:
+ kfree(st);
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void
+schedtune_boostgroup_release(struct schedtune *st)
+{
+ /* Keep track of allocated boost groups */
+ allocated_group[st->idx] = NULL;
+}
+
+static void
+schedtune_css_free(struct cgroup_subsys_state *css)
+{
+ struct schedtune *st = css_st(css);
+
+ schedtune_boostgroup_release(st);
+ kfree(st);
+}
+
+struct cgroup_subsys schedtune_cgrp_subsys = {
+ .css_alloc = schedtune_css_alloc,
+ .css_free = schedtune_css_free,
+ .legacy_cftypes = files,
+ .early_init = 1,
+};
+
+#endif /* CONFIG_CGROUP_SCHEDTUNE */
+
int
sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2b4673e..d42162c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -438,7 +438,11 @@ static struct ctl_table kern_table[] = {
.procname = "sched_cfs_boost",
.data = &sysctl_sched_cfs_boost,
.maxlen = sizeof(sysctl_sched_cfs_boost),
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+ .mode = 0444,
+#else
.mode = 0644,
+#endif
.proc_handler = &sysctl_sched_cfs_boost_handler,
.extra1 = &zero,
.extra2 = &one_hundred,
--
2.5.0

2015-08-19 18:48:42

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value

When per task boosting is enabled, we could have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:

Task SchedTune CGroup Boost Value
T1 root 0
T2 low-priority 10
T3 interactive 90

In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently scheduled on this CPU.

A suitable aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
a) boosting all the concurrently scheduled tasks thus reducing
potential co-scheduling side-effects on demanding tasks
b) reduce the number of frequency switch requested towards SchedDVFS,
thus being more friendly to architectures with slow frequency
switching times

Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on that CPU.

This patch introduces the required support to keep track of the boost
groups currently affecting CPUs. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is used which is updated for each CPU entry by
schedtune_boostgroup_update() but only when a schedtune CGroup boost
value is updated. However, this is expected to be a rare operation,
perhaps done just one time at system boot time.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/tune.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+)

diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index a26295c..3223ef3 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -1,5 +1,6 @@
#include <linux/cgroup.h>
#include <linux/err.h>
+#include <linux/percpu.h>
#include <linux/printk.h>
#include <linux/slab.h>

@@ -74,6 +75,89 @@ static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
NULL,
};

+/* SchedTune boost groups
+ * Keep track of all the boost groups which impact on CPU, for example when a
+ * CPU has two RUNNABLE tasks belonging to two different boost groups and thus
+ * likely with different boost values.
+ * Since on each system we expect only a limited number of boost groups, here
+ * we use a simple array to keep track of the metrics required to compute the
+ * maximum per-CPU boosting value.
+ */
+struct boost_groups {
+ /* Maximum boost value for all RUNNABLE tasks on a CPU */
+ unsigned boost_max;
+ struct {
+ /* The boost for tasks on that boost group */
+ unsigned boost;
+ /* Count of RUNNABLE tasks on that boost group */
+ unsigned tasks;
+ } group[BOOSTGROUPS_COUNT];
+};
+
+/* Boost groups affecting each CPU in the system */
+DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+
+static void
+schedtune_cpu_update(int cpu)
+{
+ struct boost_groups *bg;
+ unsigned boost_max;
+ int idx;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /* The root boost group is always active */
+ boost_max = bg->group[0].boost;
+ for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) {
+ /*
+ * A boost group affects a CPU only if it has
+ * RUNNABLE tasks on that CPU
+ */
+ if (bg->group[idx].tasks == 0)
+ continue;
+ boost_max = max(boost_max, bg->group[idx].boost);
+ }
+
+ bg->boost_max = boost_max;
+}
+
+static int
+schedtune_boostgroup_update(int idx, int boost)
+{
+ struct boost_groups *bg;
+ int cur_boost_max;
+ int old_boost;
+ int cpu;
+
+ /* Update per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /*
+ * Keep track of current boost values to compute the per CPU
+ * maximum only when it has been affected by the new value of
+ * the updated boost group
+ */
+ cur_boost_max = bg->boost_max;
+ old_boost = bg->group[idx].boost;
+
+ /* Update the boost value of this boost group */
+ bg->group[idx].boost = boost;
+
+ /* Check if this update increase current max */
+ if (boost > cur_boost_max && bg->group[idx].tasks) {
+ bg->boost_max = boost;
+ continue;
+ }
+
+ /* Check if this update has decreased current max */
+ if (cur_boost_max == old_boost && old_boost > boost)
+ schedtune_cpu_update(cpu);
+ }
+
+ return 0;
+}
+
static u64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
@@ -95,6 +179,9 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
if (css == &root_schedtune.css)
sysctl_sched_cfs_boost = boost;

+ /* Update CPU boost */
+ schedtune_boostgroup_update(st->idx, st->boost);
+
return 0;
}

@@ -110,9 +197,19 @@ static struct cftype files[] = {
static int
schedtune_boostgroup_init(struct schedtune *st)
{
+ struct boost_groups *bg;
+ int cpu;
+
/* Keep track of allocated boost groups */
allocated_group[st->idx] = st;

+ /* Initialize the per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ bg->group[st->idx].boost = 0;
+ bg->group[st->idx].tasks = 0;
+ }
+
return 0;
}

@@ -180,6 +277,9 @@ out:
static void
schedtune_boostgroup_release(struct schedtune *st)
{
+ /* Reset this boost group */
+ schedtune_boostgroup_update(st->idx, 0);
+
/* Keep track of allocated boost groups */
allocated_group[st->idx] = NULL;
}
--
2.5.0

2015-08-19 18:48:17

by Patrick Bellasi

[permalink] [raw]
Subject: [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value

When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.

This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a CPU its boost group is used to increment a per-cpu counter of RUNNABLE
tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
a) boost_group::tasks == 1: this boost group starts to impact the CPU
b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
sched_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.

The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.

cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/fair.c | 17 +++++++---
kernel/sched/tune.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/tune.h | 23 +++++++++++++
3 files changed, 130 insertions(+), 4 deletions(-)
create mode 100644 kernel/sched/tune.h

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 633fcab4..98470c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -34,6 +34,7 @@
#include <trace/events/sched.h>

#include "sched.h"
+#include "tune.h"

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -4145,6 +4146,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
add_nr_running(rq, 1);

+ schedtune_enqueue_task(p, cpu_of(rq));
+
/*
* We want to potentially trigger a freq switch request only for
* tasks that are waking up; this is because we get here also during
@@ -4213,6 +4216,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

if (!se) {
sub_nr_running(rq, 1);
+ schedtune_dequeue_task(p, cpu_of(rq));

/*
* We want to potentially trigger a freq switch request only for
@@ -4769,10 +4773,15 @@ schedtune_margin(unsigned long signal, unsigned long boost)
}

static inline unsigned int
-schedtune_cpu_margin(unsigned long util)
+schedtune_cpu_margin(unsigned long util, int cpu)
{
- unsigned int boost = get_sysctl_sched_cfs_boost();
+ unsigned int boost;

+#ifdef CONFIG_CGROUP_SCHEDTUNE
+ boost = schedtune_cpu_boost(cpu);
+#else
+ boost = get_sysctl_sched_cfs_boost();
+#endif
if (boost == 0)
return 0;

@@ -4782,7 +4791,7 @@ schedtune_cpu_margin(unsigned long util)
#else /* CONFIG_SCHED_TUNE */

static inline unsigned int
-schedtune_cpu_margin(unsigned long util)
+schedtune_cpu_margin(unsigned long util, int cpu)
{
return 0;
}
@@ -4793,7 +4802,7 @@ static inline unsigned long
boosted_cpu_util(int cpu)
{
unsigned long util = cpu_util(cpu);
- unsigned long margin = schedtune_cpu_margin(util);
+ unsigned long margin = schedtune_cpu_margin(util, cpu);

return util + margin;
}
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 3223ef3..3838106 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -2,6 +2,7 @@
#include <linux/err.h>
#include <linux/percpu.h>
#include <linux/printk.h>
+#include <linux/rcupdate.h>
#include <linux/slab.h>

#include "sched.h"
@@ -158,6 +159,87 @@ schedtune_boostgroup_update(int idx, int boost)
return 0;
}

+static inline void
+schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
+{
+ struct boost_groups *bg;
+ int tasks;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /* Update boosted tasks count while avoiding to make it negative */
+ if (task_count < 0 && bg->group[idx].tasks <= -task_count)
+ bg->group[idx].tasks = 0;
+ else
+ bg->group[idx].tasks += task_count;
+
+ /* Boost group activation or deactivation on that RQ */
+ tasks = bg->group[idx].tasks;
+ if (tasks == 1 || tasks == 0)
+ schedtune_cpu_update(cpu);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_enqueue_task(struct task_struct *p, int cpu)
+{
+ struct schedtune *st;
+ int idx;
+
+ /*
+ * When a task is marked PF_EXITING by do_exit() it's going to be
+ * dequeued and enqueued multiple times in the exit path.
+ * Thus we avoid any further update, since we do not want to change
+ * CPU boosting while the task is exiting.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /* Get task boost group */
+ rcu_read_lock();
+ st = task_schedtune(p);
+ idx = st->idx;
+ rcu_read_unlock();
+
+ schedtune_tasks_update(p, cpu, idx, 1);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_dequeue_task(struct task_struct *p, int cpu)
+{
+ struct schedtune *st;
+ int idx;
+
+ /*
+ * When a task is marked PF_EXITING by do_exit() it's going to be
+ * dequeued and enqueued multiple times in the exit path.
+ * Thus we avoid any further update, since we do not want to change
+ * CPU boosting while the task is exiting.
+ * The last dequeue will be done by cgroup exit() callback.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /* Get task boost group */
+ rcu_read_lock();
+ st = task_schedtune(p);
+ idx = st->idx;
+ rcu_read_unlock();
+
+ schedtune_tasks_update(p, cpu, idx, -1);
+}
+
+int schedtune_cpu_boost(int cpu)
+{
+ struct boost_groups *bg;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ return bg->boost_max;
+}
+
static u64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
@@ -293,9 +375,21 @@ schedtune_css_free(struct cgroup_subsys_state *css)
kfree(st);
}

+static void
+schedtune_exit(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *old_css,
+ struct task_struct *tsk)
+{
+ struct schedtune *old_st = css_st(old_css);
+ int cpu = task_cpu(tsk);
+
+ schedtune_tasks_update(tsk, cpu, old_st->idx, -1);
+}
+
struct cgroup_subsys schedtune_cgrp_subsys = {
.css_alloc = schedtune_css_alloc,
.css_free = schedtune_css_free,
+ .exit = schedtune_exit,
.legacy_cftypes = files,
.early_init = 1,
};
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
new file mode 100644
index 0000000..4519028
--- /dev/null
+++ b/kernel/sched/tune.h
@@ -0,0 +1,23 @@
+
+#ifdef CONFIG_SCHED_TUNE
+
+#ifdef CONFIG_CGROUP_SCHEDTUNE
+
+extern int schedtune_cpu_boost(int cpu);
+
+extern void schedtune_enqueue_task(struct task_struct *p, int cpu);
+extern void schedtune_dequeue_task(struct task_struct *p, int cpu);
+
+#else /* CONFIG_CGROUP_SCHEDTUNE */
+
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+
+#endif /* CONFIG_CGROUP_SCHEDTUNE */
+
+#else /* CONFIG_SCHED_TUNE */
+
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+
+#endif /* CONFIG_SCHED_TUNE */
--
2.5.0