2018-05-25 13:16:14

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 00/10] track CPU utilization

This patchset initially tracked only the utilization of RT rq. During
OSPM summit, it has been discussed the opportunity to extend it in order
to get an estimate of the utilization of the CPU.

- Patches 1-3 correspond to the content of patchset v4 and add utilization
tracking for rt_rq.

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of
frequency switches of the CPU whereas the CPU never goes idles during the
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) :
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
total time: 15.0009s
total number of events: 4903
total time taken by event execution: 14.9972
per-request statistics:
min: 1.23ms
avg: 3.06ms
max: 13.16ms
approx. 95 percentile: 12.73ms

Threads fairness:
events (avg/stddev): 4903.0000/0.00
execution time (avg/stddev): 14.9972/0.00

- With patchset:
Test execution summary:
total time: 15.0014s
total number of events: 7694
total time taken by event execution: 14.9979
per-request statistics:
min: 1.23ms
avg: 1.95ms
max: 10.49ms
approx. 95 percentile: 10.39ms

Threads fairness:
events (avg/stddev): 7694.0000/0.00
execution time (avg/stddev): 14.9979/0.00

The performance improvement is 56% for this use case.

- Patches 4-5 add utilization tracking for dl_rq in order to solve similar
problem as with rt_rq

- Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
dl and rt from sched_rt_avg_update

- Patches 7-8 add utilization tracking for interrupt and use it select OPP
A test with iperf on hikey 6220 gives:
w/o patchset w/ patchset
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +09%

8 iterations of iperf -c server_address -r -t 5
stdev is lower than 1%
Only WFI idle state is enable (shallowest arm idle state)

- Patches 9 removes the unused sched_avg_update code

- Patch 10 removes the unused sched_time_avg_ms

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
patch 1/2

Vincent Guittot (10):
sched/pelt: Move pelt related code in a dedicated file
sched/rt: add rt_rq utilization tracking
cpufreq/schedutil: add rt utilization tracking
sched/dl: add dl_rq utilization tracking
cpufreq/schedutil: get max utilization
sched: remove rt and dl from sched_avg
sched/irq: add irq utilization tracking
cpufreq/schedutil: take into account interrupt
sched: remove rt_avg code
proc/sched: remove unused sched_time_avg_ms

include/linux/sched/sysctl.h | 1 -
kernel/sched/Makefile | 2 +-
kernel/sched/core.c | 38 +---
kernel/sched/cpufreq_schedutil.c | 24 ++-
kernel/sched/deadline.c | 7 +-
kernel/sched/fair.c | 381 +++----------------------------------
kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 63 +++++++
kernel/sched/rt.c | 10 +-
kernel/sched/sched.h | 57 ++++--
kernel/sysctl.c | 8 -
11 files changed, 563 insertions(+), 423 deletions(-)
create mode 100644 kernel/sched/pelt.c
create mode 100644 kernel/sched/pelt.h

--
2.7.4



2018-05-25 13:13:21

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 06/10] sched: remove rt and dl from sched_avg

the utilization level of the CPU by rt and dl tasks is now tracked with
PELT so we can use these metrics and remove them from the rt_avg which
will track only interrupt and stolen virtual time.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/deadline.c | 2 --
kernel/sched/fair.c | 2 ++
kernel/sched/pelt.c | 2 +-
kernel/sched/rt.c | 2 --
4 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 950b3fb..da839e7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@ static void update_curr_dl(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 967e873..da75eda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7562,6 +7562,8 @@ static unsigned long scale_rt_capacity(int cpu)

used = div_u64(avg, total);

+ used += READ_ONCE(rq->avg_rt.util_avg);
+ used += READ_ONCE(rq->avg_dl.util_avg);
if (likely(used < SCHED_CAPACITY_SCALE))
return SCHED_CAPACITY_SCALE - used;

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index b07db80..3d5bd3a 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
*/
sa->load_avg = div_u64(load * sa->load_sum, divider);
sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
+ WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b4148a9..3393c63 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -970,8 +970,6 @@ static void update_curr_rt(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (!rt_bandwidth_enabled())
return;

--
2.7.4


2018-05-25 13:13:44

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 10/10] proc/sched: remove unused sched_time_avg_ms

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
Remove it

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 8 --------
kernel/sched/sched.h | 1 -
kernel/sysctl.c | 8 --------
4 files changed, 18 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a151..913488d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,7 +40,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
#ifdef CONFIG_SCHED_DEBUG
extern __read_mostly unsigned int sysctl_sched_migration_cost;
extern __read_mostly unsigned int sysctl_sched_nr_migrate;
-extern __read_mostly unsigned int sysctl_sched_time_avg;

int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 213d277..9894bc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -46,14 +46,6 @@ const_debug unsigned int sysctl_sched_features =
const_debug unsigned int sysctl_sched_nr_migrate = 32;

/*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
-/*
* period over which we measure -rt task CPU usage in us.
* default: 1s
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1929db7..5d55782 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1704,7 +1704,6 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

-extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6a78cf7..d77a959 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,14 +368,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
- {
- .procname = "sched_time_avg_ms",
- .data = &sysctl_sched_time_avg,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &one,
- },
#ifdef CONFIG_SCHEDSTATS
{
.procname = "sched_schedstats",
--
2.7.4


2018-05-25 13:14:10

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
can preempt and steal cfs's running time.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 28592b6..a84b5a5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
/* The fields below are only needed when sharing a policy: */
unsigned long util_cfs;
unsigned long util_dl;
+ unsigned long util_rt;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -178,14 +179,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
+ sg_cpu->util_rt = cpu_util_rt(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util;

- if (rq->rt.rt_nr_running)
- return sg_cpu->max;
+ if (rq->rt.rt_nr_running) {
+ util = sg_cpu->max;
+ } else {
+ util = sg_cpu->util_dl;
+ util += sg_cpu->util_cfs;
+ util += sg_cpu->util_rt;
+ }

/*
* Utilization required by DEADLINE must always be granted while, for
@@ -197,7 +205,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+ return min(sg_cpu->max, util);
}

static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
--
2.7.4


2018-05-25 13:14:34

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 08/10] cpufreq/schedutil: take into account interrupt

The time spent under interrupt can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
The CPU utilization is :
irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives:
iperf -c server_address -r -t 5

w/o patch w/ patch
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +09%

8 iterations
stdev is lower than 1%
Only WFI idle state is enable (shallowest diel state)

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 10 ++++++++++
kernel/sched/sched.h | 5 +++++
2 files changed, 15 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a84b5a5..06f2080 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -57,6 +57,7 @@ struct sugov_cpu {
unsigned long util_cfs;
unsigned long util_dl;
unsigned long util_rt;
+ unsigned long util_irq;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -180,6 +181,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
sg_cpu->util_rt = cpu_util_rt(rq);
+ sg_cpu->util_irq = cpu_util_irq(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
@@ -190,9 +192,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
if (rq->rt.rt_nr_running) {
util = sg_cpu->max;
} else {
+ /* Sum rq utilization*/
util = sg_cpu->util_dl;
util += sg_cpu->util_cfs;
util += sg_cpu->util_rt;
+
+ /* Weight rq's utilization to the normal context */
+ util *= (sg_cpu->max - sg_cpu->util_irq);
+ util /= sg_cpu->max;
+
+ /* Add interrupt utilization */
+ util += sg_cpu->util_irq;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7e8d5b..718c55d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2218,4 +2218,9 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return rq->avg_rt.util_avg;
}
+
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+ return rq->avg_irq.util_avg;
+}
#endif
--
2.7.4


2018-05-25 13:15:00

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 07/10] sched/irq: add irq utilization tracking

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is :
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/fair.c | 26 +++++++-------------------
kernel/sched/pelt.c | 38 ++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 7 +++++++
kernel/sched/sched.h | 1 +
5 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d155518..ab58288 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -16,6 +16,8 @@
#include "../workqueue_internal.h"
#include "../smpboot.h"

+#include "pelt.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

@@ -184,7 +186,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
- sched_rt_avg_update(rq, irq_delta + steal);
+ update_irq_load_avg(rq, irq_delta + steal);
#endif
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da75eda..1bb3379 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,8 +5323,6 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,

this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
}
-
- sched_avg_update(this_rq);
}

/* Used instead of source_load when we know the type == 0 */
@@ -7298,6 +7296,9 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
if (rq->avg_dl.util_avg)
return true;

+ if (rq->avg_irq.util_avg)
+ return true;
+
return false;
}

@@ -7362,6 +7363,7 @@ static void update_blocked_averages(int cpu)
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
/* Don't need periodic decay once load/util_avg are null */
if (others_rqs_have_blocked(rq))
done = false;
@@ -7432,6 +7434,7 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
@@ -7544,24 +7547,9 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, used, age_stamp, avg;
- s64 delta;
-
- /*
- * Since we're reading these variables without serialization make sure
- * we read them once before doing sanity checks on them.
- */
- age_stamp = READ_ONCE(rq->age_stamp);
- avg = READ_ONCE(rq->rt_avg);
- delta = __rq_clock_broken(rq) - age_stamp;
-
- if (unlikely(delta < 0))
- delta = 0;
-
- total = sched_avg_period() + delta;
-
- used = div_u64(avg, total);
+ unsigned long used;

+ used = READ_ONCE(rq->avg_irq.util_avg);
used += READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);
if (likely(used < SCHED_CAPACITY_SCALE))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 3d5bd3a..d2e4f21 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -355,3 +355,41 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+/*
+ * irq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+ int ret = 0;
+ /*
+ * We know the time that has been used by interrupt since last update
+ * but we don't when. Let be pessimistic and assume that interrupt has
+ * happened just before the update. This is not so far from reality
+ * because interrupt will most probably wake up task and trig an update
+ * of rq clock during which the metric si updated.
+ * We start to decay with normal context time and then we add the
+ * interrupt context time.
+ * We can safely remove running from rq->clock because
+ * rq->clock += delta with delta >= running
+ */
+ ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+ 0,
+ 0,
+ 0);
+ ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+ 1,
+ 1,
+ 1);
+
+ if (ret)
+ ___update_load_avg(&rq->avg_irq, 1, 1);
+
+ return ret;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912..0ce9a5a 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,6 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_irq_load_avg(struct rq *rq, u64 running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -51,6 +52,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
{
return 0;
}
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0eb07a8..f7e8d5b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,6 +850,7 @@ struct rq {
u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
+ struct sched_avg avg_irq;
u64 idle_stamp;
u64 avg_idle;

--
2.7.4


2018-05-25 13:15:06

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity but not what cfs want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization that is
"stolen" by rt tasks.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++++++-
kernel/sched/pelt.c | 23 +++++++++++++++++++++++
kernel/sched/pelt.h | 7 +++++++
kernel/sched/rt.c | 8 ++++++++
kernel/sched/sched.h | 7 +++++++
5 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6390c66..fb18bcc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+ if (rq->avg_rt.util_avg)
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED

static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu)
if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ /* Don't need periodic decay once load/util_avg are null */
+ if (rt_rq_has_blocked(rq))
+ done = false;

#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
@@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index e6ecbb2..213b922 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -309,3 +309,26 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

return 0;
}
+
+/*
+ * rt_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_rt, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73e..b2983b7 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@
int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
return 0;
}

+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
+
#endif


diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef3c4e6..b4148a9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@
*/
#include "sched.h"

+#include "pelt.h"
+
int sched_rr_timeslice = RR_TIMESLICE;
int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

@@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

rt_queue_push_tasks(rq);

+ update_rt_rq_load_avg(rq_clock_task(rq), rq,
+ rq->curr->sched_class == &rt_sched_class);
+
return p;
}

@@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);

+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
/*
* The previous task needs to be made eligible for pushing
* if it is still active
@@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
struct sched_rt_entity *rt_se = &p->rt;

update_curr_rt(rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

watchdog(rq, p);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 757a3ee..7a16de9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -592,6 +592,7 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+
#endif /* CONFIG_SMP */
int rt_queued;

@@ -847,6 +848,7 @@ struct rq {

u64 rt_avg;
u64 age_stamp;
+ struct sched_avg avg_rt;
u64 idle_stamp;
u64 avg_idle;

@@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)

return util;
}
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+ return rq->avg_rt.util_avg;
+}
#endif
--
2.7.4


2018-05-25 13:15:28

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, pelt code is moved in a
dedicated file.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 333 +-------------------------------------------------
kernel/sched/pelt.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 43 +++++++
kernel/sched/sched.h | 19 +++
5 files changed, 375 insertions(+), 333 deletions(-)
create mode 100644 kernel/sched/pelt.c
create mode 100644 kernel/sched/pelt.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b3..7fe1834 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..6390c66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -255,9 +255,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return cfs_rq->rq;
}

-/* An entity is a task if it doesn't "own" a runqueue */
-#define entity_is_task(se) (!se->my_q)
-
static inline struct task_struct *task_of(struct sched_entity *se)
{
SCHED_WARN_ON(!entity_is_task(se));
@@ -419,7 +416,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return container_of(cfs_rq, struct rq, cfs);
}

-#define entity_is_task(se) 1

#define for_each_sched_entity(se) \
for (; se; se = NULL)
@@ -692,7 +688,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}

#ifdef CONFIG_SMP
-
+#include "pelt.h"
#include "sched-pelt.h"

static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
@@ -2749,19 +2745,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
} while (0)

#ifdef CONFIG_SMP
-/*
- * XXX we want to get rid of these helpers and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
- return scale_load_down(se->load.weight);
-}
-
-static inline long se_runnable(struct sched_entity *se)
-{
- return scale_load_down(se->runnable_weight);
-}
-
static inline void
enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -3062,314 +3045,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
}

#ifdef CONFIG_SMP
-/*
- * Approximate:
- * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
- */
-static u64 decay_load(u64 val, u64 n)
-{
- unsigned int local_n;
-
- if (unlikely(n > LOAD_AVG_PERIOD * 63))
- return 0;
-
- /* after bounds checking we can collapse to 32-bit */
- local_n = n;
-
- /*
- * As y^PERIOD = 1/2, we can combine
- * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
- * With a look-up table which covers y^n (n<PERIOD)
- *
- * To achieve constant time decay_load.
- */
- if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
- val >>= local_n / LOAD_AVG_PERIOD;
- local_n %= LOAD_AVG_PERIOD;
- }
-
- val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
- return val;
-}
-
-static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
-{
- u32 c1, c2, c3 = d3; /* y^0 == 1 */
-
- /*
- * c1 = d1 y^p
- */
- c1 = decay_load((u64)d1, periods);
-
- /*
- * p-1
- * c2 = 1024 \Sum y^n
- * n=1
- *
- * inf inf
- * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
- * n=0 n=p
- */
- c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
-
- return c1 + c2 + c3;
-}
-
-/*
- * Accumulate the three separate parts of the sum; d1 the remainder
- * of the last (incomplete) period, d2 the span of full periods and d3
- * the remainder of the (incomplete) current period.
- *
- * d1 d2 d3
- * ^ ^ ^
- * | | |
- * |<->|<----------------->|<--->|
- * ... |---x---|------| ... |------|-----x (now)
- *
- * p-1
- * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
- * n=1
- *
- * = u y^p + (Step 1)
- *
- * p-1
- * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
- * n=1
- */
-static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- unsigned long scale_freq, scale_cpu;
- u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
- u64 periods;
-
- scale_freq = arch_scale_freq_capacity(cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
- delta += sa->period_contrib;
- periods = delta / 1024; /* A period is 1024us (~1ms) */
-
- /*
- * Step 1: decay old *_sum if we crossed period boundaries.
- */
- if (periods) {
- sa->load_sum = decay_load(sa->load_sum, periods);
- sa->runnable_load_sum =
- decay_load(sa->runnable_load_sum, periods);
- sa->util_sum = decay_load((u64)(sa->util_sum), periods);
-
- /*
- * Step 2
- */
- delta %= 1024;
- contrib = __accumulate_pelt_segments(periods,
- 1024 - sa->period_contrib, delta);
- }
- sa->period_contrib = delta;
-
- contrib = cap_scale(contrib, scale_freq);
- if (load)
- sa->load_sum += load * contrib;
- if (runnable)
- sa->runnable_load_sum += runnable * contrib;
- if (running)
- sa->util_sum += contrib * scale_cpu;
-
- return periods;
-}
-
-/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series. To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
- *
- * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
- * p0 p1 p2
- * (now) (~1ms ago) (~2ms ago)
- *
- * Let u_i denote the fraction of p_i that the entity was runnable.
- *
- * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
- * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
- *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- * y^32 = 0.5
- *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
- *
- * When a period "rolls over" and we have new u_0`, multiplying the previous
- * sum again by y is sufficient to update:
- * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
- */
-static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- u64 delta;
-
- delta = now - sa->last_update_time;
- /*
- * This should only happen when time goes backwards, which it
- * unfortunately does during sched clock init when we swap over to TSC.
- */
- if ((s64)delta < 0) {
- sa->last_update_time = now;
- return 0;
- }
-
- /*
- * Use 1024ns as the unit of measurement since it's a reasonable
- * approximation of 1us and fast to compute.
- */
- delta >>= 10;
- if (!delta)
- return 0;
-
- sa->last_update_time += delta << 10;
-
- /*
- * running is a subset of runnable (weight) so running can't be set if
- * runnable is clear. But there are some corner cases where the current
- * se has been already dequeued but cfs_rq->curr still points to it.
- * This means that weight will be 0 but not running for a sched_entity
- * but also for a cfs_rq if the latter becomes idle. As an example,
- * this happens during idle_balance() which calls
- * update_blocked_averages()
- */
- if (!load)
- runnable = running = 0;
-
- /*
- * Now we know we crossed measurement unit boundaries. The *_avg
- * accrues by two steps:
- *
- * Step 1: accumulate *_sum since last_update_time. If we haven't
- * crossed period boundaries, finish.
- */
- if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
- return 0;
-
- return 1;
-}
-
-static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
-{
- u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
-
- /*
- * Step 2: update *_avg.
- */
- sa->load_avg = div_u64(load * sa->load_sum, divider);
- sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
-}
-
-/*
- * When a task is dequeued, its estimated utilization should not be update if
- * its util_avg has not been updated at least once.
- * This flag is used to synchronize util_avg updates with util_est updates.
- * We map this information into the LSB bit of the utilization saved at
- * dequeue time (i.e. util_est.dequeued).
- */
-#define UTIL_AVG_UNCHANGED 0x1
-
-static inline void cfs_se_util_change(struct sched_avg *avg)
-{
- unsigned int enqueued;
-
- if (!sched_feat(UTIL_EST))
- return;
-
- /* Avoid store if the flag has been already set */
- enqueued = avg->util_est.enqueued;
- if (!(enqueued & UTIL_AVG_UNCHANGED))
- return;
-
- /* Reset flag to report util_avg has been updated */
- enqueued &= ~UTIL_AVG_UNCHANGED;
- WRITE_ONCE(avg->util_est.enqueued, enqueued);
-}
-
-/*
- * sched_entity:
- *
- * task:
- * se_runnable() == se_weight()
- *
- * group: [ see update_cfs_group() ]
- * se_weight() = tg->weight * grq->load_avg / tg->load_avg
- * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
- * load_sum := runnable_sum
- * load_avg = se_weight(se) * runnable_avg
- *
- * runnable_load_sum := runnable_sum
- * runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
- * cfq_rs:
- *
- * load_sum = \Sum se_weight(se) * se->avg.load_sum
- * load_avg = \Sum se->avg.load_avg
- *
- * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- * runnable_load_avg = \Sum se->avg.runable_load_avg
- */
-
-static int
-__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
- cfs_rq->curr == se)) {
-
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- cfs_se_util_change(&se->avg);
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
-{
- if (___update_load_sum(now, cpu, &cfs_rq->avg,
- scale_load_down(cfs_rq->load.weight),
- scale_load_down(cfs_rq->runnable_weight),
- cfs_rq->curr != NULL)) {
-
- ___update_load_avg(&cfs_rq->avg, 1, 1);
- return 1;
- }
-
- return 0;
-}
-
#ifdef CONFIG_FAIR_GROUP_SCHED
/**
* update_tg_load_avg - update the tg's load avg
@@ -4045,12 +3720,6 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

#else /* CONFIG_SMP */

-static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
-{
- return 0;
-}
-
#define UPDATE_TG 0x0
#define SKIP_AGE_LOAD 0x0
#define DO_ATTACH 0x0
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
new file mode 100644
index 0000000..e6ecbb2
--- /dev/null
+++ b/kernel/sched/pelt.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per Entity Load Tracking
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Adaptive scheduling granularity, math enhancements by Peter Zijlstra
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
+ *
+ * Move PELT related code from fair.c into this pelt.c file
+ * Author: Vincent Guittot <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "sched-pelt.h"
+#include "pelt.h"
+
+/*
+ * Approximate:
+ * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static u64 decay_load(u64 val, u64 n)
+{
+ unsigned int local_n;
+
+ if (unlikely(n > LOAD_AVG_PERIOD * 63))
+ return 0;
+
+ /* after bounds checking we can collapse to 32-bit */
+ local_n = n;
+
+ /*
+ * As y^PERIOD = 1/2, we can combine
+ * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+ * With a look-up table which covers y^n (n<PERIOD)
+ *
+ * To achieve constant time decay_load.
+ */
+ if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+ val >>= local_n / LOAD_AVG_PERIOD;
+ local_n %= LOAD_AVG_PERIOD;
+ }
+
+ val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ return val;
+}
+
+static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
+{
+ u32 c1, c2, c3 = d3; /* y^0 == 1 */
+
+ /*
+ * c1 = d1 y^p
+ */
+ c1 = decay_load((u64)d1, periods);
+
+ /*
+ * p-1
+ * c2 = 1024 \Sum y^n
+ * n=1
+ *
+ * inf inf
+ * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
+ * n=0 n=p
+ */
+ c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
+
+ return c1 + c2 + c3;
+}
+
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
+/*
+ * Accumulate the three separate parts of the sum; d1 the remainder
+ * of the last (incomplete) period, d2 the span of full periods and d3
+ * the remainder of the (incomplete) current period.
+ *
+ * d1 d2 d3
+ * ^ ^ ^
+ * | | |
+ * |<->|<----------------->|<--->|
+ * ... |---x---|------| ... |------|-----x (now)
+ *
+ * p-1
+ * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
+ * n=1
+ *
+ * = u y^p + (Step 1)
+ *
+ * p-1
+ * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
+ * n=1
+ */
+static __always_inline u32
+accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ unsigned long scale_freq, scale_cpu;
+ u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
+ u64 periods;
+
+ scale_freq = arch_scale_freq_capacity(cpu);
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ delta += sa->period_contrib;
+ periods = delta / 1024; /* A period is 1024us (~1ms) */
+
+ /*
+ * Step 1: decay old *_sum if we crossed period boundaries.
+ */
+ if (periods) {
+ sa->load_sum = decay_load(sa->load_sum, periods);
+ sa->runnable_load_sum =
+ decay_load(sa->runnable_load_sum, periods);
+ sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+
+ /*
+ * Step 2
+ */
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods,
+ 1024 - sa->period_contrib, delta);
+ }
+ sa->period_contrib = delta;
+
+ contrib = cap_scale(contrib, scale_freq);
+ if (load)
+ sa->load_sum += load * contrib;
+ if (runnable)
+ sa->runnable_load_sum += runnable * contrib;
+ if (running)
+ sa->util_sum += contrib * scale_cpu;
+
+ return periods;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series. To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ * p0 p1 p2
+ * (now) (~1ms ago) (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ * y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ u64 delta;
+
+ delta = now - sa->last_update_time;
+ /*
+ * This should only happen when time goes backwards, which it
+ * unfortunately does during sched clock init when we swap over to TSC.
+ */
+ if ((s64)delta < 0) {
+ sa->last_update_time = now;
+ return 0;
+ }
+
+ /*
+ * Use 1024ns as the unit of measurement since it's a reasonable
+ * approximation of 1us and fast to compute.
+ */
+ delta >>= 10;
+ if (!delta)
+ return 0;
+
+ sa->last_update_time += delta << 10;
+
+ /*
+ * running is a subset of runnable (weight) so running can't be set if
+ * runnable is clear. But there are some corner cases where the current
+ * se has been already dequeued but cfs_rq->curr still points to it.
+ * This means that weight will be 0 but not running for a sched_entity
+ * but also for a cfs_rq if the latter becomes idle. As an example,
+ * this happens during idle_balance() which calls
+ * update_blocked_averages()
+ */
+ if (!load)
+ runnable = running = 0;
+
+ /*
+ * Now we know we crossed measurement unit boundaries. The *_avg
+ * accrues by two steps:
+ *
+ * Step 1: accumulate *_sum since last_update_time. If we haven't
+ * crossed period boundaries, finish.
+ */
+ if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+ return 0;
+
+ return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+{
+ u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
+ /*
+ * Step 2: update *_avg.
+ */
+ sa->load_avg = div_u64(load * sa->load_sum, divider);
+ sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
+ sa->util_avg = sa->util_sum / divider;
+}
+
+/*
+ * sched_entity:
+ *
+ * task:
+ * se_runnable() == se_weight()
+ *
+ * group: [ see update_cfs_group() ]
+ * se_weight() = tg->weight * grq->load_avg / tg->load_avg
+ * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
+ * load_sum := runnable_sum
+ * load_avg = se_weight(se) * runnable_avg
+ *
+ * runnable_load_sum := runnable_sum
+ * runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
+ * cfq_rq:
+ *
+ * load_sum = \Sum se_weight(se) * se->avg.load_sum
+ * load_avg = \Sum se->avg.load_avg
+ *
+ * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ * runnable_load_avg = \Sum se->avg.runable_load_avg
+ */
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+ cfs_rq->curr == se)) {
+
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+ if (___update_load_sum(now, cpu, &cfs_rq->avg,
+ scale_load_down(cfs_rq->load.weight),
+ scale_load_down(cfs_rq->runnable_weight),
+ cfs_rq->curr != NULL)) {
+
+ ___update_load_avg(&cfs_rq->avg, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
new file mode 100644
index 0000000..9cac73e
--- /dev/null
+++ b/kernel/sched/pelt.h
@@ -0,0 +1,43 @@
+#ifdef CONFIG_SMP
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+
+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Avoid store if the flag has been already set */
+ enqueued = avg->util_est.enqueued;
+ if (!(enqueued & UTIL_AVG_UNCHANGED))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ enqueued &= ~UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
+#else
+
+static inline int
+update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+#endif
+
+
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4..757a3ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -666,7 +666,26 @@ struct dl_rq {
u64 bw_ratio;
};

+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+#else
+#define entity_is_task(se) 1
+#endif
+
#ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+ return scale_load_down(se->load.weight);
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+ return scale_load_down(se->runnable_weight);
+}

static inline bool sched_asym_prefer(int a, int b)
{
--
2.7.4


2018-05-25 13:15:32

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 09/10] sched: remove rt_avg code

rt_avg is no more used anywhere so we can remove all related code

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 26 --------------------------
kernel/sched/sched.h | 17 -----------------
2 files changed, 43 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab58288..213d277 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -650,23 +650,6 @@ bool sched_can_stop_tick(struct rq *rq)
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
- s64 period = sched_avg_period();
-
- while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
- /*
- * Inline assembly required to prevent the compiler
- * optimising this loop into a divmod call.
- * See __iter_div_u64_rem() for another example of this.
- */
- asm("" : "+rm" (rq->age_stamp));
- rq->age_stamp += period;
- rq->rt_avg /= 2;
- }
-}
-
#endif /* CONFIG_SMP */

#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5710,13 +5693,6 @@ void set_rq_offline(struct rq *rq)
}
}

-static void set_cpu_rq_start_time(unsigned int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
-
- rq->age_stamp = sched_clock_cpu(cpu);
-}
-
/*
* used to mark begin/end of suspend/resume:
*/
@@ -5834,7 +5810,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
- set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6102,7 +6077,6 @@ void __init sched_init(void)

#ifdef CONFIG_SMP
idle_thread_set_boot_cpu();
- set_cpu_rq_start_time(smp_processor_id());
#endif
init_sched_fair_class();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 718c55d..1929db7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -846,8 +846,6 @@ struct rq {

struct list_head cfs_tasks;

- u64 rt_avg;
- u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
struct sched_avg avg_irq;
@@ -1710,11 +1708,6 @@ extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

-static inline u64 sched_avg_period(void)
-{
- return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
#ifdef CONFIG_SCHED_HRTICK

/*
@@ -1751,8 +1744,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
#endif

#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1763,12 +1754,6 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
- sched_avg_update(rq);
-}
#else
#ifndef arch_scale_cpu_capacity
static __always_inline
@@ -1777,8 +1762,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
#endif

struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
--
2.7.4


2018-05-25 13:16:35

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can use the max of the 2 values when agregating the
utilization of the CPU.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/sched.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4526ba6..0eb07a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
static inline unsigned long cpu_util_dl(struct rq *rq)
{
- return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+ unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+
+ util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+
+ return util;
}

static inline unsigned long cpu_util_cfs(struct rq *rq)
--
2.7.4


2018-05-25 13:16:47

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v5 04/10] sched/dl: add dl_rq utilization tracking

Similarly to what happens with rt tasks, cfs tasks can be preempted by dl
tasks and the cfs's utilization might no longer describes the real
utilization level.
Current dl bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the dl sched class. We track
dl class utilization to estimate the system utilization.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/deadline.c | 5 +++++
kernel/sched/fair.c | 11 ++++++++---
kernel/sched/pelt.c | 23 +++++++++++++++++++++++
kernel/sched/pelt.h | 6 ++++++
kernel/sched/sched.h | 1 +
5 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1356afd..950b3fb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,7 @@
* Fabio Checconi <[email protected]>
*/
#include "sched.h"
+#include "pelt.h"

struct dl_bandwidth def_dl_bandwidth;

@@ -1761,6 +1762,8 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

deadline_queue_push_tasks(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq,
+ rq->curr->sched_class == &dl_sched_class);
return p;
}

@@ -1768,6 +1771,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
@@ -1784,6 +1788,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
/*
* Even when we have runtime, update_curr_dl() might have resulted in us
* not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb18bcc..967e873 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,11 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

-static inline bool rt_rq_has_blocked(struct rq *rq)
+static inline bool others_rqs_have_blocked(struct rq *rq)
{
if (rq->avg_rt.util_avg)
return true;

+ if (rq->avg_dl.util_avg)
+ return true;
+
return false;
}

@@ -7358,8 +7361,9 @@ static void update_blocked_averages(int cpu)
done = false;
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
/* Don't need periodic decay once load/util_avg are null */
- if (rt_rq_has_blocked(rq))
+ if (others_rqs_have_blocked(rq))
done = false;

#ifdef CONFIG_NO_HZ_COMMON
@@ -7427,9 +7431,10 @@ static inline void update_blocked_averages(int cpu)
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 213b922..b07db80 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -332,3 +332,26 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+/*
+ * dl_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_dl, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index b2983b7..0e4f912 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -4,6 +4,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -45,6 +46,11 @@ update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+static inline int
+update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a16de9..4526ba6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -849,6 +849,7 @@ struct rq {
u64 rt_avg;
u64 age_stamp;
struct sched_avg avg_rt;
+ struct sched_avg avg_dl;
u64 idle_stamp;
u64 avg_idle;

--
2.7.4


2018-05-25 14:28:22

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

Hi Vincent,

On Friday 25 May 2018 at 15:12:22 (+0200), Vincent Guittot wrote:
> We want to track rt_rq's utilization as a part of the estimation of the
> whole rq's utilization. This is necessary because rt tasks can steal
> utilization to cfs tasks and make them lighter than they are.
> As we want to use the same load tracking mecanism for both and prevent
> useless dependency between cfs and rt code, pelt code is moved in a
> dedicated file.

I tried to do a quick build test to check if this patch actually introduces
function calls or not in the end. The base branch I used is today's
tip/sched/core ("2539fc82aa9b sched/fair: Update util_est before updating
schedutil").

* x86, x86_defconfig
- Without patch:
queper01 ~/work/linux> size vmlinux
text data bss dec hex filename
17476437 4942296 999628 23418361 16555f9 vmlinux

- With patch:
queper01 ~/work/linux> size vmlinux
text data bss dec hex filename
17476757 4942296 999628 23418681 1655739 vmlinux

* arm64, defconfig
- Without patch:
queper01 ~/work/linux> size vmlinux
text data bss dec hex filename
11349598 6440764 395336 18185698 1157de2 vmlinux

- With patch:
queper01 ~/work/linux> size vmlinux
text data bss dec hex filename
11349598 6440764 395336 18185698 1157de2 vmlinux

It's also true that I'm not using the same gcc version for both archs
(both quite old TBH -- 4.9.4 for arm64, 4.8.4 for x86).

The fact that the code size doesn't change for arm64 suggests that we
already had function calls for the __update_load_avg* functions, so
the patch isn't making things worst than they already are, but for x86,
this is different. Have you tried on x86 on your end ?

And also, I understand these functions are large, but if we _really_
want to inline them even though they're big, why not putting them in
sched-pelt.h ? We probably wouldn't accept that for everything, but
those PELT functions are used all over the place, including latency
sensitive code paths (e.g. task wake-up).

Thanks,
Quentin

2018-05-25 15:56:37

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

On 25-May 15:12, Vincent Guittot wrote:
> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
^
only
otherwise, while RT tasks are running we go to max.

> tasks are running.
> When the CPU is overloaded by cfs and rt tasks, cfs tasks
^^^^^^^^^^
I would say we always have the provlem whenever an RT task preempt a
CFS one, even just for few ms, isn't it?

> are preempted by rt tasks and in this case util_avg reflects the remaining
> capacity but not what cfs want to use. In such case, schedutil can select a
> lower OPP whereas the CPU is overloaded. In order to have a more accurate
> view of the utilization of the CPU, we track the utilization that is
> "stolen" by rt tasks.
>
> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
> the same at the root group level, so the PELT windows of the util_sum are
> aligned.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 15 ++++++++++++++-
> kernel/sched/pelt.c | 23 +++++++++++++++++++++++
> kernel/sched/pelt.h | 7 +++++++
> kernel/sched/rt.c | 8 ++++++++
> kernel/sched/sched.h | 7 +++++++
> 5 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6390c66..fb18bcc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> return false;
> }
>
> +static inline bool rt_rq_has_blocked(struct rq *rq)
> +{
> + if (rq->avg_rt.util_avg)

Should use READ_ONCE?

> + return true;
> +
> + return false;

What about just:

return READ_ONCE(rq->avg_rt.util_avg);

?

> +}
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> @@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu)
> if (cfs_rq_has_blocked(cfs_rq))
> done = false;
> }
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> + /* Don't need periodic decay once load/util_avg are null */
> + if (rt_rq_has_blocked(rq))
> + done = false;
>
> #ifdef CONFIG_NO_HZ_COMMON
> rq->last_blocked_load_update_tick = jiffies;
> @@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu)
> rq_lock_irqsave(rq, &rf);
> update_rq_clock(rq);
> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> #ifdef CONFIG_NO_HZ_COMMON
> rq->last_blocked_load_update_tick = jiffies;
> - if (!cfs_rq_has_blocked(cfs_rq))
> + if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
> rq->has_blocked_load = 0;
> #endif
> rq_unlock_irqrestore(rq, &rf);
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index e6ecbb2..213b922 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -309,3 +309,26 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>
> return 0;
> }
> +
> +/*
> + * rt_rq:
> + *
> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> + * util_sum = cpu_scale * load_sum
> + * runnable_load_sum = load_sum
> + *
> + */
> +
> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
> +{
> + if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
> + running,
> + running,
> + running)) {
> +

Not needed empty line?

> + ___update_load_avg(&rq->avg_rt, 1, 1);
> + return 1;
> + }
> +
> + return 0;
> +}
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 9cac73e..b2983b7 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -3,6 +3,7 @@
> int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
> int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
> int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>
> /*
> * When a task is dequeued, its estimated utilization should not be update if
> @@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> return 0;
> }
>
> +static inline int
> +update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
> +{
> + return 0;
> +}
> +
> #endif
>
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef3c4e6..b4148a9 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -5,6 +5,8 @@
> */
> #include "sched.h"
>
> +#include "pelt.h"
> +
> int sched_rr_timeslice = RR_TIMESLICE;
> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
>
> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> rt_queue_push_tasks(rq);
>
> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
> + rq->curr->sched_class == &rt_sched_class);
> +
> return p;
> }
>
> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> {
> update_curr_rt(rq);
>
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> /*
> * The previous task needs to be made eligible for pushing
> * if it is still active
> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
> struct sched_rt_entity *rt_se = &p->rt;
>
> update_curr_rt(rq);
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

Mmm... not entirely sure... can't we fold
update_rt_rq_load_avg() into update_curr_rt() ?

Currently update_curr_rt() is used in:
dequeue_task_rt
pick_next_task_rt
put_prev_task_rt
task_tick_rt

while we update_rt_rq_load_avg() only in:
pick_next_task_rt
put_prev_task_rt
task_tick_rt
and
update_blocked_averages

Why we don't we need to update at dequeue_task_rt() time ?

>
> watchdog(rq, p);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 757a3ee..7a16de9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -592,6 +592,7 @@ struct rt_rq {
> unsigned long rt_nr_total;
> int overloaded;
> struct plist_head pushable_tasks;
> +
> #endif /* CONFIG_SMP */
> int rt_queued;
>
> @@ -847,6 +848,7 @@ struct rq {
>
> u64 rt_avg;
> u64 age_stamp;
> + struct sched_avg avg_rt;
> u64 idle_stamp;
> u64 avg_idle;
>
> @@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
>
> return util;
> }
> +
> +static inline unsigned long cpu_util_rt(struct rq *rq)
> +{
> + return rq->avg_rt.util_avg;

READ_ONCE?

> +}
> #endif
> --
> 2.7.4
>

--
#include <best/regards.h>

Patrick Bellasi

2018-05-25 16:15:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

On Fri, May 25, 2018 at 03:26:48PM +0100, Quentin Perret wrote:
> (both quite old TBH -- 4.9.4 for arm64, 4.8.4 for x86).

You really should try with a more recent compiler.

2018-05-25 18:06:14

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

On 25-May 15:26, Quentin Perret wrote:
> And also, I understand these functions are large, but if we _really_
> want to inline them even though they're big, why not putting them in
> sched-pelt.h ?

Had the same tought at first... but then I recalled that header is
generated from a script. Thus, eventually, it should be a different one.

> We probably wouldn't accept that for everything, but
> those PELT functions are used all over the place, including latency
> sensitive code paths (e.g. task wake-up).

We should better measure the overheads, if any, and check what
(a modern) compiler does. Maybe some hackbench run could help on the
first point.

> Thanks,
> Quentin

--
#include <best/regards.h>

Patrick Bellasi

2018-05-28 12:07:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 08/10] cpufreq/schedutil: take into account interrupt

Hi Juri,

On 28 May 2018 at 12:41, Juri Lelli <[email protected]> wrote:
> Hi Vincent,
>
> On 25/05/18 15:12, Vincent Guittot wrote:
>> The time spent under interrupt can be significant but it is not reflected
>> in the utilization of CPU when deciding to choose an OPP. Now that we have
>> access to this metric, schedutil can take it into account when selecting
>> the OPP for a CPU.
>> The CPU utilization is :
>> irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
>
> IIUC the code below you actually propose that
>
> util = [(max_cap - util_irq) * util_rq] / max_cap + util_irq
>
> where
>
> util_rq = /Sum rq util_avg
> util_irq = irq util_avg
>
> So, which one is what you have in mind? Or am I wrong? :)

mmh ... aren't they equal ?

util = [(max_cap - util_irq) * util_rq] / max_cap + util_irq
util = [(max_cap/max_cap - util_irq/max_cap) * util_rq] + util_irq
util = [(1 - util_irq/max_cap) * util_rq] + util_irq
util = util_irq + [(1 - util_irq/max_cap) * util_rq]

>
> [...]
>
>> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> @@ -190,9 +192,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> if (rq->rt.rt_nr_running) {
>> util = sg_cpu->max;
>> } else {
>> + /* Sum rq utilization*/
>> util = sg_cpu->util_dl;
>> util += sg_cpu->util_cfs;
>> util += sg_cpu->util_rt;
>> +
>> + /* Weight rq's utilization to the normal context */
>> + util *= (sg_cpu->max - sg_cpu->util_irq);
>> + util /= sg_cpu->max;
>> +
>> + /* Add interrupt utilization */
>> + util += sg_cpu->util_irq;
>
> Thanks,
>
> - Juri

2018-05-28 12:38:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 08/10] cpufreq/schedutil: take into account interrupt

On 28/05/18 14:06, Vincent Guittot wrote:
> Hi Juri,
>
> On 28 May 2018 at 12:41, Juri Lelli <[email protected]> wrote:
> > Hi Vincent,
> >
> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> The time spent under interrupt can be significant but it is not reflected
> >> in the utilization of CPU when deciding to choose an OPP. Now that we have
> >> access to this metric, schedutil can take it into account when selecting
> >> the OPP for a CPU.
> >> The CPU utilization is :
> >> irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> >
> > IIUC the code below you actually propose that
> >
> > util = [(max_cap - util_irq) * util_rq] / max_cap + util_irq
> >
> > where
> >
> > util_rq = /Sum rq util_avg
> > util_irq = irq util_avg
> >
> > So, which one is what you have in mind? Or am I wrong? :)
>
> mmh ... aren't they equal ?
>
> util = [(max_cap - util_irq) * util_rq] / max_cap + util_irq
> util = [(max_cap/max_cap - util_irq/max_cap) * util_rq] + util_irq
> util = [(1 - util_irq/max_cap) * util_rq] + util_irq
> util = util_irq + [(1 - util_irq/max_cap) * util_rq]

Ah, indeed.

2018-05-28 13:55:22

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 08/10] cpufreq/schedutil: take into account interrupt

Hi Vincent,

On 25/05/18 15:12, Vincent Guittot wrote:
> The time spent under interrupt can be significant but it is not reflected
> in the utilization of CPU when deciding to choose an OPP. Now that we have
> access to this metric, schedutil can take it into account when selecting
> the OPP for a CPU.
> The CPU utilization is :
> irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

IIUC the code below you actually propose that

util = [(max_cap - util_irq) * util_rq] / max_cap + util_irq

where

util_rq = /Sum rq util_avg
util_irq = irq util_avg

So, which one is what you have in mind? Or am I wrong? :)

[...]

> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> @@ -190,9 +192,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> if (rq->rt.rt_nr_running) {
> util = sg_cpu->max;
> } else {
> + /* Sum rq utilization*/
> util = sg_cpu->util_dl;
> util += sg_cpu->util_cfs;
> util += sg_cpu->util_rt;
> +
> + /* Weight rq's utilization to the normal context */
> + util *= (sg_cpu->max - sg_cpu->util_irq);
> + util /= sg_cpu->max;
> +
> + /* Add interrupt utilization */
> + util += sg_cpu->util_irq;

Thanks,

- Juri

2018-05-28 15:04:11

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Hi Juri,

On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
> Hi Vincent,
>
> On 25/05/18 15:12, Vincent Guittot wrote:
>> Now that we have both the dl class bandwidth requirement and the dl class
>> utilization, we can use the max of the 2 values when agregating the
>> utilization of the CPU.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/sched.h | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 4526ba6..0eb07a8 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> static inline unsigned long cpu_util_dl(struct rq *rq)
>> {
>> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>
> I'd be tempted to say the we actually want to cap to this one above
> instead of using the max (as you are proposing below) or the
> (theoretical) power reduction benefits of using DEADLINE for certain
> tasks might vanish.

The problem that I'm facing is that the sched_entity bandwidth is
removed after the 0-lag time and the rq->dl.running_bw goes back to
zero but if the DL task has preempted a CFS task, the utilization of
the CFS task will be lower than reality and schedutil will set a lower
OPP whereas the CPU is always running. The example with a RT task
described in the cover letter can be run with a DL task and will give
similar results.
avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
whereas rq->dl.running_bw gives the minimum to match DL requirement.

>
>> +
>> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> +
>> + return util;
>
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
>
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.
>
> Thanks,
>
> - Juri

2018-05-28 15:23:45

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 28/05/18 16:57, Vincent Guittot wrote:
> Hi Juri,
>
> On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
> > Hi Vincent,
> >
> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> Now that we have both the dl class bandwidth requirement and the dl class
> >> utilization, we can use the max of the 2 values when agregating the
> >> utilization of the CPU.
> >>
> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> ---
> >> kernel/sched/sched.h | 6 +++++-
> >> 1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 4526ba6..0eb07a8 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> static inline unsigned long cpu_util_dl(struct rq *rq)
> >> {
> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >
> > I'd be tempted to say the we actually want to cap to this one above
> > instead of using the max (as you are proposing below) or the
> > (theoretical) power reduction benefits of using DEADLINE for certain
> > tasks might vanish.
>
> The problem that I'm facing is that the sched_entity bandwidth is
> removed after the 0-lag time and the rq->dl.running_bw goes back to
> zero but if the DL task has preempted a CFS task, the utilization of
> the CFS task will be lower than reality and schedutil will set a lower
> OPP whereas the CPU is always running. The example with a RT task
> described in the cover letter can be run with a DL task and will give
> similar results.
> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
> whereas rq->dl.running_bw gives the minimum to match DL requirement.

Mmm, I see. Note that I'm only being cautious, what you propose might
work OK, but it seems to me that we might lose some of the benefits of
running tasks with DEADLINE if we start selecting frequency as you
propose even when such tasks are running.

An idea might be to copy running_bw util into dl.util_avg when a DL task
goes to sleep, and then decay the latter as for RT contribution. What
you think?

2018-05-28 16:17:58

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Hi Vincent,

On 25/05/18 15:12, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/sched.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> static inline unsigned long cpu_util_dl(struct rq *rq)
> {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

I'd be tempted to say the we actually want to cap to this one above
instead of using the max (as you are proposing below) or the
(theoretical) power reduction benefits of using DEADLINE for certain
tasks might vanish.

> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> +
> + return util;

Anyway, just a quick thought. I guess we should experiment with this a
bit. Now, I don't unfortunately have a Arm platform at hand for testing.
Claudio, Luca (now Cc-ed), would you be able to fire some tests with
this change?

Oh, adding Joel and Alessio as well that experimented with DEADLINE
lately.

Thanks,

- Juri

2018-05-28 16:35:50

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 28 May 2018 at 17:22, Juri Lelli <[email protected]> wrote:
> On 28/05/18 16:57, Vincent Guittot wrote:
>> Hi Juri,
>>
>> On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
>> > Hi Vincent,
>> >
>> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> Now that we have both the dl class bandwidth requirement and the dl class
>> >> utilization, we can use the max of the 2 values when agregating the
>> >> utilization of the CPU.
>> >>
>> >> Signed-off-by: Vincent Guittot <[email protected]>
>> >> ---
>> >> kernel/sched/sched.h | 6 +++++-
>> >> 1 file changed, 5 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index 4526ba6..0eb07a8 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >> static inline unsigned long cpu_util_dl(struct rq *rq)
>> >> {
>> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >
>> > I'd be tempted to say the we actually want to cap to this one above
>> > instead of using the max (as you are proposing below) or the
>> > (theoretical) power reduction benefits of using DEADLINE for certain
>> > tasks might vanish.
>>
>> The problem that I'm facing is that the sched_entity bandwidth is
>> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> zero but if the DL task has preempted a CFS task, the utilization of
>> the CFS task will be lower than reality and schedutil will set a lower
>> OPP whereas the CPU is always running. The example with a RT task
>> described in the cover letter can be run with a DL task and will give
>> similar results.
>> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler
>> whereas rq->dl.running_bw gives the minimum to match DL requirement.
>
> Mmm, I see. Note that I'm only being cautious, what you propose might
> work OK, but it seems to me that we might lose some of the benefits of
> running tasks with DEADLINE if we start selecting frequency as you
> propose even when such tasks are running.

I see your point. Taking into account the number cfs running task to
choose between rq->dl.running_bw and avg_dl.util_avg could help

>
> An idea might be to copy running_bw util into dl.util_avg when a DL task
> goes to sleep, and then decay the latter as for RT contribution. What
> you think?

Not sure that this will work because you will overwrite the value each
time a DL task goes to sleep and the decay will mainly happen on the
update when last DL task goes to sleep which might not reflect what
has been used by DL tasks but only the requirement of the last running
DL task. This other interest of the PELT is to have an utilization
tracking which uses the same curve as CFS so the increase of
cfs_rq->avg.util_avg and the decrease of avg_dl.util_avg will
compensate themselves (or the opposite)

2018-05-29 05:09:16

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
[..]
> > +
> > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > +
> > + return util;
>
> Anyway, just a quick thought. I guess we should experiment with this a
> bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> this change?
>
> Oh, adding Joel and Alessio as well that experimented with DEADLINE
> lately.

I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
beyond what the running bandwidth is, or atleast do that only if CFS tasks
are running and being preempted as you/Vincent mentioned in one of the
threads.

With our DL experiments, I didn't measure power but got it to a point where
the OPP is scaling correctly based on DL parameters. I think Alessio did
measure power at his setup but I can't recall now. Alessio?

thanks,

- Joel


2018-05-29 06:33:55

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 28/05/18 22:08, Joel Fernandes wrote:
> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> [..]
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > > +
> > > + return util;
> >
> > Anyway, just a quick thought. I guess we should experiment with this a
> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> > this change?
> >
> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> > lately.
>
> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> are running and being preempted as you/Vincent mentioned in one of the
> threads.

It's however a bit awkward that we might be running at a higher OPP when
CFS tasks are running (even though they are of less priority). :/

> With our DL experiments, I didn't measure power but got it to a point where
> the OPP is scaling correctly based on DL parameters. I think Alessio did
> measure power at his setup but I can't recall now. Alessio?

I see. Thanks.

2018-05-29 06:51:21

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 29 May 2018 at 08:31, Juri Lelli <[email protected]> wrote:
> On 28/05/18 22:08, Joel Fernandes wrote:
>> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
>> [..]
>> > > +
>> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>> > > +
>> > > + return util;
>> >
>> > Anyway, just a quick thought. I guess we should experiment with this a
>> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
>> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
>> > this change?
>> >
>> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
>> > lately.
>>
>> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
>> beyond what the running bandwidth is, or atleast do that only if CFS tasks
>> are running and being preempted as you/Vincent mentioned in one of the
>> threads.
>
> It's however a bit awkward that we might be running at a higher OPP when
> CFS tasks are running (even though they are of less priority). :/

Even if cfs task has lower priority that doesn't mean that we should
not take their needs into account.
In the same way, we run at max OPP as soon as a RT task is runnable

>
>> With our DL experiments, I didn't measure power but got it to a point where
>> the OPP is scaling correctly based on DL parameters. I think Alessio did
>> measure power at his setup but I can't recall now. Alessio?
>
> I see. Thanks.

2018-05-29 08:23:24

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

On Friday 25 May 2018 at 18:14:23 (+0200), Peter Zijlstra wrote:
> On Fri, May 25, 2018 at 03:26:48PM +0100, Quentin Perret wrote:
> > (both quite old TBH -- 4.9.4 for arm64, 4.8.4 for x86).
>
> You really should try with a more recent compiler.

Right, so I just gave it a try for x86 with gcc 8.0.1 (which seem to
introduce a lot of LTO-related enhancements) and I get the following:

Without patch
text data bss dec hex filename
17474129 4980348 995532 23450009 165d199 vmlinux

With patch
text data bss dec hex filename
17474049 4980348 995532 23449929 165d149 vmlinux

So it is still true that this patch actually changes the code size, most
likely because of new function calls. But maybe we don't care if the
impact on performance isn't noticeable ... As Patrick said, a hackbench
run should be more interesting here. I'll give a try later today.

Thanks,
Quentin

2018-05-29 08:41:19

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Hi Vincent,

On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can use the max of the 2 values when agregating the
> utilization of the CPU.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/sched.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4526ba6..0eb07a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> static inline unsigned long cpu_util_dl(struct rq *rq)
> {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> +
> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));

Would it make sense to use a UTIL_EST version of that signal here ? I
don't think that would make sense for the RT class with your patch-set
since you only really use the blocked part of the signal for RT IIUC,
but would that work for DL ?
> +
> + return util;
> }
>
> static inline unsigned long cpu_util_cfs(struct rq *rq)
> --
> 2.7.4
>

2018-05-29 09:49:13

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 29/05/18 08:48, Vincent Guittot wrote:
> On 29 May 2018 at 08:31, Juri Lelli <[email protected]> wrote:
> > On 28/05/18 22:08, Joel Fernandes wrote:
> >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote:
> >> [..]
> >> > > +
> >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> >> > > +
> >> > > + return util;
> >> >
> >> > Anyway, just a quick thought. I guess we should experiment with this a
> >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing.
> >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with
> >> > this change?
> >> >
> >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE
> >> > lately.
> >>
> >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP
> >> beyond what the running bandwidth is, or atleast do that only if CFS tasks
> >> are running and being preempted as you/Vincent mentioned in one of the
> >> threads.
> >
> > It's however a bit awkward that we might be running at a higher OPP when
> > CFS tasks are running (even though they are of less priority). :/
>
> Even if cfs task has lower priority that doesn't mean that we should
> not take their needs into account.
> In the same way, we run at max OPP as soon as a RT task is runnable

Sure. What I fear is a little CFS utilization generating spikes because
dl.util_avg became big when running DL tasks. Not sure that can happen
though because such DL tasks should be throttled anyway.

2018-05-29 09:53:01

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 29/05/18 09:40, Quentin Perret wrote:
> Hi Vincent,
>
> On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > Now that we have both the dl class bandwidth requirement and the dl class
> > utilization, we can use the max of the 2 values when agregating the
> > utilization of the CPU.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/sched.h | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 4526ba6..0eb07a8 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > static inline unsigned long cpu_util_dl(struct rq *rq)
> > {
> > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > +
> > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
>
> Would it make sense to use a UTIL_EST version of that signal here ? I
> don't think that would make sense for the RT class with your patch-set
> since you only really use the blocked part of the signal for RT IIUC,
> but would that work for DL ?

Well, UTIL_EST for DL looks pretty much what we already do by computing
utilization based on dl.running_bw. That's why I was thinking of using
that as a starting point for dl.util_avg decay phase.

2018-05-29 13:37:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

Hi Patrick,

On 25 May 2018 at 17:54, Patrick Bellasi <[email protected]> wrote:
> On 25-May 15:12, Vincent Guittot wrote:
>> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> ^
> only
> otherwise, while RT tasks are running we go to max.
>
>> tasks are running.
>> When the CPU is overloaded by cfs and rt tasks, cfs tasks
> ^^^^^^^^^^
> I would say we always have the provlem whenever an RT task preempt a
> CFS one, even just for few ms, isn't it?

The problem only happens when there is not enough time to run all
tasks (rt and cfs). If the cfs task is preempted few ms and the main
impact is only a delay in its execution but there is still enough time
to do cfs jobs (cpu goes back to idle from time to time), there is no
"real" problem. At now, it means that it's not a problem as long as
the rt task doesn't take more than the margin that schedutil uses to
select a frequency : (max freq + max freq >> 2) util /max capacity

>
>> are preempted by rt tasks and in this case util_avg reflects the remaining
>> capacity but not what cfs want to use. In such case, schedutil can select a
>> lower OPP whereas the CPU is overloaded. In order to have a more accurate
>> view of the utilization of the CPU, we track the utilization that is
>> "stolen" by rt tasks.
>>
>> rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
>> the same at the root group level, so the PELT windows of the util_sum are
>> aligned.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 15 ++++++++++++++-
>> kernel/sched/pelt.c | 23 +++++++++++++++++++++++
>> kernel/sched/pelt.h | 7 +++++++
>> kernel/sched/rt.c | 8 ++++++++
>> kernel/sched/sched.h | 7 +++++++
>> 5 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6390c66..fb18bcc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> return false;
>> }
>>
>> +static inline bool rt_rq_has_blocked(struct rq *rq)
>> +{
>> + if (rq->avg_rt.util_avg)
>
> Should use READ_ONCE?

I was expecting that there will be only one read by default but I can
add READ_ONCE

>
>> + return true;
>> +
>> + return false;
>
> What about just:
>
> return READ_ONCE(rq->avg_rt.util_avg);
>
> ?

This function is renamed and extended with others tracking in the
following patches so we have to test several values in the function.
That's also why there is the if test because additional if test are
going to be added

>
>> +}
>> +
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>>
>> static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>> @@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu)
>> if (cfs_rq_has_blocked(cfs_rq))
>> done = false;
>> }
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>> + /* Don't need periodic decay once load/util_avg are null */
>> + if (rt_rq_has_blocked(rq))
>> + done = false;
>>
>> #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> @@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu)
>> rq_lock_irqsave(rq, &rf);
>> update_rq_clock(rq);
>> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>> #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> - if (!cfs_rq_has_blocked(cfs_rq))
>> + if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
>> rq->has_blocked_load = 0;
>> #endif
>> rq_unlock_irqrestore(rq, &rf);
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index e6ecbb2..213b922 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -309,3 +309,26 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
>>
>> return 0;
>> }
>> +
>> +/*
>> + * rt_rq:
>> + *
>> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
>> + * util_sum = cpu_scale * load_sum
>> + * runnable_load_sum = load_sum
>> + *
>> + */
>> +
>> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
>> +{
>> + if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
>> + running,
>> + running,
>> + running)) {
>> +
>
> Not needed empty line?

yes probably.

This empty is coming from the copy/paste of __update_load_avg_cfs_rq()
I will consolidate this in the next version

>
>> + ___update_load_avg(&rq->avg_rt, 1, 1);
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>> index 9cac73e..b2983b7 100644
>> --- a/kernel/sched/pelt.h
>> +++ b/kernel/sched/pelt.h
>> @@ -3,6 +3,7 @@
>> int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
>> int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
>> int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
>> +int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>>
>> /*
>> * When a task is dequeued, its estimated utilization should not be update if
>> @@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>> return 0;
>> }
>>
>> +static inline int
>> +update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
>> +{
>> + return 0;
>> +}
>> +
>> #endif
>>
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef3c4e6..b4148a9 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -5,6 +5,8 @@
>> */
>> #include "sched.h"
>>
>> +#include "pelt.h"
>> +
>> int sched_rr_timeslice = RR_TIMESLICE;
>> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
>>
>> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>> rt_queue_push_tasks(rq);
>>
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
>> + rq->curr->sched_class == &rt_sched_class);
>> +
>> return p;
>> }
>>
>> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>> {
>> update_curr_rt(rq);
>>
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +
>> /*
>> * The previous task needs to be made eligible for pushing
>> * if it is still active
>> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
>> struct sched_rt_entity *rt_se = &p->rt;
>>
>> update_curr_rt(rq);
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>
> Mmm... not entirely sure... can't we fold
> update_rt_rq_load_avg() into update_curr_rt() ?
>
> Currently update_curr_rt() is used in:
> dequeue_task_rt
> pick_next_task_rt
> put_prev_task_rt
> task_tick_rt
>
> while we update_rt_rq_load_avg() only in:
> pick_next_task_rt
> put_prev_task_rt
> task_tick_rt
> and
> update_blocked_averages
>
> Why we don't we need to update at dequeue_task_rt() time ?

We are tracking rt rq and not sched entities so we want to know when
sched rt will be the running or not sched class on the rq. Tracking
dequeue_task_rt is useless

>
>>
>> watchdog(rq, p);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 757a3ee..7a16de9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -592,6 +592,7 @@ struct rt_rq {
>> unsigned long rt_nr_total;
>> int overloaded;
>> struct plist_head pushable_tasks;
>> +
>> #endif /* CONFIG_SMP */
>> int rt_queued;
>>
>> @@ -847,6 +848,7 @@ struct rq {
>>
>> u64 rt_avg;
>> u64 age_stamp;
>> + struct sched_avg avg_rt;
>> u64 idle_stamp;
>> u64 avg_idle;
>>
>> @@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
>>
>> return util;
>> }
>> +
>> +static inline unsigned long cpu_util_rt(struct rq *rq)
>> +{
>> + return rq->avg_rt.util_avg;
>
> READ_ONCE?
>
>> +}
>> #endif
>> --
>> 2.7.4
>>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-29 14:57:02

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

On Friday 25 May 2018 at 19:04:55 (+0100), Patrick Bellasi wrote:
> On 25-May 15:26, Quentin Perret wrote:
> > And also, I understand these functions are large, but if we _really_
> > want to inline them even though they're big, why not putting them in
> > sched-pelt.h ?
>
> Had the same tought at first... but then I recalled that header is
> generated from a script. Thus, eventually, it should be a different one.

Ah, good point. This patch already introduces a pelt.h so I guess that
could work as well.

>
> > We probably wouldn't accept that for everything, but
> > those PELT functions are used all over the place, including latency
> > sensitive code paths (e.g. task wake-up).
>
> We should better measure the overheads, if any, and check what
> (a modern) compiler does. Maybe some hackbench run could help on the
> first point.

FWIW, I ran a few hackbench tests today on my Intel box:
- Intel i7-6700 (4 cores / 8 threads) @ 3.40GHz
- Base kernel: today's tip/sched/core "2539fc82aa9b sched/fair: Update
util_est before updating schedutil"
- Compiler: GCC 7.3.0

The tables below summarize the results for:
perf stat --repeat 10 perf bench sched messaging --pipe --thread -l 50000 --group G

Without patch:
+---+-------+----------+---------+
| G | Tasks | Duration | Stddev |
+---+-------+----------+---------+
| 1 | 40 | 3.906 | +-0.84% |
| 2 | 80 | 8.569 | +-0.77% |
| 4 | 160 | 16.384 | +-0.46% |
| 8 | 320 | 33.686 | +-0.42% |
+---+-------+----------+---------+

With patch:
+---+-------+----------------+---------+
| G | Tasks | Duration | Stddev |
+---+-------+----------------+---------+
| 1 | 40 | 3.953 (+1.2%) | +-1.43% |
| 2 | 80 | 8.646 (+0.9%) | +-0.32% |
| 4 | 160 | 16.390 (+0.0%) | +-0.38% |
| 8 | 320 | 33.992 (+0.9%) | +-0.27% |
+---+-------+----------------+---------+

So there is (maybe) a little something on my box, but not so significant
IMHO ... :)

Thanks,
Quentin

2018-05-29 15:05:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

Hi Quentin,

On 29 May 2018 at 16:55, Quentin Perret <[email protected]> wrote:
>
> On Friday 25 May 2018 at 19:04:55 (+0100), Patrick Bellasi wrote:
> > On 25-May 15:26, Quentin Perret wrote:
> > > And also, I understand these functions are large, but if we _really_
> > > want to inline them even though they're big, why not putting them in
> > > sched-pelt.h ?
> >
> > Had the same tought at first... but then I recalled that header is
> > generated from a script. Thus, eventually, it should be a different one.
>
> Ah, good point. This patch already introduces a pelt.h so I guess that
> could work as well.
>
> >
> > > We probably wouldn't accept that for everything, but
> > > those PELT functions are used all over the place, including latency
> > > sensitive code paths (e.g. task wake-up).
> >
> > We should better measure the overheads, if any, and check what
> > (a modern) compiler does. Maybe some hackbench run could help on the
> > first point.
>
> FWIW, I ran a few hackbench tests today on my Intel box:
> - Intel i7-6700 (4 cores / 8 threads) @ 3.40GHz
> - Base kernel: today's tip/sched/core "2539fc82aa9b sched/fair: Update
> util_est before updating schedutil"
> - Compiler: GCC 7.3.0

Which cpufreq governor are you using ?

>
> The tables below summarize the results for:
> perf stat --repeat 10 perf bench sched messaging --pipe --thread -l 50000 --group G
>
> Without patch:
> +---+-------+----------+---------+
> | G | Tasks | Duration | Stddev |
> +---+-------+----------+---------+
> | 1 | 40 | 3.906 | +-0.84% |
> | 2 | 80 | 8.569 | +-0.77% |
> | 4 | 160 | 16.384 | +-0.46% |
> | 8 | 320 | 33.686 | +-0.42% |
> +---+-------+----------+---------+
>
> With patch:

Just to make sure. You mean only this patch and not the whole patchset ?

> +---+-------+----------------+---------+
> | G | Tasks | Duration | Stddev |
> +---+-------+----------------+---------+
> | 1 | 40 | 3.953 (+1.2%) | +-1.43% |
> | 2 | 80 | 8.646 (+0.9%) | +-0.32% |
> | 4 | 160 | 16.390 (+0.0%) | +-0.38% |
> | 8 | 320 | 33.992 (+0.9%) | +-0.27% |
> +---+-------+----------------+---------+
>
> So there is (maybe) a little something on my box, but not so significant
> IMHO ... :)
>
> Thanks,
> Quentin

2018-05-29 15:05:34

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 01/10] sched/pelt: Move pelt related code in a dedicated file

On Tuesday 29 May 2018 at 17:02:29 (+0200), Vincent Guittot wrote:
> Hi Quentin,
>
> On 29 May 2018 at 16:55, Quentin Perret <[email protected]> wrote:
> >
> > On Friday 25 May 2018 at 19:04:55 (+0100), Patrick Bellasi wrote:
> > > On 25-May 15:26, Quentin Perret wrote:
> > > > And also, I understand these functions are large, but if we _really_
> > > > want to inline them even though they're big, why not putting them in
> > > > sched-pelt.h ?
> > >
> > > Had the same tought at first... but then I recalled that header is
> > > generated from a script. Thus, eventually, it should be a different one.
> >
> > Ah, good point. This patch already introduces a pelt.h so I guess that
> > could work as well.
> >
> > >
> > > > We probably wouldn't accept that for everything, but
> > > > those PELT functions are used all over the place, including latency
> > > > sensitive code paths (e.g. task wake-up).
> > >
> > > We should better measure the overheads, if any, and check what
> > > (a modern) compiler does. Maybe some hackbench run could help on the
> > > first point.
> >
> > FWIW, I ran a few hackbench tests today on my Intel box:
> > - Intel i7-6700 (4 cores / 8 threads) @ 3.40GHz
> > - Base kernel: today's tip/sched/core "2539fc82aa9b sched/fair: Update
> > util_est before updating schedutil"
> > - Compiler: GCC 7.3.0
>
> Which cpufreq governor are you using ?

powersave, with the intel_pstate driver.

>
> >
> > The tables below summarize the results for:
> > perf stat --repeat 10 perf bench sched messaging --pipe --thread -l 50000 --group G
> >
> > Without patch:
> > +---+-------+----------+---------+
> > | G | Tasks | Duration | Stddev |
> > +---+-------+----------+---------+
> > | 1 | 40 | 3.906 | +-0.84% |
> > | 2 | 80 | 8.569 | +-0.77% |
> > | 4 | 160 | 16.384 | +-0.46% |
> > | 8 | 320 | 33.686 | +-0.42% |
> > +---+-------+----------+---------+
> >
> > With patch:
>
> Just to make sure. You mean only this patch and not the whole patchset ?

That's right, I applied only this patch.

>
> > +---+-------+----------------+---------+
> > | G | Tasks | Duration | Stddev |
> > +---+-------+----------------+---------+
> > | 1 | 40 | 3.953 (+1.2%) | +-1.43% |
> > | 2 | 80 | 8.646 (+0.9%) | +-0.32% |
> > | 4 | 160 | 16.390 (+0.0%) | +-0.38% |
> > | 8 | 320 | 33.992 (+0.9%) | +-0.27% |
> > +---+-------+----------------+---------+
> >
> > So there is (maybe) a little something on my box, but not so significant
> > IMHO ... :)
> >
> > Thanks,
> > Quentin

2018-05-30 07:04:09

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 25-05-18, 15:12, Vincent Guittot wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> /*
> * Utilization required by DEADLINE must always be granted while, for
> @@ -197,7 +205,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> + return min(sg_cpu->max, util);

Need to update comment above this line to include RT in that ?

> }

--
viresh

2018-05-30 08:25:21

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 30 May 2018 at 09:03, Viresh Kumar <[email protected]> wrote:
> On 25-05-18, 15:12, Vincent Guittot wrote:
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> /*
>> * Utilization required by DEADLINE must always be granted while, for
>> @@ -197,7 +205,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>> * ready for such an interface. So, we only do the latter for now.
>> */
>> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>> + return min(sg_cpu->max, util);
>
> Need to update comment above this line to include RT in that ?

yes good point

>
>> }
>
> --
> viresh

2018-05-30 08:39:05

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> On 29/05/18 09:40, Quentin Perret wrote:
> > Hi Vincent,
> >
> > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > Now that we have both the dl class bandwidth requirement and the dl class
> > > utilization, we can use the max of the 2 values when agregating the
> > > utilization of the CPU.
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > kernel/sched/sched.h | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 4526ba6..0eb07a8 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> > > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > static inline unsigned long cpu_util_dl(struct rq *rq)
> > > {
> > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > +
> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> >
> > Would it make sense to use a UTIL_EST version of that signal here ? I
> > don't think that would make sense for the RT class with your patch-set
> > since you only really use the blocked part of the signal for RT IIUC,
> > but would that work for DL ?
>
> Well, UTIL_EST for DL looks pretty much what we already do by computing
> utilization based on dl.running_bw. That's why I was thinking of using
> that as a starting point for dl.util_avg decay phase.

Hmmm I see your point, but running_bw and the util_avg are fundamentally
different ... I mean, the util_avg doesn't know about the period, which is
an issue in itself I guess ...

If you have a long running DL task (say 100ms runtime) with a long period
(say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
the util_avg can go quite high, which means that you might end up
executing this task at max OPP. So if we really want to drive OPPs like
that for deadline, a util_est-like version of this util_avg signal
should help. Now, you can also argue that going to max OPP for a task
that _we know_ uses 1/10 of the CPU capacity isn't right ...

2018-05-30 08:52:51

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 30/05/18 09:37, Quentin Perret wrote:
> On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote:
> > On 29/05/18 09:40, Quentin Perret wrote:
> > > Hi Vincent,
> > >
> > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote:
> > > > Now that we have both the dl class bandwidth requirement and the dl class
> > > > utilization, we can use the max of the 2 values when agregating the
> > > > utilization of the CPU.
> > > >
> > > > Signed-off-by: Vincent Guittot <[email protected]>
> > > > ---
> > > > kernel/sched/sched.h | 6 +++++-
> > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index 4526ba6..0eb07a8 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> > > > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > > static inline unsigned long cpu_util_dl(struct rq *rq)
> > > > {
> > > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > +
> > > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
> > >
> > > Would it make sense to use a UTIL_EST version of that signal here ? I
> > > don't think that would make sense for the RT class with your patch-set
> > > since you only really use the blocked part of the signal for RT IIUC,
> > > but would that work for DL ?
> >
> > Well, UTIL_EST for DL looks pretty much what we already do by computing
> > utilization based on dl.running_bw. That's why I was thinking of using
> > that as a starting point for dl.util_avg decay phase.
>
> Hmmm I see your point, but running_bw and the util_avg are fundamentally
> different ... I mean, the util_avg doesn't know about the period, which is
> an issue in itself I guess ...
>
> If you have a long running DL task (say 100ms runtime) with a long period
> (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but
> the util_avg can go quite high, which means that you might end up
> executing this task at max OPP. So if we really want to drive OPPs like
> that for deadline, a util_est-like version of this util_avg signal
> should help. Now, you can also argue that going to max OPP for a task
> that _we know_ uses 1/10 of the CPU capacity isn't right ...

Yep, that's my point. :)

2018-05-30 09:32:55

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

On 29-May 15:29, Vincent Guittot wrote:
> Hi Patrick,
> >> +static inline bool rt_rq_has_blocked(struct rq *rq)
> >> +{
> >> + if (rq->avg_rt.util_avg)
> >
> > Should use READ_ONCE?
>
> I was expecting that there will be only one read by default but I can
> add READ_ONCE

I would say here it's required mainly for "documentation" purposes,
since we can use this function from non rq-locked paths, e.g.

update_sg_lb_stats()
update_nohz_stats()
update_blocked_averages()
rt_rq_has_blocked()

Thus, AFAIU, we should use READ_ONCE to "flag" that the value can
potentially be updated concurrently?

> >
> >> + return true;
> >> +
> >> + return false;
> >
> > What about just:
> >
> > return READ_ONCE(rq->avg_rt.util_avg);
> >
> > ?
>
> This function is renamed and extended with others tracking in the
> following patches so we have to test several values in the function.
> That's also why there is the if test because additional if test are
> going to be added

Right, makes sense.

[...]

> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index ef3c4e6..b4148a9 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -5,6 +5,8 @@
> >> */
> >> #include "sched.h"
> >>
> >> +#include "pelt.h"
> >> +
> >> int sched_rr_timeslice = RR_TIMESLICE;
> >> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
> >>
> >> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>
> >> rt_queue_push_tasks(rq);
> >>
> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
> >> + rq->curr->sched_class == &rt_sched_class);
> >> +
> >> return p;
> >> }
> >>
> >> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> >> {
> >> update_curr_rt(rq);
> >>
> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> >> +
> >> /*
> >> * The previous task needs to be made eligible for pushing
> >> * if it is still active
> >> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
> >> struct sched_rt_entity *rt_se = &p->rt;
> >>
> >> update_curr_rt(rq);
> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> >
> > Mmm... not entirely sure... can't we fold
> > update_rt_rq_load_avg() into update_curr_rt() ?
> >
> > Currently update_curr_rt() is used in:
> > dequeue_task_rt
> > pick_next_task_rt
> > put_prev_task_rt
> > task_tick_rt
> >
> > while we update_rt_rq_load_avg() only in:
> > pick_next_task_rt
> > put_prev_task_rt
> > task_tick_rt
> > and
> > update_blocked_averages
> >
> > Why we don't we need to update at dequeue_task_rt() time ?
>
> We are tracking rt rq and not sched entities so we want to know when
> sched rt will be the running or not sched class on the rq. Tracking
> dequeue_task_rt is useless

What about (push) migrations?

--
#include <best/regards.h>

Patrick Bellasi

2018-05-30 09:41:02

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 25-May 15:12, Vincent Guittot wrote:
> Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
> can preempt and steal cfs's running time.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 28592b6..a84b5a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,7 @@ struct sugov_cpu {
> /* The fields below are only needed when sharing a policy: */
> unsigned long util_cfs;
> unsigned long util_dl;
> + unsigned long util_rt;
> unsigned long max;
>
> /* The field below is for single-CPU policies only: */
> @@ -178,14 +179,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> sg_cpu->util_cfs = cpu_util_cfs(rq);
> sg_cpu->util_dl = cpu_util_dl(rq);
> + sg_cpu->util_rt = cpu_util_rt(rq);
> }
>
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util;
>
> - if (rq->rt.rt_nr_running)
> - return sg_cpu->max;
> + if (rq->rt.rt_nr_running) {
> + util = sg_cpu->max;

Why not just adding the following lines while keeping the return in
for the rq->rt.rt_nr_running case?

> + } else {
> + util = sg_cpu->util_dl;
> + util += sg_cpu->util_cfs;
> + util += sg_cpu->util_rt;
> + }
>
> /*
> * Utilization required by DEADLINE must always be granted while, for
> @@ -197,7 +205,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> + return min(sg_cpu->max, util);

... for the rq->rt.rt_nr_running case we don't really need to min
clamp util = sg_cpu->max with itself...

> }
>
> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
> --
> 2.7.4
>

--
#include <best/regards.h>

Patrick Bellasi

2018-05-30 09:55:14

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 30 May 2018 at 11:40, Patrick Bellasi <[email protected]> wrote:
> On 25-May 15:12, Vincent Guittot wrote:
>> Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
>> can preempt and steal cfs's running time.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/cpufreq_schedutil.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index 28592b6..a84b5a5 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -56,6 +56,7 @@ struct sugov_cpu {
>> /* The fields below are only needed when sharing a policy: */
>> unsigned long util_cfs;
>> unsigned long util_dl;
>> + unsigned long util_rt;
>> unsigned long max;
>>
>> /* The field below is for single-CPU policies only: */
>> @@ -178,14 +179,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>> sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
>> sg_cpu->util_cfs = cpu_util_cfs(rq);
>> sg_cpu->util_dl = cpu_util_dl(rq);
>> + sg_cpu->util_rt = cpu_util_rt(rq);
>> }
>>
>> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> {
>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>> + unsigned long util;
>>
>> - if (rq->rt.rt_nr_running)
>> - return sg_cpu->max;
>> + if (rq->rt.rt_nr_running) {
>> + util = sg_cpu->max;
>
> Why not just adding the following lines while keeping the return in
> for the rq->rt.rt_nr_running case?
>
>> + } else {
>> + util = sg_cpu->util_dl;
>> + util += sg_cpu->util_cfs;
>> + util += sg_cpu->util_rt;
>> + }
>>
>> /*
>> * Utilization required by DEADLINE must always be granted while, for
>> @@ -197,7 +205,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>> * ready for such an interface. So, we only do the latter for now.
>> */
>> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>> + return min(sg_cpu->max, util);
>
> ... for the rq->rt.rt_nr_running case we don't really need to min
> clamp util = sg_cpu->max with itself...

yes good point

>
>> }
>>
>> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
>> --
>> 2.7.4
>>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-30 10:09:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

On 30 May 2018 at 11:32, Patrick Bellasi <[email protected]> wrote:
> On 29-May 15:29, Vincent Guittot wrote:
>> Hi Patrick,
>> >> +static inline bool rt_rq_has_blocked(struct rq *rq)
>> >> +{
>> >> + if (rq->avg_rt.util_avg)
>> >
>> > Should use READ_ONCE?
>>
>> I was expecting that there will be only one read by default but I can
>> add READ_ONCE
>
> I would say here it's required mainly for "documentation" purposes,
> since we can use this function from non rq-locked paths, e.g.
>
> update_sg_lb_stats()
> update_nohz_stats()
> update_blocked_averages()
> rt_rq_has_blocked()
>
> Thus, AFAIU, we should use READ_ONCE to "flag" that the value can
> potentially be updated concurrently?

yes

>
>> >
>> >> + return true;
>> >> +
>> >> + return false;
>> >
>> > What about just:
>> >
>> > return READ_ONCE(rq->avg_rt.util_avg);
>> >
>> > ?
>>
>> This function is renamed and extended with others tracking in the
>> following patches so we have to test several values in the function.
>> That's also why there is the if test because additional if test are
>> going to be added
>
> Right, makes sense.
>
> [...]
>
>> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> >> index ef3c4e6..b4148a9 100644
>> >> --- a/kernel/sched/rt.c
>> >> +++ b/kernel/sched/rt.c
>> >> @@ -5,6 +5,8 @@
>> >> */
>> >> #include "sched.h"
>> >>
>> >> +#include "pelt.h"
>> >> +
>> >> int sched_rr_timeslice = RR_TIMESLICE;
>> >> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
>> >>
>> >> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> >>
>> >> rt_queue_push_tasks(rq);
>> >>
>> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
>> >> + rq->curr->sched_class == &rt_sched_class);
>> >> +
>> >> return p;
>> >> }
>> >>
>> >> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>> >> {
>> >> update_curr_rt(rq);
>> >>
>> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> >> +
>> >> /*
>> >> * The previous task needs to be made eligible for pushing
>> >> * if it is still active
>> >> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
>> >> struct sched_rt_entity *rt_se = &p->rt;
>> >>
>> >> update_curr_rt(rq);
>> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> >
>> > Mmm... not entirely sure... can't we fold
>> > update_rt_rq_load_avg() into update_curr_rt() ?
>> >
>> > Currently update_curr_rt() is used in:
>> > dequeue_task_rt
>> > pick_next_task_rt
>> > put_prev_task_rt
>> > task_tick_rt
>> >
>> > while we update_rt_rq_load_avg() only in:
>> > pick_next_task_rt
>> > put_prev_task_rt
>> > task_tick_rt
>> > and
>> > update_blocked_averages
>> >
>> > Why we don't we need to update at dequeue_task_rt() time ?
>>
>> We are tracking rt rq and not sched entities so we want to know when
>> sched rt will be the running or not sched class on the rq. Tracking
>> dequeue_task_rt is useless
>
> What about (push) migrations?

it doesn't make any difference. put_prev_task_rt() says that the prev
task that was running, was a rt task so we can account past time at rt
running time
and pick_next_task_rt says that the next one will be a rt task so we
have to account elapse time either to rt or not rt time according.

I can probably optimize the pick_next_task_rt by doing the below instead:

if (rq->curr->sched_class == &rt_sched_class)
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);

If prev task is a rt task, put_prev_task_rt has already done the update

>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-30 10:52:08

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 04/10] sched/dl: add dl_rq utilization tracking

On 25-May 15:12, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fb18bcc..967e873 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7290,11 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> return false;
> }
>
> -static inline bool rt_rq_has_blocked(struct rq *rq)
> +static inline bool others_rqs_have_blocked(struct rq *rq)

Here you are going to fold in IRQ's utilization which, strictly
speaking, is not a RQ. Moreover, we are checking only utilization.

Can we use a better matching name? E.g.
others_have_blocked_util
non_cfs_blocked_util
?

> {
> if (rq->avg_rt.util_avg)
> return true;
>
> + if (rq->avg_dl.util_avg)
> + return true;
> +
> return false;
> }
>
--
#include <best/regards.h>

Patrick Bellasi

2018-05-30 11:02:27

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

On 30-May 12:06, Vincent Guittot wrote:
> On 30 May 2018 at 11:32, Patrick Bellasi <[email protected]> wrote:
> > On 29-May 15:29, Vincent Guittot wrote:

[...]

> >> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> >> index ef3c4e6..b4148a9 100644
> >> >> --- a/kernel/sched/rt.c
> >> >> +++ b/kernel/sched/rt.c
> >> >> @@ -5,6 +5,8 @@
> >> >> */
> >> >> #include "sched.h"
> >> >>
> >> >> +#include "pelt.h"
> >> >> +
> >> >> int sched_rr_timeslice = RR_TIMESLICE;
> >> >> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
> >> >>
> >> >> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >> >>
> >> >> rt_queue_push_tasks(rq);
> >> >>
> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
> >> >> + rq->curr->sched_class == &rt_sched_class);
> >> >> +
> >> >> return p;
> >> >> }
> >> >>
> >> >> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
> >> >> {
> >> >> update_curr_rt(rq);
> >> >>
> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> >> >> +
> >> >> /*
> >> >> * The previous task needs to be made eligible for pushing
> >> >> * if it is still active
> >> >> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
> >> >> struct sched_rt_entity *rt_se = &p->rt;
> >> >>
> >> >> update_curr_rt(rq);
> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> >> >
> >> > Mmm... not entirely sure... can't we fold
> >> > update_rt_rq_load_avg() into update_curr_rt() ?
> >> >
> >> > Currently update_curr_rt() is used in:
> >> > dequeue_task_rt
> >> > pick_next_task_rt
> >> > put_prev_task_rt
> >> > task_tick_rt
> >> >
> >> > while we update_rt_rq_load_avg() only in:
> >> > pick_next_task_rt
> >> > put_prev_task_rt
> >> > task_tick_rt
> >> > and
> >> > update_blocked_averages
> >> >
> >> > Why we don't we need to update at dequeue_task_rt() time ?
> >>
> >> We are tracking rt rq and not sched entities so we want to know when
> >> sched rt will be the running or not sched class on the rq. Tracking
> >> dequeue_task_rt is useless
> >
> > What about (push) migrations?
>
> it doesn't make any difference. put_prev_task_rt() says that the prev
> task that was running, was a rt task so we can account past time at rt
> running time
> and pick_next_task_rt says that the next one will be a rt task so we
> have to account elapse time either to rt or not rt time according.

Right, I was missing that you are tracking RT (and DL) only at RQ
level... not SE level, thus we will not see migrations of blocked
utilization.

> I can probably optimize the pick_next_task_rt by doing the below instead:
>
> if (rq->curr->sched_class == &rt_sched_class)
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>
> If prev task is a rt task, put_prev_task_rt has already done the update

Right.

Just one more question about non tracking SE. Once we migrate an RT
task with the current solution we will have to wait for it's PELT
blocked utilization to decay completely before starting to ignore that
task contribution, which means that:
1. we will see an higher utilization on the original CPU
2. we don't immediately see the increased utilization on the
destination CPU

I remember Juri had some patches to track SE utilization thus fixing
the two issues above. Can you remember me why we decided to go just
for the RQ tracking solution?
Don't we expect any strange behaviors on real systems when RT tasks
are moved around?

Perhaps we should run some tests on Android...

--
#include <best/regards.h>

Patrick Bellasi

2018-05-30 11:52:35

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 04/10] sched/dl: add dl_rq utilization tracking

On 30 May 2018 at 12:50, Patrick Bellasi <[email protected]> wrote:
> On 25-May 15:12, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fb18bcc..967e873 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7290,11 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> return false;
>> }
>>
>> -static inline bool rt_rq_has_blocked(struct rq *rq)
>> +static inline bool others_rqs_have_blocked(struct rq *rq)
>
> Here you are going to fold in IRQ's utilization which, strictly
> speaking, is not a RQ. Moreover, we are checking only utilization.
>
> Can we use a better matching name? E.g.
> others_have_blocked_util
> non_cfs_blocked_util

others_have_blocked looks ok and consistent with cfs_rq_has_blocked

> ?
>
>> {
>> if (rq->avg_rt.util_avg)
>> return true;
>>
>> + if (rq->avg_dl.util_avg)
>> + return true;
>> +
>> return false;
>> }
>>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-30 14:41:23

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 02/10] sched/rt: add rt_rq utilization tracking

On 30 May 2018 at 13:01, Patrick Bellasi <[email protected]> wrote:
> On 30-May 12:06, Vincent Guittot wrote:
>> On 30 May 2018 at 11:32, Patrick Bellasi <[email protected]> wrote:
>> > On 29-May 15:29, Vincent Guittot wrote:
>
> [...]
>
>> >> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> >> >> index ef3c4e6..b4148a9 100644
>> >> >> --- a/kernel/sched/rt.c
>> >> >> +++ b/kernel/sched/rt.c
>> >> >> @@ -5,6 +5,8 @@
>> >> >> */
>> >> >> #include "sched.h"
>> >> >>
>> >> >> +#include "pelt.h"
>> >> >> +
>> >> >> int sched_rr_timeslice = RR_TIMESLICE;
>> >> >> int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
>> >> >>
>> >> >> @@ -1572,6 +1574,9 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>> >> >>
>> >> >> rt_queue_push_tasks(rq);
>> >> >>
>> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq,
>> >> >> + rq->curr->sched_class == &rt_sched_class);
>> >> >> +
>> >> >> return p;
>> >> >> }
>> >> >>
>> >> >> @@ -1579,6 +1584,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
>> >> >> {
>> >> >> update_curr_rt(rq);
>> >> >>
>> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> >> >> +
>> >> >> /*
>> >> >> * The previous task needs to be made eligible for pushing
>> >> >> * if it is still active
>> >> >> @@ -2308,6 +2315,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
>> >> >> struct sched_rt_entity *rt_se = &p->rt;
>> >> >>
>> >> >> update_curr_rt(rq);
>> >> >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> >> >
>> >> > Mmm... not entirely sure... can't we fold
>> >> > update_rt_rq_load_avg() into update_curr_rt() ?
>> >> >
>> >> > Currently update_curr_rt() is used in:
>> >> > dequeue_task_rt
>> >> > pick_next_task_rt
>> >> > put_prev_task_rt
>> >> > task_tick_rt
>> >> >
>> >> > while we update_rt_rq_load_avg() only in:
>> >> > pick_next_task_rt
>> >> > put_prev_task_rt
>> >> > task_tick_rt
>> >> > and
>> >> > update_blocked_averages
>> >> >
>> >> > Why we don't we need to update at dequeue_task_rt() time ?
>> >>
>> >> We are tracking rt rq and not sched entities so we want to know when
>> >> sched rt will be the running or not sched class on the rq. Tracking
>> >> dequeue_task_rt is useless
>> >
>> > What about (push) migrations?
>>
>> it doesn't make any difference. put_prev_task_rt() says that the prev
>> task that was running, was a rt task so we can account past time at rt
>> running time
>> and pick_next_task_rt says that the next one will be a rt task so we
>> have to account elapse time either to rt or not rt time according.
>
> Right, I was missing that you are tracking RT (and DL) only at RQ
> level... not SE level, thus we will not see migrations of blocked
> utilization.
>
>> I can probably optimize the pick_next_task_rt by doing the below instead:
>>
>> if (rq->curr->sched_class == &rt_sched_class)
>> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>>
>> If prev task is a rt task, put_prev_task_rt has already done the update
>
> Right.
>
> Just one more question about non tracking SE. Once we migrate an RT
> task with the current solution we will have to wait for it's PELT
> blocked utilization to decay completely before starting to ignore that
> task contribution, which means that:
> 1. we will see an higher utilization on the original CPU
> 2. we don't immediately see the increased utilization on the
> destination CPU
>
> I remember Juri had some patches to track SE utilization thus fixing
> the two issues above. Can you remember me why we decided to go just
> for the RQ tracking solution?

I would say that one main reason is the overhead of tracking per SE

Then, what we want to track the other class utilization to replace
current rt_avg.

And we want something to track steal time of cfs to compensate the
fact that cfs util_avg will be lower than what cfs really needs.
so we really want rt util_avg to smoothly decrease if a rt task
migrate to let time to cfs util_avg to smoothly increase itself as cfs
tasks will run more often.

Based on some discussion on IRC, I'm studying how to track more
accurately the stolen time

> Don't we expect any strange behaviors on real systems when RT tasks
> are moved around?

Which kind of strange behavior ? we don't use rt util_avg for OPP
selection when a rt task is running

>
> Perhaps we should run some tests on Android...
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-30 15:57:34

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

On 05/25/2018 03:12 PM, Vincent Guittot wrote:
> interrupt and steal time are the only remaining activities tracked by
> rt_avg. Like for sched classes, we can use PELT to track their average
> utilization of the CPU. But unlike sched class, we don't track when
> entering/leaving interrupt; Instead, we take into account the time spent
> under interrupt context when we update rqs' clock (rq_clock_task).
> This also means that we have to decay the normal context time and account
> for interrupt time during the update.
>
> That's also important to note that because
> rq_clock == rq_clock_task + interrupt time
> and rq_clock_task is used by a sched class to compute its utilization, the
> util_avg of a sched class only reflects the utilization of the time spent
> in normal context and not of the whole time of the CPU. The utilization of
> interrupt gives an more accurate level of utilization of CPU.
> The CPU utilization is :
> avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
>
> Most of the time, avg_irq is small and neglictible so the use of the
> approximation CPU utilization = /Sum avg_rq was enough

[...]

> @@ -7362,6 +7363,7 @@ static void update_blocked_averages(int cpu)
> }
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> + update_irq_load_avg(rq, 0);

So this one decays the signals only in case the update_rq_clock_task()
didn't call update_irq_load_avg() because 'irq_delta + steal' is 0, right?

[...]

> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 3d5bd3a..d2e4f21 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -355,3 +355,41 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>
> return 0;
> }
> +
> +/*
> + * irq:
> + *
> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> + * util_sum = cpu_scale * load_sum
> + * runnable_load_sum = load_sum
> + *
> + */
> +
> +int update_irq_load_avg(struct rq *rq, u64 running)
> +{
> + int ret = 0;
> + /*
> + * We know the time that has been used by interrupt since last update
> + * but we don't when. Let be pessimistic and assume that interrupt has
> + * happened just before the update. This is not so far from reality
> + * because interrupt will most probably wake up task and trig an update
> + * of rq clock during which the metric si updated.
> + * We start to decay with normal context time and then we add the
> + * interrupt context time.
> + * We can safely remove running from rq->clock because
> + * rq->clock += delta with delta >= running

This is true as long update_irq_load_avg() with a 'running != 0' is
called only after rq->clock moved forward (rq->clock += delta) (which is
true for update_rq_clock()->update_rq_clock_task()).

> + */
> + ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> + 0,
> + 0,
> + 0);
> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> + 1,
> + 1,
> + 1);

So you decay the signal in [sa->lut, rq->clock - running] (assumed to be
the portion of delta used by the task scheduler) and you increase it in
[rq->clock - running, rq->clock] (irq and virt portion of delta).

That means that this signal is updated on rq->clock whereas the others
are on rq->clock_task.

What about the ever growing clock diff between them? I see e.g ~6s after
20min uptime and up to 1.5ms 'running'.

It should be still safe to sum the sched class and irq signal in
sugov_aggregate_util() because they are independent, I guess.

[...]

2018-05-30 16:47:58

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

Hi Vincent,

On Friday 25 May 2018 at 15:12:24 (+0200), Vincent Guittot wrote:
> Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
> can preempt and steal cfs's running time.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 28592b6..a84b5a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -56,6 +56,7 @@ struct sugov_cpu {
> /* The fields below are only needed when sharing a policy: */
> unsigned long util_cfs;
> unsigned long util_dl;
> + unsigned long util_rt;
> unsigned long max;
>
> /* The field below is for single-CPU policies only: */
> @@ -178,14 +179,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> sg_cpu->util_cfs = cpu_util_cfs(rq);
> sg_cpu->util_dl = cpu_util_dl(rq);
> + sg_cpu->util_rt = cpu_util_rt(rq);
> }
>
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util;
>
> - if (rq->rt.rt_nr_running)
> - return sg_cpu->max;
> + if (rq->rt.rt_nr_running) {
> + util = sg_cpu->max;

So I understand why we want to got to max freq when a RT task is running,
but I think there are use cases where we might want to be more conservative
and use the util_avg of the RT rq instead. The first use case is
battery-powered devices where going to max isn't really affordable from
an energy standpoint. Android, for example, has been using a RT
utilization signal to select OPPs for quite a while now, because going
to max blindly is _very_ expensive.

And the second use-case is thermal pressure. On some modern CPUs, going to
max freq can lead to stringent thermal capping very quickly, at the
point where your CPUs might not have enough capacity to serve your tasks
properly. And that can ultimately hurt the very RT tasks you originally
tried to run fast. In these systems, in the long term, you'd be better off
not asking for more than what you really need ...

So what about having a sched_feature to select between going to max and
using the RT util_avg ? Obviously the default should keep the current
behaviour.

Thanks,
Quentin

2018-05-30 18:46:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

Hi Dietmar,

On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]> wrote:
> On 05/25/2018 03:12 PM, Vincent Guittot wrote:
>>
>> interrupt and steal time are the only remaining activities tracked by
>> rt_avg. Like for sched classes, we can use PELT to track their average
>> utilization of the CPU. But unlike sched class, we don't track when
>> entering/leaving interrupt; Instead, we take into account the time spent
>> under interrupt context when we update rqs' clock (rq_clock_task).
>> This also means that we have to decay the normal context time and account
>> for interrupt time during the update.
>>
>> That's also important to note that because
>> rq_clock == rq_clock_task + interrupt time
>> and rq_clock_task is used by a sched class to compute its utilization, the
>> util_avg of a sched class only reflects the utilization of the time spent
>> in normal context and not of the whole time of the CPU. The utilization of
>> interrupt gives an more accurate level of utilization of CPU.
>> The CPU utilization is :
>> avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
>>
>> Most of the time, avg_irq is small and neglictible so the use of the
>> approximation CPU utilization = /Sum avg_rq was enough
>
>
> [...]
>
>> @@ -7362,6 +7363,7 @@ static void update_blocked_averages(int cpu)
>> }
>> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
>> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
>> + update_irq_load_avg(rq, 0);
>
>
> So this one decays the signals only in case the update_rq_clock_task()
> didn't call update_irq_load_avg() because 'irq_delta + steal' is 0, right?

yes

>
> [...]
>
>
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index 3d5bd3a..d2e4f21 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -355,3 +355,41 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int
>> running)
>> return 0;
>> }
>> +
>> +/*
>> + * irq:
>> + *
>> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
>> + * util_sum = cpu_scale * load_sum
>> + * runnable_load_sum = load_sum
>> + *
>> + */
>> +
>> +int update_irq_load_avg(struct rq *rq, u64 running)
>> +{
>> + int ret = 0;
>> + /*
>> + * We know the time that has been used by interrupt since last
>> update
>> + * but we don't when. Let be pessimistic and assume that interrupt
>> has
>> + * happened just before the update. This is not so far from
>> reality
>> + * because interrupt will most probably wake up task and trig an
>> update
>> + * of rq clock during which the metric si updated.
>> + * We start to decay with normal context time and then we add the
>> + * interrupt context time.
>> + * We can safely remove running from rq->clock because
>> + * rq->clock += delta with delta >= running
>
>
> This is true as long update_irq_load_avg() with a 'running != 0' is called
> only after rq->clock moved forward (rq->clock += delta) (which is true for
> update_rq_clock()->update_rq_clock_task()).

yes

>
>> + */
>> + ret = ___update_load_sum(rq->clock - running, rq->cpu,
>> &rq->avg_irq,
>> + 0,
>> + 0,
>> + 0);
>> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
>> + 1,
>> + 1,
>> + 1);
>
>
> So you decay the signal in [sa->lut, rq->clock - running] (assumed to be the
> portion of delta used by the task scheduler) and you increase it in
> [rq->clock - running, rq->clock] (irq and virt portion of delta).
>
> That means that this signal is updated on rq->clock whereas the others are
> on rq->clock_task.
>
> What about the ever growing clock diff between them? I see e.g ~6s after
> 20min uptime and up to 1.5ms 'running'.
>
> It should be still safe to sum the sched class and irq signal in
> sugov_aggregate_util() because they are independent, I guess.

yes. the formula is explained in patch "cpufreq/schedutil: take into
account interrupt"


>
> [...]

2018-05-31 08:48:17

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 30/05/18 17:46, Quentin Perret wrote:
> Hi Vincent,
>
> On Friday 25 May 2018 at 15:12:24 (+0200), Vincent Guittot wrote:
> > Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
> > can preempt and steal cfs's running time.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/cpufreq_schedutil.c | 14 +++++++++++---
> > 1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 28592b6..a84b5a5 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -56,6 +56,7 @@ struct sugov_cpu {
> > /* The fields below are only needed when sharing a policy: */
> > unsigned long util_cfs;
> > unsigned long util_dl;
> > + unsigned long util_rt;
> > unsigned long max;
> >
> > /* The field below is for single-CPU policies only: */
> > @@ -178,14 +179,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> > sg_cpu->util_cfs = cpu_util_cfs(rq);
> > sg_cpu->util_dl = cpu_util_dl(rq);
> > + sg_cpu->util_rt = cpu_util_rt(rq);
> > }
> >
> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > + unsigned long util;
> >
> > - if (rq->rt.rt_nr_running)
> > - return sg_cpu->max;
> > + if (rq->rt.rt_nr_running) {
> > + util = sg_cpu->max;
>
> So I understand why we want to got to max freq when a RT task is running,
> but I think there are use cases where we might want to be more conservative
> and use the util_avg of the RT rq instead. The first use case is
> battery-powered devices where going to max isn't really affordable from
> an energy standpoint. Android, for example, has been using a RT
> utilization signal to select OPPs for quite a while now, because going
> to max blindly is _very_ expensive.
>
> And the second use-case is thermal pressure. On some modern CPUs, going to
> max freq can lead to stringent thermal capping very quickly, at the
> point where your CPUs might not have enough capacity to serve your tasks
> properly. And that can ultimately hurt the very RT tasks you originally
> tried to run fast. In these systems, in the long term, you'd be better off
> not asking for more than what you really need ...

Proposed the same at last LPC. Peter NAKed it (since RT is all about
meeting deadlines, and when using FIFO/RR we don't really know how fast
the CPU should go to meet them, so go to max is the only safe decision).

> So what about having a sched_feature to select between going to max and
> using the RT util_avg ? Obviously the default should keep the current
> behaviour.

Peter, would SCHED_FEAT make a difference? :)

Or Patrick's utilization capping applied to RT..

2018-05-31 10:28:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization


Hi Vincent, Juri,

On 28-May 18:34, Vincent Guittot wrote:
> On 28 May 2018 at 17:22, Juri Lelli <[email protected]> wrote:
> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> Hi Juri,
> >>
> >> On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
> >> > Hi Vincent,
> >> >
> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> Now that we have both the dl class bandwidth requirement and the dl class
> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> utilization of the CPU.
> >> >>
> >> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> >> ---
> >> >> kernel/sched/sched.h | 6 +++++-
> >> >> 1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> index 4526ba6..0eb07a8 100644
> >> >> --- a/kernel/sched/sched.h
> >> >> +++ b/kernel/sched/sched.h
> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >> static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >> {
> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >
> >> > I'd be tempted to say the we actually want to cap to this one above
> >> > instead of using the max (as you are proposing below) or the
> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> > tasks might vanish.
> >>
> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> the CFS task will be lower than reality and schedutil will set a lower
> >> OPP whereas the CPU is always running.

With UTIL_EST enabled I don't expect an OPP reduction below the
expected utilization of a CFS task.

IOW, when a periodic CFS task is preempted by a DL one, what we use
for OPP selection once the DL task is over is still the estimated
utilization for the CFS task itself. Thus, schedutil will eventually
(since we have quite conservative down scaling thresholds) go down to
the right OPP to serve that task.

> >> The example with a RT task described in the cover letter can be
> >> run with a DL task and will give similar results.

In the cover letter you says:

A rt-app use case which creates an always running cfs thread and a
rt threads that wakes up periodically with both threads pinned on
same CPU, show lot of frequency switches of the CPU whereas the CPU
never goes idles during the test.

I would say that's a quite specific corner case where your always
running CFS task has never accumulated a util_est sample.

Do we really have these cases in real systems?

Otherwise, it seems to me that we are trying to solve quite specific
corner cases by adding a not negligible level of "complexity".

Moreover, I also have the impression that we can fix these
use-cases by:

- improving the way we accumulate samples in util_est
i.e. by discarding preemption time

- maybe by improving the utilization aggregation in schedutil to
better understand DL requirements
i.e. a 10% utilization with a 100ms running time is way different
then the same utilization with a 1ms running time


--
#include <best/regards.h>

Patrick Bellasi

2018-05-31 13:03:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 31 May 2018 at 12:27, Patrick Bellasi <[email protected]> wrote:
>
> Hi Vincent, Juri,
>
> On 28-May 18:34, Vincent Guittot wrote:
>> On 28 May 2018 at 17:22, Juri Lelli <[email protected]> wrote:
>> > On 28/05/18 16:57, Vincent Guittot wrote:
>> >> Hi Juri,
>> >>
>> >> On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
>> >> > Hi Vincent,
>> >> >
>> >> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> >> Now that we have both the dl class bandwidth requirement and the dl class
>> >> >> utilization, we can use the max of the 2 values when agregating the
>> >> >> utilization of the CPU.
>> >> >>
>> >> >> Signed-off-by: Vincent Guittot <[email protected]>
>> >> >> ---
>> >> >> kernel/sched/sched.h | 6 +++++-
>> >> >> 1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> >> index 4526ba6..0eb07a8 100644
>> >> >> --- a/kernel/sched/sched.h
>> >> >> +++ b/kernel/sched/sched.h
>> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >> >> static inline unsigned long cpu_util_dl(struct rq *rq)
>> >> >> {
>> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >
>> >> > I'd be tempted to say the we actually want to cap to this one above
>> >> > instead of using the max (as you are proposing below) or the
>> >> > (theoretical) power reduction benefits of using DEADLINE for certain
>> >> > tasks might vanish.
>> >>
>> >> The problem that I'm facing is that the sched_entity bandwidth is
>> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> >> zero but if the DL task has preempted a CFS task, the utilization of
>> >> the CFS task will be lower than reality and schedutil will set a lower
>> >> OPP whereas the CPU is always running.
>
> With UTIL_EST enabled I don't expect an OPP reduction below the
> expected utilization of a CFS task.

I'm not sure to fully catch what you mean but all tests that I ran,
are using util_est (which is enable by default if i'm not wrong). So
all OPP drops that have been observed, were with util_est

>
> IOW, when a periodic CFS task is preempted by a DL one, what we use
> for OPP selection once the DL task is over is still the estimated
> utilization for the CFS task itself. Thus, schedutil will eventually
> (since we have quite conservative down scaling thresholds) go down to
> the right OPP to serve that task.
>
>> >> The example with a RT task described in the cover letter can be
>> >> run with a DL task and will give similar results.
>
> In the cover letter you says:
>
> A rt-app use case which creates an always running cfs thread and a
> rt threads that wakes up periodically with both threads pinned on
> same CPU, show lot of frequency switches of the CPU whereas the CPU
> never goes idles during the test.
>
> I would say that's a quite specific corner case where your always
> running CFS task has never accumulated a util_est sample.
>
> Do we really have these cases in real systems?

My example is voluntary an extreme one because it's easier to
highlight the problem

>
> Otherwise, it seems to me that we are trying to solve quite specific
> corner cases by adding a not negligible level of "complexity".

By complexity, do you mean :

Taking into account the number cfs running task to choose between
rq->dl.running_bw and avg_dl.util_avg

I'm preparing a patchset that will provide the cfs waiting time in
addition to dl/rt util_avg for almost no additional cost. I will try
to sent the proposal later today

>
> Moreover, I also have the impression that we can fix these
> use-cases by:
>
> - improving the way we accumulate samples in util_est
> i.e. by discarding preemption time
>
> - maybe by improving the utilization aggregation in schedutil to
> better understand DL requirements
> i.e. a 10% utilization with a 100ms running time is way different
> then the same utilization with a 1ms running time
>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-05-31 16:55:56

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

On 05/30/2018 08:45 PM, Vincent Guittot wrote:
> Hi Dietmar,
>
> On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]> wrote:
>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

[...]

>>> + */
>>> + ret = ___update_load_sum(rq->clock - running, rq->cpu,
>>> &rq->avg_irq,
>>> + 0,
>>> + 0,
>>> + 0);
>>> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
>>> + 1,
>>> + 1,
>>> + 1);

Can you not change the function parameter list to the usual
(u64 now, struct rq *rq, int running)?

Something like this (only compile and boot tested):

-- >8 --

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9894bc7af37e..26ffd585cab8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -177,8 +177,22 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->clock_task += delta;

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
- if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
- update_irq_load_avg(rq, irq_delta + steal);
+ if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
+ /*
+ * We know the time that has been used by interrupt since last
+ * update but we don't when. Let be pessimistic and assume that
+ * interrupt has happened just before the update. This is not
+ * so far from reality because interrupt will most probably
+ * wake up task and trig an update of rq clock during which the
+ * metric si updated.
+ * We start to decay with normal context time and then we add
+ * the interrupt context time.
+ * We can safely remove running from rq->clock because
+ * rq->clock += delta with delta >= running
+ */
+ update_irq_load_avg(rq_clock(rq) - (irq_delta + steal), rq, 0);
+ update_irq_load_avg(rq_clock(rq), rq, 1);
+ }
#endif
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1bb3379c4b71..a245f853c271 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7363,7 +7363,7 @@ static void update_blocked_averages(int cpu)
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
- update_irq_load_avg(rq, 0);
+ update_irq_load_avg(rq_clock(rq), rq, 0);
/* Don't need periodic decay once load/util_avg are null */
if (others_rqs_have_blocked(rq))
done = false;
@@ -7434,7 +7434,7 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
- update_irq_load_avg(rq, 0);
+ update_irq_load_avg(rq_clock(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index d2e4f2186b13..ae01bb18e28c 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -365,31 +365,15 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
*
*/

-int update_irq_load_avg(struct rq *rq, u64 running)
+int update_irq_load_avg(u64 now, struct rq *rq, int running)
{
- int ret = 0;
- /*
- * We know the time that has been used by interrupt since last update
- * but we don't when. Let be pessimistic and assume that interrupt has
- * happened just before the update. This is not so far from reality
- * because interrupt will most probably wake up task and trig an update
- * of rq clock during which the metric si updated.
- * We start to decay with normal context time and then we add the
- * interrupt context time.
- * We can safely remove running from rq->clock because
- * rq->clock += delta with delta >= running
- */
- ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
- 0,
- 0,
- 0);
- ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
- 1,
- 1,
- 1);
-
- if (ret)
+ if (___update_load_sum(now, rq->cpu, &rq->avg_irq,
+ running,
+ running,
+ running)) {
___update_load_avg(&rq->avg_irq, 1, 1);
+ return 1;
+ }

- return ret;
+ return 0;
}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0ce9a5a5877a..ebc57301a9a8 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,7 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
-int update_irq_load_avg(struct rq *rq, u64 running);
+int update_irq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if



2018-06-01 13:54:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Le Thursday 31 May 2018 ? 15:02:04 (+0200), Vincent Guittot a ?crit :
> On 31 May 2018 at 12:27, Patrick Bellasi <[email protected]> wrote:
> >
> > Hi Vincent, Juri,
> >
> > On 28-May 18:34, Vincent Guittot wrote:
> >> On 28 May 2018 at 17:22, Juri Lelli <[email protected]> wrote:
> >> > On 28/05/18 16:57, Vincent Guittot wrote:
> >> >> Hi Juri,
> >> >>
> >> >> On 28 May 2018 at 12:12, Juri Lelli <[email protected]> wrote:
> >> >> > Hi Vincent,
> >> >> >
> >> >> > On 25/05/18 15:12, Vincent Guittot wrote:
> >> >> >> Now that we have both the dl class bandwidth requirement and the dl class
> >> >> >> utilization, we can use the max of the 2 values when agregating the
> >> >> >> utilization of the CPU.
> >> >> >>
> >> >> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> >> >> ---
> >> >> >> kernel/sched/sched.h | 6 +++++-
> >> >> >> 1 file changed, 5 insertions(+), 1 deletion(-)
> >> >> >>
> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> >> index 4526ba6..0eb07a8 100644
> >> >> >> --- a/kernel/sched/sched.h
> >> >> >> +++ b/kernel/sched/sched.h
> >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >> >> >> static inline unsigned long cpu_util_dl(struct rq *rq)
> >> >> >> {
> >> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> >> >
> >> >> > I'd be tempted to say the we actually want to cap to this one above
> >> >> > instead of using the max (as you are proposing below) or the
> >> >> > (theoretical) power reduction benefits of using DEADLINE for certain
> >> >> > tasks might vanish.
> >> >>
> >> >> The problem that I'm facing is that the sched_entity bandwidth is
> >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
> >> >> zero but if the DL task has preempted a CFS task, the utilization of
> >> >> the CFS task will be lower than reality and schedutil will set a lower
> >> >> OPP whereas the CPU is always running.
> >
> > With UTIL_EST enabled I don't expect an OPP reduction below the
> > expected utilization of a CFS task.
>
> I'm not sure to fully catch what you mean but all tests that I ran,
> are using util_est (which is enable by default if i'm not wrong). So
> all OPP drops that have been observed, were with util_est
>
> >
> > IOW, when a periodic CFS task is preempted by a DL one, what we use
> > for OPP selection once the DL task is over is still the estimated
> > utilization for the CFS task itself. Thus, schedutil will eventually
> > (since we have quite conservative down scaling thresholds) go down to
> > the right OPP to serve that task.
> >
> >> >> The example with a RT task described in the cover letter can be
> >> >> run with a DL task and will give similar results.
> >
> > In the cover letter you says:
> >
> > A rt-app use case which creates an always running cfs thread and a
> > rt threads that wakes up periodically with both threads pinned on
> > same CPU, show lot of frequency switches of the CPU whereas the CPU
> > never goes idles during the test.
> >
> > I would say that's a quite specific corner case where your always
> > running CFS task has never accumulated a util_est sample.
> >
> > Do we really have these cases in real systems?
>
> My example is voluntary an extreme one because it's easier to
> highlight the problem
>
> >
> > Otherwise, it seems to me that we are trying to solve quite specific
> > corner cases by adding a not negligible level of "complexity".
>
> By complexity, do you mean :
>
> Taking into account the number cfs running task to choose between
> rq->dl.running_bw and avg_dl.util_avg
>
> I'm preparing a patchset that will provide the cfs waiting time in
> addition to dl/rt util_avg for almost no additional cost. I will try
> to sent the proposal later today


The code below adds the tracking of the waiting level of cfs tasks because of
rt/dl preemption. This waiting time can then be used when selecting an OPP
instead of the dl util_avg which could become higher than dl bandwidth with
"long" runtime

We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
the call to arch_scale_cpu_capacity() can be removed once the later will be
taken into account when computing the load (which scales only with freq
currently)

For rt task, we must keep to take into account util_avg to have an idea of the
rt level on the cpu which is given by the badnwodth for dl

---
kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
kernel/sched/pelt.c | 8 ++++++--
kernel/sched/sched.h | 4 +++-
3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eac1f9a..1682ea7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static inline void update_cfs_wait_util_avg(struct rq *rq)
+{
+ /*
+ * If cfs is already enqueued, we don't have anything to do because
+ * we already updated the non waiting time
+ */
+ if (rq->cfs.h_nr_running)
+ return;
+
+ /*
+ * If rt is running, we update the non wait time before increasing
+ * cfs.h_nr_running)
+ */
+ if (rq->curr->sched_class == &rt_sched_class)
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
+ /*
+ * If dl is running, we update the non time before increasing
+ * cfs.h_nr_running)
+ */
+ if (rq->curr->sched_class == &dl_sched_class)
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5159,6 +5183,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ /* Update the tracking of waiting time */
+ update_cfs_wait_util_avg(rq);
+
/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a559a53..ef8905e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -322,9 +322,11 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
{
+ unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0;
+
if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
running,
- running,
+ waiting,
running)) {

___update_load_avg(&rq->avg_rt, 1, 1);
@@ -345,9 +347,11 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
{
+ unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0;
+
if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
running,
- running,
+ waiting,
running)) {

___update_load_avg(&rq->avg_dl, 1, 1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0ea94de..9f72a05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2184,7 +2184,9 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
{
unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

- util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg));
+ util = max_t(unsigned long, util,
+ READ_ONCE(rq->avg_dl.runnable_load_avg));
+
trace_printk("cpu_util_dl cpu%d %u %lu %llu", rq->cpu,
rq->cfs.h_nr_running,
READ_ONCE(rq->avg_dl.util_avg),
--
2.7.4



>
> >
> > Moreover, I also have the impression that we can fix these
> > use-cases by:
> >
> > - improving the way we accumulate samples in util_est
> > i.e. by discarding preemption time
> >
> > - maybe by improving the utilization aggregation in schedutil to
> > better understand DL requirements
> > i.e. a 10% utilization with a 100ms running time is way different
> > then the same utilization with a 1ms running time
> >
> >
> > --
> > #include <best/regards.h>
> >
> > Patrick Bellasi

2018-06-01 16:25:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On Thu, May 31, 2018 at 10:46:07AM +0200, Juri Lelli wrote:
> On 30/05/18 17:46, Quentin Perret wrote:

> > So I understand why we want to got to max freq when a RT task is running,
> > but I think there are use cases where we might want to be more conservative
> > and use the util_avg of the RT rq instead. The first use case is
> > battery-powered devices where going to max isn't really affordable from
> > an energy standpoint. Android, for example, has been using a RT
> > utilization signal to select OPPs for quite a while now, because going
> > to max blindly is _very_ expensive.
> >
> > And the second use-case is thermal pressure. On some modern CPUs, going to
> > max freq can lead to stringent thermal capping very quickly, at the
> > point where your CPUs might not have enough capacity to serve your tasks
> > properly. And that can ultimately hurt the very RT tasks you originally
> > tried to run fast. In these systems, in the long term, you'd be better off
> > not asking for more than what you really need ...
>
> Proposed the same at last LPC. Peter NAKed it (since RT is all about
> meeting deadlines, and when using FIFO/RR we don't really know how fast
> the CPU should go to meet them, so go to max is the only safe decision).
>
> > So what about having a sched_feature to select between going to max and
> > using the RT util_avg ? Obviously the default should keep the current
> > behaviour.
>
> Peter, would SCHED_FEAT make a difference? :)

Hurmph...

> Or Patrick's utilization capping applied to RT..

There might be something there, IIRC that tracks the max potential
utilization for the running tasks. So at that point we can set a
frequency to minimize idle time.

It's not perfect, because while the clamping thing effectively sets a
per-task bandwidth, the max filter is wrong. Also there's no CBS to
enforce anything.

With RT servers we could aggregate the group bandwidth and limit from
that...

2018-06-01 17:24:51

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 01-Jun 18:23, Peter Zijlstra wrote:
> On Thu, May 31, 2018 at 10:46:07AM +0200, Juri Lelli wrote:
> > On 30/05/18 17:46, Quentin Perret wrote:
>
> > > So I understand why we want to got to max freq when a RT task is running,
> > > but I think there are use cases where we might want to be more conservative
> > > and use the util_avg of the RT rq instead. The first use case is
> > > battery-powered devices where going to max isn't really affordable from
> > > an energy standpoint. Android, for example, has been using a RT
> > > utilization signal to select OPPs for quite a while now, because going
> > > to max blindly is _very_ expensive.
> > >
> > > And the second use-case is thermal pressure. On some modern CPUs, going to
> > > max freq can lead to stringent thermal capping very quickly, at the
> > > point where your CPUs might not have enough capacity to serve your tasks
> > > properly. And that can ultimately hurt the very RT tasks you originally
> > > tried to run fast. In these systems, in the long term, you'd be better off
> > > not asking for more than what you really need ...
> >
> > Proposed the same at last LPC. Peter NAKed it (since RT is all about
> > meeting deadlines, and when using FIFO/RR we don't really know how fast
> > the CPU should go to meet them, so go to max is the only safe decision).
> >
> > > So what about having a sched_feature to select between going to max and
> > > using the RT util_avg ? Obviously the default should keep the current
> > > behaviour.
> >
> > Peter, would SCHED_FEAT make a difference? :)
>
> Hurmph...
>
> > Or Patrick's utilization capping applied to RT..
>
> There might be something there, IIRC that tracks the max potential
> utilization for the running tasks. So at that point we can set a
> frequency to minimize idle time.

Or we can do the opposite: we go to max by default (as it is now) and
if you think that some RT tasks don't need the full speed, you can
apply a util_max to them.

That way, when a RT task is running alone on a CPU, we can run it
only at a custom max freq which is known to be ok according to your
latency requirements.

If instead it's running with other CFS tasks, we add already the CFS
utilization, which will result into a speedup of the RT task to give
back the CPU to CFS.

> It's not perfect, because while the clamping thing effectively sets a
> per-task bandwidth, the max filter is wrong. Also there's no CBS to
> enforce anything.

Right, well... from user-space potentially if you carefully set the RT
cpu's controller (both bandwidth and clamping) and keep track of the
allocated bandwidth, you can still ensure that all your RT tasks will
be able to run, according to their prio.

> With RT servers we could aggregate the group bandwidth and limit from
> that...

What we certainly miss I think it's the EDF scheduler: it's not
possible to run certain RT tasks before others irrespectively of they
relative priority.

--
#include <best/regards.h>

Patrick Bellasi

2018-06-01 17:47:31

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be
> > >> >> run with a DL task and will give similar results.
> > >
> > > In the cover letter you says:
> > >
> > > A rt-app use case which creates an always running cfs thread and a
> > > rt threads that wakes up periodically with both threads pinned on
> > > same CPU, show lot of frequency switches of the CPU whereas the CPU
> > > never goes idles during the test.
> > >
> > > I would say that's a quite specific corner case where your always
> > > running CFS task has never accumulated a util_est sample.
> > >
> > > Do we really have these cases in real systems?
> >
> > My example is voluntary an extreme one because it's easier to
> > highlight the problem
> >
> > >
> > > Otherwise, it seems to me that we are trying to solve quite specific
> > > corner cases by adding a not negligible level of "complexity".
> >
> > By complexity, do you mean :
> >
> > Taking into account the number cfs running task to choose between
> > rq->dl.running_bw and avg_dl.util_avg
> >
> > I'm preparing a patchset that will provide the cfs waiting time in
> > addition to dl/rt util_avg for almost no additional cost. I will try
> > to sent the proposal later today
>
>
> The code below adds the tracking of the waiting level of cfs tasks because of
> rt/dl preemption. This waiting time can then be used when selecting an OPP
> instead of the dl util_avg which could become higher than dl bandwidth with
> "long" runtime
>
> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
> the call to arch_scale_cpu_capacity() can be removed once the later will be
> taken into account when computing the load (which scales only with freq
> currently)
>
> For rt task, we must keep to take into account util_avg to have an idea of the
> rt level on the cpu which is given by the badnwodth for dl
>
> ---
> kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
> kernel/sched/pelt.c | 8 ++++++--
> kernel/sched/sched.h | 4 +++-
> 3 files changed, 36 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eac1f9a..1682ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static inline void update_cfs_wait_util_avg(struct rq *rq)
> +{
> + /*
> + * If cfs is already enqueued, we don't have anything to do because
> + * we already updated the non waiting time
> + */
> + if (rq->cfs.h_nr_running)
> + return;
> +
> + /*
> + * If rt is running, we update the non wait time before increasing
> + * cfs.h_nr_running)
> + */
> + if (rq->curr->sched_class == &rt_sched_class)
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> + /*
> + * If dl is running, we update the non time before increasing
> + * cfs.h_nr_running)
> + */
> + if (rq->curr->sched_class == &dl_sched_class)
> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
> +}
> +

Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

- Joel


2018-06-04 06:42:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 1 June 2018 at 19:45, Joel Fernandes <[email protected]> wrote:
> On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> > >> >> The example with a RT task described in the cover letter can be
>> > >> >> run with a DL task and will give similar results.
>> > >
>> > > In the cover letter you says:
>> > >
>> > > A rt-app use case which creates an always running cfs thread and a
>> > > rt threads that wakes up periodically with both threads pinned on
>> > > same CPU, show lot of frequency switches of the CPU whereas the CPU
>> > > never goes idles during the test.
>> > >
>> > > I would say that's a quite specific corner case where your always
>> > > running CFS task has never accumulated a util_est sample.
>> > >
>> > > Do we really have these cases in real systems?
>> >
>> > My example is voluntary an extreme one because it's easier to
>> > highlight the problem
>> >
>> > >
>> > > Otherwise, it seems to me that we are trying to solve quite specific
>> > > corner cases by adding a not negligible level of "complexity".
>> >
>> > By complexity, do you mean :
>> >
>> > Taking into account the number cfs running task to choose between
>> > rq->dl.running_bw and avg_dl.util_avg
>> >
>> > I'm preparing a patchset that will provide the cfs waiting time in
>> > addition to dl/rt util_avg for almost no additional cost. I will try
>> > to sent the proposal later today
>>
>>
>> The code below adds the tracking of the waiting level of cfs tasks because of
>> rt/dl preemption. This waiting time can then be used when selecting an OPP
>> instead of the dl util_avg which could become higher than dl bandwidth with
>> "long" runtime
>>
>> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
>> the call to arch_scale_cpu_capacity() can be removed once the later will be
>> taken into account when computing the load (which scales only with freq
>> currently)
>>
>> For rt task, we must keep to take into account util_avg to have an idea of the
>> rt level on the cpu which is given by the badnwodth for dl
>>
>> ---
>> kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
>> kernel/sched/pelt.c | 8 ++++++--
>> kernel/sched/sched.h | 4 +++-
>> 3 files changed, 36 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eac1f9a..1682ea7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
>> }
>> #endif
>>
>> +static inline void update_cfs_wait_util_avg(struct rq *rq)
>> +{
>> + /*
>> + * If cfs is already enqueued, we don't have anything to do because
>> + * we already updated the non waiting time
>> + */
>> + if (rq->cfs.h_nr_running)
>> + return;
>> +
>> + /*
>> + * If rt is running, we update the non wait time before increasing
>> + * cfs.h_nr_running)
>> + */
>> + if (rq->curr->sched_class == &rt_sched_class)
>> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +
>> + /*
>> + * If dl is running, we update the non time before increasing
>> + * cfs.h_nr_running)
>> + */
>> + if (rq->curr->sched_class == &dl_sched_class)
>> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
>> +}
>> +
>
> Please correct me if I'm wrong but the CFS preemption-decay happens in
> set_next_entity -> update_load_avg when the CFS task is scheduled again after
> the preemption. Then can we not fix this issue by doing our UTIL_EST magic
> from set_next_entity? But yeah probably we need to be careful with overhead..

util_est is there to keep track of the last max. I'm not sure that
trying to add some magics to take into account preemption is the right
way to do. Mixing several information in the same metric just add more
fuzzy in the meaning of the metric.

>
> IMO I feel its overkill to account dl_avg when we already have DL's running
> bandwidth we can use. I understand it may be too instanenous, but perhaps we

We keep using dl bandwidth which is quite correct for dl needs but
doesn't reflect how it has disturbed other classes

> can fix CFS's problems within CFS itself and not have to do this kind of
> extra external accounting ?
>
> I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
> from within CFS class as being done above.
>
> thanks,
>
> - Joel
>

2018-06-04 07:05:32

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

Hi Vincent,

On 04/06/18 08:41, Vincent Guittot wrote:
> On 1 June 2018 at 19:45, Joel Fernandes <[email protected]> wrote:
> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:

[...]

> > IMO I feel its overkill to account dl_avg when we already have DL's running
> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
>
> We keep using dl bandwidth which is quite correct for dl needs but
> doesn't reflect how it has disturbed other classes
>
> > can fix CFS's problems within CFS itself and not have to do this kind of
> > extra external accounting ?

I would also keep accounting for waiting time due to higher prio classes
all inside CFS. My impression, when discussing it with you on IRC, was
that we should be able to do that by not decaying cfs.util_avg when CFS
is preempted (creating a new signal for it). Is not this enough?

I feel we should try to keep cross-class accounting/interaction at a
minimum.

Thanks,

- Juri

2018-06-04 07:15:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 4 June 2018 at 09:04, Juri Lelli <[email protected]> wrote:
> Hi Vincent,
>
> On 04/06/18 08:41, Vincent Guittot wrote:
>> On 1 June 2018 at 19:45, Joel Fernandes <[email protected]> wrote:
>> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>
> [...]
>
>> > IMO I feel its overkill to account dl_avg when we already have DL's running
>> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
>>
>> We keep using dl bandwidth which is quite correct for dl needs but
>> doesn't reflect how it has disturbed other classes
>>
>> > can fix CFS's problems within CFS itself and not have to do this kind of
>> > extra external accounting ?
>
> I would also keep accounting for waiting time due to higher prio classes
> all inside CFS. My impression, when discussing it with you on IRC, was
> that we should be able to do that by not decaying cfs.util_avg when CFS
> is preempted (creating a new signal for it). Is not this enough?

We don't just want to not decay a signal but increase the signal to
reflect the amount of preemption
Then, we can't do that in a current signal. So you would like to add
another metrics in cfs_rq ?
The place doesn't really matter to be honest in cfs_rq or in dl_rq but
you will not prevent to add call in dl class to start/stop the
accounting of the preemption

>
> I feel we should try to keep cross-class accounting/interaction at a
> minimum.

accounting for cross class preemption can't be done without
cross-class accounting

Regards,
Vincent

>
> Thanks,
>
> - Juri

2018-06-04 10:14:49

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 04/06/18 09:14, Vincent Guittot wrote:
> On 4 June 2018 at 09:04, Juri Lelli <[email protected]> wrote:
> > Hi Vincent,
> >
> > On 04/06/18 08:41, Vincent Guittot wrote:
> >> On 1 June 2018 at 19:45, Joel Fernandes <[email protected]> wrote:
> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> >
> > [...]
> >
> >> > IMO I feel its overkill to account dl_avg when we already have DL's running
> >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
> >>
> >> We keep using dl bandwidth which is quite correct for dl needs but
> >> doesn't reflect how it has disturbed other classes
> >>
> >> > can fix CFS's problems within CFS itself and not have to do this kind of
> >> > extra external accounting ?
> >
> > I would also keep accounting for waiting time due to higher prio classes
> > all inside CFS. My impression, when discussing it with you on IRC, was
> > that we should be able to do that by not decaying cfs.util_avg when CFS
> > is preempted (creating a new signal for it). Is not this enough?
>
> We don't just want to not decay a signal but increase the signal to
> reflect the amount of preemption

OK.

> Then, we can't do that in a current signal. So you would like to add
> another metrics in cfs_rq ?

Since it's CFS related, I'd say it should fit in CFS.

> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
> you will not prevent to add call in dl class to start/stop the
> accounting of the preemption
>
> >
> > I feel we should try to keep cross-class accounting/interaction at a
> > minimum.
>
> accounting for cross class preemption can't be done without
> cross-class accounting

Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
higher prio class and act accordingly?

Thanks,

- Juri

2018-06-04 10:18:34

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On Friday 01 Jun 2018 at 18:23:59 (+0100), Patrick Bellasi wrote:
> On 01-Jun 18:23, Peter Zijlstra wrote:
> > On Thu, May 31, 2018 at 10:46:07AM +0200, Juri Lelli wrote:
> > > On 30/05/18 17:46, Quentin Perret wrote:
> >
> > > > So I understand why we want to got to max freq when a RT task is running,
> > > > but I think there are use cases where we might want to be more conservative
> > > > and use the util_avg of the RT rq instead. The first use case is
> > > > battery-powered devices where going to max isn't really affordable from
> > > > an energy standpoint. Android, for example, has been using a RT
> > > > utilization signal to select OPPs for quite a while now, because going
> > > > to max blindly is _very_ expensive.
> > > >
> > > > And the second use-case is thermal pressure. On some modern CPUs, going to
> > > > max freq can lead to stringent thermal capping very quickly, at the
> > > > point where your CPUs might not have enough capacity to serve your tasks
> > > > properly. And that can ultimately hurt the very RT tasks you originally
> > > > tried to run fast. In these systems, in the long term, you'd be better off
> > > > not asking for more than what you really need ...
> > >
> > > Proposed the same at last LPC. Peter NAKed it (since RT is all about
> > > meeting deadlines, and when using FIFO/RR we don't really know how fast
> > > the CPU should go to meet them, so go to max is the only safe decision).
> > >
> > > > So what about having a sched_feature to select between going to max and
> > > > using the RT util_avg ? Obviously the default should keep the current
> > > > behaviour.
> > >
> > > Peter, would SCHED_FEAT make a difference? :)
> >
> > Hurmph...

:)

> >
> > > Or Patrick's utilization capping applied to RT..
> >
> > There might be something there, IIRC that tracks the max potential
> > utilization for the running tasks. So at that point we can set a
> > frequency to minimize idle time.
>
> Or we can do the opposite: we go to max by default (as it is now) and
> if you think that some RT tasks don't need the full speed, you can
> apply a util_max to them.
>
> That way, when a RT task is running alone on a CPU, we can run it
> only at a custom max freq which is known to be ok according to your
> latency requirements.
>
> If instead it's running with other CFS tasks, we add already the CFS
> utilization, which will result into a speedup of the RT task to give
> back the CPU to CFS.

Hmmm why not setting a util_min constraint instead ? The default for a
RT task should be a util_min of 1024, which means go to max. And then
if userspace has knowledge about the tasks it could decide to lower the
util_min value. That way, you would still let the util_avg grow if a
RT task runs flat out for a long time, and you would still be able to go
to higher frequencies. But if the util of the RT tasks is very low, you
wouldn't necessarily run at max freq, you would run at the freq matching
the util_min constraint.

So you probably want to: 1) forbid setting max_util constraints for RT;
2) have schedutil look at the min-capped RT util if rt_nr_running > 0;
and 3) have schedutil look at the non-capped RT util if rt_nr_running == 0.

Does that make any sense ?

Thanks,
Quentin

>
> > It's not perfect, because while the clamping thing effectively sets a
> > per-task bandwidth, the max filter is wrong. Also there's no CBS to
> > enforce anything.
>
> Right, well... from user-space potentially if you carefully set the RT
> cpu's controller (both bandwidth and clamping) and keep track of the
> allocated bandwidth, you can still ensure that all your RT tasks will
> be able to run, according to their prio.
>
> > With RT servers we could aggregate the group bandwidth and limit from
> > that...
>
> What we certainly miss I think it's the EDF scheduler: it's not
> possible to run certain RT tasks before others irrespectively of they
> relative priority.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-06-04 12:37:02

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

On 4 June 2018 at 12:12, Juri Lelli <[email protected]> wrote:
> On 04/06/18 09:14, Vincent Guittot wrote:
>> On 4 June 2018 at 09:04, Juri Lelli <[email protected]> wrote:
>> > Hi Vincent,
>> >
>> > On 04/06/18 08:41, Vincent Guittot wrote:
>> >> On 1 June 2018 at 19:45, Joel Fernandes <[email protected]> wrote:
>> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
>> >
>> > [...]
>> >
>> >> > IMO I feel its overkill to account dl_avg when we already have DL's running
>> >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we
>> >>
>> >> We keep using dl bandwidth which is quite correct for dl needs but
>> >> doesn't reflect how it has disturbed other classes
>> >>
>> >> > can fix CFS's problems within CFS itself and not have to do this kind of
>> >> > extra external accounting ?
>> >
>> > I would also keep accounting for waiting time due to higher prio classes
>> > all inside CFS. My impression, when discussing it with you on IRC, was
>> > that we should be able to do that by not decaying cfs.util_avg when CFS
>> > is preempted (creating a new signal for it). Is not this enough?
>>
>> We don't just want to not decay a signal but increase the signal to
>> reflect the amount of preemption
>
> OK.
>
>> Then, we can't do that in a current signal. So you would like to add
>> another metrics in cfs_rq ?
>
> Since it's CFS related, I'd say it should fit in CFS.

It's dl and cfs as the goal is to track cfs preempted by dl
This means creating a new struct whereas some fields are unused in avg_dl struct
And duplicate some call to ___update_load_sum as we track avg_dl for
removing sched_rt_avg_update
and update_dl/rt_rq_load_avg are already call in fair.c for updating
blocked load

>
>> The place doesn't really matter to be honest in cfs_rq or in dl_rq but
>> you will not prevent to add call in dl class to start/stop the
>> accounting of the preemption
>>
>> >
>> > I feel we should try to keep cross-class accounting/interaction at a
>> > minimum.
>>
>> accounting for cross class preemption can't be done without
>> cross-class accounting
>
> Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of
> higher prio class and act accordingly?

we will not be able to make the difference between rt/dl/stop
preemption by using only pick_next_task_fair

Thanks

>
> Thanks,
>
> - Juri

2018-06-04 15:17:35

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] cpufreq/schedutil: add rt utilization tracking

On 04-Jun 11:17, Quentin Perret wrote:
> On Friday 01 Jun 2018 at 18:23:59 (+0100), Patrick Bellasi wrote:
> > On 01-Jun 18:23, Peter Zijlstra wrote:
> > > On Thu, May 31, 2018 at 10:46:07AM +0200, Juri Lelli wrote:
> > > > On 30/05/18 17:46, Quentin Perret wrote:

[...]

> > > There might be something there, IIRC that tracks the max potential
> > > utilization for the running tasks. So at that point we can set a
> > > frequency to minimize idle time.
> >
> > Or we can do the opposite: we go to max by default (as it is now) and
> > if you think that some RT tasks don't need the full speed, you can
> > apply a util_max to them.
> >
> > That way, when a RT task is running alone on a CPU, we can run it
> > only at a custom max freq which is known to be ok according to your
> > latency requirements.
> >
> > If instead it's running with other CFS tasks, we add already the CFS
> > utilization, which will result into a speedup of the RT task to give
> > back the CPU to CFS.
>
> Hmmm why not setting a util_min constraint instead ? The default for a
> RT task should be a util_min of 1024, which means go to max. And then
> if userspace has knowledge about the tasks it could decide to lower the
> util_min value. That way, you would still let the util_avg grow if a
> RT task runs flat out for a long time, and you would still be able to go
> to higher frequencies. But if the util of the RT tasks is very low, you
> wouldn't necessarily run at max freq, you would run at the freq matching
> the util_min constraint.
>
> So you probably want to: 1) forbid setting max_util constraints for RT;
> 2) have schedutil look at the min-capped RT util if rt_nr_running > 0;
> and 3) have schedutil look at the non-capped RT util if rt_nr_running == 0.
>
> Does that make any sense ?

I would say that it "could" make sense... it really depends on
use-space IMO. You could have long running RT tasks which still you
don't want to run at max OPP for power/energy concerns, maybe?

Anyway, the good point is that this is a user-space policy.

Right now we do not do anything special for RT task from the
util_clamp side. The user-space is in charge to configure it
correctly, and it could also very well decide to use different clamps
for different RT tasks, maybe.
Thus, I would probably avoid the special cases you describe in the
above last two points.

--
#include <best/regards.h>

Patrick Bellasi

2018-06-04 16:52:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:
> When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> reflect anymore the utilization of cfs tasks but only the remaining part that
> is not used by rt tasks. We should monitor the stolen utilization and take
> it into account when selecting OPP. This patchset doesn't change the OPP
> selection policy for RT tasks but only for CFS tasks

So the problem is that when RT/DL/stop/IRQ happens and preempts CFS
tasks, time continues and the CFS load tracking will see !running and
decay things.

Then, when we get back to CFS, we'll have lower load/util than we
expected.

In particular, your focus is on OPP selection, and where we would have
say: u=1 (always running task), after being preempted by our RT task for
a while, it will now have u=.5. With the effect that when the RT task
goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?

Your solution is to track RT/DL/stop/IRQ with the identical PELT average
as we track cfs util. Such that we can then add the various averages to
reconstruct the actual utilisation signal.

This should work for the case of the utilization signal on UP. When we
consider that PELT migrates the signal around on SMP, but we don't do
that to the per-rq signals we have for RT/DL/stop/IRQ.

There is also the 'complaint' that this ends up with 2 util signals for
DL, complicating things.


So this patch-set tracks the !cfs occupation using the same function,
which is all good. But what, if instead of using that to compensate the
OPP selection, we employ that to renormalize the util signal?

If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
then I think your initial problem goes away. Because while the RT task
will push the util to .5, it will at the same time push the CPU capacity
to .5, and renormalized that gives 1.

NOTE: the renorm would then become something like:
scale_cpu = arch_scale_cpu_capacity() / rt_frac();


On IRC I mentioned stopping the CFS clock when preempted, and while that
would result in fixed numbers, Vincent was right in pointing out the
numbers will be difficult to interpret, since the meaning will be purely
CPU local and I'm not sure you can actually fix it again with
normalization.

Imagine, running a .3 RT task, that would push the (always running) CFS
down to .7, but because we discard all !cfs time, it actually has 1. If
we try and normalize that we'll end up with ~1.43, which is of course
completely broken.


_However_, all that happens for util, also happens for load. So the above
scenario will also make the CPU appear less loaded than it actually is.

Now, we actually try and compensate for that by decreasing the capacity
of the CPU. But because the existing rt_avg and PELT signals are so
out-of-tune, this is likely to be less than ideal. With that fixed
however, the best this appears to do is, as per the above, preserve the
actual load. But what we really wanted is to actually inflate the load,
such that someone will take load from us -- we're doing less actual work
after all.

Possibly, we can do something like:

scale_cpu_capacity / (rt_frac^2)

for load, then we inflate the load and could maybe get rid of all this
capacity_of() sprinkling, but that needs more thinking.


But I really feel we need to consider both util and load, as this issue
affects both.

2018-06-04 17:15:51

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Monday 04 Jun 2018 at 18:50:47 (+0200), Peter Zijlstra wrote:
> On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:
> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> > reflect anymore the utilization of cfs tasks but only the remaining part that
> > is not used by rt tasks. We should monitor the stolen utilization and take
> > it into account when selecting OPP. This patchset doesn't change the OPP
> > selection policy for RT tasks but only for CFS tasks
>
> So the problem is that when RT/DL/stop/IRQ happens and preempts CFS
> tasks, time continues and the CFS load tracking will see !running and
> decay things.
>
> Then, when we get back to CFS, we'll have lower load/util than we
> expected.
>
> In particular, your focus is on OPP selection, and where we would have
> say: u=1 (always running task), after being preempted by our RT task for
> a while, it will now have u=.5. With the effect that when the RT task
> goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?
>
> Your solution is to track RT/DL/stop/IRQ with the identical PELT average
> as we track cfs util. Such that we can then add the various averages to
> reconstruct the actual utilisation signal.
>
> This should work for the case of the utilization signal on UP. When we
> consider that PELT migrates the signal around on SMP, but we don't do
> that to the per-rq signals we have for RT/DL/stop/IRQ.
>
> There is also the 'complaint' that this ends up with 2 util signals for
> DL, complicating things.
>
>
> So this patch-set tracks the !cfs occupation using the same function,
> which is all good. But what, if instead of using that to compensate the
> OPP selection, we employ that to renormalize the util signal?
>
> If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> then I think your initial problem goes away. Because while the RT task
> will push the util to .5, it will at the same time push the CPU capacity
> to .5, and renormalized that gives 1.
>
> NOTE: the renorm would then become something like:
> scale_cpu = arch_scale_cpu_capacity() / rt_frac();

Isn't it equivalent ? I mean, you can remove RT/DL/stop/IRQ from the CPU
capacity and compare the CFS util_avg against that, or you can add
RT/DL/stop/IRQ to the CFS util_avg and compare it to arch_scale_cpu_capacity().
Both should be interchangeable no ? By adding RT/DL/IRQ PELT signals
to the CFS util_avg, Vincent is proposing to go with the latter I think.

But aren't the signals we currently use to account for RT/DL/stop/IRQ in
cpu_capacity good enough for that ? Can't we just add the diff between
capacity_orig_of and capacity_of to the CFS util and do OPP selection with
that (for !nr_rt_running) ? Maybe add a min with dl running_bw to be on
the safe side ... ?

>
>
> On IRC I mentioned stopping the CFS clock when preempted, and while that
> would result in fixed numbers, Vincent was right in pointing out the
> numbers will be difficult to interpret, since the meaning will be purely
> CPU local and I'm not sure you can actually fix it again with
> normalization.
>
> Imagine, running a .3 RT task, that would push the (always running) CFS
> down to .7, but because we discard all !cfs time, it actually has 1. If
> we try and normalize that we'll end up with ~1.43, which is of course
> completely broken.
>
>
> _However_, all that happens for util, also happens for load. So the above
> scenario will also make the CPU appear less loaded than it actually is.
>
> Now, we actually try and compensate for that by decreasing the capacity
> of the CPU. But because the existing rt_avg and PELT signals are so
> out-of-tune, this is likely to be less than ideal. With that fixed
> however, the best this appears to do is, as per the above, preserve the
> actual load. But what we really wanted is to actually inflate the load,
> such that someone will take load from us -- we're doing less actual work
> after all.
>
> Possibly, we can do something like:
>
> scale_cpu_capacity / (rt_frac^2)
>
> for load, then we inflate the load and could maybe get rid of all this
> capacity_of() sprinkling, but that needs more thinking.
>
>
> But I really feel we need to consider both util and load, as this issue
> affects both.

2018-06-04 18:11:02

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:
> On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote:
>> When both cfs and rt tasks compete to run on a CPU, we can see some frequency
>> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
>> reflect anymore the utilization of cfs tasks but only the remaining part that
>> is not used by rt tasks. We should monitor the stolen utilization and take
>> it into account when selecting OPP. This patchset doesn't change the OPP
>> selection policy for RT tasks but only for CFS tasks
>
> So the problem is that when RT/DL/stop/IRQ happens and preempts CFS
> tasks, time continues and the CFS load tracking will see !running and
> decay things.
>
> Then, when we get back to CFS, we'll have lower load/util than we
> expected.
>
> In particular, your focus is on OPP selection, and where we would have
> say: u=1 (always running task), after being preempted by our RT task for
> a while, it will now have u=.5. With the effect that when the RT task
> goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right?

yes that's the typical example

>
> Your solution is to track RT/DL/stop/IRQ with the identical PELT average
> as we track cfs util. Such that we can then add the various averages to
> reconstruct the actual utilisation signal.

yes and get the whole cpu utilization

>
> This should work for the case of the utilization signal on UP. When we
> consider that PELT migrates the signal around on SMP, but we don't do
> that to the per-rq signals we have for RT/DL/stop/IRQ.
>
> There is also the 'complaint' that this ends up with 2 util signals for
> DL, complicating things.

yes. that's the main point of discussion how to balance dl bandwidth
and dl utilization

>
>
> So this patch-set tracks the !cfs occupation using the same function,
> which is all good. But what, if instead of using that to compensate the
> OPP selection, we employ that to renormalize the util signal?
>
> If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> then I think your initial problem goes away. Because while the RT task
> will push the util to .5, it will at the same time push the CPU capacity
> to .5, and renormalized that gives 1.
>
> NOTE: the renorm would then become something like:
> scale_cpu = arch_scale_cpu_capacity() / rt_frac();
>
>
> On IRC I mentioned stopping the CFS clock when preempted, and while that
> would result in fixed numbers, Vincent was right in pointing out the
> numbers will be difficult to interpret, since the meaning will be purely
> CPU local and I'm not sure you can actually fix it again with
> normalization.
>
> Imagine, running a .3 RT task, that would push the (always running) CFS
> down to .7, but because we discard all !cfs time, it actually has 1. If
> we try and normalize that we'll end up with ~1.43, which is of course
> completely broken.
>
>
> _However_, all that happens for util, also happens for load. So the above
> scenario will also make the CPU appear less loaded than it actually is.

The load will continue to increase because we track runnable state and
not running for the load

>
> Now, we actually try and compensate for that by decreasing the capacity
> of the CPU. But because the existing rt_avg and PELT signals are so
> out-of-tune, this is likely to be less than ideal. With that fixed
> however, the best this appears to do is, as per the above, preserve the
> actual load. But what we really wanted is to actually inflate the load,
> such that someone will take load from us -- we're doing less actual work
> after all.
>
> Possibly, we can do something like:
>
> scale_cpu_capacity / (rt_frac^2)
>
> for load, then we inflate the load and could maybe get rid of all this
> capacity_of() sprinkling, but that needs more thinking.
>
>
> But I really feel we need to consider both util and load, as this issue
> affects both.

my initial idea was to get max between dl bandwidth and dl util_avg
but util_avg can be higher than bandwidth and using it will make
sched_util selecting higher OPP for now good reason if nothing is
running around and need compute capacity

As you mentioned, scale_rt_capacity give the remaining capacity for
cfs and it will behave like cfs util_avg now that it uses PELT. So as
long as cfs util_avg < scale_rt_capacity(we probably need a margin)
we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
OPP because we have remaining spare capacity but if cfs util_avg ==
scale_rt_capacity, we make sure to use max OPP.

I will run some test to make sure that all my test are running
correctly which such policy

2018-06-05 08:37:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Quentin,

On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
> This patchset initially tracked only the utilization of RT rq. During
> OSPM summit, it has been discussed the opportunity to extend it in order
> to get an estimate of the utilization of the CPU.
>
> - Patches 1-3 correspond to the content of patchset v4 and add utilization
> tracking for rt_rq.
>
> When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> reflect anymore the utilization of cfs tasks but only the remaining part that
> is not used by rt tasks. We should monitor the stolen utilization and take
> it into account when selecting OPP. This patchset doesn't change the OPP
> selection policy for RT tasks but only for CFS tasks
>
> A rt-app use case which creates an always running cfs thread and a rt threads
> that wakes up periodically with both threads pinned on same CPU, show lot of
> frequency switches of the CPU whereas the CPU never goes idles during the
> test. I can share the json file that I used for the test if someone is
> interested in.
>
> For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
> the cpufreq statistics outputs (stats are reset just before the test) :
> $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> without patchset : 1230
> with patchset : 14

I have attached the rt-app json file that I use for this test

>
> If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
> performance improvements:
>
> - Without patchset :
> Test execution summary:
> total time: 15.0009s
> total number of events: 4903
> total time taken by event execution: 14.9972
> per-request statistics:
> min: 1.23ms
> avg: 3.06ms
> max: 13.16ms
> approx. 95 percentile: 12.73ms
>
> Threads fairness:
> events (avg/stddev): 4903.0000/0.00
> execution time (avg/stddev): 14.9972/0.00
>
> - With patchset:
> Test execution summary:
> total time: 15.0014s
> total number of events: 7694
> total time taken by event execution: 14.9979
> per-request statistics:
> min: 1.23ms
> avg: 1.95ms
> max: 10.49ms
> approx. 95 percentile: 10.39ms
>
> Threads fairness:
> events (avg/stddev): 7694.0000/0.00
> execution time (avg/stddev): 14.9979/0.00
>
> The performance improvement is 56% for this use case.
>
> - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
> problem as with rt_rq
>
> - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
> dl and rt from sched_rt_avg_update
>
> - Patches 7-8 add utilization tracking for interrupt and use it select OPP
> A test with iperf on hikey 6220 gives:
> w/o patchset w/ patchset
> Tx 276 Mbits/sec 304 Mbits/sec +10%
> Rx 299 Mbits/sec 328 Mbits/sec +09%
>
> 8 iterations of iperf -c server_address -r -t 5
> stdev is lower than 1%
> Only WFI idle state is enable (shallowest arm idle state)
>
> - Patches 9 removes the unused sched_avg_update code
>
> - Patch 10 removes the unused sched_time_avg_ms
>
> Change since v3:
> - add support of periodic update of blocked utilization
> - rebase on lastest tip/sched/core
>
> Change since v2:
> - move pelt code into a dedicated pelt.c file
> - rebase on load tracking changes
>
> Change since v1:
> - Only a rebase. I have addressed the comments on previous version in
> patch 1/2
>
> Vincent Guittot (10):
> sched/pelt: Move pelt related code in a dedicated file
> sched/rt: add rt_rq utilization tracking
> cpufreq/schedutil: add rt utilization tracking
> sched/dl: add dl_rq utilization tracking
> cpufreq/schedutil: get max utilization
> sched: remove rt and dl from sched_avg
> sched/irq: add irq utilization tracking
> cpufreq/schedutil: take into account interrupt
> sched: remove rt_avg code
> proc/sched: remove unused sched_time_avg_ms
>
> include/linux/sched/sysctl.h | 1 -
> kernel/sched/Makefile | 2 +-
> kernel/sched/core.c | 38 +---
> kernel/sched/cpufreq_schedutil.c | 24 ++-
> kernel/sched/deadline.c | 7 +-
> kernel/sched/fair.c | 381 +++----------------------------------
> kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 63 +++++++
> kernel/sched/rt.c | 10 +-
> kernel/sched/sched.h | 57 ++++--
> kernel/sysctl.c | 8 -
> 11 files changed, 563 insertions(+), 423 deletions(-)
> create mode 100644 kernel/sched/pelt.c
> create mode 100644 kernel/sched/pelt.h
>
> --
> 2.7.4
>


Attachments:
test-rt.json (961.00 B)

2018-06-05 10:59:47

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Vincent,

On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
> Hi Quentin,
>
> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
> > This patchset initially tracked only the utilization of RT rq. During
> > OSPM summit, it has been discussed the opportunity to extend it in order
> > to get an estimate of the utilization of the CPU.
> >
> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
> > tracking for rt_rq.
> >
> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> > reflect anymore the utilization of cfs tasks but only the remaining part that
> > is not used by rt tasks. We should monitor the stolen utilization and take
> > it into account when selecting OPP. This patchset doesn't change the OPP
> > selection policy for RT tasks but only for CFS tasks
> >
> > A rt-app use case which creates an always running cfs thread and a rt threads
> > that wakes up periodically with both threads pinned on same CPU, show lot of
> > frequency switches of the CPU whereas the CPU never goes idles during the
> > test. I can share the json file that I used for the test if someone is
> > interested in.
> >
> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
> > the cpufreq statistics outputs (stats are reset just before the test) :
> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> > without patchset : 1230
> > with patchset : 14
>
> I have attached the rt-app json file that I use for this test

Thank you very much ! I did a quick test with a much simpler fix to this
RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
I get the following results on Hikey960:

Without patch:
cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
12
cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
640
With patch
cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
8
cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
12

Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
this is an actual issue for realistic use-cases ?

What about the diff below (just a quick hack to show the idea) applied
on tip/sched/core ?

---8<---
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a8ba6d1f262a..23a4fb1c2c25 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->util_dl = cpu_util_dl(rq);
}

+unsigned long scale_rt_capacity(int cpu);
static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ int cpu = sg_cpu->cpu;
+ unsigned long util, dl_bw;

if (rq->rt.rt_nr_running)
return sg_cpu->max;
@@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+ util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
+ util >>= SCHED_CAPACITY_SHIFT;
+ util = arch_scale_cpu_capacity(NULL, cpu) - util;
+ util += sg_cpu->util_cfs;
+ dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+
+ /* Make sure to always provide the reserved freq to DL. */
+ return max(util, dl_bw);
}

static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f01f0f395f9a..0e87cbe47c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

-static unsigned long scale_rt_capacity(int cpu)
+unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
u64 total, used, age_stamp, avg;
--->8---



>
> >
> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
> > performance improvements:
> >
> > - Without patchset :
> > Test execution summary:
> > total time: 15.0009s
> > total number of events: 4903
> > total time taken by event execution: 14.9972
> > per-request statistics:
> > min: 1.23ms
> > avg: 3.06ms
> > max: 13.16ms
> > approx. 95 percentile: 12.73ms
> >
> > Threads fairness:
> > events (avg/stddev): 4903.0000/0.00
> > execution time (avg/stddev): 14.9972/0.00
> >
> > - With patchset:
> > Test execution summary:
> > total time: 15.0014s
> > total number of events: 7694
> > total time taken by event execution: 14.9979
> > per-request statistics:
> > min: 1.23ms
> > avg: 1.95ms
> > max: 10.49ms
> > approx. 95 percentile: 10.39ms
> >
> > Threads fairness:
> > events (avg/stddev): 7694.0000/0.00
> > execution time (avg/stddev): 14.9979/0.00
> >
> > The performance improvement is 56% for this use case.
> >
> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
> > problem as with rt_rq
> >
> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
> > dl and rt from sched_rt_avg_update
> >
> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
> > A test with iperf on hikey 6220 gives:
> > w/o patchset w/ patchset
> > Tx 276 Mbits/sec 304 Mbits/sec +10%
> > Rx 299 Mbits/sec 328 Mbits/sec +09%
> >
> > 8 iterations of iperf -c server_address -r -t 5
> > stdev is lower than 1%
> > Only WFI idle state is enable (shallowest arm idle state)
> >
> > - Patches 9 removes the unused sched_avg_update code
> >
> > - Patch 10 removes the unused sched_time_avg_ms
> >
> > Change since v3:
> > - add support of periodic update of blocked utilization
> > - rebase on lastest tip/sched/core
> >
> > Change since v2:
> > - move pelt code into a dedicated pelt.c file
> > - rebase on load tracking changes
> >
> > Change since v1:
> > - Only a rebase. I have addressed the comments on previous version in
> > patch 1/2
> >
> > Vincent Guittot (10):
> > sched/pelt: Move pelt related code in a dedicated file
> > sched/rt: add rt_rq utilization tracking
> > cpufreq/schedutil: add rt utilization tracking
> > sched/dl: add dl_rq utilization tracking
> > cpufreq/schedutil: get max utilization
> > sched: remove rt and dl from sched_avg
> > sched/irq: add irq utilization tracking
> > cpufreq/schedutil: take into account interrupt
> > sched: remove rt_avg code
> > proc/sched: remove unused sched_time_avg_ms
> >
> > include/linux/sched/sysctl.h | 1 -
> > kernel/sched/Makefile | 2 +-
> > kernel/sched/core.c | 38 +---
> > kernel/sched/cpufreq_schedutil.c | 24 ++-
> > kernel/sched/deadline.c | 7 +-
> > kernel/sched/fair.c | 381 +++----------------------------------
> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
> > kernel/sched/pelt.h | 63 +++++++
> > kernel/sched/rt.c | 10 +-
> > kernel/sched/sched.h | 57 ++++--
> > kernel/sysctl.c | 8 -
> > 11 files changed, 563 insertions(+), 423 deletions(-)
> > create mode 100644 kernel/sched/pelt.c
> > create mode 100644 kernel/sched/pelt.h
> >
> > --
> > 2.7.4
> >



2018-06-05 12:01:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 5 June 2018 at 12:57, Quentin Perret <[email protected]> wrote:
> Hi Vincent,
>
> On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
>> Hi Quentin,
>>
>> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
>> > This patchset initially tracked only the utilization of RT rq. During
>> > OSPM summit, it has been discussed the opportunity to extend it in order
>> > to get an estimate of the utilization of the CPU.
>> >
>> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
>> > tracking for rt_rq.
>> >
>> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
>> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
>> > reflect anymore the utilization of cfs tasks but only the remaining part that
>> > is not used by rt tasks. We should monitor the stolen utilization and take
>> > it into account when selecting OPP. This patchset doesn't change the OPP
>> > selection policy for RT tasks but only for CFS tasks
>> >
>> > A rt-app use case which creates an always running cfs thread and a rt threads
>> > that wakes up periodically with both threads pinned on same CPU, show lot of
>> > frequency switches of the CPU whereas the CPU never goes idles during the
>> > test. I can share the json file that I used for the test if someone is
>> > interested in.
>> >
>> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
>> > the cpufreq statistics outputs (stats are reset just before the test) :
>> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> > without patchset : 1230
>> > with patchset : 14
>>
>> I have attached the rt-app json file that I use for this test
>
> Thank you very much ! I did a quick test with a much simpler fix to this
> RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
> I get the following results on Hikey960:
>
> Without patch:
> cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> 12
> cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> 640
> With patch
> cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> 8
> cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> 12
>
> Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
> this is an actual issue for realistic use-cases ?

yes I think that it's worth syncing and consolidating things on the
same metric. The result will be saner and more robust as we will have
the same behavior

>
> What about the diff below (just a quick hack to show the idea) applied
> on tip/sched/core ?
>
> ---8<---
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index a8ba6d1f262a..23a4fb1c2c25 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> sg_cpu->util_dl = cpu_util_dl(rq);
> }
>
> +unsigned long scale_rt_capacity(int cpu);
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + int cpu = sg_cpu->cpu;
> + unsigned long util, dl_bw;
>
> if (rq->rt.rt_nr_running)
> return sg_cpu->max;
> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> + util >>= SCHED_CAPACITY_SHIFT;
> + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> + util += sg_cpu->util_cfs;
> + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> +
> + /* Make sure to always provide the reserved freq to DL. */
> + return max(util, dl_bw);
> }
>
> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f01f0f395f9a..0e87cbe47c8b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> return load_idx;
> }
>
> -static unsigned long scale_rt_capacity(int cpu)
> +unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> u64 total, used, age_stamp, avg;
> --->8---
>
>
>
>>
>> >
>> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
>> > performance improvements:
>> >
>> > - Without patchset :
>> > Test execution summary:
>> > total time: 15.0009s
>> > total number of events: 4903
>> > total time taken by event execution: 14.9972
>> > per-request statistics:
>> > min: 1.23ms
>> > avg: 3.06ms
>> > max: 13.16ms
>> > approx. 95 percentile: 12.73ms
>> >
>> > Threads fairness:
>> > events (avg/stddev): 4903.0000/0.00
>> > execution time (avg/stddev): 14.9972/0.00
>> >
>> > - With patchset:
>> > Test execution summary:
>> > total time: 15.0014s
>> > total number of events: 7694
>> > total time taken by event execution: 14.9979
>> > per-request statistics:
>> > min: 1.23ms
>> > avg: 1.95ms
>> > max: 10.49ms
>> > approx. 95 percentile: 10.39ms
>> >
>> > Threads fairness:
>> > events (avg/stddev): 7694.0000/0.00
>> > execution time (avg/stddev): 14.9979/0.00
>> >
>> > The performance improvement is 56% for this use case.
>> >
>> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
>> > problem as with rt_rq
>> >
>> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
>> > dl and rt from sched_rt_avg_update
>> >
>> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
>> > A test with iperf on hikey 6220 gives:
>> > w/o patchset w/ patchset
>> > Tx 276 Mbits/sec 304 Mbits/sec +10%
>> > Rx 299 Mbits/sec 328 Mbits/sec +09%
>> >
>> > 8 iterations of iperf -c server_address -r -t 5
>> > stdev is lower than 1%
>> > Only WFI idle state is enable (shallowest arm idle state)
>> >
>> > - Patches 9 removes the unused sched_avg_update code
>> >
>> > - Patch 10 removes the unused sched_time_avg_ms
>> >
>> > Change since v3:
>> > - add support of periodic update of blocked utilization
>> > - rebase on lastest tip/sched/core
>> >
>> > Change since v2:
>> > - move pelt code into a dedicated pelt.c file
>> > - rebase on load tracking changes
>> >
>> > Change since v1:
>> > - Only a rebase. I have addressed the comments on previous version in
>> > patch 1/2
>> >
>> > Vincent Guittot (10):
>> > sched/pelt: Move pelt related code in a dedicated file
>> > sched/rt: add rt_rq utilization tracking
>> > cpufreq/schedutil: add rt utilization tracking
>> > sched/dl: add dl_rq utilization tracking
>> > cpufreq/schedutil: get max utilization
>> > sched: remove rt and dl from sched_avg
>> > sched/irq: add irq utilization tracking
>> > cpufreq/schedutil: take into account interrupt
>> > sched: remove rt_avg code
>> > proc/sched: remove unused sched_time_avg_ms
>> >
>> > include/linux/sched/sysctl.h | 1 -
>> > kernel/sched/Makefile | 2 +-
>> > kernel/sched/core.c | 38 +---
>> > kernel/sched/cpufreq_schedutil.c | 24 ++-
>> > kernel/sched/deadline.c | 7 +-
>> > kernel/sched/fair.c | 381 +++----------------------------------
>> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
>> > kernel/sched/pelt.h | 63 +++++++
>> > kernel/sched/rt.c | 10 +-
>> > kernel/sched/sched.h | 57 ++++--
>> > kernel/sysctl.c | 8 -
>> > 11 files changed, 563 insertions(+), 423 deletions(-)
>> > create mode 100644 kernel/sched/pelt.c
>> > create mode 100644 kernel/sched/pelt.h
>> >
>> > --
>> > 2.7.4
>> >
>
>

2018-06-05 12:12:40

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Quentin,

On 05/06/18 11:57, Quentin Perret wrote:

[...]

> What about the diff below (just a quick hack to show the idea) applied
> on tip/sched/core ?
>
> ---8<---
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index a8ba6d1f262a..23a4fb1c2c25 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> sg_cpu->util_dl = cpu_util_dl(rq);
> }
>
> +unsigned long scale_rt_capacity(int cpu);
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + int cpu = sg_cpu->cpu;
> + unsigned long util, dl_bw;
>
> if (rq->rt.rt_nr_running)
> return sg_cpu->max;
> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);

Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
since we use max below, we will probably have the same problem that we
discussed on Vincent's approach (overestimation of DL contribution while
we could use running_bw).

> + util >>= SCHED_CAPACITY_SHIFT;
> + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> + util += sg_cpu->util_cfs;
> + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;

Why this_bw instead of running_bw?

Thanks,

- Juri

2018-06-05 13:07:54

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> Hi Quentin,
>
> On 05/06/18 11:57, Quentin Perret wrote:
>
> [...]
>
> > What about the diff below (just a quick hack to show the idea) applied
> > on tip/sched/core ?
> >
> > ---8<---
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index a8ba6d1f262a..23a4fb1c2c25 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > sg_cpu->util_dl = cpu_util_dl(rq);
> > }
> >
> > +unsigned long scale_rt_capacity(int cpu);
> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > + int cpu = sg_cpu->cpu;
> > + unsigned long util, dl_bw;
> >
> > if (rq->rt.rt_nr_running)
> > return sg_cpu->max;
> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > * ready for such an interface. So, we only do the latter for now.
> > */
> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
>
> Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> since we use max below, we will probably have the same problem that we
> discussed on Vincent's approach (overestimation of DL contribution while
> we could use running_bw).

Ah no, you're right, this isn't great for long running deadline tasks.
We should definitely account for the running_bw here, not the dl avg...

I was trying to address the issue of RT stealing time from CFS here, but
the DL integration isn't quite right which this patch as-is, I agree ...

>
> > + util >>= SCHED_CAPACITY_SHIFT;
> > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > + util += sg_cpu->util_cfs;
> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>
> Why this_bw instead of running_bw?

So IIUC, this_bw should basically give you the absolute reservation (== the
sum of runtime/deadline ratios of all DL tasks on that rq).

The reason I added this max is because I'm still not sure to understand
how we can safely drop the freq below that point ? If we don't guarantee
to always stay at least at the freq required by DL, aren't we risking to
start a deadline tasks stuck at a low freq because of rate limiting ? In
this case, if that tasks uses all of its runtime then you might start
missing deadlines ...

My feeling is that the only safe thing to do is to guarantee to never go
below the freq required by DL, and to optimistically add CFS tasks
without raising the OPP if we have good reasons to think that DL is
using less than it required (which is what we should get by using
running_bw above I suppose). Does that make any sense ?

Thanks !
Quentin

2018-06-05 13:13:22

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 12:57, Quentin Perret <[email protected]> wrote:
> > Hi Vincent,
> >
> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
> >> Hi Quentin,
> >>
> >> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
> >> > This patchset initially tracked only the utilization of RT rq. During
> >> > OSPM summit, it has been discussed the opportunity to extend it in order
> >> > to get an estimate of the utilization of the CPU.
> >> >
> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
> >> > tracking for rt_rq.
> >> >
> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> >> > reflect anymore the utilization of cfs tasks but only the remaining part that
> >> > is not used by rt tasks. We should monitor the stolen utilization and take
> >> > it into account when selecting OPP. This patchset doesn't change the OPP
> >> > selection policy for RT tasks but only for CFS tasks
> >> >
> >> > A rt-app use case which creates an always running cfs thread and a rt threads
> >> > that wakes up periodically with both threads pinned on same CPU, show lot of
> >> > frequency switches of the CPU whereas the CPU never goes idles during the
> >> > test. I can share the json file that I used for the test if someone is
> >> > interested in.
> >> >
> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
> >> > the cpufreq statistics outputs (stats are reset just before the test) :
> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> >> > without patchset : 1230
> >> > with patchset : 14
> >>
> >> I have attached the rt-app json file that I use for this test
> >
> > Thank you very much ! I did a quick test with a much simpler fix to this
> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
> > I get the following results on Hikey960:
> >
> > Without patch:
> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> > 12
> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> > 640
> > With patch
> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> > 8
> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> > 12
> >
> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
> > this is an actual issue for realistic use-cases ?
>
> yes I think that it's worth syncing and consolidating things on the
> same metric. The result will be saner and more robust as we will have
> the same behavior

TBH I'm not disagreeing with that, the PELT-everywhere approach feels
cleaner in a way, but do you have a use-case in mind where this will
definitely help ?

I mean, yes the rt_avg is a slow response to the RT pressure, but is
this always a problem ? Ramping down slower might actually help in some
cases no ?

>
> >
> > What about the diff below (just a quick hack to show the idea) applied
> > on tip/sched/core ?
> >
> > ---8<---
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index a8ba6d1f262a..23a4fb1c2c25 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > sg_cpu->util_dl = cpu_util_dl(rq);
> > }
> >
> > +unsigned long scale_rt_capacity(int cpu);
> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > + int cpu = sg_cpu->cpu;
> > + unsigned long util, dl_bw;
> >
> > if (rq->rt.rt_nr_running)
> > return sg_cpu->max;
> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > * ready for such an interface. So, we only do the latter for now.
> > */
> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > + util >>= SCHED_CAPACITY_SHIFT;
> > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > + util += sg_cpu->util_cfs;
> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > +
> > + /* Make sure to always provide the reserved freq to DL. */
> > + return max(util, dl_bw);
> > }
> >
> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f01f0f395f9a..0e87cbe47c8b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> > return load_idx;
> > }
> >
> > -static unsigned long scale_rt_capacity(int cpu)
> > +unsigned long scale_rt_capacity(int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> > u64 total, used, age_stamp, avg;
> > --->8---
> >
> >
> >
> >>
> >> >
> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
> >> > performance improvements:
> >> >
> >> > - Without patchset :
> >> > Test execution summary:
> >> > total time: 15.0009s
> >> > total number of events: 4903
> >> > total time taken by event execution: 14.9972
> >> > per-request statistics:
> >> > min: 1.23ms
> >> > avg: 3.06ms
> >> > max: 13.16ms
> >> > approx. 95 percentile: 12.73ms
> >> >
> >> > Threads fairness:
> >> > events (avg/stddev): 4903.0000/0.00
> >> > execution time (avg/stddev): 14.9972/0.00
> >> >
> >> > - With patchset:
> >> > Test execution summary:
> >> > total time: 15.0014s
> >> > total number of events: 7694
> >> > total time taken by event execution: 14.9979
> >> > per-request statistics:
> >> > min: 1.23ms
> >> > avg: 1.95ms
> >> > max: 10.49ms
> >> > approx. 95 percentile: 10.39ms
> >> >
> >> > Threads fairness:
> >> > events (avg/stddev): 7694.0000/0.00
> >> > execution time (avg/stddev): 14.9979/0.00
> >> >
> >> > The performance improvement is 56% for this use case.
> >> >
> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
> >> > problem as with rt_rq
> >> >
> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
> >> > dl and rt from sched_rt_avg_update
> >> >
> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
> >> > A test with iperf on hikey 6220 gives:
> >> > w/o patchset w/ patchset
> >> > Tx 276 Mbits/sec 304 Mbits/sec +10%
> >> > Rx 299 Mbits/sec 328 Mbits/sec +09%
> >> >
> >> > 8 iterations of iperf -c server_address -r -t 5
> >> > stdev is lower than 1%
> >> > Only WFI idle state is enable (shallowest arm idle state)
> >> >
> >> > - Patches 9 removes the unused sched_avg_update code
> >> >
> >> > - Patch 10 removes the unused sched_time_avg_ms
> >> >
> >> > Change since v3:
> >> > - add support of periodic update of blocked utilization
> >> > - rebase on lastest tip/sched/core
> >> >
> >> > Change since v2:
> >> > - move pelt code into a dedicated pelt.c file
> >> > - rebase on load tracking changes
> >> >
> >> > Change since v1:
> >> > - Only a rebase. I have addressed the comments on previous version in
> >> > patch 1/2
> >> >
> >> > Vincent Guittot (10):
> >> > sched/pelt: Move pelt related code in a dedicated file
> >> > sched/rt: add rt_rq utilization tracking
> >> > cpufreq/schedutil: add rt utilization tracking
> >> > sched/dl: add dl_rq utilization tracking
> >> > cpufreq/schedutil: get max utilization
> >> > sched: remove rt and dl from sched_avg
> >> > sched/irq: add irq utilization tracking
> >> > cpufreq/schedutil: take into account interrupt
> >> > sched: remove rt_avg code
> >> > proc/sched: remove unused sched_time_avg_ms
> >> >
> >> > include/linux/sched/sysctl.h | 1 -
> >> > kernel/sched/Makefile | 2 +-
> >> > kernel/sched/core.c | 38 +---
> >> > kernel/sched/cpufreq_schedutil.c | 24 ++-
> >> > kernel/sched/deadline.c | 7 +-
> >> > kernel/sched/fair.c | 381 +++----------------------------------
> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
> >> > kernel/sched/pelt.h | 63 +++++++
> >> > kernel/sched/rt.c | 10 +-
> >> > kernel/sched/sched.h | 57 ++++--
> >> > kernel/sysctl.c | 8 -
> >> > 11 files changed, 563 insertions(+), 423 deletions(-)
> >> > create mode 100644 kernel/sched/pelt.c
> >> > create mode 100644 kernel/sched/pelt.h
> >> >
> >> > --
> >> > 2.7.4
> >> >
> >
> >

2018-06-05 13:17:31

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 05/06/18 14:05, Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > Hi Quentin,
> >
> > On 05/06/18 11:57, Quentin Perret wrote:
> >
> > [...]
> >
> > > What about the diff below (just a quick hack to show the idea) applied
> > > on tip/sched/core ?
> > >
> > > ---8<---
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > }
> > >
> > > +unsigned long scale_rt_capacity(int cpu);
> > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > {
> > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > + int cpu = sg_cpu->cpu;
> > > + unsigned long util, dl_bw;
> > >
> > > if (rq->rt.rt_nr_running)
> > > return sg_cpu->max;
> > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > * ready for such an interface. So, we only do the latter for now.
> > > */
> > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> >
> > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > since we use max below, we will probably have the same problem that we
> > discussed on Vincent's approach (overestimation of DL contribution while
> > we could use running_bw).
>
> Ah no, you're right, this isn't great for long running deadline tasks.
> We should definitely account for the running_bw here, not the dl avg...
>
> I was trying to address the issue of RT stealing time from CFS here, but
> the DL integration isn't quite right which this patch as-is, I agree ...
>
> >
> > > + util >>= SCHED_CAPACITY_SHIFT;
> > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > + util += sg_cpu->util_cfs;
> > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >
> > Why this_bw instead of running_bw?
>
> So IIUC, this_bw should basically give you the absolute reservation (== the
> sum of runtime/deadline ratios of all DL tasks on that rq).

Yep.

> The reason I added this max is because I'm still not sure to understand
> how we can safely drop the freq below that point ? If we don't guarantee
> to always stay at least at the freq required by DL, aren't we risking to
> start a deadline tasks stuck at a low freq because of rate limiting ? In
> this case, if that tasks uses all of its runtime then you might start
> missing deadlines ...

We decided to avoid (software) rate limiting for DL with e97a90f7069b
("sched/cpufreq: Rate limits for SCHED_DEADLINE").

> My feeling is that the only safe thing to do is to guarantee to never go
> below the freq required by DL, and to optimistically add CFS tasks
> without raising the OPP if we have good reasons to think that DL is
> using less than it required (which is what we should get by using
> running_bw above I suppose). Does that make any sense ?

Then we can't still avoid the hardware limits, so using running_bw is a
trade off between safety (especially considering soft real-time
scenarios) and energy consumption (which seems to be working in
practice).

Thanks,

- Juri

2018-06-05 13:19:41

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 5 June 2018 at 15:12, Quentin Perret <[email protected]> wrote:
> On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:
>> On 5 June 2018 at 12:57, Quentin Perret <[email protected]> wrote:
>> > Hi Vincent,
>> >
>> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
>> >> Hi Quentin,
>> >>
>> >> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
>> >> > This patchset initially tracked only the utilization of RT rq. During
>> >> > OSPM summit, it has been discussed the opportunity to extend it in order
>> >> > to get an estimate of the utilization of the CPU.
>> >> >
>> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
>> >> > tracking for rt_rq.
>> >> >
>> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
>> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
>> >> > reflect anymore the utilization of cfs tasks but only the remaining part that
>> >> > is not used by rt tasks. We should monitor the stolen utilization and take
>> >> > it into account when selecting OPP. This patchset doesn't change the OPP
>> >> > selection policy for RT tasks but only for CFS tasks
>> >> >
>> >> > A rt-app use case which creates an always running cfs thread and a rt threads
>> >> > that wakes up periodically with both threads pinned on same CPU, show lot of
>> >> > frequency switches of the CPU whereas the CPU never goes idles during the
>> >> > test. I can share the json file that I used for the test if someone is
>> >> > interested in.
>> >> >
>> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
>> >> > the cpufreq statistics outputs (stats are reset just before the test) :
>> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> >> > without patchset : 1230
>> >> > with patchset : 14
>> >>
>> >> I have attached the rt-app json file that I use for this test
>> >
>> > Thank you very much ! I did a quick test with a much simpler fix to this
>> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
>> > I get the following results on Hikey960:
>> >
>> > Without patch:
>> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> > 12
>> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
>> > 640
>> > With patch
>> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> > 8
>> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
>> > 12
>> >
>> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
>> > this is an actual issue for realistic use-cases ?
>>
>> yes I think that it's worth syncing and consolidating things on the
>> same metric. The result will be saner and more robust as we will have
>> the same behavior
>
> TBH I'm not disagreeing with that, the PELT-everywhere approach feels
> cleaner in a way, but do you have a use-case in mind where this will
> definitely help ?
>
> I mean, yes the rt_avg is a slow response to the RT pressure, but is
> this always a problem ? Ramping down slower might actually help in some
> cases no ?

I would say no because when one will decrease the other one will not
increase at the same pace and we will have some wrong behavior or
decision

>
>>
>> >
>> > What about the diff below (just a quick hack to show the idea) applied
>> > on tip/sched/core ?
>> >
>> > ---8<---
>> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> > index a8ba6d1f262a..23a4fb1c2c25 100644
>> > --- a/kernel/sched/cpufreq_schedutil.c
>> > +++ b/kernel/sched/cpufreq_schedutil.c
>> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>> > sg_cpu->util_dl = cpu_util_dl(rq);
>> > }
>> >
>> > +unsigned long scale_rt_capacity(int cpu);
>> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> > {
>> > struct rq *rq = cpu_rq(sg_cpu->cpu);
>> > + int cpu = sg_cpu->cpu;
>> > + unsigned long util, dl_bw;
>> >
>> > if (rq->rt.rt_nr_running)
>> > return sg_cpu->max;
>> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>> > * ready for such an interface. So, we only do the latter for now.
>> > */
>> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
>> > + util >>= SCHED_CAPACITY_SHIFT;
>> > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
>> > + util += sg_cpu->util_cfs;
>> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> > +
>> > + /* Make sure to always provide the reserved freq to DL. */
>> > + return max(util, dl_bw);
>> > }
>> >
>> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index f01f0f395f9a..0e87cbe47c8b 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
>> > return load_idx;
>> > }
>> >
>> > -static unsigned long scale_rt_capacity(int cpu)
>> > +unsigned long scale_rt_capacity(int cpu)
>> > {
>> > struct rq *rq = cpu_rq(cpu);
>> > u64 total, used, age_stamp, avg;
>> > --->8---
>> >
>> >
>> >
>> >>
>> >> >
>> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
>> >> > performance improvements:
>> >> >
>> >> > - Without patchset :
>> >> > Test execution summary:
>> >> > total time: 15.0009s
>> >> > total number of events: 4903
>> >> > total time taken by event execution: 14.9972
>> >> > per-request statistics:
>> >> > min: 1.23ms
>> >> > avg: 3.06ms
>> >> > max: 13.16ms
>> >> > approx. 95 percentile: 12.73ms
>> >> >
>> >> > Threads fairness:
>> >> > events (avg/stddev): 4903.0000/0.00
>> >> > execution time (avg/stddev): 14.9972/0.00
>> >> >
>> >> > - With patchset:
>> >> > Test execution summary:
>> >> > total time: 15.0014s
>> >> > total number of events: 7694
>> >> > total time taken by event execution: 14.9979
>> >> > per-request statistics:
>> >> > min: 1.23ms
>> >> > avg: 1.95ms
>> >> > max: 10.49ms
>> >> > approx. 95 percentile: 10.39ms
>> >> >
>> >> > Threads fairness:
>> >> > events (avg/stddev): 7694.0000/0.00
>> >> > execution time (avg/stddev): 14.9979/0.00
>> >> >
>> >> > The performance improvement is 56% for this use case.
>> >> >
>> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
>> >> > problem as with rt_rq
>> >> >
>> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
>> >> > dl and rt from sched_rt_avg_update
>> >> >
>> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
>> >> > A test with iperf on hikey 6220 gives:
>> >> > w/o patchset w/ patchset
>> >> > Tx 276 Mbits/sec 304 Mbits/sec +10%
>> >> > Rx 299 Mbits/sec 328 Mbits/sec +09%
>> >> >
>> >> > 8 iterations of iperf -c server_address -r -t 5
>> >> > stdev is lower than 1%
>> >> > Only WFI idle state is enable (shallowest arm idle state)
>> >> >
>> >> > - Patches 9 removes the unused sched_avg_update code
>> >> >
>> >> > - Patch 10 removes the unused sched_time_avg_ms
>> >> >
>> >> > Change since v3:
>> >> > - add support of periodic update of blocked utilization
>> >> > - rebase on lastest tip/sched/core
>> >> >
>> >> > Change since v2:
>> >> > - move pelt code into a dedicated pelt.c file
>> >> > - rebase on load tracking changes
>> >> >
>> >> > Change since v1:
>> >> > - Only a rebase. I have addressed the comments on previous version in
>> >> > patch 1/2
>> >> >
>> >> > Vincent Guittot (10):
>> >> > sched/pelt: Move pelt related code in a dedicated file
>> >> > sched/rt: add rt_rq utilization tracking
>> >> > cpufreq/schedutil: add rt utilization tracking
>> >> > sched/dl: add dl_rq utilization tracking
>> >> > cpufreq/schedutil: get max utilization
>> >> > sched: remove rt and dl from sched_avg
>> >> > sched/irq: add irq utilization tracking
>> >> > cpufreq/schedutil: take into account interrupt
>> >> > sched: remove rt_avg code
>> >> > proc/sched: remove unused sched_time_avg_ms
>> >> >
>> >> > include/linux/sched/sysctl.h | 1 -
>> >> > kernel/sched/Makefile | 2 +-
>> >> > kernel/sched/core.c | 38 +---
>> >> > kernel/sched/cpufreq_schedutil.c | 24 ++-
>> >> > kernel/sched/deadline.c | 7 +-
>> >> > kernel/sched/fair.c | 381 +++----------------------------------
>> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
>> >> > kernel/sched/pelt.h | 63 +++++++
>> >> > kernel/sched/rt.c | 10 +-
>> >> > kernel/sched/sched.h | 57 ++++--
>> >> > kernel/sysctl.c | 8 -
>> >> > 11 files changed, 563 insertions(+), 423 deletions(-)
>> >> > create mode 100644 kernel/sched/pelt.c
>> >> > create mode 100644 kernel/sched/pelt.h
>> >> >
>> >> > --
>> >> > 2.7.4
>> >> >
>> >
>> >

2018-06-05 13:53:57

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 15:12, Quentin Perret <[email protected]> wrote:
> > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:
> >> On 5 June 2018 at 12:57, Quentin Perret <[email protected]> wrote:
> >> > Hi Vincent,
> >> >
> >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
> >> >> Hi Quentin,
> >> >>
> >> >> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
> >> >> > This patchset initially tracked only the utilization of RT rq. During
> >> >> > OSPM summit, it has been discussed the opportunity to extend it in order
> >> >> > to get an estimate of the utilization of the CPU.
> >> >> >
> >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
> >> >> > tracking for rt_rq.
> >> >> >
> >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
> >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
> >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that
> >> >> > is not used by rt tasks. We should monitor the stolen utilization and take
> >> >> > it into account when selecting OPP. This patchset doesn't change the OPP
> >> >> > selection policy for RT tasks but only for CFS tasks
> >> >> >
> >> >> > A rt-app use case which creates an always running cfs thread and a rt threads
> >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of
> >> >> > frequency switches of the CPU whereas the CPU never goes idles during the
> >> >> > test. I can share the json file that I used for the test if someone is
> >> >> > interested in.
> >> >> >
> >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
> >> >> > the cpufreq statistics outputs (stats are reset just before the test) :
> >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> >> >> > without patchset : 1230
> >> >> > with patchset : 14
> >> >>
> >> >> I have attached the rt-app json file that I use for this test
> >> >
> >> > Thank you very much ! I did a quick test with a much simpler fix to this
> >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
> >> > I get the following results on Hikey960:
> >> >
> >> > Without patch:
> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> >> > 12
> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> >> > 640
> >> > With patch
> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
> >> > 8
> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
> >> > 12
> >> >
> >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
> >> > this is an actual issue for realistic use-cases ?
> >>
> >> yes I think that it's worth syncing and consolidating things on the
> >> same metric. The result will be saner and more robust as we will have
> >> the same behavior
> >
> > TBH I'm not disagreeing with that, the PELT-everywhere approach feels
> > cleaner in a way, but do you have a use-case in mind where this will
> > definitely help ?
> >
> > I mean, yes the rt_avg is a slow response to the RT pressure, but is
> > this always a problem ? Ramping down slower might actually help in some
> > cases no ?
>
> I would say no because when one will decrease the other one will not
> increase at the same pace and we will have some wrong behavior or
> decision

I think I get your point. Yes, sometimes, the slow-moving rt_avg can be
off a little bit (which can be good or bad, depending in the case) if your
RT task runs a lot with very changing behaviour. And again, I'm not
fundamentally against the idea of having extra complexity for RT/IRQ PELT
signals _if_ we have a use-case. But is there a real use-case where we
really need all of that ? That's a true question, I honestly don't have
the answer :-)

>
> >
> >>
> >> >
> >> > What about the diff below (just a quick hack to show the idea) applied
> >> > on tip/sched/core ?
> >> >
> >> > ---8<---
> >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> > index a8ba6d1f262a..23a4fb1c2c25 100644
> >> > --- a/kernel/sched/cpufreq_schedutil.c
> >> > +++ b/kernel/sched/cpufreq_schedutil.c
> >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> >> > sg_cpu->util_dl = cpu_util_dl(rq);
> >> > }
> >> >
> >> > +unsigned long scale_rt_capacity(int cpu);
> >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> >> > {
> >> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> >> > + int cpu = sg_cpu->cpu;
> >> > + unsigned long util, dl_bw;
> >> >
> >> > if (rq->rt.rt_nr_running)
> >> > return sg_cpu->max;
> >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> >> > * ready for such an interface. So, we only do the latter for now.
> >> > */
> >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> >> > + util >>= SCHED_CAPACITY_SHIFT;
> >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> >> > + util += sg_cpu->util_cfs;
> >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> >> > +
> >> > + /* Make sure to always provide the reserved freq to DL. */
> >> > + return max(util, dl_bw);
> >> > }
> >> >
> >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index f01f0f395f9a..0e87cbe47c8b 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> >> > return load_idx;
> >> > }
> >> >
> >> > -static unsigned long scale_rt_capacity(int cpu)
> >> > +unsigned long scale_rt_capacity(int cpu)
> >> > {
> >> > struct rq *rq = cpu_rq(cpu);
> >> > u64 total, used, age_stamp, avg;
> >> > --->8---
> >> >
> >> >
> >> >
> >> >>
> >> >> >
> >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
> >> >> > performance improvements:
> >> >> >
> >> >> > - Without patchset :
> >> >> > Test execution summary:
> >> >> > total time: 15.0009s
> >> >> > total number of events: 4903
> >> >> > total time taken by event execution: 14.9972
> >> >> > per-request statistics:
> >> >> > min: 1.23ms
> >> >> > avg: 3.06ms
> >> >> > max: 13.16ms
> >> >> > approx. 95 percentile: 12.73ms
> >> >> >
> >> >> > Threads fairness:
> >> >> > events (avg/stddev): 4903.0000/0.00
> >> >> > execution time (avg/stddev): 14.9972/0.00
> >> >> >
> >> >> > - With patchset:
> >> >> > Test execution summary:
> >> >> > total time: 15.0014s
> >> >> > total number of events: 7694
> >> >> > total time taken by event execution: 14.9979
> >> >> > per-request statistics:
> >> >> > min: 1.23ms
> >> >> > avg: 1.95ms
> >> >> > max: 10.49ms
> >> >> > approx. 95 percentile: 10.39ms
> >> >> >
> >> >> > Threads fairness:
> >> >> > events (avg/stddev): 7694.0000/0.00
> >> >> > execution time (avg/stddev): 14.9979/0.00
> >> >> >
> >> >> > The performance improvement is 56% for this use case.
> >> >> >
> >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
> >> >> > problem as with rt_rq
> >> >> >
> >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
> >> >> > dl and rt from sched_rt_avg_update
> >> >> >
> >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
> >> >> > A test with iperf on hikey 6220 gives:
> >> >> > w/o patchset w/ patchset
> >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10%
> >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09%
> >> >> >
> >> >> > 8 iterations of iperf -c server_address -r -t 5
> >> >> > stdev is lower than 1%
> >> >> > Only WFI idle state is enable (shallowest arm idle state)
> >> >> >
> >> >> > - Patches 9 removes the unused sched_avg_update code
> >> >> >
> >> >> > - Patch 10 removes the unused sched_time_avg_ms
> >> >> >
> >> >> > Change since v3:
> >> >> > - add support of periodic update of blocked utilization
> >> >> > - rebase on lastest tip/sched/core
> >> >> >
> >> >> > Change since v2:
> >> >> > - move pelt code into a dedicated pelt.c file
> >> >> > - rebase on load tracking changes
> >> >> >
> >> >> > Change since v1:
> >> >> > - Only a rebase. I have addressed the comments on previous version in
> >> >> > patch 1/2
> >> >> >
> >> >> > Vincent Guittot (10):
> >> >> > sched/pelt: Move pelt related code in a dedicated file
> >> >> > sched/rt: add rt_rq utilization tracking
> >> >> > cpufreq/schedutil: add rt utilization tracking
> >> >> > sched/dl: add dl_rq utilization tracking
> >> >> > cpufreq/schedutil: get max utilization
> >> >> > sched: remove rt and dl from sched_avg
> >> >> > sched/irq: add irq utilization tracking
> >> >> > cpufreq/schedutil: take into account interrupt
> >> >> > sched: remove rt_avg code
> >> >> > proc/sched: remove unused sched_time_avg_ms
> >> >> >
> >> >> > include/linux/sched/sysctl.h | 1 -
> >> >> > kernel/sched/Makefile | 2 +-
> >> >> > kernel/sched/core.c | 38 +---
> >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++-
> >> >> > kernel/sched/deadline.c | 7 +-
> >> >> > kernel/sched/fair.c | 381 +++----------------------------------
> >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
> >> >> > kernel/sched/pelt.h | 63 +++++++
> >> >> > kernel/sched/rt.c | 10 +-
> >> >> > kernel/sched/sched.h | 57 ++++--
> >> >> > kernel/sysctl.c | 8 -
> >> >> > 11 files changed, 563 insertions(+), 423 deletions(-)
> >> >> > create mode 100644 kernel/sched/pelt.c
> >> >> > create mode 100644 kernel/sched/pelt.h
> >> >> >
> >> >> > --
> >> >> > 2.7.4
> >> >> >
> >> >
> >> >

2018-06-05 13:57:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 5 June 2018 at 15:52, Quentin Perret <[email protected]> wrote:
> On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:
>> On 5 June 2018 at 15:12, Quentin Perret <[email protected]> wrote:
>> > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote:
>> >> On 5 June 2018 at 12:57, Quentin Perret <[email protected]> wrote:
>> >> > Hi Vincent,
>> >> >
>> >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote:
>> >> >> Hi Quentin,
>> >> >>
>> >> >> On 25 May 2018 at 15:12, Vincent Guittot <[email protected]> wrote:
>> >> >> > This patchset initially tracked only the utilization of RT rq. During
>> >> >> > OSPM summit, it has been discussed the opportunity to extend it in order
>> >> >> > to get an estimate of the utilization of the CPU.
>> >> >> >
>> >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization
>> >> >> > tracking for rt_rq.
>> >> >> >
>> >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency
>> >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
>> >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that
>> >> >> > is not used by rt tasks. We should monitor the stolen utilization and take
>> >> >> > it into account when selecting OPP. This patchset doesn't change the OPP
>> >> >> > selection policy for RT tasks but only for CFS tasks
>> >> >> >
>> >> >> > A rt-app use case which creates an always running cfs thread and a rt threads
>> >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of
>> >> >> > frequency switches of the CPU whereas the CPU never goes idles during the
>> >> >> > test. I can share the json file that I used for the test if someone is
>> >> >> > interested in.
>> >> >> >
>> >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
>> >> >> > the cpufreq statistics outputs (stats are reset just before the test) :
>> >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> >> >> > without patchset : 1230
>> >> >> > with patchset : 14
>> >> >>
>> >> >> I have attached the rt-app json file that I use for this test
>> >> >
>> >> > Thank you very much ! I did a quick test with a much simpler fix to this
>> >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity().
>> >> > I get the following results on Hikey960:
>> >> >
>> >> > Without patch:
>> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> >> > 12
>> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
>> >> > 640
>> >> > With patch
>> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
>> >> > 8
>> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans
>> >> > 12
>> >> >
>> >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think
>> >> > this is an actual issue for realistic use-cases ?
>> >>
>> >> yes I think that it's worth syncing and consolidating things on the
>> >> same metric. The result will be saner and more robust as we will have
>> >> the same behavior
>> >
>> > TBH I'm not disagreeing with that, the PELT-everywhere approach feels
>> > cleaner in a way, but do you have a use-case in mind where this will
>> > definitely help ?
>> >
>> > I mean, yes the rt_avg is a slow response to the RT pressure, but is
>> > this always a problem ? Ramping down slower might actually help in some
>> > cases no ?
>>
>> I would say no because when one will decrease the other one will not
>> increase at the same pace and we will have some wrong behavior or
>> decision
>
> I think I get your point. Yes, sometimes, the slow-moving rt_avg can be
> off a little bit (which can be good or bad, depending in the case) if your
> RT task runs a lot with very changing behaviour. And again, I'm not
> fundamentally against the idea of having extra complexity for RT/IRQ PELT
> signals _if_ we have a use-case. But is there a real use-case where we
> really need all of that ? That's a true question, I honestly don't have
> the answer :-)

The iperf test result is another example of the benefit

>
>>
>> >
>> >>
>> >> >
>> >> > What about the diff below (just a quick hack to show the idea) applied
>> >> > on tip/sched/core ?
>> >> >
>> >> > ---8<---
>> >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> >> > index a8ba6d1f262a..23a4fb1c2c25 100644
>> >> > --- a/kernel/sched/cpufreq_schedutil.c
>> >> > +++ b/kernel/sched/cpufreq_schedutil.c
>> >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>> >> > sg_cpu->util_dl = cpu_util_dl(rq);
>> >> > }
>> >> >
>> >> > +unsigned long scale_rt_capacity(int cpu);
>> >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> >> > {
>> >> > struct rq *rq = cpu_rq(sg_cpu->cpu);
>> >> > + int cpu = sg_cpu->cpu;
>> >> > + unsigned long util, dl_bw;
>> >> >
>> >> > if (rq->rt.rt_nr_running)
>> >> > return sg_cpu->max;
>> >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>> >> > * ready for such an interface. So, we only do the latter for now.
>> >> > */
>> >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>> >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
>> >> > + util >>= SCHED_CAPACITY_SHIFT;
>> >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
>> >> > + util += sg_cpu->util_cfs;
>> >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> > +
>> >> > + /* Make sure to always provide the reserved freq to DL. */
>> >> > + return max(util, dl_bw);
>> >> > }
>> >> >
>> >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
>> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> > index f01f0f395f9a..0e87cbe47c8b 100644
>> >> > --- a/kernel/sched/fair.c
>> >> > +++ b/kernel/sched/fair.c
>> >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
>> >> > return load_idx;
>> >> > }
>> >> >
>> >> > -static unsigned long scale_rt_capacity(int cpu)
>> >> > +unsigned long scale_rt_capacity(int cpu)
>> >> > {
>> >> > struct rq *rq = cpu_rq(cpu);
>> >> > u64 total, used, age_stamp, avg;
>> >> > --->8---
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
>> >> >> > performance improvements:
>> >> >> >
>> >> >> > - Without patchset :
>> >> >> > Test execution summary:
>> >> >> > total time: 15.0009s
>> >> >> > total number of events: 4903
>> >> >> > total time taken by event execution: 14.9972
>> >> >> > per-request statistics:
>> >> >> > min: 1.23ms
>> >> >> > avg: 3.06ms
>> >> >> > max: 13.16ms
>> >> >> > approx. 95 percentile: 12.73ms
>> >> >> >
>> >> >> > Threads fairness:
>> >> >> > events (avg/stddev): 4903.0000/0.00
>> >> >> > execution time (avg/stddev): 14.9972/0.00
>> >> >> >
>> >> >> > - With patchset:
>> >> >> > Test execution summary:
>> >> >> > total time: 15.0014s
>> >> >> > total number of events: 7694
>> >> >> > total time taken by event execution: 14.9979
>> >> >> > per-request statistics:
>> >> >> > min: 1.23ms
>> >> >> > avg: 1.95ms
>> >> >> > max: 10.49ms
>> >> >> > approx. 95 percentile: 10.39ms
>> >> >> >
>> >> >> > Threads fairness:
>> >> >> > events (avg/stddev): 7694.0000/0.00
>> >> >> > execution time (avg/stddev): 14.9979/0.00
>> >> >> >
>> >> >> > The performance improvement is 56% for this use case.
>> >> >> >
>> >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar
>> >> >> > problem as with rt_rq
>> >> >> >
>> >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove
>> >> >> > dl and rt from sched_rt_avg_update
>> >> >> >
>> >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP
>> >> >> > A test with iperf on hikey 6220 gives:
>> >> >> > w/o patchset w/ patchset
>> >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10%
>> >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09%
>> >> >> >
>> >> >> > 8 iterations of iperf -c server_address -r -t 5
>> >> >> > stdev is lower than 1%
>> >> >> > Only WFI idle state is enable (shallowest arm idle state)
>> >> >> >
>> >> >> > - Patches 9 removes the unused sched_avg_update code
>> >> >> >
>> >> >> > - Patch 10 removes the unused sched_time_avg_ms
>> >> >> >
>> >> >> > Change since v3:
>> >> >> > - add support of periodic update of blocked utilization
>> >> >> > - rebase on lastest tip/sched/core
>> >> >> >
>> >> >> > Change since v2:
>> >> >> > - move pelt code into a dedicated pelt.c file
>> >> >> > - rebase on load tracking changes
>> >> >> >
>> >> >> > Change since v1:
>> >> >> > - Only a rebase. I have addressed the comments on previous version in
>> >> >> > patch 1/2
>> >> >> >
>> >> >> > Vincent Guittot (10):
>> >> >> > sched/pelt: Move pelt related code in a dedicated file
>> >> >> > sched/rt: add rt_rq utilization tracking
>> >> >> > cpufreq/schedutil: add rt utilization tracking
>> >> >> > sched/dl: add dl_rq utilization tracking
>> >> >> > cpufreq/schedutil: get max utilization
>> >> >> > sched: remove rt and dl from sched_avg
>> >> >> > sched/irq: add irq utilization tracking
>> >> >> > cpufreq/schedutil: take into account interrupt
>> >> >> > sched: remove rt_avg code
>> >> >> > proc/sched: remove unused sched_time_avg_ms
>> >> >> >
>> >> >> > include/linux/sched/sysctl.h | 1 -
>> >> >> > kernel/sched/Makefile | 2 +-
>> >> >> > kernel/sched/core.c | 38 +---
>> >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++-
>> >> >> > kernel/sched/deadline.c | 7 +-
>> >> >> > kernel/sched/fair.c | 381 +++----------------------------------
>> >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++
>> >> >> > kernel/sched/pelt.h | 63 +++++++
>> >> >> > kernel/sched/rt.c | 10 +-
>> >> >> > kernel/sched/sched.h | 57 ++++--
>> >> >> > kernel/sysctl.c | 8 -
>> >> >> > 11 files changed, 563 insertions(+), 423 deletions(-)
>> >> >> > create mode 100644 kernel/sched/pelt.c
>> >> >> > create mode 100644 kernel/sched/pelt.h
>> >> >> >
>> >> >> > --
>> >> >> > 2.7.4
>> >> >> >
>> >> >
>> >> >

2018-06-05 14:03:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> On 05/06/18 14:05, Quentin Perret wrote:
> > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > > Hi Quentin,
> > >
> > > On 05/06/18 11:57, Quentin Perret wrote:
> > >
> > > [...]
> > >
> > > > What about the diff below (just a quick hack to show the idea) applied
> > > > on tip/sched/core ?
> > > >
> > > > ---8<---
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > > }
> > > >
> > > > +unsigned long scale_rt_capacity(int cpu);
> > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > {
> > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > + int cpu = sg_cpu->cpu;
> > > > + unsigned long util, dl_bw;
> > > >
> > > > if (rq->rt.rt_nr_running)
> > > > return sg_cpu->max;
> > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > * ready for such an interface. So, we only do the latter for now.
> > > > */
> > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > >
> > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > > since we use max below, we will probably have the same problem that we
> > > discussed on Vincent's approach (overestimation of DL contribution while
> > > we could use running_bw).
> >
> > Ah no, you're right, this isn't great for long running deadline tasks.
> > We should definitely account for the running_bw here, not the dl avg...
> >
> > I was trying to address the issue of RT stealing time from CFS here, but
> > the DL integration isn't quite right which this patch as-is, I agree ...
> >
> > >
> > > > + util >>= SCHED_CAPACITY_SHIFT;
> > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > > + util += sg_cpu->util_cfs;
> > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > >
> > > Why this_bw instead of running_bw?
> >
> > So IIUC, this_bw should basically give you the absolute reservation (== the
> > sum of runtime/deadline ratios of all DL tasks on that rq).
>
> Yep.
>
> > The reason I added this max is because I'm still not sure to understand
> > how we can safely drop the freq below that point ? If we don't guarantee
> > to always stay at least at the freq required by DL, aren't we risking to
> > start a deadline tasks stuck at a low freq because of rate limiting ? In
> > this case, if that tasks uses all of its runtime then you might start
> > missing deadlines ...
>
> We decided to avoid (software) rate limiting for DL with e97a90f7069b
> ("sched/cpufreq: Rate limits for SCHED_DEADLINE").

Right, I spotted that one, but yeah you could also be limited by HW ...

>
> > My feeling is that the only safe thing to do is to guarantee to never go
> > below the freq required by DL, and to optimistically add CFS tasks
> > without raising the OPP if we have good reasons to think that DL is
> > using less than it required (which is what we should get by using
> > running_bw above I suppose). Does that make any sense ?
>
> Then we can't still avoid the hardware limits, so using running_bw is a
> trade off between safety (especially considering soft real-time
> scenarios) and energy consumption (which seems to be working in
> practice).

Ok, I see ... Have you guys already tried something like my patch above
(keeping the freq >= this_bw) in real world use cases ? Is this costing
that much energy in practice ? If we fill the gaps left by DL (when it
doesn't use all the runtime) with CFS tasks that might no be so bad ...

Thank you very much for taking the time to explain all this, I really
appreciate :-)

Quentin

2018-06-05 14:10:41

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote:
> On 5 June 2018 at 15:52, Quentin Perret <[email protected]> wrote:
> > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:
> >> On 5 June 2018 at 15:12, Quentin Perret <[email protected]> wrote:
> >> I would say no because when one will decrease the other one will not
> >> increase at the same pace and we will have some wrong behavior or
> >> decision
> >
> > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be
> > off a little bit (which can be good or bad, depending in the case) if your
> > RT task runs a lot with very changing behaviour. And again, I'm not
> > fundamentally against the idea of having extra complexity for RT/IRQ PELT
> > signals _if_ we have a use-case. But is there a real use-case where we
> > really need all of that ? That's a true question, I honestly don't have
> > the answer :-)
>
> The iperf test result is another example of the benefit

The iperf test result ? The sysbench test you mean ?

2018-06-05 14:14:03

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 05/06/18 15:01, Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> > On 05/06/18 14:05, Quentin Perret wrote:
> > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > > > Hi Quentin,
> > > >
> > > > On 05/06/18 11:57, Quentin Perret wrote:
> > > >
> > > > [...]
> > > >
> > > > > What about the diff below (just a quick hack to show the idea) applied
> > > > > on tip/sched/core ?
> > > > >
> > > > > ---8<---
> > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > > > }
> > > > >
> > > > > +unsigned long scale_rt_capacity(int cpu);
> > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > {
> > > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > > + int cpu = sg_cpu->cpu;
> > > > > + unsigned long util, dl_bw;
> > > > >
> > > > > if (rq->rt.rt_nr_running)
> > > > > return sg_cpu->max;
> > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > > * ready for such an interface. So, we only do the latter for now.
> > > > > */
> > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > > >
> > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > > > since we use max below, we will probably have the same problem that we
> > > > discussed on Vincent's approach (overestimation of DL contribution while
> > > > we could use running_bw).
> > >
> > > Ah no, you're right, this isn't great for long running deadline tasks.
> > > We should definitely account for the running_bw here, not the dl avg...
> > >
> > > I was trying to address the issue of RT stealing time from CFS here, but
> > > the DL integration isn't quite right which this patch as-is, I agree ...
> > >
> > > >
> > > > > + util >>= SCHED_CAPACITY_SHIFT;
> > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > > > + util += sg_cpu->util_cfs;
> > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > >
> > > > Why this_bw instead of running_bw?
> > >
> > > So IIUC, this_bw should basically give you the absolute reservation (== the
> > > sum of runtime/deadline ratios of all DL tasks on that rq).
> >
> > Yep.
> >
> > > The reason I added this max is because I'm still not sure to understand
> > > how we can safely drop the freq below that point ? If we don't guarantee
> > > to always stay at least at the freq required by DL, aren't we risking to
> > > start a deadline tasks stuck at a low freq because of rate limiting ? In
> > > this case, if that tasks uses all of its runtime then you might start
> > > missing deadlines ...
> >
> > We decided to avoid (software) rate limiting for DL with e97a90f7069b
> > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").
>
> Right, I spotted that one, but yeah you could also be limited by HW ...
>
> >
> > > My feeling is that the only safe thing to do is to guarantee to never go
> > > below the freq required by DL, and to optimistically add CFS tasks
> > > without raising the OPP if we have good reasons to think that DL is
> > > using less than it required (which is what we should get by using
> > > running_bw above I suppose). Does that make any sense ?
> >
> > Then we can't still avoid the hardware limits, so using running_bw is a
> > trade off between safety (especially considering soft real-time
> > scenarios) and energy consumption (which seems to be working in
> > practice).
>
> Ok, I see ... Have you guys already tried something like my patch above
> (keeping the freq >= this_bw) in real world use cases ? Is this costing
> that much energy in practice ? If we fill the gaps left by DL (when it

IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
he might add some numbers to my words above. I didn't (yet). But, please
consider that I might be reserving (for example) 50% of bandwidth for my
heavy and time sensitive task and then have that task wake up only once
in a while (but I'll be keeping clock speed up for the whole time). :/

> doesn't use all the runtime) with CFS tasks that might no be so bad ...
>
> Thank you very much for taking the time to explain all this, I really
> appreciate :-)

Sure. Thanks for participating to the discussion!

Best,

- Juri

2018-06-05 14:18:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
> On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:

> > So this patch-set tracks the !cfs occupation using the same function,
> > which is all good. But what, if instead of using that to compensate the
> > OPP selection, we employ that to renormalize the util signal?
> >
> > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> > then I think your initial problem goes away. Because while the RT task
> > will push the util to .5, it will at the same time push the CPU capacity
> > to .5, and renormalized that gives 1.
> >
> > NOTE: the renorm would then become something like:
> > scale_cpu = arch_scale_cpu_capacity() / rt_frac();

Should probably be:

scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())

> >
> >
> > On IRC I mentioned stopping the CFS clock when preempted, and while that
> > would result in fixed numbers, Vincent was right in pointing out the
> > numbers will be difficult to interpret, since the meaning will be purely
> > CPU local and I'm not sure you can actually fix it again with
> > normalization.
> >
> > Imagine, running a .3 RT task, that would push the (always running) CFS
> > down to .7, but because we discard all !cfs time, it actually has 1. If
> > we try and normalize that we'll end up with ~1.43, which is of course
> > completely broken.
> >
> >
> > _However_, all that happens for util, also happens for load. So the above
> > scenario will also make the CPU appear less loaded than it actually is.
>
> The load will continue to increase because we track runnable state and
> not running for the load

Duh yes. So renormalizing it once, like proposed for util would actually
do the right thing there too. Would not that allow us to get rid of
much of the capacity magic in the load balance code?

/me thinks more..

Bah, no.. because you don't want this dynamic renormalization part of
the sums. So you want to keep it after the fact. :/

> As you mentioned, scale_rt_capacity give the remaining capacity for
> cfs and it will behave like cfs util_avg now that it uses PELT. So as
> long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> OPP because we have remaining spare capacity but if cfs util_avg ==
> scale_rt_capacity, we make sure to use max OPP.

Good point, when cfs-util < cfs-cap then there is idle time and the util
number is 'right', when cfs-util == cfs-cap we're overcommitted and
should go max.

Since the util and cap values are aligned that should track nicely.

2018-06-05 14:22:52

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 15:09:54 (+0100), Quentin Perret wrote:
> On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote:
> > On 5 June 2018 at 15:52, Quentin Perret <[email protected]> wrote:
> > > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote:
> > >> On 5 June 2018 at 15:12, Quentin Perret <[email protected]> wrote:
> > >> I would say no because when one will decrease the other one will not
> > >> increase at the same pace and we will have some wrong behavior or
> > >> decision
> > >
> > > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be
> > > off a little bit (which can be good or bad, depending in the case) if your
> > > RT task runs a lot with very changing behaviour. And again, I'm not
> > > fundamentally against the idea of having extra complexity for RT/IRQ PELT
> > > signals _if_ we have a use-case. But is there a real use-case where we
> > > really need all of that ? That's a true question, I honestly don't have
> > > the answer :-)
> >
> > The iperf test result is another example of the benefit
>
> The iperf test result ? The sysbench test you mean ?

Ah sorry I missed that one form the cover letter ... I'll look into that
then :-)

Thanks,
Quentin

2018-06-05 15:06:10

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 05/06/18 16:18, Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

[...]

> > As you mentioned, scale_rt_capacity give the remaining capacity for
> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> > OPP because we have remaining spare capacity but if cfs util_avg ==
> > scale_rt_capacity, we make sure to use max OPP.
>
> Good point, when cfs-util < cfs-cap then there is idle time and the util
> number is 'right', when cfs-util == cfs-cap we're overcommitted and
> should go max.
>
> Since the util and cap values are aligned that should track nicely.

Yeah. Makes sense to me as well. :)

2018-06-05 15:40:12

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 05-Jun 16:18, Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
> > On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:
>
> > > So this patch-set tracks the !cfs occupation using the same function,
> > > which is all good. But what, if instead of using that to compensate the
> > > OPP selection, we employ that to renormalize the util signal?
> > >
> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> > > then I think your initial problem goes away. Because while the RT task
> > > will push the util to .5, it will at the same time push the CPU capacity
> > > to .5, and renormalized that gives 1.

And would not that mean also that a 50% task co-scheduled with the
same 50% RT task, will be reported as a 100% util_avg task?

> > >
> > > NOTE: the renorm would then become something like:
> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac();
>
> Should probably be:
>
> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())
>
> > >
> > >
> > > On IRC I mentioned stopping the CFS clock when preempted, and while that
> > > would result in fixed numbers, Vincent was right in pointing out the
> > > numbers will be difficult to interpret, since the meaning will be purely
> > > CPU local and I'm not sure you can actually fix it again with
> > > normalization.
> > >
> > > Imagine, running a .3 RT task, that would push the (always running) CFS
> > > down to .7, but because we discard all !cfs time, it actually has 1. If
> > > we try and normalize that we'll end up with ~1.43, which is of course
> > > completely broken.
> > >
> > >
> > > _However_, all that happens for util, also happens for load. So the above
> > > scenario will also make the CPU appear less loaded than it actually is.
> >
> > The load will continue to increase because we track runnable state and
> > not running for the load
>
> Duh yes. So renormalizing it once, like proposed for util would actually
> do the right thing there too. Would not that allow us to get rid of
> much of the capacity magic in the load balance code?
>
> /me thinks more..
>
> Bah, no.. because you don't want this dynamic renormalization part of
> the sums. So you want to keep it after the fact. :/
>
> > As you mentioned, scale_rt_capacity give the remaining capacity for
> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> > OPP because we have remaining spare capacity but if cfs util_avg ==
> > scale_rt_capacity, we make sure to use max OPP.

What will happen for the 50% task of the example above?

> Good point, when cfs-util < cfs-cap then there is idle time and the util
> number is 'right', when cfs-util == cfs-cap we're overcommitted and
> should go max.

Again I cannot easily read the example above...

Would that mean that a 50% CFS task, preempted by a 50% RT task (which
already set OPP to max while RUNNABLE) will end up running at the max
OPP too?

> Since the util and cap values are aligned that should track nicely.

True... the only potential issue I see is that we are steering PELT
behaviors towards better driving schedutil to run high-demand
workloads while _maybe_ affecting quite sensibly the capacity of PELT
to describe how much CPU a task uses.

Ultimately, utilization has always been a metric on "how much you
use"... while here it seems to me we are bending it to be something to
define "how fast you have to run".

--
#include <best/regards.h>

Patrick Bellasi

2018-06-05 22:29:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tue, Jun 05, 2018 at 04:38:26PM +0100, Patrick Bellasi wrote:
> On 05-Jun 16:18, Peter Zijlstra wrote:
> > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:

> > > As you mentioned, scale_rt_capacity give the remaining capacity for
> > > cfs and it will behave like cfs util_avg now that it uses PELT. So as
> > > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> > > OPP because we have remaining spare capacity but if cfs util_avg ==
> > > scale_rt_capacity, we make sure to use max OPP.
>
> What will happen for the 50% task of the example above?

When the cfs-cap reaches 50% (where cfs_cap := 1 - rt_avg - dl_avg -
stop_avg - irq_avg) a cfs-util of 50% means that there is no idle time.

So util will still be 50%, nothing funny. But frequency selection will
see util==cap and select max (it might not have because reduction could
be due to IRQ pressure for example).

At the moment cfs-cap rises (>50%), and the cfs-util stays at 50%, we'll
have 50% utilization. We know there is idle time, the task could use
more if it wanted to.

> > Good point, when cfs-util < cfs-cap then there is idle time and the util
> > number is 'right', when cfs-util == cfs-cap we're overcommitted and
> > should go max.
>
> Again I cannot easily read the example above...
>
> Would that mean that a 50% CFS task, preempted by a 50% RT task (which
> already set OPP to max while RUNNABLE) will end up running at the max
> OPP too?

Yes, because there is no idle time. When there is no idle time, max freq
is the right frequency.

The moment cfs-util drops below cfs-cap, we'll stop running at max,
because at that point we've found idle time to reduce frequency with.

> > Since the util and cap values are aligned that should track nicely.
>
> True... the only potential issue I see is that we are steering PELT
> behaviors towards better driving schedutil to run high-demand
> workloads while _maybe_ affecting quite sensibly the capacity of PELT
> to describe how much CPU a task uses.
>
> Ultimately, utilization has always been a metric on "how much you
> use"... while here it seems to me we are bending it to be something to
> define "how fast you have to run".

This latest proposal does not in fact change the util tracking. But in
general, 'how much do you use' can be a very difficult question, see the
whole turbo / hardware managed dvfs discussion a week or so ago.



2018-06-06 09:46:06

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:
> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
> > On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:
>
> > > So this patch-set tracks the !cfs occupation using the same function,
> > > which is all good. But what, if instead of using that to compensate the
> > > OPP selection, we employ that to renormalize the util signal?
> > >
> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> > > then I think your initial problem goes away. Because while the RT task
> > > will push the util to .5, it will at the same time push the CPU capacity
> > > to .5, and renormalized that gives 1.
> > >
> > > NOTE: the renorm would then become something like:
> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac();
>
> Should probably be:
>
> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())
>
> > >
> > >
> > > On IRC I mentioned stopping the CFS clock when preempted, and while that
> > > would result in fixed numbers, Vincent was right in pointing out the
> > > numbers will be difficult to interpret, since the meaning will be purely
> > > CPU local and I'm not sure you can actually fix it again with
> > > normalization.
> > >
> > > Imagine, running a .3 RT task, that would push the (always running) CFS
> > > down to .7, but because we discard all !cfs time, it actually has 1. If
> > > we try and normalize that we'll end up with ~1.43, which is of course
> > > completely broken.
> > >
> > >
> > > _However_, all that happens for util, also happens for load. So the above
> > > scenario will also make the CPU appear less loaded than it actually is.
> >
> > The load will continue to increase because we track runnable state and
> > not running for the load
>
> Duh yes. So renormalizing it once, like proposed for util would actually
> do the right thing there too. Would not that allow us to get rid of
> much of the capacity magic in the load balance code?
>
> /me thinks more..
>
> Bah, no.. because you don't want this dynamic renormalization part of
> the sums. So you want to keep it after the fact. :/
>
> > As you mentioned, scale_rt_capacity give the remaining capacity for
> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> > OPP because we have remaining spare capacity but if cfs util_avg ==
> > scale_rt_capacity, we make sure to use max OPP.
>
> Good point, when cfs-util < cfs-cap then there is idle time and the util
> number is 'right', when cfs-util == cfs-cap we're overcommitted and
> should go max.
>
> Since the util and cap values are aligned that should track nicely.

So Vincent proposed to have a margin between cfs util and cfs cap to be
sure there is a little bit of idle time. This is _exactly_ what the
overutilized flag in EAS does. That would actually make a lot of sense
to use that flag in schedutil. The idea is basically to say, if there
isn't enough idle time on all CPUs, the util signal are kinda wrong, so
let's not make any decisions (task placement or OPP selection) based on
that. If overutilized, go to max freq. Does that make sense ?

Thanks,
Quentin

2018-06-06 10:00:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 6 June 2018 at 11:44, Quentin Perret <[email protected]> wrote:
> On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:
>> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
>> > On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:
>>
>> > > So this patch-set tracks the !cfs occupation using the same function,
>> > > which is all good. But what, if instead of using that to compensate the
>> > > OPP selection, we employ that to renormalize the util signal?
>> > >
>> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
>> > > then I think your initial problem goes away. Because while the RT task
>> > > will push the util to .5, it will at the same time push the CPU capacity
>> > > to .5, and renormalized that gives 1.
>> > >
>> > > NOTE: the renorm would then become something like:
>> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac();
>>
>> Should probably be:
>>
>> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())
>>
>> > >
>> > >
>> > > On IRC I mentioned stopping the CFS clock when preempted, and while that
>> > > would result in fixed numbers, Vincent was right in pointing out the
>> > > numbers will be difficult to interpret, since the meaning will be purely
>> > > CPU local and I'm not sure you can actually fix it again with
>> > > normalization.
>> > >
>> > > Imagine, running a .3 RT task, that would push the (always running) CFS
>> > > down to .7, but because we discard all !cfs time, it actually has 1. If
>> > > we try and normalize that we'll end up with ~1.43, which is of course
>> > > completely broken.
>> > >
>> > >
>> > > _However_, all that happens for util, also happens for load. So the above
>> > > scenario will also make the CPU appear less loaded than it actually is.
>> >
>> > The load will continue to increase because we track runnable state and
>> > not running for the load
>>
>> Duh yes. So renormalizing it once, like proposed for util would actually
>> do the right thing there too. Would not that allow us to get rid of
>> much of the capacity magic in the load balance code?
>>
>> /me thinks more..
>>
>> Bah, no.. because you don't want this dynamic renormalization part of
>> the sums. So you want to keep it after the fact. :/
>>
>> > As you mentioned, scale_rt_capacity give the remaining capacity for
>> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
>> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
>> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
>> > OPP because we have remaining spare capacity but if cfs util_avg ==
>> > scale_rt_capacity, we make sure to use max OPP.
>>
>> Good point, when cfs-util < cfs-cap then there is idle time and the util
>> number is 'right', when cfs-util == cfs-cap we're overcommitted and
>> should go max.
>>
>> Since the util and cap values are aligned that should track nicely.
>
> So Vincent proposed to have a margin between cfs util and cfs cap to be
> sure there is a little bit of idle time. This is _exactly_ what the
> overutilized flag in EAS does. That would actually make a lot of sense
> to use that flag in schedutil. The idea is basically to say, if there
> isn't enough idle time on all CPUs, the util signal are kinda wrong, so
> let's not make any decisions (task placement or OPP selection) based on
> that. If overutilized, go to max freq. Does that make sense ?

Yes it's similar to the overutilized except that
- this is done per cpu and whereas overutilization is for the whole system
- the test is done at every freq update and not only during some cfs
event and it uses the last up to date value and not a periodically
updated snapshot of the value
- this is done also without EAS

Then for the margin, it has to be discussed if it is really needed or not

>
> Thanks,
> Quentin

2018-06-06 10:04:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On 6 June 2018 at 11:59, Vincent Guittot <[email protected]> wrote:
> On 6 June 2018 at 11:44, Quentin Perret <[email protected]> wrote:
>> On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:

[snip]

>>>
>>> > As you mentioned, scale_rt_capacity give the remaining capacity for
>>> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
>>> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
>>> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
>>> > OPP because we have remaining spare capacity but if cfs util_avg ==
>>> > scale_rt_capacity, we make sure to use max OPP.
>>>
>>> Good point, when cfs-util < cfs-cap then there is idle time and the util
>>> number is 'right', when cfs-util == cfs-cap we're overcommitted and
>>> should go max.
>>>
>>> Since the util and cap values are aligned that should track nicely.

I have re run my tests and and the results seem to be ok so far.

I'm going to clean up a bit the code used for the test and sent a new
version of the proposal

>>
>> So Vincent proposed to have a margin between cfs util and cfs cap to be
>> sure there is a little bit of idle time. This is _exactly_ what the
>> overutilized flag in EAS does. That would actually make a lot of sense
>> to use that flag in schedutil. The idea is basically to say, if there
>> isn't enough idle time on all CPUs, the util signal are kinda wrong, so
>> let's not make any decisions (task placement or OPP selection) based on
>> that. If overutilized, go to max freq. Does that make sense ?
>
> Yes it's similar to the overutilized except that
> - this is done per cpu and whereas overutilization is for the whole system
> - the test is done at every freq update and not only during some cfs
> event and it uses the last up to date value and not a periodically
> updated snapshot of the value
> - this is done also without EAS
>
> Then for the margin, it has to be discussed if it is really needed or not
>
>>
>> Thanks,
>> Quentin

2018-06-06 10:14:21

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Wednesday 06 Jun 2018 at 11:59:04 (+0200), Vincent Guittot wrote:
> On 6 June 2018 at 11:44, Quentin Perret <[email protected]> wrote:
> > On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote:
> >> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote:
> >> > On 4 June 2018 at 18:50, Peter Zijlstra <[email protected]> wrote:
> >>
> >> > > So this patch-set tracks the !cfs occupation using the same function,
> >> > > which is all good. But what, if instead of using that to compensate the
> >> > > OPP selection, we employ that to renormalize the util signal?
> >> > >
> >> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity,
> >> > > then I think your initial problem goes away. Because while the RT task
> >> > > will push the util to .5, it will at the same time push the CPU capacity
> >> > > to .5, and renormalized that gives 1.
> >> > >
> >> > > NOTE: the renorm would then become something like:
> >> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac();
> >>
> >> Should probably be:
> >>
> >> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac())
> >>
> >> > >
> >> > >
> >> > > On IRC I mentioned stopping the CFS clock when preempted, and while that
> >> > > would result in fixed numbers, Vincent was right in pointing out the
> >> > > numbers will be difficult to interpret, since the meaning will be purely
> >> > > CPU local and I'm not sure you can actually fix it again with
> >> > > normalization.
> >> > >
> >> > > Imagine, running a .3 RT task, that would push the (always running) CFS
> >> > > down to .7, but because we discard all !cfs time, it actually has 1. If
> >> > > we try and normalize that we'll end up with ~1.43, which is of course
> >> > > completely broken.
> >> > >
> >> > >
> >> > > _However_, all that happens for util, also happens for load. So the above
> >> > > scenario will also make the CPU appear less loaded than it actually is.
> >> >
> >> > The load will continue to increase because we track runnable state and
> >> > not running for the load
> >>
> >> Duh yes. So renormalizing it once, like proposed for util would actually
> >> do the right thing there too. Would not that allow us to get rid of
> >> much of the capacity magic in the load balance code?
> >>
> >> /me thinks more..
> >>
> >> Bah, no.. because you don't want this dynamic renormalization part of
> >> the sums. So you want to keep it after the fact. :/
> >>
> >> > As you mentioned, scale_rt_capacity give the remaining capacity for
> >> > cfs and it will behave like cfs util_avg now that it uses PELT. So as
> >> > long as cfs util_avg < scale_rt_capacity(we probably need a margin)
> >> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting
> >> > OPP because we have remaining spare capacity but if cfs util_avg ==
> >> > scale_rt_capacity, we make sure to use max OPP.
> >>
> >> Good point, when cfs-util < cfs-cap then there is idle time and the util
> >> number is 'right', when cfs-util == cfs-cap we're overcommitted and
> >> should go max.
> >>
> >> Since the util and cap values are aligned that should track nicely.
> >
> > So Vincent proposed to have a margin between cfs util and cfs cap to be
> > sure there is a little bit of idle time. This is _exactly_ what the
> > overutilized flag in EAS does. That would actually make a lot of sense
> > to use that flag in schedutil. The idea is basically to say, if there
> > isn't enough idle time on all CPUs, the util signal are kinda wrong, so
> > let's not make any decisions (task placement or OPP selection) based on
> > that. If overutilized, go to max freq. Does that make sense ?
>
> Yes it's similar to the overutilized except that
> - this is done per cpu and whereas overutilization is for the whole system

Is this a good thing ? It has to be discussed. Anyways, the patch from
Morten which is part of the latest EAS posting (v3) introduces a
cpu_overutilized() function which does what you want I think.

> - the test is done at every freq update and not only during some cfs
> event and it uses the last up to date value and not a periodically
> updated snapshot of the value

Yeah good point. Now, the overutilized flag is attached to the root domain
so you should be able to set/clear it from RT/DL whenever that makes sense
I suppose. That's just a flag about the current state of the system so I
don't see why it should be touched only by CFS.

> - this is done also without EAS

The overutilized flag doesn't have to come with EAS if it is useful for
something else (OPP selection).

>
> Then for the margin, it has to be discussed if it is really needed or not

+1

Thanks,
Quentin

2018-06-06 13:08:02

by Claudio Scordino

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Quentin,

Il 05/06/2018 16:13, Juri Lelli ha scritto:
> On 05/06/18 15:01, Quentin Perret wrote:
>> On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
>>> On 05/06/18 14:05, Quentin Perret wrote:
>>>> On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
>>>>> Hi Quentin,
>>>>>
>>>>> On 05/06/18 11:57, Quentin Perret wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> What about the diff below (just a quick hack to show the idea) applied
>>>>>> on tip/sched/core ?
>>>>>>
>>>>>> ---8<---
>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>>>> index a8ba6d1f262a..23a4fb1c2c25 100644
>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>>>> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>>>>> sg_cpu->util_dl = cpu_util_dl(rq);
>>>>>> }
>>>>>>
>>>>>> +unsigned long scale_rt_capacity(int cpu);
>>>>>> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>>>>>> {
>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>>>>> + int cpu = sg_cpu->cpu;
>>>>>> + unsigned long util, dl_bw;
>>>>>>
>>>>>> if (rq->rt.rt_nr_running)
>>>>>> return sg_cpu->max;
>>>>>> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>>>>>> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>>>>>> * ready for such an interface. So, we only do the latter for now.
>>>>>> */
>>>>>> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>>>>>> + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
>>>>>
>>>>> Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
>>>>> since we use max below, we will probably have the same problem that we
>>>>> discussed on Vincent's approach (overestimation of DL contribution while
>>>>> we could use running_bw).
>>>>
>>>> Ah no, you're right, this isn't great for long running deadline tasks.
>>>> We should definitely account for the running_bw here, not the dl avg...
>>>>
>>>> I was trying to address the issue of RT stealing time from CFS here, but
>>>> the DL integration isn't quite right which this patch as-is, I agree ...
>>>>
>>>>>
>>>>>> + util >>= SCHED_CAPACITY_SHIFT;
>>>>>> + util = arch_scale_cpu_capacity(NULL, cpu) - util;
>>>>>> + util += sg_cpu->util_cfs;
>>>>>> + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>>>>>
>>>>> Why this_bw instead of running_bw?
>>>>
>>>> So IIUC, this_bw should basically give you the absolute reservation (== the
>>>> sum of runtime/deadline ratios of all DL tasks on that rq).
>>>
>>> Yep.
>>>
>>>> The reason I added this max is because I'm still not sure to understand
>>>> how we can safely drop the freq below that point ? If we don't guarantee
>>>> to always stay at least at the freq required by DL, aren't we risking to
>>>> start a deadline tasks stuck at a low freq because of rate limiting ? In
>>>> this case, if that tasks uses all of its runtime then you might start
>>>> missing deadlines ...
>>>
>>> We decided to avoid (software) rate limiting for DL with e97a90f7069b
>>> ("sched/cpufreq: Rate limits for SCHED_DEADLINE").
>>
>> Right, I spotted that one, but yeah you could also be limited by HW ...
>>
>>>
>>>> My feeling is that the only safe thing to do is to guarantee to never go
>>>> below the freq required by DL, and to optimistically add CFS tasks
>>>> without raising the OPP if we have good reasons to think that DL is
>>>> using less than it required (which is what we should get by using
>>>> running_bw above I suppose). Does that make any sense ?
>>>
>>> Then we can't still avoid the hardware limits, so using running_bw is a
>>> trade off between safety (especially considering soft real-time
>>> scenarios) and energy consumption (which seems to be working in
>>> practice).
>>
>> Ok, I see ... Have you guys already tried something like my patch above
>> (keeping the freq >= this_bw) in real world use cases ? Is this costing
>> that much energy in practice ? If we fill the gaps left by DL (when it
>
> IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
> he might add some numbers to my words above. I didn't (yet). But, please
> consider that I might be reserving (for example) 50% of bandwidth for my
> heavy and time sensitive task and then have that task wake up only once
> in a while (but I'll be keeping clock speed up for the whole time). :/

As far as I can remember, we never tested energy consumption of running_bw vs this_bw, as at OSPM'17 we had already decided to use running_bw implementing GRUB-PA.
The rationale is that, as Juri pointed out, the amount of spare (i.e. reclaimable) bandwidth in this_bw is very user-dependent.
For example, the user can let this_bw be much higher than the measured bandwidth, just to be sure that the deadlines are met even in corner cases.
In practice, this means that the task executes for quite a short time and then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag time).
Using this_bw rather than running_bw, the CPU frequency would remain at the same fixed value even when the task is blocked.
I understand that on some cases it could even be better (i.e. no waste of energy in frequency switch).
However, IMHO, these are corner cases and in the average case it is better to rely on running_bw and reduce the CPU frequency accordingly.

Best regards,

Claudio

2018-06-06 13:23:18

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Claudio,

On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:
> Hi Quentin,
>
> Il 05/06/2018 16:13, Juri Lelli ha scritto:
> > On 05/06/18 15:01, Quentin Perret wrote:
> > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> > > > On 05/06/18 14:05, Quentin Perret wrote:
> > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > > > > > Hi Quentin,
> > > > > >
> > > > > > On 05/06/18 11:57, Quentin Perret wrote:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > What about the diff below (just a quick hack to show the idea) applied
> > > > > > > on tip/sched/core ?
> > > > > > >
> > > > > > > ---8<---
> > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > > > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > > > > > }
> > > > > > > +unsigned long scale_rt_capacity(int cpu);
> > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > {
> > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > > > > + int cpu = sg_cpu->cpu;
> > > > > > > + unsigned long util, dl_bw;
> > > > > > > if (rq->rt.rt_nr_running)
> > > > > > > return sg_cpu->max;
> > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > > > > * ready for such an interface. So, we only do the latter for now.
> > > > > > > */
> > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > > > > >
> > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > > > > > since we use max below, we will probably have the same problem that we
> > > > > > discussed on Vincent's approach (overestimation of DL contribution while
> > > > > > we could use running_bw).
> > > > >
> > > > > Ah no, you're right, this isn't great for long running deadline tasks.
> > > > > We should definitely account for the running_bw here, not the dl avg...
> > > > >
> > > > > I was trying to address the issue of RT stealing time from CFS here, but
> > > > > the DL integration isn't quite right which this patch as-is, I agree ...
> > > > >
> > > > > >
> > > > > > > + util >>= SCHED_CAPACITY_SHIFT;
> > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > > > > > + util += sg_cpu->util_cfs;
> > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > > >
> > > > > > Why this_bw instead of running_bw?
> > > > >
> > > > > So IIUC, this_bw should basically give you the absolute reservation (== the
> > > > > sum of runtime/deadline ratios of all DL tasks on that rq).
> > > >
> > > > Yep.
> > > >
> > > > > The reason I added this max is because I'm still not sure to understand
> > > > > how we can safely drop the freq below that point ? If we don't guarantee
> > > > > to always stay at least at the freq required by DL, aren't we risking to
> > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In
> > > > > this case, if that tasks uses all of its runtime then you might start
> > > > > missing deadlines ...
> > > >
> > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b
> > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").
> > >
> > > Right, I spotted that one, but yeah you could also be limited by HW ...
> > >
> > > >
> > > > > My feeling is that the only safe thing to do is to guarantee to never go
> > > > > below the freq required by DL, and to optimistically add CFS tasks
> > > > > without raising the OPP if we have good reasons to think that DL is
> > > > > using less than it required (which is what we should get by using
> > > > > running_bw above I suppose). Does that make any sense ?
> > > >
> > > > Then we can't still avoid the hardware limits, so using running_bw is a
> > > > trade off between safety (especially considering soft real-time
> > > > scenarios) and energy consumption (which seems to be working in
> > > > practice).
> > >
> > > Ok, I see ... Have you guys already tried something like my patch above
> > > (keeping the freq >= this_bw) in real world use cases ? Is this costing
> > > that much energy in practice ? If we fill the gaps left by DL (when it
> >
> > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
> > he might add some numbers to my words above. I didn't (yet). But, please
> > consider that I might be reserving (for example) 50% of bandwidth for my
> > heavy and time sensitive task and then have that task wake up only once
> > in a while (but I'll be keeping clock speed up for the whole time). :/
>
> As far as I can remember, we never tested energy consumption of running_bw
> vs this_bw, as at OSPM'17 we had already decided to use running_bw
> implementing GRUB-PA.
> The rationale is that, as Juri pointed out, the amount of spare (i.e.
> reclaimable) bandwidth in this_bw is very user-dependent. For example,
> the user can let this_bw be much higher than the measured bandwidth, just
> to be sure that the deadlines are met even in corner cases.

Ok I see the issue. Trusting userspace isn't necessarily the right thing
to do, I totally agree with that.

> In practice, this means that the task executes for quite a short time and
> then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,
> at the 0lag time).
> Using this_bw rather than running_bw, the CPU frequency would remain at
> the same fixed value even when the task is blocked.
> I understand that on some cases it could even be better (i.e. no waste
> of energy in frequency switch).

+1, I'm pretty sure using this_bw is pretty much always worst than
using running_bw from an energy standpoint,. The waste of energy in
frequency changes should be less than the energy wasted by staying at a
too high frequency for a long time, otherwise DVFS isn't a good idea to
begin with :-)

> However, IMHO, these are corner cases and in the average case it is better
> to rely on running_bw and reduce the CPU frequency accordingly.

My point was that accepting to go at a lower frequency than required by
this_bw is fundamentally unsafe. If you're at a low frequency when a DL
task starts, there are real situations where you won't be able to
increase the frequency immediately, which can eventually lead to missing
deadlines. Now, if this risk is known, has been discussed, and is
accepted, that's fair enough. I'm just too late for the discussion :-)

Thank you for your time !
Quentin

2018-06-06 13:54:16

by Claudio Scordino

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Quentin,

2018-06-06 15:20 GMT+02:00 Quentin Perret <[email protected]>:
>
> Hi Claudio,
>
> On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:
> > Hi Quentin,
> >
> > Il 05/06/2018 16:13, Juri Lelli ha scritto:
> > > On 05/06/18 15:01, Quentin Perret wrote:
> > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> > > > > On 05/06/18 14:05, Quentin Perret wrote:
> > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > > > > > > Hi Quentin,
> > > > > > >
> > > > > > > On 05/06/18 11:57, Quentin Perret wrote:
> > > > > > >
> > > > > > > [...]
> > > > > > >
> > > > > > > > What about the diff below (just a quick hack to show the idea) applied
> > > > > > > > on tip/sched/core ?
> > > > > > > >
> > > > > > > > ---8<---
> > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > > > > > > }
> > > > > > > > +unsigned long scale_rt_capacity(int cpu);
> > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > > {
> > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > > > > > + int cpu = sg_cpu->cpu;
> > > > > > > > + unsigned long util, dl_bw;
> > > > > > > > if (rq->rt.rt_nr_running)
> > > > > > > > return sg_cpu->max;
> > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > > > > > * ready for such an interface. So, we only do the latter for now.
> > > > > > > > */
> > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > > > > > >
> > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > > > > > > since we use max below, we will probably have the same problem that we
> > > > > > > discussed on Vincent's approach (overestimation of DL contribution while
> > > > > > > we could use running_bw).
> > > > > >
> > > > > > Ah no, you're right, this isn't great for long running deadline tasks.
> > > > > > We should definitely account for the running_bw here, not the dl avg...
> > > > > >
> > > > > > I was trying to address the issue of RT stealing time from CFS here, but
> > > > > > the DL integration isn't quite right which this patch as-is, I agree ...
> > > > > >
> > > > > > >
> > > > > > > > + util >>= SCHED_CAPACITY_SHIFT;
> > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > > > > > > + util += sg_cpu->util_cfs;
> > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > > > >
> > > > > > > Why this_bw instead of running_bw?
> > > > > >
> > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the
> > > > > > sum of runtime/deadline ratios of all DL tasks on that rq).
> > > > >
> > > > > Yep.
> > > > >
> > > > > > The reason I added this max is because I'm still not sure to understand
> > > > > > how we can safely drop the freq below that point ? If we don't guarantee
> > > > > > to always stay at least at the freq required by DL, aren't we risking to
> > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In
> > > > > > this case, if that tasks uses all of its runtime then you might start
> > > > > > missing deadlines ...
> > > > >
> > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b
> > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").
> > > >
> > > > Right, I spotted that one, but yeah you could also be limited by HW ...
> > > >
> > > > >
> > > > > > My feeling is that the only safe thing to do is to guarantee to never go
> > > > > > below the freq required by DL, and to optimistically add CFS tasks
> > > > > > without raising the OPP if we have good reasons to think that DL is
> > > > > > using less than it required (which is what we should get by using
> > > > > > running_bw above I suppose). Does that make any sense ?
> > > > >
> > > > > Then we can't still avoid the hardware limits, so using running_bw is a
> > > > > trade off between safety (especially considering soft real-time
> > > > > scenarios) and energy consumption (which seems to be working in
> > > > > practice).
> > > >
> > > > Ok, I see ... Have you guys already tried something like my patch above
> > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing
> > > > that much energy in practice ? If we fill the gaps left by DL (when it
> > >
> > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
> > > he might add some numbers to my words above. I didn't (yet). But, please
> > > consider that I might be reserving (for example) 50% of bandwidth for my
> > > heavy and time sensitive task and then have that task wake up only once
> > > in a while (but I'll be keeping clock speed up for the whole time). :/
> >
> > As far as I can remember, we never tested energy consumption of running_bw
> > vs this_bw, as at OSPM'17 we had already decided to use running_bw
> > implementing GRUB-PA.
> > The rationale is that, as Juri pointed out, the amount of spare (i.e.
> > reclaimable) bandwidth in this_bw is very user-dependent. For example,
> > the user can let this_bw be much higher than the measured bandwidth, just
> > to be sure that the deadlines are met even in corner cases.
>
> Ok I see the issue. Trusting userspace isn't necessarily the right thing
> to do, I totally agree with that.
>
> > In practice, this means that the task executes for quite a short time and
> > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,
> > at the 0lag time).
> > Using this_bw rather than running_bw, the CPU frequency would remain at
> > the same fixed value even when the task is blocked.
> > I understand that on some cases it could even be better (i.e. no waste
> > of energy in frequency switch).
>
> +1, I'm pretty sure using this_bw is pretty much always worst than
> using running_bw from an energy standpoint,. The waste of energy in
> frequency changes should be less than the energy wasted by staying at a
> too high frequency for a long time, otherwise DVFS isn't a good idea to
> begin with :-)
>
> > However, IMHO, these are corner cases and in the average case it is better
> > to rely on running_bw and reduce the CPU frequency accordingly.
>
> My point was that accepting to go at a lower frequency than required by
> this_bw is fundamentally unsafe. If you're at a low frequency when a DL
> task starts, there are real situations where you won't be able to
> increase the frequency immediately, which can eventually lead to missing
> deadlines.


I see. Unfortunately, I'm having quite crazy days so I couldn't follow
the original discussion on LKML properly. My fault.
Anyway, to answer your question (if this time I have understood it correctly).

You're right: the tests have shown that whenever the DL task period
gets comparable with the time for switching frequency, the amount of
missed deadlines becomes not negligible.
To give you a rough idea, this already happens with periods of 10msec
on a Odroid XU4.
The reason is that the task instance starts at a too low frequency,
and the system can't switch frequency in time for meeting the
deadline.

This is a known issue, partially discussed during the RT Summit'17.
However, the community has been more in favour of reducing the energy
consumption than meeting firm deadlines.
If you need a safe system, in fact, you'd better thinking about
disabling DVFS completely and relying on a fixed CPU frequency.

A possible trade-off could be a further entry in sys to let system
designers switching from (default) running_bw to (more pessimistic)
this_bw.
However, I'm not sure the community wants a further knob on sysfs just
to make RT people happier :)

Best,

Claudio

2018-06-06 14:11:56

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

On Wednesday 06 Jun 2018 at 15:53:27 (+0200), Claudio Scordino wrote:
> Hi Quentin,
>
> 2018-06-06 15:20 GMT+02:00 Quentin Perret <[email protected]>:
> >
> > Hi Claudio,
> >
> > On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote:
> > > Hi Quentin,
> > >
> > > Il 05/06/2018 16:13, Juri Lelli ha scritto:
> > > > On 05/06/18 15:01, Quentin Perret wrote:
> > > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote:
> > > > > > On 05/06/18 14:05, Quentin Perret wrote:
> > > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote:
> > > > > > > > Hi Quentin,
> > > > > > > >
> > > > > > > > On 05/06/18 11:57, Quentin Perret wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > > What about the diff below (just a quick hack to show the idea) applied
> > > > > > > > > on tip/sched/core ?
> > > > > > > > >
> > > > > > > > > ---8<---
> > > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644
> > > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq);
> > > > > > > > > }
> > > > > > > > > +unsigned long scale_rt_capacity(int cpu);
> > > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > > > {
> > > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > > > > > > + int cpu = sg_cpu->cpu;
> > > > > > > > > + unsigned long util, dl_bw;
> > > > > > > > > if (rq->rt.rt_nr_running)
> > > > > > > > > return sg_cpu->max;
> > > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > > > > > > > > * ready for such an interface. So, we only do the latter for now.
> > > > > > > > > */
> > > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu);
> > > > > > > >
> > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so,
> > > > > > > > since we use max below, we will probably have the same problem that we
> > > > > > > > discussed on Vincent's approach (overestimation of DL contribution while
> > > > > > > > we could use running_bw).
> > > > > > >
> > > > > > > Ah no, you're right, this isn't great for long running deadline tasks.
> > > > > > > We should definitely account for the running_bw here, not the dl avg...
> > > > > > >
> > > > > > > I was trying to address the issue of RT stealing time from CFS here, but
> > > > > > > the DL integration isn't quite right which this patch as-is, I agree ...
> > > > > > >
> > > > > > > >
> > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT;
> > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util;
> > > > > > > > > + util += sg_cpu->util_cfs;
> > > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > > > > > >
> > > > > > > > Why this_bw instead of running_bw?
> > > > > > >
> > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the
> > > > > > > sum of runtime/deadline ratios of all DL tasks on that rq).
> > > > > >
> > > > > > Yep.
> > > > > >
> > > > > > > The reason I added this max is because I'm still not sure to understand
> > > > > > > how we can safely drop the freq below that point ? If we don't guarantee
> > > > > > > to always stay at least at the freq required by DL, aren't we risking to
> > > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In
> > > > > > > this case, if that tasks uses all of its runtime then you might start
> > > > > > > missing deadlines ...
> > > > > >
> > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b
> > > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE").
> > > > >
> > > > > Right, I spotted that one, but yeah you could also be limited by HW ...
> > > > >
> > > > > >
> > > > > > > My feeling is that the only safe thing to do is to guarantee to never go
> > > > > > > below the freq required by DL, and to optimistically add CFS tasks
> > > > > > > without raising the OPP if we have good reasons to think that DL is
> > > > > > > using less than it required (which is what we should get by using
> > > > > > > running_bw above I suppose). Does that make any sense ?
> > > > > >
> > > > > > Then we can't still avoid the hardware limits, so using running_bw is a
> > > > > > trade off between safety (especially considering soft real-time
> > > > > > scenarios) and energy consumption (which seems to be working in
> > > > > > practice).
> > > > >
> > > > > Ok, I see ... Have you guys already tried something like my patch above
> > > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing
> > > > > that much energy in practice ? If we fill the gaps left by DL (when it
> > > >
> > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so
> > > > he might add some numbers to my words above. I didn't (yet). But, please
> > > > consider that I might be reserving (for example) 50% of bandwidth for my
> > > > heavy and time sensitive task and then have that task wake up only once
> > > > in a while (but I'll be keeping clock speed up for the whole time). :/
> > >
> > > As far as I can remember, we never tested energy consumption of running_bw
> > > vs this_bw, as at OSPM'17 we had already decided to use running_bw
> > > implementing GRUB-PA.
> > > The rationale is that, as Juri pointed out, the amount of spare (i.e.
> > > reclaimable) bandwidth in this_bw is very user-dependent. For example,
> > > the user can let this_bw be much higher than the measured bandwidth, just
> > > to be sure that the deadlines are met even in corner cases.
> >
> > Ok I see the issue. Trusting userspace isn't necessarily the right thing
> > to do, I totally agree with that.
> >
> > > In practice, this means that the task executes for quite a short time and
> > > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced,
> > > at the 0lag time).
> > > Using this_bw rather than running_bw, the CPU frequency would remain at
> > > the same fixed value even when the task is blocked.
> > > I understand that on some cases it could even be better (i.e. no waste
> > > of energy in frequency switch).
> >
> > +1, I'm pretty sure using this_bw is pretty much always worst than
> > using running_bw from an energy standpoint,. The waste of energy in
> > frequency changes should be less than the energy wasted by staying at a
> > too high frequency for a long time, otherwise DVFS isn't a good idea to
> > begin with :-)
> >
> > > However, IMHO, these are corner cases and in the average case it is better
> > > to rely on running_bw and reduce the CPU frequency accordingly.
> >
> > My point was that accepting to go at a lower frequency than required by
> > this_bw is fundamentally unsafe. If you're at a low frequency when a DL
> > task starts, there are real situations where you won't be able to
> > increase the frequency immediately, which can eventually lead to missing
> > deadlines.
>
>
> I see. Unfortunately, I'm having quite crazy days so I couldn't follow
> the original discussion on LKML properly. My fault.

No problem !

> Anyway, to answer your question (if this time I have understood it correctly).
>
> You're right: the tests have shown that whenever the DL task period
> gets comparable with the time for switching frequency, the amount of
> missed deadlines becomes not negligible.
> To give you a rough idea, this already happens with periods of 10msec
> on a Odroid XU4.
> The reason is that the task instance starts at a too low frequency,
> and the system can't switch frequency in time for meeting the
> deadline.

Ok that's very interesting ...

>
> This is a known issue, partially discussed during the RT Summit'17.
> However, the community has been more in favour of reducing the energy
> consumption than meeting firm deadlines.
> If you need a safe system, in fact, you'd better thinking about
> disabling DVFS completely and relying on a fixed CPU frequency.

Yeah, understood. My proposition was sort of a middle-ground solution. I
was basically suggesting to select OPPs with something like:

max(this_bw, running_bw + cfs_util)

The idea is that you're always guaranteed to always request a high enough
frequency for this_bw, and you can opportunistically try to run CFS tasks
without raising the OPP if running_bw is low. In the worst case, the CFS
tasks will be preempted and delayed a bit. But DL should always be safe.
And if the CFS activity is sufficient to fill the gap between running_bw
and this_bw, then you should be pretty energy efficient as well.

Now, that doesn't always solve the issue you mentioned earlier. If there
isn't much CFS traffic going on, and if the delta between this_bw and
running_bw is high, they you'll be stuck at a high freq although the
system utilization is low ... As usual, it's just a trade off :-)

>
> A possible trade-off could be a further entry in sys to let system
> designers switching from (default) running_bw to (more pessimistic)
> this_bw.
> However, I'm not sure the community wants a further knob on sysfs just
> to make RT people happier :)
>
> Best,
>
> Claudio

2018-06-06 16:07:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

Hi Dietmar,

Sorry for the late answer

On 31 May 2018 at 18:54, Dietmar Eggemann <[email protected]> wrote:
> On 05/30/2018 08:45 PM, Vincent Guittot wrote:
>> Hi Dietmar,
>>
>> On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]> wrote:
>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:
>
> [...]
>
>>>> + */
>>>> + ret = ___update_load_sum(rq->clock - running, rq->cpu,
>>>> &rq->avg_irq,
>>>> + 0,
>>>> + 0,
>>>> + 0);
>>>> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
>>>> + 1,
>>>> + 1,
>>>> + 1);
>
> Can you not change the function parameter list to the usual
> (u64 now, struct rq *rq, int running)?
>
> Something like this (only compile and boot tested):

To be honest, I prefer to keep the specific sequence above in a
dedicated function instead of adding it in core code.
Furthermore, we end up calling call twice ___update_load_avg instead
of only once. This will set an intermediate and incorrect value in
util_avg and this value can be read in the meantime

Vincent

>
> -- >8 --
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9894bc7af37e..26ffd585cab8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -177,8 +177,22 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> rq->clock_task += delta;
>
> #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> - if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> - update_irq_load_avg(rq, irq_delta + steal);
> + if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
> + /*
> + * We know the time that has been used by interrupt since last
> + * update but we don't when. Let be pessimistic and assume that
> + * interrupt has happened just before the update. This is not
> + * so far from reality because interrupt will most probably
> + * wake up task and trig an update of rq clock during which the
> + * metric si updated.
> + * We start to decay with normal context time and then we add
> + * the interrupt context time.
> + * We can safely remove running from rq->clock because
> + * rq->clock += delta with delta >= running
> + */
> + update_irq_load_avg(rq_clock(rq) - (irq_delta + steal), rq, 0);
> + update_irq_load_avg(rq_clock(rq), rq, 1);
> + }
> #endif
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1bb3379c4b71..a245f853c271 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7363,7 +7363,7 @@ static void update_blocked_averages(int cpu)
> }
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> - update_irq_load_avg(rq, 0);
> + update_irq_load_avg(rq_clock(rq), rq, 0);
> /* Don't need periodic decay once load/util_avg are null */
> if (others_rqs_have_blocked(rq))
> done = false;
> @@ -7434,7 +7434,7 @@ static inline void update_blocked_averages(int cpu)
> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> - update_irq_load_avg(rq, 0);
> + update_irq_load_avg(rq_clock(rq), rq, 0);
> #ifdef CONFIG_NO_HZ_COMMON
> rq->last_blocked_load_update_tick = jiffies;
> if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index d2e4f2186b13..ae01bb18e28c 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -365,31 +365,15 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> *
> */
>
> -int update_irq_load_avg(struct rq *rq, u64 running)
> +int update_irq_load_avg(u64 now, struct rq *rq, int running)
> {
> - int ret = 0;
> - /*
> - * We know the time that has been used by interrupt since last update
> - * but we don't when. Let be pessimistic and assume that interrupt has
> - * happened just before the update. This is not so far from reality
> - * because interrupt will most probably wake up task and trig an update
> - * of rq clock during which the metric si updated.
> - * We start to decay with normal context time and then we add the
> - * interrupt context time.
> - * We can safely remove running from rq->clock because
> - * rq->clock += delta with delta >= running
> - */
> - ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> - 0,
> - 0,
> - 0);
> - ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> - 1,
> - 1,
> - 1);
> -
> - if (ret)
> + if (___update_load_sum(now, rq->cpu, &rq->avg_irq,
> + running,
> + running,
> + running)) {
> ___update_load_avg(&rq->avg_irq, 1, 1);
> + return 1;
> + }
>
> - return ret;
> + return 0;
> }
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 0ce9a5a5877a..ebc57301a9a8 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -5,7 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
> int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> -int update_irq_load_avg(struct rq *rq, u64 running);
> +int update_irq_load_avg(u64 now, struct rq *rq, int running);
>
> /*
> * When a task is dequeued, its estimated utilization should not be update if
>
>

2018-06-06 21:16:54

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi,

On Wed, 6 Jun 2018 14:20:46 +0100
Quentin Perret <[email protected]> wrote:
[...]
> > However, IMHO, these are corner cases and in the average case it is
> > better to rely on running_bw and reduce the CPU frequency
> > accordingly.
>
> My point was that accepting to go at a lower frequency than required
> by this_bw is fundamentally unsafe. If you're at a low frequency when
> a DL task starts, there are real situations where you won't be able
> to increase the frequency immediately, which can eventually lead to
> missing deadlines. Now, if this risk is known, has been discussed,
> and is accepted, that's fair enough. I'm just too late for the
> discussion :-)

Well, our conclusion was that this issue can be addressed when
designing the scheduling parameters:
- If we do not consider frequency scaling, a task can respect its
deadlines if the SCHED_DEADLINE runtime is larger than the task's
execution time and the SCHED_DEADLINE period is smaller than the
task's period (and if some kind of "global" admission test is
respected)
- Considering frequency scaling (and 0-time frequency switches), the
SCHED_DEADLINE runtime must be larger than the task execution time at
the highest frequency
- If the frequency switch time is larger than 0, then the
SCHED_DEADLINE runtime must be larger than the task execution time
(at the highest frequency) plus the frequency switch time

If this third condition is respected, I think that deadline misses can
be avoided even if running_bw is used (and the CPU takes a considerable
time to switch frequency). Of course, this requires an over-allocation
of runtime (and the global admission test has more probabilities to
fail)...



Luca

2018-06-06 21:55:24

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi all,

sorry; I missed the beginning of this thread... Anyway, below I add
some comments:

On Wed, 6 Jun 2018 15:05:58 +0200
Claudio Scordino <[email protected]> wrote:
[...]
> >> Ok, I see ... Have you guys already tried something like my patch
> >> above (keeping the freq >= this_bw) in real world use cases ? Is
> >> this costing that much energy in practice ? If we fill the gaps
> >> left by DL (when it
> >
> > IIRC, Claudio (now Cc-ed) did experiment a bit with both
> > approaches, so he might add some numbers to my words above. I
> > didn't (yet). But, please consider that I might be reserving (for
> > example) 50% of bandwidth for my heavy and time sensitive task and
> > then have that task wake up only once in a while (but I'll be
> > keeping clock speed up for the whole time). :/
>
> As far as I can remember, we never tested energy consumption of
> running_bw vs this_bw, as at OSPM'17 we had already decided to use
> running_bw implementing GRUB-PA. The rationale is that, as Juri
> pointed out, the amount of spare (i.e. reclaimable) bandwidth in
> this_bw is very user-dependent.

Yes, I agree with this. The appropriateness of using this_bw or
running_bw is highly workload-dependent... If a periodic task consumes
much less than its runtime (or if a sporadic task has inter-activation
times much larger than the SCHED_DEADLINE period), then running_bw has
to be preferred. But if a periodic task consumes almost all of its
runtime before blocking, then this_bw has to be preferred...

But this also depends on the hardware: if the frequency switch time is
small, then running_bw is more appropriate... On the other hand,
this_bw works much better if the frequency switch time is high.
(Talking about this, I remember Claudio measured frequency switch times
large almost 3ms... Is this really due to hardware issues? Or maybe
there is some software issue invoved? 3ms look like a lot of time...)

Anyway, this is why I proposed to use some kind of /sys/... file to
control the kind of deadline utilization used for frequency scaling: in
this way, the system designer / administrator, who hopefully has the
needed information about workload and hardware, can optimize the
frequency scaling behaviour by deciding if running_bw or this_bw will be
used.


Luca

> For example, the user can let this_bw
> be much higher than the measured bandwidth, just to be sure that the
> deadlines are met even in corner cases. In practice, this means that
> the task executes for quite a short time and then blocks (with its
> bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag
> time). Using this_bw rather than running_bw, the CPU frequency would
> remain at the same fixed value even when the task is blocked. I
> understand that on some cases it could even be better (i.e. no waste
> of energy in frequency switch). However, IMHO, these are corner cases
> and in the average case it is better to rely on running_bw and reduce
> the CPU frequency accordingly.
>
> Best regards,
>
> Claudio


2018-06-07 08:31:07

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

On 06/06/2018 06:06 PM, Vincent Guittot wrote:
> Hi Dietmar,
>
> Sorry for the late answer
>
> On 31 May 2018 at 18:54, Dietmar Eggemann <[email protected]> wrote:
>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:
>>> Hi Dietmar,
>>>
>>> On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]> wrote:
>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:
>>
>> [...]
>>
>>>>> + */
>>>>> + ret = ___update_load_sum(rq->clock - running, rq->cpu,
>>>>> &rq->avg_irq,
>>>>> + 0,
>>>>> + 0,
>>>>> + 0);
>>>>> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
>>>>> + 1,
>>>>> + 1,
>>>>> + 1);
>>
>> Can you not change the function parameter list to the usual
>> (u64 now, struct rq *rq, int running)?
>>
>> Something like this (only compile and boot tested):
>
> To be honest, I prefer to keep the specific sequence above in a
> dedicated function instead of adding it in core code.

No big issue.

> Furthermore, we end up calling call twice ___update_load_avg instead
> of only once. This will set an intermediate and incorrect value in
> util_avg and this value can be read in the meantime

Can't buy this argument though because this is true with the current
implementation as well since the 'decay load sum' - 'accrue load sum'
sequence is not atomic.

What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task()
if (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true
in this #ifdef CONFIG_XXX_TIME_ACCOUNTING block?

Maintaining a irq/steal time signal makes only sense if at least one of
the CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The
call to update_irq_load_avg() in update_blocked_averages() isn't guarded
my them.

[...]

2018-06-07 08:46:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

On 7 June 2018 at 10:29, Dietmar Eggemann <[email protected]> wrote:
> On 06/06/2018 06:06 PM, Vincent Guittot wrote:
>>
>> Hi Dietmar,
>>
>> Sorry for the late answer
>>
>> On 31 May 2018 at 18:54, Dietmar Eggemann <[email protected]>
>> wrote:
>>>
>>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:
>>>>
>>>> Hi Dietmar,
>>>>
>>>> On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]>
>>>> wrote:
>>>>>
>>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:
>>>
>>>
>>> [...]
>>>
>>>>>> + */
>>>>>> + ret = ___update_load_sum(rq->clock - running, rq->cpu,
>>>>>> &rq->avg_irq,
>>>>>> + 0,
>>>>>> + 0,
>>>>>> + 0);
>>>>>> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
>>>>>> + 1,
>>>>>> + 1,
>>>>>> + 1);
>>>
>>>
>>> Can you not change the function parameter list to the usual
>>> (u64 now, struct rq *rq, int running)?
>>>
>>> Something like this (only compile and boot tested):
>>
>>
>> To be honest, I prefer to keep the specific sequence above in a
>> dedicated function instead of adding it in core code.
>
>
> No big issue.
>
>> Furthermore, we end up calling call twice ___update_load_avg instead
>> of only once. This will set an intermediate and incorrect value in
>> util_avg and this value can be read in the meantime
>
>
> Can't buy this argument though because this is true with the current
> implementation as well since the 'decay load sum' - 'accrue load sum'
> sequence is not atomic.

it's not a problem that the _sum variable are updated in different
step because there are internal variable
Only util_avg is used "outside" and the latter is updated after both
idle and running steps have been applied

>
> What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task() if
> (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true in this
> #ifdef CONFIG_XXX_TIME_ACCOUNTING block?

update_irq_load_avg(rq, 0) is called in update_blocked_averages to
decay smoothly like other blocked signals and replace the need to call
update_irq_load_avg(rq, 0) for every call to update_rq_clock_task()
which can be significant

>
> Maintaining a irq/steal time signal makes only sense if at least one of the
> CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The call to
> update_irq_load_avg() in update_blocked_averages() isn't guarded my them.

good point

>
> [...]

2018-06-07 09:25:16

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] track CPU utilization

Hi Luca,

On Wednesday 06 Jun 2018 at 23:05:36 (+0200), luca abeni wrote:
> Hi,
>
> On Wed, 6 Jun 2018 14:20:46 +0100
> Quentin Perret <[email protected]> wrote:
> [...]
> > > However, IMHO, these are corner cases and in the average case it is
> > > better to rely on running_bw and reduce the CPU frequency
> > > accordingly.
> >
> > My point was that accepting to go at a lower frequency than required
> > by this_bw is fundamentally unsafe. If you're at a low frequency when
> > a DL task starts, there are real situations where you won't be able
> > to increase the frequency immediately, which can eventually lead to
> > missing deadlines. Now, if this risk is known, has been discussed,
> > and is accepted, that's fair enough. I'm just too late for the
> > discussion :-)
>
> Well, our conclusion was that this issue can be addressed when
> designing the scheduling parameters:
> - If we do not consider frequency scaling, a task can respect its
> deadlines if the SCHED_DEADLINE runtime is larger than the task's
> execution time and the SCHED_DEADLINE period is smaller than the
> task's period (and if some kind of "global" admission test is
> respected)
> - Considering frequency scaling (and 0-time frequency switches), the
> SCHED_DEADLINE runtime must be larger than the task execution time at
> the highest frequency
> - If the frequency switch time is larger than 0, then the
> SCHED_DEADLINE runtime must be larger than the task execution time
> (at the highest frequency) plus the frequency switch time
>
> If this third condition is respected, I think that deadline misses can
> be avoided even if running_bw is used (and the CPU takes a considerable
> time to switch frequency). Of course, this requires an over-allocation
> of runtime (and the global admission test has more probabilities to
> fail)...

Ah, right, this third condition should definitely be a valid workaround
to the issue I mentioned earlier. And the runtime parameter is already
very much target-dependent I guess, so it should be fine to add another
target-specific component (the frequency-switching time) to the runtime
estimation.

And, if you really need to have tight runtimes to fit all of your tasks,
then you should just use a fixed frequency I guess ... At least the
current implementation gives a choice to the user between being
energy-efficient using sugov with over-allocated runtimes and having
tighter runtimes with the performance/powersave/userspace governor, so
that's all good :-)

Thank you very much,
Quentin

2018-06-07 10:13:49

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 07/10] sched/irq: add irq utilization tracking

On 06/07/2018 10:44 AM, Vincent Guittot wrote:
> On 7 June 2018 at 10:29, Dietmar Eggemann <[email protected]> wrote:
>> On 06/06/2018 06:06 PM, Vincent Guittot wrote:
>>>
>>> Hi Dietmar,
>>>
>>> Sorry for the late answer
>>>
>>> On 31 May 2018 at 18:54, Dietmar Eggemann <[email protected]>
>>> wrote:
>>>>
>>>> On 05/30/2018 08:45 PM, Vincent Guittot wrote:
>>>>>
>>>>> Hi Dietmar,
>>>>>
>>>>> On 30 May 2018 at 17:55, Dietmar Eggemann <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> On 05/25/2018 03:12 PM, Vincent Guittot wrote:

[...]

>> Can't buy this argument though because this is true with the current
>> implementation as well since the 'decay load sum' - 'accrue load sum'
>> sequence is not atomic.
>
> it's not a problem that the _sum variable are updated in different
> step because there are internal variable
> Only util_avg is used "outside" and the latter is updated after both
> idle and running steps have been applied

You're right here!

>> What about calling update_irq_load_avg(rq, 0) in update_rq_clock_task() if
>> (irq_delta + steal) eq. 0 and sched_feat(NONTASK_CAPACITY) eq. true in this
>> #ifdef CONFIG_XXX_TIME_ACCOUNTING block?
>
> update_irq_load_avg(rq, 0) is called in update_blocked_averages to
> decay smoothly like other blocked signals and replace the need to call
> update_irq_load_avg(rq, 0) for every call to update_rq_clock_task()
> which can be significant

OK.

>> Maintaining a irq/steal time signal makes only sense if at least one of the
>> CONFIG_XXX_TIME_ACCOUNTING is set and NONTASK_CAPACITY is true. The call to
>> update_irq_load_avg() in update_blocked_averages() isn't guarded my them.
>
> good point

[...]