2018-06-28 19:08:23

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v7 00/11] track CPU utilization

This patchset initially tracked only the utilization of RT rq. During
OSPM summit, it has been discussed the opportunity to extend it in order
to get an estimate of the utilization of the CPU.

- Patches 1 move pelt code in a dedicated file and remove some blank lines

- Patches 2-3 add utilization tracking for rt_rq.

When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of
frequency switches of the CPU whereas the CPU never goes idles during the
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) :
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
total time: 15.0009s
total number of events: 4903
total time taken by event execution: 14.9972
per-request statistics:
min: 1.23ms
avg: 3.06ms
max: 13.16ms
approx. 95 percentile: 12.73ms

Threads fairness:
events (avg/stddev): 4903.0000/0.00
execution time (avg/stddev): 14.9972/0.00

- With patchset:
Test execution summary:
total time: 15.0014s
total number of events: 7694
total time taken by event execution: 14.9979
per-request statistics:
min: 1.23ms
avg: 1.95ms
max: 10.49ms
approx. 95 percentile: 10.39ms

Threads fairness:
events (avg/stddev): 7694.0000/0.00
execution time (avg/stddev): 14.9979/0.00

The performance improvement is 56% for this use case.

- Patches 4-5 add utilization tracking for dl_rq in order to solve similar
problem as with rt_rq. Nevertheless, we keep using dl bandwidth as default
level of requirement for dl tasks. The dl utilization is used to check that
the CPU is not overloaded which is not always reflected when using dl
bandwidth

- Patches 6-7 add utilization tracking for interrupt and use it select OPP
A test with iperf on hikey 6220 gives:
w/o patchset w/ patchset
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +09%

8 iterations of iperf -c server_address -r -t 5
stdev is lower than 1%
Only WFI idle state is enable (shallowest arm idle state)

- Patch 8 merges sugov_aggregate_util and sugov_get_util as proposed by Peter

- Patches 9 uses rt, dl and interrupt utilization in the scale_rt_capacity()
and remove the use of sched_rt_avg_update.

- Patches 10 removes the unused sched_avg_update code

- Patch 11 removes the unused sched_time_avg_ms

Change since v6:
- add more comments load tracking metrics
- merge sugov_aggregate_util and sugov_get_util

Change since v4:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
patch 1/2


Vincent Guittot (11):
sched/pelt: Move pelt related code in a dedicated file
sched/rt: add rt_rq utilization tracking
cpufreq/schedutil: use rt utilization tracking
sched/dl: add dl_rq utilization tracking
cpufreq/schedutil: use dl utilization tracking
sched/irq: add irq utilization tracking
cpufreq/schedutil: take into account interrupt
sched: schedutil: remove sugov_aggregate_util()
sched: use pelt for scale_rt_capacity()
sched: remove rt_avg code
proc/sched: remove unused sched_time_avg_ms

include/linux/sched/sysctl.h | 1 -
kernel/sched/Makefile | 2 +-
kernel/sched/core.c | 38 +---
kernel/sched/cpufreq_schedutil.c | 65 ++++---
kernel/sched/deadline.c | 8 +-
kernel/sched/fair.c | 403 +++++----------------------------------
kernel/sched/pelt.c | 399 ++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 72 +++++++
kernel/sched/rt.c | 15 +-
kernel/sched/sched.h | 68 +++++--
kernel/sysctl.c | 8 -
11 files changed, 632 insertions(+), 447 deletions(-)
create mode 100644 kernel/sched/pelt.c
create mode 100644 kernel/sched/pelt.h

--
2.7.4



2018-06-28 19:07:49

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 02/11] sched/rt: add rt_rq utilization tracking

schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
are preempted by rt tasks and in this case util_avg reflects the remaining
capacity but not what cfs want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of rt tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++++++-
kernel/sched/pelt.c | 25 +++++++++++++++++++++++++
kernel/sched/pelt.h | 7 +++++++
kernel/sched/rt.c | 13 +++++++++++++
kernel/sched/sched.h | 7 +++++++
5 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdab9ed..328bedc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,6 +7289,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+ if (READ_ONCE(rq->avg_rt.util_avg))
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED

static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7348,6 +7356,10 @@ static void update_blocked_averages(int cpu)
if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ /* Don't need periodic decay once load/util_avg are null */
+ if (rt_rq_has_blocked(rq))
+ done = false;

#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
@@ -7413,9 +7425,10 @@ static inline void update_blocked_averages(int cpu)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index e6ecbb2..a00b1ba 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -309,3 +309,28 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

return 0;
}
+
+/*
+ * rt_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ * load_avg and runnable_load_avg are not supported and meaningless.
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_rt, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73e..b2983b7 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@
int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
return 0;
}

+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
+
#endif


diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 47556b0..0e3e57a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@
*/
#include "sched.h"

+#include "pelt.h"
+
int sched_rr_timeslice = RR_TIMESLICE;
int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

@@ -1572,6 +1574,14 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

rt_queue_push_tasks(rq);

+ /*
+ * If prev task was rt, put_prev_task() has already updated the
+ * utilization. We only care of the case where we start to schedule a
+ * rt task
+ */
+ if (rq->curr->sched_class != &rt_sched_class)
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+
return p;
}

@@ -1579,6 +1589,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);

+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
/*
* The previous task needs to be made eligible for pushing
* if it is still active
@@ -2308,6 +2320,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
struct sched_rt_entity *rt_se = &p->rt;

update_curr_rt(rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

watchdog(rq, p);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db6878e..f2b12b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -592,6 +592,7 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+
#endif /* CONFIG_SMP */
int rt_queued;

@@ -847,6 +848,7 @@ struct rq {

u64 rt_avg;
u64 age_stamp;
+ struct sched_avg avg_rt;
u64 idle_stamp;
u64 avg_idle;

@@ -2205,4 +2207,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)

return util;
}
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+ return rq->avg_rt.util_avg;
+}
#endif
--
2.7.4


2018-06-28 19:08:01

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util()

There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[rebased after adding irq tracking and fixed some compilation errors]
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
kernel/sched/sched.h | 2 +-
2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index b77bfef..d04f941 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -53,12 +53,7 @@ struct sugov_cpu {
unsigned int iowait_boost_max;
u64 last_update;

- /* The fields below are only needed when sharing a policy: */
- unsigned long util_cfs;
- unsigned long util_dl;
unsigned long bw_dl;
- unsigned long util_rt;
- unsigned long util_irq;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-static void sugov_get_util(struct sugov_cpu *sg_cpu)
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util, irq, max;

- sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
- sg_cpu->util_cfs = cpu_util_cfs(rq);
- sg_cpu->util_dl = cpu_util_dl(rq);
- sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util_rt = cpu_util_rt(rq);
- sg_cpu->util_irq = cpu_util_irq(rq);
-}
-
-static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
-{
- struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util, max = sg_cpu->max;
+ sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);

if (rq->rt.rt_nr_running)
- return sg_cpu->max;
+ return max;
+
+ irq = cpu_util_irq(rq);

- if (unlikely(sg_cpu->util_irq >= max))
+ if (unlikely(irq >= max))
return max;

/* Sum rq utilization */
- util = sg_cpu->util_cfs;
- util += sg_cpu->util_rt;
+ util = cpu_util_cfs(rq);
+ util += cpu_util_rt(rq);

/*
* Interrupt time is not seen by rqs utilization nso we can compare
* them with the CPU capacity
*/
- if ((util + sg_cpu->util_dl) >= max)
+ if ((util + cpu_util_dl(rq)) >= max)
return max;

/*
@@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
*/

/* Weight rqs utilization to normal context window */
- util *= (max - sg_cpu->util_irq);
+ util *= (max - irq);
util /= max;

/* Add interrupt utilization */
- util += sg_cpu->util_irq;
+ util += irq;

/* Add DL bandwidth requirement */
util += sg_cpu->bw_dl;
@@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,

busy = sugov_cpu_is_busy(sg_cpu);

- sugov_get_util(sg_cpu);
+ util = sugov_get_util(sg_cpu);
max = sg_cpu->max;
- util = sugov_aggregate_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
/*
@@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long j_util, j_max;

- sugov_get_util(j_sg_cpu);
+ j_util = sugov_get_util(j_sg_cpu);
j_max = j_sg_cpu->max;
- j_util = sugov_aggregate_util(j_sg_cpu);
sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);

if (j_util * max > j_max * util) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9438e68..59a633d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2219,7 +2219,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)

static inline unsigned long cpu_util_rt(struct rq *rq)
{
- return rq->avg_rt.util_avg;
+ return READ_ONCE(rq->avg_rt.util_avg);
}

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
--
2.7.4


2018-06-28 19:08:03

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
Remove it

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Luis R. Rodriguez" <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 8 --------
kernel/sched/sched.h | 1 -
kernel/sysctl.c | 8 --------
4 files changed, 18 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a151..913488d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,7 +40,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
#ifdef CONFIG_SCHED_DEBUG
extern __read_mostly unsigned int sysctl_sched_migration_cost;
extern __read_mostly unsigned int sysctl_sched_nr_migrate;
-extern __read_mostly unsigned int sysctl_sched_time_avg;

int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9aae7f..6935691 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,14 +48,6 @@ const_debug unsigned int sysctl_sched_features =
const_debug unsigned int sysctl_sched_nr_migrate = 32;

/*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
-/*
* period over which we measure -rt task CPU usage in us.
* default: 1s
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c71ea81..47b9175 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1706,7 +1706,6 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

-extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2d9837c..f22f76b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,14 +368,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
- {
- .procname = "sched_time_avg_ms",
- .data = &sysctl_sched_time_avg,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &one,
- },
#ifdef CONFIG_SCHEDSTATS
{
.procname = "sched_schedstats",
--
2.7.4


2018-06-28 19:08:13

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 09/11] sched: use pelt for scale_rt_capacity()

The utilization of the CPU by rt, dl and interrupts are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for cfs class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for cfs instead of a scaling factor because rt, dl and
interrupt provide now absolute utilization value.

The same formula as schedutil is used:
irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/deadline.c | 2 --
kernel/sched/fair.c | 41 +++++++++++++++++++----------------------
kernel/sched/pelt.c | 2 +-
kernel/sched/rt.c | 2 --
4 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4de2698..68b8a9f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@ static void update_curr_dl(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2758e3..ce0dcbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, used, age_stamp, avg;
- s64 delta;
-
- /*
- * Since we're reading these variables without serialization make sure
- * we read them once before doing sanity checks on them.
- */
- age_stamp = READ_ONCE(rq->age_stamp);
- avg = READ_ONCE(rq->rt_avg);
- delta = __rq_clock_broken(rq) - age_stamp;
+ unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long used, irq, free;

- if (unlikely(delta < 0))
- delta = 0;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ irq = READ_ONCE(rq->avg_irq.util_avg);

- total = sched_avg_period() + delta;
+ if (unlikely(irq >= max))
+ return 1;
+#endif

- used = div_u64(avg, total);
+ used = READ_ONCE(rq->avg_rt.util_avg);
+ used += READ_ONCE(rq->avg_dl.util_avg);

- if (likely(used < SCHED_CAPACITY_SCALE))
- return SCHED_CAPACITY_SCALE - used;
+ if (unlikely(used >= max))
+ return 1;

- return 1;
+ free = max - used;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ free *= (max - irq);
+ free /= max;
+#endif
+ return free;
}

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
+ unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;

- cpu_rq(cpu)->cpu_capacity_orig = capacity;
-
- capacity *= scale_rt_capacity(cpu);
- capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);

if (!capacity)
capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index ead6d8b..35475c0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
*/
sa->load_avg = div_u64(load * sa->load_sum, divider);
sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
+ WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0e3e57a..2a881bd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -970,8 +970,6 @@ static void update_curr_rt(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (!rt_bandwidth_enabled())
return;

--
2.7.4


2018-06-28 19:08:17

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 07/11] cpufreq/schedutil: take into account interrupt

The time spent under interrupt can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.
rqs utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is :
irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives:
iperf -c server_address -r -t 5

w/o patch w/ patch
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +09%

8 iterations
stdev is lower than 1%
Only WFI idle state is enable (shallowest diel state)

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
kernel/sched/sched.h | 13 +++++++++++++
2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index edfbfc1..b77bfef 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -58,6 +58,7 @@ struct sugov_cpu {
unsigned long util_dl;
unsigned long bw_dl;
unsigned long util_rt;
+ unsigned long util_irq;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->util_dl = cpu_util_dl(rq);
sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util_rt = cpu_util_rt(rq);
+ sg_cpu->util_irq = cpu_util_irq(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util;
+ unsigned long util, max = sg_cpu->max;

if (rq->rt.rt_nr_running)
return sg_cpu->max;

+ if (unlikely(sg_cpu->util_irq >= max))
+ return max;
+
+ /* Sum rq utilization */
util = sg_cpu->util_cfs;
util += sg_cpu->util_rt;

- if ((util + sg_cpu->util_dl) >= sg_cpu->max)
- return sg_cpu->max;
+ /*
+ * Interrupt time is not seen by rqs utilization nso we can compare
+ * them with the CPU capacity
+ */
+ if ((util + sg_cpu->util_dl) >= max)
+ return max;

/*
* As there is still idle time on the CPU, we need to compute the
@@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* ready for such an interface. So, we only do the latter for now.
*/

+ /* Weight rqs utilization to normal context window */
+ util *= (max - sg_cpu->util_irq);
+ util /= max;
+
+ /* Add interrupt utilization */
+ util += sg_cpu->util_irq;
+
/* Add DL bandwidth requirement */
util += sg_cpu->bw_dl;

- return min(sg_cpu->max, util);
+ return min(max, util);
}

/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 377be2b..9438e68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2221,4 +2221,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return rq->avg_rt.util_avg;
}
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+ return rq->avg_irq.util_avg;
+}
+#else
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+ return 0;
+}
+
+#endif
#endif
--
2.7.4


2018-06-28 19:08:31

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 01/11] sched/pelt: Move pelt related code in a dedicated file

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, pelt code is moved in a
dedicated file.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 333 +-------------------------------------------------
kernel/sched/pelt.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 43 +++++++
kernel/sched/sched.h | 19 +++
5 files changed, 375 insertions(+), 333 deletions(-)
create mode 100644 kernel/sched/pelt.c
create mode 100644 kernel/sched/pelt.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b3..7fe1834 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4cc1441..bdab9ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -255,9 +255,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return cfs_rq->rq;
}

-/* An entity is a task if it doesn't "own" a runqueue */
-#define entity_is_task(se) (!se->my_q)
-
static inline struct task_struct *task_of(struct sched_entity *se)
{
SCHED_WARN_ON(!entity_is_task(se));
@@ -419,7 +416,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return container_of(cfs_rq, struct rq, cfs);
}

-#define entity_is_task(se) 1

#define for_each_sched_entity(se) \
for (; se; se = NULL)
@@ -692,7 +688,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}

#ifdef CONFIG_SMP
-
+#include "pelt.h"
#include "sched-pelt.h"

static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
@@ -2749,19 +2745,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
} while (0)

#ifdef CONFIG_SMP
-/*
- * XXX we want to get rid of these helpers and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
- return scale_load_down(se->load.weight);
-}
-
-static inline long se_runnable(struct sched_entity *se)
-{
- return scale_load_down(se->runnable_weight);
-}
-
static inline void
enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -3062,314 +3045,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
}

#ifdef CONFIG_SMP
-/*
- * Approximate:
- * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
- */
-static u64 decay_load(u64 val, u64 n)
-{
- unsigned int local_n;
-
- if (unlikely(n > LOAD_AVG_PERIOD * 63))
- return 0;
-
- /* after bounds checking we can collapse to 32-bit */
- local_n = n;
-
- /*
- * As y^PERIOD = 1/2, we can combine
- * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
- * With a look-up table which covers y^n (n<PERIOD)
- *
- * To achieve constant time decay_load.
- */
- if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
- val >>= local_n / LOAD_AVG_PERIOD;
- local_n %= LOAD_AVG_PERIOD;
- }
-
- val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
- return val;
-}
-
-static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
-{
- u32 c1, c2, c3 = d3; /* y^0 == 1 */
-
- /*
- * c1 = d1 y^p
- */
- c1 = decay_load((u64)d1, periods);
-
- /*
- * p-1
- * c2 = 1024 \Sum y^n
- * n=1
- *
- * inf inf
- * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
- * n=0 n=p
- */
- c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
-
- return c1 + c2 + c3;
-}
-
-/*
- * Accumulate the three separate parts of the sum; d1 the remainder
- * of the last (incomplete) period, d2 the span of full periods and d3
- * the remainder of the (incomplete) current period.
- *
- * d1 d2 d3
- * ^ ^ ^
- * | | |
- * |<->|<----------------->|<--->|
- * ... |---x---|------| ... |------|-----x (now)
- *
- * p-1
- * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
- * n=1
- *
- * = u y^p + (Step 1)
- *
- * p-1
- * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
- * n=1
- */
-static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- unsigned long scale_freq, scale_cpu;
- u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
- u64 periods;
-
- scale_freq = arch_scale_freq_capacity(cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
- delta += sa->period_contrib;
- periods = delta / 1024; /* A period is 1024us (~1ms) */
-
- /*
- * Step 1: decay old *_sum if we crossed period boundaries.
- */
- if (periods) {
- sa->load_sum = decay_load(sa->load_sum, periods);
- sa->runnable_load_sum =
- decay_load(sa->runnable_load_sum, periods);
- sa->util_sum = decay_load((u64)(sa->util_sum), periods);
-
- /*
- * Step 2
- */
- delta %= 1024;
- contrib = __accumulate_pelt_segments(periods,
- 1024 - sa->period_contrib, delta);
- }
- sa->period_contrib = delta;
-
- contrib = cap_scale(contrib, scale_freq);
- if (load)
- sa->load_sum += load * contrib;
- if (runnable)
- sa->runnable_load_sum += runnable * contrib;
- if (running)
- sa->util_sum += contrib * scale_cpu;
-
- return periods;
-}
-
-/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series. To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
- *
- * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
- * p0 p1 p2
- * (now) (~1ms ago) (~2ms ago)
- *
- * Let u_i denote the fraction of p_i that the entity was runnable.
- *
- * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
- * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
- *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- * y^32 = 0.5
- *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
- *
- * When a period "rolls over" and we have new u_0`, multiplying the previous
- * sum again by y is sufficient to update:
- * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
- */
-static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- u64 delta;
-
- delta = now - sa->last_update_time;
- /*
- * This should only happen when time goes backwards, which it
- * unfortunately does during sched clock init when we swap over to TSC.
- */
- if ((s64)delta < 0) {
- sa->last_update_time = now;
- return 0;
- }
-
- /*
- * Use 1024ns as the unit of measurement since it's a reasonable
- * approximation of 1us and fast to compute.
- */
- delta >>= 10;
- if (!delta)
- return 0;
-
- sa->last_update_time += delta << 10;
-
- /*
- * running is a subset of runnable (weight) so running can't be set if
- * runnable is clear. But there are some corner cases where the current
- * se has been already dequeued but cfs_rq->curr still points to it.
- * This means that weight will be 0 but not running for a sched_entity
- * but also for a cfs_rq if the latter becomes idle. As an example,
- * this happens during idle_balance() which calls
- * update_blocked_averages()
- */
- if (!load)
- runnable = running = 0;
-
- /*
- * Now we know we crossed measurement unit boundaries. The *_avg
- * accrues by two steps:
- *
- * Step 1: accumulate *_sum since last_update_time. If we haven't
- * crossed period boundaries, finish.
- */
- if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
- return 0;
-
- return 1;
-}
-
-static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
-{
- u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
-
- /*
- * Step 2: update *_avg.
- */
- sa->load_avg = div_u64(load * sa->load_sum, divider);
- sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
-}
-
-/*
- * When a task is dequeued, its estimated utilization should not be update if
- * its util_avg has not been updated at least once.
- * This flag is used to synchronize util_avg updates with util_est updates.
- * We map this information into the LSB bit of the utilization saved at
- * dequeue time (i.e. util_est.dequeued).
- */
-#define UTIL_AVG_UNCHANGED 0x1
-
-static inline void cfs_se_util_change(struct sched_avg *avg)
-{
- unsigned int enqueued;
-
- if (!sched_feat(UTIL_EST))
- return;
-
- /* Avoid store if the flag has been already set */
- enqueued = avg->util_est.enqueued;
- if (!(enqueued & UTIL_AVG_UNCHANGED))
- return;
-
- /* Reset flag to report util_avg has been updated */
- enqueued &= ~UTIL_AVG_UNCHANGED;
- WRITE_ONCE(avg->util_est.enqueued, enqueued);
-}
-
-/*
- * sched_entity:
- *
- * task:
- * se_runnable() == se_weight()
- *
- * group: [ see update_cfs_group() ]
- * se_weight() = tg->weight * grq->load_avg / tg->load_avg
- * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
- * load_sum := runnable_sum
- * load_avg = se_weight(se) * runnable_avg
- *
- * runnable_load_sum := runnable_sum
- * runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
- * cfq_rs:
- *
- * load_sum = \Sum se_weight(se) * se->avg.load_sum
- * load_avg = \Sum se->avg.load_avg
- *
- * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- * runnable_load_avg = \Sum se->avg.runable_load_avg
- */
-
-static int
-__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
- cfs_rq->curr == se)) {
-
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- cfs_se_util_change(&se->avg);
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
-{
- if (___update_load_sum(now, cpu, &cfs_rq->avg,
- scale_load_down(cfs_rq->load.weight),
- scale_load_down(cfs_rq->runnable_weight),
- cfs_rq->curr != NULL)) {
-
- ___update_load_avg(&cfs_rq->avg, 1, 1);
- return 1;
- }
-
- return 0;
-}
-
#ifdef CONFIG_FAIR_GROUP_SCHED
/**
* update_tg_load_avg - update the tg's load avg
@@ -4045,12 +3720,6 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

#else /* CONFIG_SMP */

-static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
-{
- return 0;
-}
-
#define UPDATE_TG 0x0
#define SKIP_AGE_LOAD 0x0
#define DO_ATTACH 0x0
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
new file mode 100644
index 0000000..e6ecbb2
--- /dev/null
+++ b/kernel/sched/pelt.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per Entity Load Tracking
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Adaptive scheduling granularity, math enhancements by Peter Zijlstra
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
+ *
+ * Move PELT related code from fair.c into this pelt.c file
+ * Author: Vincent Guittot <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "sched-pelt.h"
+#include "pelt.h"
+
+/*
+ * Approximate:
+ * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static u64 decay_load(u64 val, u64 n)
+{
+ unsigned int local_n;
+
+ if (unlikely(n > LOAD_AVG_PERIOD * 63))
+ return 0;
+
+ /* after bounds checking we can collapse to 32-bit */
+ local_n = n;
+
+ /*
+ * As y^PERIOD = 1/2, we can combine
+ * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+ * With a look-up table which covers y^n (n<PERIOD)
+ *
+ * To achieve constant time decay_load.
+ */
+ if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+ val >>= local_n / LOAD_AVG_PERIOD;
+ local_n %= LOAD_AVG_PERIOD;
+ }
+
+ val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ return val;
+}
+
+static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
+{
+ u32 c1, c2, c3 = d3; /* y^0 == 1 */
+
+ /*
+ * c1 = d1 y^p
+ */
+ c1 = decay_load((u64)d1, periods);
+
+ /*
+ * p-1
+ * c2 = 1024 \Sum y^n
+ * n=1
+ *
+ * inf inf
+ * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
+ * n=0 n=p
+ */
+ c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
+
+ return c1 + c2 + c3;
+}
+
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
+/*
+ * Accumulate the three separate parts of the sum; d1 the remainder
+ * of the last (incomplete) period, d2 the span of full periods and d3
+ * the remainder of the (incomplete) current period.
+ *
+ * d1 d2 d3
+ * ^ ^ ^
+ * | | |
+ * |<->|<----------------->|<--->|
+ * ... |---x---|------| ... |------|-----x (now)
+ *
+ * p-1
+ * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
+ * n=1
+ *
+ * = u y^p + (Step 1)
+ *
+ * p-1
+ * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
+ * n=1
+ */
+static __always_inline u32
+accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ unsigned long scale_freq, scale_cpu;
+ u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
+ u64 periods;
+
+ scale_freq = arch_scale_freq_capacity(cpu);
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ delta += sa->period_contrib;
+ periods = delta / 1024; /* A period is 1024us (~1ms) */
+
+ /*
+ * Step 1: decay old *_sum if we crossed period boundaries.
+ */
+ if (periods) {
+ sa->load_sum = decay_load(sa->load_sum, periods);
+ sa->runnable_load_sum =
+ decay_load(sa->runnable_load_sum, periods);
+ sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+
+ /*
+ * Step 2
+ */
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods,
+ 1024 - sa->period_contrib, delta);
+ }
+ sa->period_contrib = delta;
+
+ contrib = cap_scale(contrib, scale_freq);
+ if (load)
+ sa->load_sum += load * contrib;
+ if (runnable)
+ sa->runnable_load_sum += runnable * contrib;
+ if (running)
+ sa->util_sum += contrib * scale_cpu;
+
+ return periods;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series. To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ * p0 p1 p2
+ * (now) (~1ms ago) (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ * y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ u64 delta;
+
+ delta = now - sa->last_update_time;
+ /*
+ * This should only happen when time goes backwards, which it
+ * unfortunately does during sched clock init when we swap over to TSC.
+ */
+ if ((s64)delta < 0) {
+ sa->last_update_time = now;
+ return 0;
+ }
+
+ /*
+ * Use 1024ns as the unit of measurement since it's a reasonable
+ * approximation of 1us and fast to compute.
+ */
+ delta >>= 10;
+ if (!delta)
+ return 0;
+
+ sa->last_update_time += delta << 10;
+
+ /*
+ * running is a subset of runnable (weight) so running can't be set if
+ * runnable is clear. But there are some corner cases where the current
+ * se has been already dequeued but cfs_rq->curr still points to it.
+ * This means that weight will be 0 but not running for a sched_entity
+ * but also for a cfs_rq if the latter becomes idle. As an example,
+ * this happens during idle_balance() which calls
+ * update_blocked_averages()
+ */
+ if (!load)
+ runnable = running = 0;
+
+ /*
+ * Now we know we crossed measurement unit boundaries. The *_avg
+ * accrues by two steps:
+ *
+ * Step 1: accumulate *_sum since last_update_time. If we haven't
+ * crossed period boundaries, finish.
+ */
+ if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+ return 0;
+
+ return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+{
+ u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
+ /*
+ * Step 2: update *_avg.
+ */
+ sa->load_avg = div_u64(load * sa->load_sum, divider);
+ sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
+ sa->util_avg = sa->util_sum / divider;
+}
+
+/*
+ * sched_entity:
+ *
+ * task:
+ * se_runnable() == se_weight()
+ *
+ * group: [ see update_cfs_group() ]
+ * se_weight() = tg->weight * grq->load_avg / tg->load_avg
+ * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
+ * load_sum := runnable_sum
+ * load_avg = se_weight(se) * runnable_avg
+ *
+ * runnable_load_sum := runnable_sum
+ * runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
+ * cfq_rq:
+ *
+ * load_sum = \Sum se_weight(se) * se->avg.load_sum
+ * load_avg = \Sum se->avg.load_avg
+ *
+ * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ * runnable_load_avg = \Sum se->avg.runable_load_avg
+ */
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+ cfs_rq->curr == se)) {
+
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+ if (___update_load_sum(now, cpu, &cfs_rq->avg,
+ scale_load_down(cfs_rq->load.weight),
+ scale_load_down(cfs_rq->runnable_weight),
+ cfs_rq->curr != NULL)) {
+
+ ___update_load_avg(&cfs_rq->avg, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
new file mode 100644
index 0000000..9cac73e
--- /dev/null
+++ b/kernel/sched/pelt.h
@@ -0,0 +1,43 @@
+#ifdef CONFIG_SMP
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+
+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Avoid store if the flag has been already set */
+ enqueued = avg->util_est.enqueued;
+ if (!(enqueued & UTIL_AVG_UNCHANGED))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ enqueued &= ~UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
+#else
+
+static inline int
+update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+#endif
+
+
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6601baf..db6878e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -666,7 +666,26 @@ struct dl_rq {
u64 bw_ratio;
};

+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+#else
+#define entity_is_task(se) 1
+#endif
+
#ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+ return scale_load_down(se->load.weight);
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+ return scale_load_down(se->runnable_weight);
+}

static inline bool sched_asym_prefer(int a, int b)
{
--
2.7.4


2018-06-28 19:08:57

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 03/11] cpufreq/schedutil: use rt utilization tracking

Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
can preempt and steal cfs's running time.

rt util_avg is used to take into account the utilization of rt tasks
on the CPU when selecting OPP. If a rt task migrate, the rt utilization
will not migrate but will decay over time. On an overloaded CPU, cfs
utilization reflects the remaining utilization avialable on CPU. When rt
task migrates, the cfs utilization will increase when tasks will start to
use the newly available capacity. At the same pace, rt utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3cde464..9c5e92e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
/* The fields below are only needed when sharing a policy: */
unsigned long util_cfs;
unsigned long util_dl;
+ unsigned long util_rt;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -186,15 +187,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
+ sg_cpu->util_rt = cpu_util_rt(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util;

if (rq->rt.rt_nr_running)
return sg_cpu->max;

+ util = sg_cpu->util_dl;
+ util += sg_cpu->util_cfs;
+ util += sg_cpu->util_rt;
+
/*
* Utilization required by DEADLINE must always be granted while, for
* FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
@@ -205,7 +212,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+ return min(sg_cpu->max, util);
}

/**
--
2.7.4


2018-06-28 19:09:19

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 05/11] cpufreq/schedutil: use dl utilization tracking

Now that we have both the dl class bandwidth requirement and the dl class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the dl bandwidth requirement to define the
utilization of the CPU

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
kernel/sched/sched.h | 7 ++++++-
2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 9c5e92e..edfbfc1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
/* The fields below are only needed when sharing a policy: */
unsigned long util_cfs;
unsigned long util_dl;
+ unsigned long bw_dl;
unsigned long util_rt;
unsigned long max;

@@ -187,6 +188,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util_rt = cpu_util_rt(rq);
}

@@ -198,20 +200,29 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
if (rq->rt.rt_nr_running)
return sg_cpu->max;

- util = sg_cpu->util_dl;
- util += sg_cpu->util_cfs;
+ util = sg_cpu->util_cfs;
util += sg_cpu->util_rt;

+ if ((util + sg_cpu->util_dl) >= sg_cpu->max)
+ return sg_cpu->max;
+
/*
- * Utilization required by DEADLINE must always be granted while, for
- * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
- * gracefully reduce the frequency when no tasks show up for longer
+ * As there is still idle time on the CPU, we need to compute the
+ * utilization level of the CPU.
+ *
+ * Bandwidth required by DEADLINE must always be granted while, for
+ * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+ * to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
+
+ /* Add DL bandwidth requirement */
+ util += sg_cpu->bw_dl;
+
return min(sg_cpu->max, util);
}

@@ -367,7 +378,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
{
- if (cpu_util_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->util_dl)
+ if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
sg_policy->need_freq_update = true;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7d7d4f4..ef5d6aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2192,11 +2192,16 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-static inline unsigned long cpu_util_dl(struct rq *rq)
+static inline unsigned long cpu_bw_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
}

+static inline unsigned long cpu_util_dl(struct rq *rq)
+{
+ return READ_ONCE(rq->avg_dl.util_avg);
+}
+
static inline unsigned long cpu_util_cfs(struct rq *rq)
{
unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
--
2.7.4


2018-06-28 19:09:33

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 06/11] sched/irq: add irq utilization tracking

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because
rq_clock == rq_clock_task + interrupt time
and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.
The CPU utilization is :
avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/fair.c | 13 ++++++++++---
kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 16 ++++++++++++++++
kernel/sched/sched.h | 3 +++
5 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d8fac..e5263a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -18,6 +18,8 @@
#include "../workqueue_internal.h"
#include "../smpboot.h"

+#include "pelt.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

@@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
- sched_rt_avg_update(rq, irq_delta + steal);
+ update_irq_load_avg(rq, irq_delta + steal);
#endif
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffce4b2..d2758e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

-static inline bool others_rqs_have_blocked(struct rq *rq)
+static inline bool others_have_blocked(struct rq *rq)
{
if (READ_ONCE(rq->avg_rt.util_avg))
return true;
@@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
if (READ_ONCE(rq->avg_dl.util_avg))
return true;

+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ if (READ_ONCE(rq->avg_irq.util_avg))
+ return true;
+#endif
+
return false;
}

@@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
/* Don't need periodic decay once load/util_avg are null */
- if (others_rqs_have_blocked(rq))
+ if (others_have_blocked(rq))
done = false;

#ifdef CONFIG_NO_HZ_COMMON
@@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 8b78b63..ead6d8b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+/*
+ * irq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+ int ret = 0;
+ /*
+ * We know the time that has been used by interrupt since last update
+ * but we don't when. Let be pessimistic and assume that interrupt has
+ * happened just before the update. This is not so far from reality
+ * because interrupt will most probably wake up task and trig an update
+ * of rq clock during which the metric si updated.
+ * We start to decay with normal context time and then we add the
+ * interrupt context time.
+ * We can safely remove running from rq->clock because
+ * rq->clock += delta with delta >= running
+ */
+ ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+ 0,
+ 0,
+ 0);
+ ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+ 1,
+ 1,
+ 1);
+
+ if (ret)
+ ___update_load_avg(&rq->avg_irq, 1, 1);
+
+ return ret;
+}
+#endif
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912..d2894db 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+int update_irq_load_avg(struct rq *rq, u64 running);
+#else
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+ return 0;
+}
+#endif
+
/*
* When a task is dequeued, its estimated utilization should not be update if
* its util_avg has not been updated at least once.
@@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
{
return 0;
}
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef5d6aa..377be2b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,6 +850,9 @@ struct rq {
u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ struct sched_avg avg_irq;
+#endif
u64 idle_stamp;
u64 avg_idle;

--
2.7.4


2018-06-28 23:22:03

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 10/11] sched: remove rt_avg code

rt_avg is no more used anywhere so we can remove all related code

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/core.c | 26 --------------------------
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 17 -----------------
3 files changed, 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5263a4..e9aae7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -652,23 +652,6 @@ bool sched_can_stop_tick(struct rq *rq)
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
- s64 period = sched_avg_period();
-
- while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
- /*
- * Inline assembly required to prevent the compiler
- * optimising this loop into a divmod call.
- * See __iter_div_u64_rem() for another example of this.
- */
- asm("" : "+rm" (rq->age_stamp));
- rq->age_stamp += period;
- rq->rt_avg /= 2;
- }
-}
-
#endif /* CONFIG_SMP */

#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5719,13 +5702,6 @@ void set_rq_offline(struct rq *rq)
}
}

-static void set_cpu_rq_start_time(unsigned int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
-
- rq->age_stamp = sched_clock_cpu(cpu);
-}
-
/*
* used to mark begin/end of suspend/resume:
*/
@@ -5843,7 +5819,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
- set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6111,7 +6086,6 @@ void __init sched_init(void)

#ifdef CONFIG_SMP
idle_thread_set_boot_cpu();
- set_cpu_rq_start_time(smp_processor_id());
#endif
init_sched_fair_class();

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce0dcbf..7ddb13a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5322,8 +5322,6 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,

this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
}
-
- sched_avg_update(this_rq);
}

/* Used instead of source_load when we know the type == 0 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59a633d..c71ea81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -846,8 +846,6 @@ struct rq {

struct list_head cfs_tasks;

- u64 rt_avg;
- u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
@@ -1712,11 +1710,6 @@ extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

-static inline u64 sched_avg_period(void)
-{
- return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
#ifdef CONFIG_SCHED_HRTICK

/*
@@ -1753,8 +1746,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
#endif

#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1765,12 +1756,6 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
- sched_avg_update(rq);
-}
#else
#ifndef arch_scale_cpu_capacity
static __always_inline
@@ -1779,8 +1764,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
#endif

struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
--
2.7.4


2018-06-28 23:47:55

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 04/11] sched/dl: add dl_rq utilization tracking

Similarly to what happens with rt tasks, cfs tasks can be preempted by dl
tasks and the cfs's utilization might no longer describes the real
utilization level.
Current dl bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the dl sched class. We track
dl class utilization to estimate the system utilization.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/deadline.c | 6 ++++++
kernel/sched/fair.c | 11 ++++++++---
kernel/sched/pelt.c | 23 +++++++++++++++++++++++
kernel/sched/pelt.h | 6 ++++++
kernel/sched/sched.h | 1 +
5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fbfc3f1..f4de2698 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,7 @@
* Fabio Checconi <[email protected]>
*/
#include "sched.h"
+#include "pelt.h"

struct dl_bandwidth def_dl_bandwidth;

@@ -1761,6 +1762,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

deadline_queue_push_tasks(rq);

+ if (rq->curr->sched_class != &dl_sched_class)
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+
return p;
}

@@ -1768,6 +1772,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
@@ -1784,6 +1789,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
/*
* Even when we have runtime, update_curr_dl() might have resulted in us
* not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 328bedc..ffce4b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,11 +7289,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

-static inline bool rt_rq_has_blocked(struct rq *rq)
+static inline bool others_rqs_have_blocked(struct rq *rq)
{
if (READ_ONCE(rq->avg_rt.util_avg))
return true;

+ if (READ_ONCE(rq->avg_dl.util_avg))
+ return true;
+
return false;
}

@@ -7357,8 +7360,9 @@ static void update_blocked_averages(int cpu)
done = false;
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
/* Don't need periodic decay once load/util_avg are null */
- if (rt_rq_has_blocked(rq))
+ if (others_rqs_have_blocked(rq))
done = false;

#ifdef CONFIG_NO_HZ_COMMON
@@ -7426,9 +7430,10 @@ static inline void update_blocked_averages(int cpu)
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a00b1ba..8b78b63 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -334,3 +334,26 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+/*
+ * dl_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_dl, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index b2983b7..0e4f912 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -4,6 +4,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -45,6 +46,11 @@ update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+static inline int
+update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f2b12b0..7d7d4f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -849,6 +849,7 @@ struct rq {
u64 rt_avg;
u64 age_stamp;
struct sched_avg avg_rt;
+ struct sched_avg avg_dl;
u64 idle_stamp;
u64 avg_idle;

--
2.7.4


2018-06-29 03:48:31

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms

On Thu, Jun 28, 2018 at 05:45:14PM +0200, Vincent Guittot wrote:
> /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
> Remove it
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: "Luis R. Rodriguez" <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>

Reviewed-by: Luis R. Rodriguez <[email protected]>

Luis

2018-06-29 07:05:09

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 11/11] proc/sched: remove unused sched_time_avg_ms

On Thu, 28 Jun 2018 at 21:03, Luis R. Rodriguez <[email protected]> wrote:
>
> On Thu, Jun 28, 2018 at 05:45:14PM +0200, Vincent Guittot wrote:
> > /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere.
> > Remove it
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: "Luis R. Rodriguez" <[email protected]>
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> Reviewed-by: Luis R. Rodriguez <[email protected]>

Thanks

Vincent

>
> Luis

2018-07-05 12:38:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] track CPU utilization

On Thu, Jun 28, 2018 at 05:45:03PM +0200, Vincent Guittot wrote:
> Vincent Guittot (11):
> sched/pelt: Move pelt related code in a dedicated file
> sched/rt: add rt_rq utilization tracking
> cpufreq/schedutil: use rt utilization tracking
> sched/dl: add dl_rq utilization tracking
> cpufreq/schedutil: use dl utilization tracking
> sched/irq: add irq utilization tracking
> cpufreq/schedutil: take into account interrupt
> sched: schedutil: remove sugov_aggregate_util()
> sched: use pelt for scale_rt_capacity()
> sched: remove rt_avg code
> proc/sched: remove unused sched_time_avg_ms
>
> include/linux/sched/sysctl.h | 1 -
> kernel/sched/Makefile | 2 +-
> kernel/sched/core.c | 38 +---
> kernel/sched/cpufreq_schedutil.c | 65 ++++---
> kernel/sched/deadline.c | 8 +-
> kernel/sched/fair.c | 403 +++++----------------------------------
> kernel/sched/pelt.c | 399 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 72 +++++++
> kernel/sched/rt.c | 15 +-
> kernel/sched/sched.h | 68 +++++--
> kernel/sysctl.c | 8 -
> 11 files changed, 632 insertions(+), 447 deletions(-)
> create mode 100644 kernel/sched/pelt.c
> create mode 100644 kernel/sched/pelt.h

OK, this looks good I suppose. Rafael, are you OK with me taking these?

I have the below on top because I once again forgot how it all worked;
does this work for you Vincent?

---
Subject: sched/cpufreq: Clarify sugov_get_util()

Add a few comments (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
cpufreq_schedutil.c | 69 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 51 insertions(+), 18 deletions(-)

--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct
return cpufreq_driver_resolve_freq(policy, freq);
}

+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ * cpu_util_{cfs,rt,dl,irq}()
+ * cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued in
+ * the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured meric but a value computed
+ * based on the task model parameters and gives the minimal u required to meet
+ * deadlines.
+ */
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
@@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru
if (rt_rq_is_runnable(&rq->rt))
return max;

+ /*
+ * Early check to see if IRQ/steal time saturates the CPU, can be
+ * because of inaccuracies in how we track these -- see
+ * update_irq_load_avg().
+ */
irq = cpu_util_irq(rq);
-
if (unlikely(irq >= max))
return max;

- /* Sum rq utilization */
+ /*
+ * Because the time spend on RT/DL tasks is visible as 'lost' time to
+ * CFS tasks and we use the same metric to track the effective
+ * utilization (PELT windows are synchronized) we can directly add them
+ * to obtain the CPU's actual utilization.
+ */
util = cpu_util_cfs(rq);
util += cpu_util_rt(rq);

/*
- * Interrupt time is not seen by rqs utilization nso we can compare
- * them with the CPU capacity
+ * We do not make cpu_util_dl() a permanent part of this sum because we
+ * want to use cpu_bw_dl() later on, but we need to check if the
+ * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
+ * f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us to not quite hit
+ * saturation when we should -- something for later.
*/
if ((util + cpu_util_dl(rq)) >= max)
return max;

/*
- * As there is still idle time on the CPU, we need to compute the
- * utilization level of the CPU.
+ * There is still idle time; further improve the number by using the
+ * irq metric. Because IRQ/steal time is hidden from the task clock we
+ * need to scale the task numbers:
*
+ * 1 - irq
+ * U' = irq + ------- * U
+ * max
+ */
+ util *= (max - irq);
+ util /= max;
+ util += irq;
+
+ /*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
@@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
-
- /* Weight rqs utilization to normal context window */
- util *= (max - irq);
- util /= max;
-
- /* Add interrupt utilization */
- util += irq;
-
- /* Add DL bandwidth requirement */
- util += sg_cpu->bw_dl;
-
- return min(max, util);
+ return min(max, util + sg_cpu->bw_dl);
}

/**

2018-07-05 13:35:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] track CPU utilization

Hi Peter

On Thu, 5 Jul 2018 at 14:36, Peter Zijlstra <[email protected]> wrote:
>

>
> OK, this looks good I suppose. Rafael, are you OK with me taking these?
>
> I have the below on top because I once again forgot how it all worked;
> does this work for you Vincent?

Yes looks good to me

Thanks

>
> ---
> Subject: sched/cpufreq: Clarify sugov_get_util()
>
> Add a few comments (hopefully) clarifying some of the magic in
> sugov_get_util().
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> cpufreq_schedutil.c | 69 ++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 51 insertions(+), 18 deletions(-)
>

2018-07-06 05:59:02

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 03/11] cpufreq/schedutil: use rt utilization tracking

On 28-06-18, 17:45, Vincent Guittot wrote:
> Add both cfs and rt utilization when selecting an OPP for cfs tasks as rt
> can preempt and steal cfs's running time.
>
> rt util_avg is used to take into account the utilization of rt tasks
> on the CPU when selecting OPP. If a rt task migrate, the rt utilization
> will not migrate but will decay over time. On an overloaded CPU, cfs
> utilization reflects the remaining utilization avialable on CPU. When rt
> task migrates, the cfs utilization will increase when tasks will start to
> use the newly available capacity. At the same pace, rt utilization will
> decay and both variations will compensate each other to keep unchanged
> overall utilization and will prevent any OPP drop.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-07-06 06:00:53

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 05/11] cpufreq/schedutil: use dl utilization tracking

On 28-06-18, 17:45, Vincent Guittot wrote:
> Now that we have both the dl class bandwidth requirement and the dl class
> utilization, we can detect when CPU is fully used so we should run at max.
> Otherwise, we keep using the dl bandwidth requirement to define the
> utilization of the CPU
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
> kernel/sched/sched.h | 7 ++++++-
> 2 files changed, 23 insertions(+), 7 deletions(-)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-07-06 06:01:59

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt

On 28-06-18, 17:45, Vincent Guittot wrote:
> The time spent under interrupt can be significant but it is not reflected
> in the utilization of CPU when deciding to choose an OPP. Now that we have
> access to this metric, schedutil can take it into account when selecting
> the OPP for a CPU.
> rqs utilization don't see the time spend under interrupt context and report
> their value in the normal context time window. We need to compensate this when
> adding interrupt utilization
>
> The CPU utilization is :
> irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
>
> A test with iperf on hikey (octo arm64) gives:
> iperf -c server_address -r -t 5
>
> w/o patch w/ patch
> Tx 276 Mbits/sec 304 Mbits/sec +10%
> Rx 299 Mbits/sec 328 Mbits/sec +09%
>
> 8 iterations
> stdev is lower than 1%
> Only WFI idle state is enable (shallowest diel state)
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
> kernel/sched/sched.h | 13 +++++++++++++
> 2 files changed, 34 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index edfbfc1..b77bfef 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -58,6 +58,7 @@ struct sugov_cpu {
> unsigned long util_dl;
> unsigned long bw_dl;
> unsigned long util_rt;
> + unsigned long util_irq;
> unsigned long max;
>
> /* The field below is for single-CPU policies only: */
> @@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> sg_cpu->util_dl = cpu_util_dl(rq);
> sg_cpu->bw_dl = cpu_bw_dl(rq);
> sg_cpu->util_rt = cpu_util_rt(rq);
> + sg_cpu->util_irq = cpu_util_irq(rq);
> }
>
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> - unsigned long util;
> + unsigned long util, max = sg_cpu->max;
>
> if (rq->rt.rt_nr_running)
> return sg_cpu->max;
>
> + if (unlikely(sg_cpu->util_irq >= max))
> + return max;
> +
> + /* Sum rq utilization */
> util = sg_cpu->util_cfs;
> util += sg_cpu->util_rt;
>
> - if ((util + sg_cpu->util_dl) >= sg_cpu->max)
> - return sg_cpu->max;
> + /*
> + * Interrupt time is not seen by rqs utilization nso we can compare

nso ?

> + * them with the CPU capacity
> + */
> + if ((util + sg_cpu->util_dl) >= max)
> + return max;
>
> /*
> * As there is still idle time on the CPU, we need to compute the
> @@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> * ready for such an interface. So, we only do the latter for now.
> */
>
> + /* Weight rqs utilization to normal context window */
> + util *= (max - sg_cpu->util_irq);
> + util /= max;
> +
> + /* Add interrupt utilization */
> + util += sg_cpu->util_irq;
> +
> /* Add DL bandwidth requirement */
> util += sg_cpu->bw_dl;
>
> - return min(sg_cpu->max, util);
> + return min(max, util);
> }
>
> /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 377be2b..9438e68 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2221,4 +2221,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> {
> return rq->avg_rt.util_avg;
> }
> +
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +static inline unsigned long cpu_util_irq(struct rq *rq)
> +{
> + return rq->avg_irq.util_avg;
> +}
> +#else
> +static inline unsigned long cpu_util_irq(struct rq *rq)
> +{
> + return 0;
> +}
> +
> +#endif
> #endif

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-07-06 06:05:01

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 08/11] sched: schedutil: remove sugov_aggregate_util()

On 28-06-18, 17:45, Vincent Guittot wrote:
> There is no reason why sugov_get_util() and sugov_aggregate_util()
> were in fact separate functions.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> [rebased after adding irq tracking and fixed some compilation errors]
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
> kernel/sched/sched.h | 2 +-
> 2 files changed, 16 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index b77bfef..d04f941 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -53,12 +53,7 @@ struct sugov_cpu {
> unsigned int iowait_boost_max;
> u64 last_update;
>
> - /* The fields below are only needed when sharing a policy: */
> - unsigned long util_cfs;
> - unsigned long util_dl;
> unsigned long bw_dl;
> - unsigned long util_rt;
> - unsigned long util_irq;
> unsigned long max;
>
> /* The field below is for single-CPU policies only: */
> @@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> return cpufreq_driver_resolve_freq(policy, freq);
> }
>
> -static void sugov_get_util(struct sugov_cpu *sg_cpu)
> +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util, irq, max;
>
> - sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> - sg_cpu->util_cfs = cpu_util_cfs(rq);
> - sg_cpu->util_dl = cpu_util_dl(rq);
> - sg_cpu->bw_dl = cpu_bw_dl(rq);
> - sg_cpu->util_rt = cpu_util_rt(rq);
> - sg_cpu->util_irq = cpu_util_irq(rq);
> -}
> -
> -static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> -{
> - struct rq *rq = cpu_rq(sg_cpu->cpu);
> - unsigned long util, max = sg_cpu->max;
> + sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> + sg_cpu->bw_dl = cpu_bw_dl(rq);
>
> if (rq->rt.rt_nr_running)
> - return sg_cpu->max;
> + return max;
> +
> + irq = cpu_util_irq(rq);
>
> - if (unlikely(sg_cpu->util_irq >= max))
> + if (unlikely(irq >= max))
> return max;
>
> /* Sum rq utilization */
> - util = sg_cpu->util_cfs;
> - util += sg_cpu->util_rt;
> + util = cpu_util_cfs(rq);
> + util += cpu_util_rt(rq);
>
> /*
> * Interrupt time is not seen by rqs utilization nso we can compare
> * them with the CPU capacity
> */
> - if ((util + sg_cpu->util_dl) >= max)
> + if ((util + cpu_util_dl(rq)) >= max)
> return max;
>
> /*
> @@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> */
>
> /* Weight rqs utilization to normal context window */
> - util *= (max - sg_cpu->util_irq);
> + util *= (max - irq);
> util /= max;
>
> /* Add interrupt utilization */
> - util += sg_cpu->util_irq;
> + util += irq;
>
> /* Add DL bandwidth requirement */
> util += sg_cpu->bw_dl;
> @@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
>
> busy = sugov_cpu_is_busy(sg_cpu);
>
> - sugov_get_util(sg_cpu);
> + util = sugov_get_util(sg_cpu);
> max = sg_cpu->max;
> - util = sugov_aggregate_util(sg_cpu);
> sugov_iowait_apply(sg_cpu, time, &util, &max);
> next_f = get_next_freq(sg_policy, util, max);
> /*
> @@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> unsigned long j_util, j_max;
>
> - sugov_get_util(j_sg_cpu);
> + j_util = sugov_get_util(j_sg_cpu);
> j_max = j_sg_cpu->max;
> - j_util = sugov_aggregate_util(j_sg_cpu);
> sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);
>
> if (j_util * max > j_max * util) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9438e68..59a633d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2219,7 +2219,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)
>
> static inline unsigned long cpu_util_rt(struct rq *rq)
> {
> - return rq->avg_rt.util_avg;
> + return READ_ONCE(rq->avg_rt.util_avg);
> }
>
> #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-07-06 06:07:50

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] track CPU utilization

On 05-07-18, 14:36, Peter Zijlstra wrote:
> Subject: sched/cpufreq: Clarify sugov_get_util()
>
> Add a few comments (hopefully) clarifying some of the magic in
> sugov_get_util().
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> cpufreq_schedutil.c | 69 ++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 51 insertions(+), 18 deletions(-)
>
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct
> return cpufreq_driver_resolve_freq(policy, freq);
> }
>
> +/*
> + * This function computes an effective utilization for the given CPU, to be
> + * used for frequency selection given the linear relation: f = u * f_max.
> + *
> + * The scheduler tracks the following metrics:
> + *
> + * cpu_util_{cfs,rt,dl,irq}()
> + * cpu_bw_dl()
> + *
> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> + * synchronized windows and are thus directly comparable.
> + *
> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> + * which excludes things like IRQ and steal-time. These latter are then accrued in
> + * the irq utilization.
> + *
> + * The DL bandwidth number otoh is not a measured meric but a value computed

metric

> + * based on the task model parameters and gives the minimal u required to meet

u ?

> + * deadlines.
> + */
> static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> @@ -188,26 +208,50 @@ static unsigned long sugov_get_util(stru
> if (rt_rq_is_runnable(&rq->rt))
> return max;
>
> + /*
> + * Early check to see if IRQ/steal time saturates the CPU, can be
> + * because of inaccuracies in how we track these -- see
> + * update_irq_load_avg().
> + */
> irq = cpu_util_irq(rq);
> -
> if (unlikely(irq >= max))
> return max;
>
> - /* Sum rq utilization */
> + /*
> + * Because the time spend on RT/DL tasks is visible as 'lost' time to
> + * CFS tasks and we use the same metric to track the effective
> + * utilization (PELT windows are synchronized) we can directly add them
> + * to obtain the CPU's actual utilization.
> + */
> util = cpu_util_cfs(rq);
> util += cpu_util_rt(rq);
>
> /*
> - * Interrupt time is not seen by rqs utilization nso we can compare
> - * them with the CPU capacity
> + * We do not make cpu_util_dl() a permanent part of this sum because we
> + * want to use cpu_bw_dl() later on, but we need to check if the
> + * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
> + * f_max when there is no idle time.
> + *
> + * NOTE: numerical errors or stop class might cause us to not quite hit
> + * saturation when we should -- something for later.
> */
> if ((util + cpu_util_dl(rq)) >= max)
> return max;
>
> /*
> - * As there is still idle time on the CPU, we need to compute the
> - * utilization level of the CPU.
> + * There is still idle time; further improve the number by using the
> + * irq metric. Because IRQ/steal time is hidden from the task clock we
> + * need to scale the task numbers:
> *
> + * 1 - irq
> + * U' = irq + ------- * U
> + * max
> + */
> + util *= (max - irq);
> + util /= max;
> + util += irq;
> +
> + /*
> * Bandwidth required by DEADLINE must always be granted while, for
> * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> * to gracefully reduce the frequency when no tasks show up for longer
> @@ -217,18 +261,7 @@ static unsigned long sugov_get_util(stru
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> -
> - /* Weight rqs utilization to normal context window */
> - util *= (max - irq);
> - util /= max;
> -
> - /* Add interrupt utilization */
> - util += irq;
> -
> - /* Add DL bandwidth requirement */
> - util += sg_cpu->bw_dl;
> -
> - return min(max, util);
> + return min(max, util + sg_cpu->bw_dl);
> }
>

Acked-by: Viresh Kumar <[email protected]>

--
viresh

2018-07-06 09:15:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt

On Fri, Jul 06, 2018 at 11:30:33AM +0530, Viresh Kumar wrote:
> On 28-06-18, 17:45, Vincent Guittot wrote:
> > The time spent under interrupt can be significant but it is not reflected
> > in the utilization of CPU when deciding to choose an OPP. Now that we have
> > access to this metric, schedutil can take it into account when selecting
> > the OPP for a CPU.
> > rqs utilization don't see the time spend under interrupt context and report
> > their value in the normal context time window. We need to compensate this when
> > adding interrupt utilization
> >
> > The CPU utilization is :
> > irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> >
> > A test with iperf on hikey (octo arm64) gives:
> > iperf -c server_address -r -t 5
> >
> > w/o patch w/ patch
> > Tx 276 Mbits/sec 304 Mbits/sec +10%
> > Rx 299 Mbits/sec 328 Mbits/sec +09%
> >
> > 8 iterations
> > stdev is lower than 1%
> > Only WFI idle state is enable (shallowest diel state)

Also s/diel/idle/

> > + /*
> > + * Interrupt time is not seen by rqs utilization nso we can compare
>
> nso ?
>
> > + * them with the CPU capacity
> > + */

Already fixed ;-)

2018-07-06 09:20:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] track CPU utilization

On Fri, Jul 06, 2018 at 11:35:22AM +0530, Viresh Kumar wrote:
> On 05-07-18, 14:36, Peter Zijlstra wrote:
> > +/*
> > + * This function computes an effective utilization for the given CPU, to be
> > + * used for frequency selection given the linear relation: f = u * f_max.
> > + *
> > + * The scheduler tracks the following metrics:
> > + *
> > + * cpu_util_{cfs,rt,dl,irq}()
> > + * cpu_bw_dl()
> > + *
> > + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> > + * synchronized windows and are thus directly comparable.
> > + *
> > + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> > + * which excludes things like IRQ and steal-time. These latter are then accrued in
> > + * the irq utilization.
> > + *
> > + * The DL bandwidth number otoh is not a measured meric but a value computed
>
> metric

Indeed, fixed.

> > + * based on the task model parameters and gives the minimal u required to meet
>
> u ?

utilization, but for lazy people :-) I'll use the whole word.

> > + * deadlines.
> > + */



2018-07-06 09:23:05

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 07/11] cpufreq/schedutil: take into account interrupt

On Fri, 6 Jul 2018 at 11:14, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 06, 2018 at 11:30:33AM +0530, Viresh Kumar wrote:
> > On 28-06-18, 17:45, Vincent Guittot wrote:
> > > The time spent under interrupt can be significant but it is not reflected
> > > in the utilization of CPU when deciding to choose an OPP. Now that we have
> > > access to this metric, schedutil can take it into account when selecting
> > > the OPP for a CPU.
> > > rqs utilization don't see the time spend under interrupt context and report
> > > their value in the normal context time window. We need to compensate this when
> > > adding interrupt utilization
> > >
> > > The CPU utilization is :
> > > irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> > >
> > > A test with iperf on hikey (octo arm64) gives:
> > > iperf -c server_address -r -t 5
> > >
> > > w/o patch w/ patch
> > > Tx 276 Mbits/sec 304 Mbits/sec +10%
> > > Rx 299 Mbits/sec 328 Mbits/sec +09%
> > >
> > > 8 iterations
> > > stdev is lower than 1%
> > > Only WFI idle state is enable (shallowest diel state)
>
> Also s/diel/idle/
>
> > > + /*
> > > + * Interrupt time is not seen by rqs utilization nso we can compare
> >
> > nso ?
> >
> > > + * them with the CPU capacity
> > > + */
>
> Already fixed ;-)

Thanks

2018-07-15 22:16:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()


* Vincent Guittot <[email protected]> wrote:

> The utilization of the CPU by rt, dl and interrupts are now tracked with
> PELT so we can use these metrics instead of rt_avg to evaluate the remaining
> capacity available for cfs class.
>
> scale_rt_capacity() behavior has been changed and now returns the remaining
> capacity available for cfs instead of a scaling factor because rt, dl and
> interrupt provide now absolute utilization value.
>
> The same formula as schedutil is used:
> irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> but the implementation is different because it doesn't return the same value
> and doesn't benefit of the same optimization
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/deadline.c | 2 --
> kernel/sched/fair.c | 41 +++++++++++++++++++----------------------
> kernel/sched/pelt.c | 2 +-
> kernel/sched/rt.c | 2 --
> 4 files changed, 20 insertions(+), 27 deletions(-)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2758e3..ce0dcbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> static unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> - u64 total, used, age_stamp, avg;
> - s64 delta;
> -
> - /*
> - * Since we're reading these variables without serialization make sure
> - * we read them once before doing sanity checks on them.
> - */
> - age_stamp = READ_ONCE(rq->age_stamp);
> - avg = READ_ONCE(rq->rt_avg);
> - delta = __rq_clock_broken(rq) - age_stamp;
> + unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> + unsigned long used, irq, free;
>
> - if (unlikely(delta < 0))
> - delta = 0;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> + irq = READ_ONCE(rq->avg_irq.util_avg);
>
> - total = sched_avg_period() + delta;
> + if (unlikely(irq >= max))
> + return 1;
> +#endif

Note that 'irq' is unused outside that macro block, resulting in a new warning on
defconfig builds:

CC kernel/sched/fair.o
kernel/sched/fair.c: In function ‘scale_rt_capacity’:
kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]
unsigned long used, irq, free;
^~~

I have applied the delta fix below for simplicity, but what we really want is a
cleanup of that function to eliminate the #ifdefs. One solution would be to factor
out the 'irq' utilization value into a helper inline, and double check that if the
configs are off the compiler does the right thing and eliminates this identity
transformation for the irq==0 case:

free *= (max - irq);
free /= max;

If the compiler refuses to optimize this away (due to the zero and overflow
cases), try to find something more clever?

Thanks,

Ingo

kernel/sched/fair.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3221db0511a..d5f7d521e448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
- unsigned long used, irq, free;
+ unsigned long used, free;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ unsigned long irq;
+#endif

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
irq = READ_ONCE(rq->avg_irq.util_avg);

2018-07-15 22:47:46

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()

On Mon, 2018-07-16 at 00:15 +0200, Ingo Molnar wrote:
> * Vincent Guittot <[email protected]> wrote:
>
> > The utilization of the CPU by rt, dl and interrupts are now tracked with
> > PELT so we can use these metrics instead of rt_avg to evaluate the remaining
> > capacity available for cfs class.
> >
> > scale_rt_capacity() behavior has been changed and now returns the remaining
> > capacity available for cfs instead of a scaling factor because rt, dl and
> > interrupt provide now absolute utilization value.
> >
> > The same formula as schedutil is used:
> > irq util_avg + (1 - irq util_avg / max capacity ) * /Sum rq util_avg
> > but the implementation is different because it doesn't return the same value
> > and doesn't benefit of the same optimization
[]
> I have applied the delta fix below for simplicity, but what we really want is a
> cleanup of that function to eliminate the #ifdefs. One solution would be to factor
> out the 'irq' utilization value into a helper inline, and double check that if the
> configs are off the compiler does the right thing and eliminates this identity
> transformation for the irq==0 case:
[]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
[]
> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> - unsigned long used, irq, free;
> + unsigned long used, free;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> + unsigned long irq;
> +#endif
>
> #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

Perhaps combine these two #if defined blocks into
a single block

> irq = READ_ONCE(rq->avg_irq.util_avg);

Subject: [tip:sched/core] sched/rt: Add rt_rq utilization tracking

Commit-ID: 371bf42732694d142b0de026e152266c039b97d3
Gitweb: https://git.kernel.org/tip/371bf42732694d142b0de026e152266c039b97d3
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:05 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/rt: Add rt_rq utilization tracking

schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
are preempted by RT tasks and in this case util_avg reflects the remaining
capacity but not what CFS want to use. In such case, schedutil can select a
lower OPP whereas the CPU is overloaded. In order to have a more accurate
view of the utilization of the CPU, we track the utilization of RT tasks.
Only util_avg is correctly tracked but not load_avg and runnable_load_avg
which are useless for rt_rq.

rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
the same at the root group level, so the PELT windows of the util_sum are
aligned.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++++++-
kernel/sched/pelt.c | 25 +++++++++++++++++++++++++
kernel/sched/pelt.h | 7 +++++++
kernel/sched/rt.c | 13 +++++++++++++
kernel/sched/sched.h | 7 +++++++
5 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39ab46cea6c5..5b453213cd18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,6 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

+static inline bool rt_rq_has_blocked(struct rq *rq)
+{
+ if (READ_ONCE(rq->avg_rt.util_avg))
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED

static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7349,6 +7357,10 @@ static void update_blocked_averages(int cpu)
if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ /* Don't need periodic decay once load/util_avg are null */
+ if (rt_rq_has_blocked(rq))
+ done = false;

#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
@@ -7414,9 +7426,10 @@ static inline void update_blocked_averages(int cpu)
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index e6ecbb2b8698..a00b1ba3dd5b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -309,3 +309,28 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)

return 0;
}
+
+/*
+ * rt_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ * load_avg and runnable_load_avg are not supported and meaningless.
+ *
+ */
+
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_rt, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 9cac73efd64a..b2983b741d57 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -3,6 +3,7 @@
int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -38,6 +39,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
return 0;
}

+static inline int
+update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
+
#endif


diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 572567078b60..0dc8ad1915e6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -5,6 +5,8 @@
*/
#include "sched.h"

+#include "pelt.h"
+
int sched_rr_timeslice = RR_TIMESLICE;
int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

@@ -1576,6 +1578,14 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

rt_queue_push_tasks(rq);

+ /*
+ * If prev task was rt, put_prev_task() has already updated the
+ * utilization. We only care of the case where we start to schedule a
+ * rt task
+ */
+ if (rq->curr->sched_class != &rt_sched_class)
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+
return p;
}

@@ -1583,6 +1593,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
update_curr_rt(rq);

+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+
/*
* The previous task needs to be made eligible for pushing
* if it is still active
@@ -2312,6 +2324,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
struct sched_rt_entity *rt_se = &p->rt;

update_curr_rt(rq);
+ update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);

watchdog(rq, p);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 00d6f2594c4e..405dd9ba6b39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
+
#endif /* CONFIG_SMP */
int rt_queued;

@@ -854,6 +855,7 @@ struct rq {

u64 rt_avg;
u64 age_stamp;
+ struct sched_avg avg_rt;
u64 idle_stamp;
u64 avg_idle;

@@ -2212,4 +2214,9 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)

return util;
}
+
+static inline unsigned long cpu_util_rt(struct rq *rq)
+{
+ return rq->avg_rt.util_avg;
+}
#endif

Subject: [tip:sched/core] sched/pelt: Move PELT related code in a dedicated file

Commit-ID: c079629862b20c101e8336362a8b042ec7d942fe
Gitweb: https://git.kernel.org/tip/c079629862b20c101e8336362a8b042ec7d942fe
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:04 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/pelt: Move PELT related code in a dedicated file

We want to track rt_rq's utilization as a part of the estimation of the
whole rq's utilization. This is necessary because rt tasks can steal
utilization to cfs tasks and make them lighter than they are.
As we want to use the same load tracking mecanism for both and prevent
useless dependency between cfs and rt code, PELT code is moved in a
dedicated file.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/Makefile | 2 +-
kernel/sched/fair.c | 333 +-------------------------------------------------
kernel/sched/pelt.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 43 +++++++
kernel/sched/sched.h | 19 +++
5 files changed, 375 insertions(+), 333 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..7fe183404c38 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle.o fair.o rt.o deadline.o
obj-y += wait.o wait_bit.o swait.o completion.o

-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08b89ae34233..39ab46cea6c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -255,9 +255,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return cfs_rq->rq;
}

-/* An entity is a task if it doesn't "own" a runqueue */
-#define entity_is_task(se) (!se->my_q)
-
static inline struct task_struct *task_of(struct sched_entity *se)
{
SCHED_WARN_ON(!entity_is_task(se));
@@ -419,7 +416,6 @@ static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
return container_of(cfs_rq, struct rq, cfs);
}

-#define entity_is_task(se) 1

#define for_each_sched_entity(se) \
for (; se; se = NULL)
@@ -692,7 +688,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}

#ifdef CONFIG_SMP
-
+#include "pelt.h"
#include "sched-pelt.h"

static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
@@ -2751,19 +2747,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
} while (0)

#ifdef CONFIG_SMP
-/*
- * XXX we want to get rid of these helpers and use the full load resolution.
- */
-static inline long se_weight(struct sched_entity *se)
-{
- return scale_load_down(se->load.weight);
-}
-
-static inline long se_runnable(struct sched_entity *se)
-{
- return scale_load_down(se->runnable_weight);
-}
-
static inline void
enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -3064,314 +3047,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
}

#ifdef CONFIG_SMP
-/*
- * Approximate:
- * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
- */
-static u64 decay_load(u64 val, u64 n)
-{
- unsigned int local_n;
-
- if (unlikely(n > LOAD_AVG_PERIOD * 63))
- return 0;
-
- /* after bounds checking we can collapse to 32-bit */
- local_n = n;
-
- /*
- * As y^PERIOD = 1/2, we can combine
- * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
- * With a look-up table which covers y^n (n<PERIOD)
- *
- * To achieve constant time decay_load.
- */
- if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
- val >>= local_n / LOAD_AVG_PERIOD;
- local_n %= LOAD_AVG_PERIOD;
- }
-
- val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
- return val;
-}
-
-static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
-{
- u32 c1, c2, c3 = d3; /* y^0 == 1 */
-
- /*
- * c1 = d1 y^p
- */
- c1 = decay_load((u64)d1, periods);
-
- /*
- * p-1
- * c2 = 1024 \Sum y^n
- * n=1
- *
- * inf inf
- * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
- * n=0 n=p
- */
- c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
-
- return c1 + c2 + c3;
-}
-
-/*
- * Accumulate the three separate parts of the sum; d1 the remainder
- * of the last (incomplete) period, d2 the span of full periods and d3
- * the remainder of the (incomplete) current period.
- *
- * d1 d2 d3
- * ^ ^ ^
- * | | |
- * |<->|<----------------->|<--->|
- * ... |---x---|------| ... |------|-----x (now)
- *
- * p-1
- * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
- * n=1
- *
- * = u y^p + (Step 1)
- *
- * p-1
- * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
- * n=1
- */
-static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- unsigned long scale_freq, scale_cpu;
- u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
- u64 periods;
-
- scale_freq = arch_scale_freq_capacity(cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
- delta += sa->period_contrib;
- periods = delta / 1024; /* A period is 1024us (~1ms) */
-
- /*
- * Step 1: decay old *_sum if we crossed period boundaries.
- */
- if (periods) {
- sa->load_sum = decay_load(sa->load_sum, periods);
- sa->runnable_load_sum =
- decay_load(sa->runnable_load_sum, periods);
- sa->util_sum = decay_load((u64)(sa->util_sum), periods);
-
- /*
- * Step 2
- */
- delta %= 1024;
- contrib = __accumulate_pelt_segments(periods,
- 1024 - sa->period_contrib, delta);
- }
- sa->period_contrib = delta;
-
- contrib = cap_scale(contrib, scale_freq);
- if (load)
- sa->load_sum += load * contrib;
- if (runnable)
- sa->runnable_load_sum += runnable * contrib;
- if (running)
- sa->util_sum += contrib * scale_cpu;
-
- return periods;
-}
-
-/*
- * We can represent the historical contribution to runnable average as the
- * coefficients of a geometric series. To do this we sub-divide our runnable
- * history into segments of approximately 1ms (1024us); label the segment that
- * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
- *
- * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
- * p0 p1 p2
- * (now) (~1ms ago) (~2ms ago)
- *
- * Let u_i denote the fraction of p_i that the entity was runnable.
- *
- * We then designate the fractions u_i as our co-efficients, yielding the
- * following representation of historical load:
- * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
- *
- * We choose y based on the with of a reasonably scheduling period, fixing:
- * y^32 = 0.5
- *
- * This means that the contribution to load ~32ms ago (u_32) will be weighted
- * approximately half as much as the contribution to load within the last ms
- * (u_0).
- *
- * When a period "rolls over" and we have new u_0`, multiplying the previous
- * sum again by y is sufficient to update:
- * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
- * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
- */
-static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
-{
- u64 delta;
-
- delta = now - sa->last_update_time;
- /*
- * This should only happen when time goes backwards, which it
- * unfortunately does during sched clock init when we swap over to TSC.
- */
- if ((s64)delta < 0) {
- sa->last_update_time = now;
- return 0;
- }
-
- /*
- * Use 1024ns as the unit of measurement since it's a reasonable
- * approximation of 1us and fast to compute.
- */
- delta >>= 10;
- if (!delta)
- return 0;
-
- sa->last_update_time += delta << 10;
-
- /*
- * running is a subset of runnable (weight) so running can't be set if
- * runnable is clear. But there are some corner cases where the current
- * se has been already dequeued but cfs_rq->curr still points to it.
- * This means that weight will be 0 but not running for a sched_entity
- * but also for a cfs_rq if the latter becomes idle. As an example,
- * this happens during idle_balance() which calls
- * update_blocked_averages()
- */
- if (!load)
- runnable = running = 0;
-
- /*
- * Now we know we crossed measurement unit boundaries. The *_avg
- * accrues by two steps:
- *
- * Step 1: accumulate *_sum since last_update_time. If we haven't
- * crossed period boundaries, finish.
- */
- if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
- return 0;
-
- return 1;
-}
-
-static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
-{
- u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
-
- /*
- * Step 2: update *_avg.
- */
- sa->load_avg = div_u64(load * sa->load_sum, divider);
- sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
-}
-
-/*
- * When a task is dequeued, its estimated utilization should not be update if
- * its util_avg has not been updated at least once.
- * This flag is used to synchronize util_avg updates with util_est updates.
- * We map this information into the LSB bit of the utilization saved at
- * dequeue time (i.e. util_est.dequeued).
- */
-#define UTIL_AVG_UNCHANGED 0x1
-
-static inline void cfs_se_util_change(struct sched_avg *avg)
-{
- unsigned int enqueued;
-
- if (!sched_feat(UTIL_EST))
- return;
-
- /* Avoid store if the flag has been already set */
- enqueued = avg->util_est.enqueued;
- if (!(enqueued & UTIL_AVG_UNCHANGED))
- return;
-
- /* Reset flag to report util_avg has been updated */
- enqueued &= ~UTIL_AVG_UNCHANGED;
- WRITE_ONCE(avg->util_est.enqueued, enqueued);
-}
-
-/*
- * sched_entity:
- *
- * task:
- * se_runnable() == se_weight()
- *
- * group: [ see update_cfs_group() ]
- * se_weight() = tg->weight * grq->load_avg / tg->load_avg
- * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
- * load_sum := runnable_sum
- * load_avg = se_weight(se) * runnable_avg
- *
- * runnable_load_sum := runnable_sum
- * runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
- * cfq_rs:
- *
- * load_sum = \Sum se_weight(se) * se->avg.load_sum
- * load_avg = \Sum se->avg.load_avg
- *
- * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- * runnable_load_avg = \Sum se->avg.runable_load_avg
- */
-
-static int
-__update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
- cfs_rq->curr == se)) {
-
- ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
- cfs_se_util_change(&se->avg);
- return 1;
- }
-
- return 0;
-}
-
-static int
-__update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
-{
- if (___update_load_sum(now, cpu, &cfs_rq->avg,
- scale_load_down(cfs_rq->load.weight),
- scale_load_down(cfs_rq->runnable_weight),
- cfs_rq->curr != NULL)) {
-
- ___update_load_avg(&cfs_rq->avg, 1, 1);
- return 1;
- }
-
- return 0;
-}
-
#ifdef CONFIG_FAIR_GROUP_SCHED
/**
* update_tg_load_avg - update the tg's load avg
@@ -4039,12 +3714,6 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

#else /* CONFIG_SMP */

-static inline int
-update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
-{
- return 0;
-}
-
#define UPDATE_TG 0x0
#define SKIP_AGE_LOAD 0x0
#define DO_ATTACH 0x0
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
new file mode 100644
index 000000000000..e6ecbb2b8698
--- /dev/null
+++ b/kernel/sched/pelt.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per Entity Load Tracking
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Adaptive scheduling granularity, math enhancements by Peter Zijlstra
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
+ *
+ * Move PELT related code from fair.c into this pelt.c file
+ * Author: Vincent Guittot <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "sched-pelt.h"
+#include "pelt.h"
+
+/*
+ * Approximate:
+ * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static u64 decay_load(u64 val, u64 n)
+{
+ unsigned int local_n;
+
+ if (unlikely(n > LOAD_AVG_PERIOD * 63))
+ return 0;
+
+ /* after bounds checking we can collapse to 32-bit */
+ local_n = n;
+
+ /*
+ * As y^PERIOD = 1/2, we can combine
+ * y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+ * With a look-up table which covers y^n (n<PERIOD)
+ *
+ * To achieve constant time decay_load.
+ */
+ if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+ val >>= local_n / LOAD_AVG_PERIOD;
+ local_n %= LOAD_AVG_PERIOD;
+ }
+
+ val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
+ return val;
+}
+
+static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
+{
+ u32 c1, c2, c3 = d3; /* y^0 == 1 */
+
+ /*
+ * c1 = d1 y^p
+ */
+ c1 = decay_load((u64)d1, periods);
+
+ /*
+ * p-1
+ * c2 = 1024 \Sum y^n
+ * n=1
+ *
+ * inf inf
+ * = 1024 ( \Sum y^n - \Sum y^n - y^0 )
+ * n=0 n=p
+ */
+ c2 = LOAD_AVG_MAX - decay_load(LOAD_AVG_MAX, periods) - 1024;
+
+ return c1 + c2 + c3;
+}
+
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
+/*
+ * Accumulate the three separate parts of the sum; d1 the remainder
+ * of the last (incomplete) period, d2 the span of full periods and d3
+ * the remainder of the (incomplete) current period.
+ *
+ * d1 d2 d3
+ * ^ ^ ^
+ * | | |
+ * |<->|<----------------->|<--->|
+ * ... |---x---|------| ... |------|-----x (now)
+ *
+ * p-1
+ * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
+ * n=1
+ *
+ * = u y^p + (Step 1)
+ *
+ * p-1
+ * d1 y^p + 1024 \Sum y^n + d3 y^0 (Step 2)
+ * n=1
+ */
+static __always_inline u32
+accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ unsigned long scale_freq, scale_cpu;
+ u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
+ u64 periods;
+
+ scale_freq = arch_scale_freq_capacity(cpu);
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+ delta += sa->period_contrib;
+ periods = delta / 1024; /* A period is 1024us (~1ms) */
+
+ /*
+ * Step 1: decay old *_sum if we crossed period boundaries.
+ */
+ if (periods) {
+ sa->load_sum = decay_load(sa->load_sum, periods);
+ sa->runnable_load_sum =
+ decay_load(sa->runnable_load_sum, periods);
+ sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+
+ /*
+ * Step 2
+ */
+ delta %= 1024;
+ contrib = __accumulate_pelt_segments(periods,
+ 1024 - sa->period_contrib, delta);
+ }
+ sa->period_contrib = delta;
+
+ contrib = cap_scale(contrib, scale_freq);
+ if (load)
+ sa->load_sum += load * contrib;
+ if (runnable)
+ sa->runnable_load_sum += runnable * contrib;
+ if (running)
+ sa->util_sum += contrib * scale_cpu;
+
+ return periods;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series. To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ * p0 p1 p2
+ * (now) (~1ms ago) (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ * u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ * y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ * load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int
+___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+ unsigned long load, unsigned long runnable, int running)
+{
+ u64 delta;
+
+ delta = now - sa->last_update_time;
+ /*
+ * This should only happen when time goes backwards, which it
+ * unfortunately does during sched clock init when we swap over to TSC.
+ */
+ if ((s64)delta < 0) {
+ sa->last_update_time = now;
+ return 0;
+ }
+
+ /*
+ * Use 1024ns as the unit of measurement since it's a reasonable
+ * approximation of 1us and fast to compute.
+ */
+ delta >>= 10;
+ if (!delta)
+ return 0;
+
+ sa->last_update_time += delta << 10;
+
+ /*
+ * running is a subset of runnable (weight) so running can't be set if
+ * runnable is clear. But there are some corner cases where the current
+ * se has been already dequeued but cfs_rq->curr still points to it.
+ * This means that weight will be 0 but not running for a sched_entity
+ * but also for a cfs_rq if the latter becomes idle. As an example,
+ * this happens during idle_balance() which calls
+ * update_blocked_averages()
+ */
+ if (!load)
+ runnable = running = 0;
+
+ /*
+ * Now we know we crossed measurement unit boundaries. The *_avg
+ * accrues by two steps:
+ *
+ * Step 1: accumulate *_sum since last_update_time. If we haven't
+ * crossed period boundaries, finish.
+ */
+ if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+ return 0;
+
+ return 1;
+}
+
+static __always_inline void
+___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+{
+ u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
+
+ /*
+ * Step 2: update *_avg.
+ */
+ sa->load_avg = div_u64(load * sa->load_sum, divider);
+ sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
+ sa->util_avg = sa->util_sum / divider;
+}
+
+/*
+ * sched_entity:
+ *
+ * task:
+ * se_runnable() == se_weight()
+ *
+ * group: [ see update_cfs_group() ]
+ * se_weight() = tg->weight * grq->load_avg / tg->load_avg
+ * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
+ *
+ * load_sum := runnable_sum
+ * load_avg = se_weight(se) * runnable_avg
+ *
+ * runnable_load_sum := runnable_sum
+ * runnable_load_avg = se_runnable(se) * runnable_avg
+ *
+ * XXX collapse load_sum and runnable_load_sum
+ *
+ * cfq_rq:
+ *
+ * load_sum = \Sum se_weight(se) * se->avg.load_sum
+ * load_avg = \Sum se->avg.load_avg
+ *
+ * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
+ * runnable_load_avg = \Sum se->avg.runable_load_avg
+ */
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (entity_is_task(se))
+ se->runnable_weight = se->load.weight;
+
+ if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+ cfs_rq->curr == se)) {
+
+ ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
+ return 1;
+ }
+
+ return 0;
+}
+
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+{
+ if (___update_load_sum(now, cpu, &cfs_rq->avg,
+ scale_load_down(cfs_rq->load.weight),
+ scale_load_down(cfs_rq->runnable_weight),
+ cfs_rq->curr != NULL)) {
+
+ ___update_load_avg(&cfs_rq->avg, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
new file mode 100644
index 000000000000..9cac73efd64a
--- /dev/null
+++ b/kernel/sched/pelt.h
@@ -0,0 +1,43 @@
+#ifdef CONFIG_SMP
+
+int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
+int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+
+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Avoid store if the flag has been already set */
+ enqueued = avg->util_est.enqueued;
+ if (!(enqueued & UTIL_AVG_UNCHANGED))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ enqueued &= ~UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
+#else
+
+static inline int
+update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+#endif
+
+
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..00d6f2594c4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -673,7 +673,26 @@ struct dl_rq {
u64 bw_ratio;
};

+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+#else
+#define entity_is_task(se) 1
+#endif
+
#ifdef CONFIG_SMP
+/*
+ * XXX we want to get rid of these helpers and use the full load resolution.
+ */
+static inline long se_weight(struct sched_entity *se)
+{
+ return scale_load_down(se->load.weight);
+}
+
+static inline long se_runnable(struct sched_entity *se)
+{
+ return scale_load_down(se->runnable_weight);
+}

static inline bool sched_asym_prefer(int a, int b)
{

Subject: [tip:sched/core] cpufreq/schedutil: Use RT utilization tracking

Commit-ID: 3ae117c6cd7c4783819a0766aa97b9493a8a0f62
Gitweb: https://git.kernel.org/tip/3ae117c6cd7c4783819a0766aa97b9493a8a0f62
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:06 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

cpufreq/schedutil: Use RT utilization tracking

Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
can preempt and steal CFS's running time.

RT util_avg is used to take into account the utilization of RT tasks
on the CPU when selecting OPP. If a RT task migrate, the RT utilization
will not migrate but will decay over time. On an overloaded CPU, CFS
utilization reflects the remaining utilization avialable on CPU. When RT
task migrates, the CFS utilization will increase when tasks will start to
use the newly available capacity. At the same pace, RT utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c907fde01eaa..da29b5a33adb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
/* The fields below are only needed when sharing a policy: */
unsigned long util_cfs;
unsigned long util_dl;
+ unsigned long util_rt;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -186,15 +187,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
+ sg_cpu->util_rt = cpu_util_rt(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util;

if (rt_rq_is_runnable(&rq->rt))
return sg_cpu->max;

+ util = sg_cpu->util_dl;
+ util += sg_cpu->util_cfs;
+ util += sg_cpu->util_rt;
+
/*
* Utilization required by DEADLINE must always be granted while, for
* FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
@@ -205,7 +212,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+ return min(sg_cpu->max, util);
}

/**

Subject: [tip:sched/core] cpufreq/schedutil: Use DL utilization tracking

Commit-ID: 8cc90515a4fa419ccfc4703ff127699cdcb96839
Gitweb: https://git.kernel.org/tip/8cc90515a4fa419ccfc4703ff127699cdcb96839
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:08 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

cpufreq/schedutil: Use DL utilization tracking

Now that we have both the DL class bandwidth requirement and the DL class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the DL bandwidth requirement to define the
utilization of the CPU.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++++++------
kernel/sched/sched.h | 7 ++++++-
2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index da29b5a33adb..07760bc7f69a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -56,6 +56,7 @@ struct sugov_cpu {
/* The fields below are only needed when sharing a policy: */
unsigned long util_cfs;
unsigned long util_dl;
+ unsigned long bw_dl;
unsigned long util_rt;
unsigned long max;

@@ -187,6 +188,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->util_cfs = cpu_util_cfs(rq);
sg_cpu->util_dl = cpu_util_dl(rq);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util_rt = cpu_util_rt(rq);
}

@@ -198,20 +200,29 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
if (rt_rq_is_runnable(&rq->rt))
return sg_cpu->max;

- util = sg_cpu->util_dl;
- util += sg_cpu->util_cfs;
+ util = sg_cpu->util_cfs;
util += sg_cpu->util_rt;

+ if ((util + sg_cpu->util_dl) >= sg_cpu->max)
+ return sg_cpu->max;
+
/*
- * Utilization required by DEADLINE must always be granted while, for
- * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
- * gracefully reduce the frequency when no tasks show up for longer
+ * As there is still idle time on the CPU, we need to compute the
+ * utilization level of the CPU.
+ *
+ * Bandwidth required by DEADLINE must always be granted while, for
+ * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+ * to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
+
+ /* Add DL bandwidth requirement */
+ util += sg_cpu->bw_dl;
+
return min(sg_cpu->max, util);
}

@@ -367,7 +378,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu, struct sugov_policy *sg_policy)
{
- if (cpu_util_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->util_dl)
+ if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
sg_policy->need_freq_update = true;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab8b5296b5f6..9028f268f867 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2199,11 +2199,16 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif

#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-static inline unsigned long cpu_util_dl(struct rq *rq)
+static inline unsigned long cpu_bw_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
}

+static inline unsigned long cpu_util_dl(struct rq *rq)
+{
+ return READ_ONCE(rq->avg_dl.util_avg);
+}
+
static inline unsigned long cpu_util_cfs(struct rq *rq)
{
unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

Subject: [tip:sched/core] sched/dl: Add dl_rq utilization tracking

Commit-ID: 3727e0e16340cbdf83818f5bf0113505c6876057
Gitweb: https://git.kernel.org/tip/3727e0e16340cbdf83818f5bf0113505c6876057
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:07 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:20 +0200

sched/dl: Add dl_rq utilization tracking

Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
tasks and the CFS's utilization might no longer describes the real
utilization level.

Current DL bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the DL sched class. We track
DL class utilization to estimate the system utilization.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 6 ++++++
kernel/sched/fair.c | 11 ++++++++---
kernel/sched/pelt.c | 23 +++++++++++++++++++++++
kernel/sched/pelt.h | 6 ++++++
kernel/sched/sched.h | 1 +
5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fbfc3f1d368a..f4de26982d80 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,7 @@
* Fabio Checconi <[email protected]>
*/
#include "sched.h"
+#include "pelt.h"

struct dl_bandwidth def_dl_bandwidth;

@@ -1761,6 +1762,9 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

deadline_queue_push_tasks(rq);

+ if (rq->curr->sched_class != &dl_sched_class)
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+
return p;
}

@@ -1768,6 +1772,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
@@ -1784,6 +1789,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_dl(rq);

+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
/*
* Even when we have runtime, update_curr_dl() might have resulted in us
* not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b453213cd18..f096275c7df2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,11 +7290,14 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

-static inline bool rt_rq_has_blocked(struct rq *rq)
+static inline bool others_rqs_have_blocked(struct rq *rq)
{
if (READ_ONCE(rq->avg_rt.util_avg))
return true;

+ if (READ_ONCE(rq->avg_dl.util_avg))
+ return true;
+
return false;
}

@@ -7358,8 +7361,9 @@ static void update_blocked_averages(int cpu)
done = false;
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
/* Don't need periodic decay once load/util_avg are null */
- if (rt_rq_has_blocked(rq))
+ if (others_rqs_have_blocked(rq))
done = false;

#ifdef CONFIG_NO_HZ_COMMON
@@ -7427,9 +7431,10 @@ static inline void update_blocked_averages(int cpu)
update_rq_clock(rq);
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq) && !rt_rq_has_blocked(rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a00b1ba3dd5b..8b78b6320cda 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -334,3 +334,26 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+/*
+ * dl_rq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+ running,
+ running,
+ running)) {
+
+ ___update_load_avg(&rq->avg_dl, 1, 1);
+ return 1;
+ }
+
+ return 0;
+}
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index b2983b741d57..0e4f912461ad 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -4,6 +4,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

/*
* When a task is dequeued, its estimated utilization should not be update if
@@ -45,6 +46,11 @@ update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
return 0;
}

+static inline int
+update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 405dd9ba6b39..ab8b5296b5f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -856,6 +856,7 @@ struct rq {
u64 rt_avg;
u64 age_stamp;
struct sched_avg avg_rt;
+ struct sched_avg avg_dl;
u64 idle_stamp;
u64 avg_idle;


Subject: [tip:sched/core] sched/irq: Add IRQ utilization tracking

Commit-ID: 91c27493e78df6849baaa21a9d66e26de8b875c0
Gitweb: https://git.kernel.org/tip/91c27493e78df6849baaa21a9d66e26de8b875c0
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:09 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

sched/irq: Add IRQ utilization tracking

interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because:

rq_clock == rq_clock_task + interrupt time

and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.

The CPU utilization is:

avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 4 +++-
kernel/sched/fair.c | 13 ++++++++++---
kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 16 ++++++++++++++++
kernel/sched/sched.h | 3 +++
5 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..38107a95baca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -17,6 +17,8 @@
#include "../workqueue_internal.h"
#include "../smpboot.h"

+#include "pelt.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>

@@ -185,7 +187,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
- sched_rt_avg_update(rq, irq_delta + steal);
+ update_irq_load_avg(rq, irq_delta + steal);
#endif
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f096275c7df2..c2782b29c79f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7290,7 +7290,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
return false;
}

-static inline bool others_rqs_have_blocked(struct rq *rq)
+static inline bool others_have_blocked(struct rq *rq)
{
if (READ_ONCE(rq->avg_rt.util_avg))
return true;
@@ -7298,6 +7298,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
if (READ_ONCE(rq->avg_dl.util_avg))
return true;

+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ if (READ_ONCE(rq->avg_irq.util_avg))
+ return true;
+#endif
+
return false;
}

@@ -7362,8 +7367,9 @@ static void update_blocked_averages(int cpu)
}
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
/* Don't need periodic decay once load/util_avg are null */
- if (others_rqs_have_blocked(rq))
+ if (others_have_blocked(rq))
done = false;

#ifdef CONFIG_NO_HZ_COMMON
@@ -7432,9 +7438,10 @@ static inline void update_blocked_averages(int cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_irq_load_avg(rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
+ if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 8b78b6320cda..ead6d8b4a8b8 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)

return 0;
}
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+/*
+ * irq:
+ *
+ * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
+ * util_sum = cpu_scale * load_sum
+ * runnable_load_sum = load_sum
+ *
+ */
+
+int update_irq_load_avg(struct rq *rq, u64 running)
+{
+ int ret = 0;
+ /*
+ * We know the time that has been used by interrupt since last update
+ * but we don't when. Let be pessimistic and assume that interrupt has
+ * happened just before the update. This is not so far from reality
+ * because interrupt will most probably wake up task and trig an update
+ * of rq clock during which the metric si updated.
+ * We start to decay with normal context time and then we add the
+ * interrupt context time.
+ * We can safely remove running from rq->clock because
+ * rq->clock += delta with delta >= running
+ */
+ ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+ 0,
+ 0,
+ 0);
+ ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+ 1,
+ 1,
+ 1);
+
+ if (ret)
+ ___update_load_avg(&rq->avg_irq, 1, 1);
+
+ return ret;
+}
+#endif
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 0e4f912461ad..d2894db28955 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);

+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+int update_irq_load_avg(struct rq *rq, u64 running);
+#else
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+ return 0;
+}
+#endif
+
/*
* When a task is dequeued, its estimated utilization should not be update if
* its util_avg has not been updated at least once.
@@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
{
return 0;
}
+
+static inline int
+update_irq_load_avg(struct rq *rq, u64 running)
+{
+ return 0;
+}
#endif


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9028f268f867..b26d0c9948dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -857,6 +857,9 @@ struct rq {
u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ struct sched_avg avg_irq;
+#endif
u64 idle_stamp;
u64 avg_idle;


Subject: [tip:sched/core] cpufreq/schedutil: Take time spent in interrupts into account

Commit-ID: 9033ea11889f88f243445495f72441e22256d5e9
Gitweb: https://git.kernel.org/tip/9033ea11889f88f243445495f72441e22256d5e9
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:10 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

cpufreq/schedutil: Take time spent in interrupts into account

The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.

RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is:

IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives the following speedup:

iperf -c server_address -r -t 5

w/o patch w/ patch
Tx 276 Mbits/sec 304 Mbits/sec +10%
Rx 299 Mbits/sec 328 Mbits/sec +9%

8 iterations
stdev is lower than 1%

Only WFI idle state is enabled (shallowest idle state).

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 25 +++++++++++++++++++++----
kernel/sched/sched.h | 13 +++++++++++++
2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 07760bc7f69a..7016bde9d194 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -58,6 +58,7 @@ struct sugov_cpu {
unsigned long util_dl;
unsigned long bw_dl;
unsigned long util_rt;
+ unsigned long util_irq;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -190,21 +191,30 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->util_dl = cpu_util_dl(rq);
sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util_rt = cpu_util_rt(rq);
+ sg_cpu->util_irq = cpu_util_irq(rq);
}

static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util;
+ unsigned long util, max = sg_cpu->max;

if (rt_rq_is_runnable(&rq->rt))
return sg_cpu->max;

+ if (unlikely(sg_cpu->util_irq >= max))
+ return max;
+
+ /* Sum rq utilization */
util = sg_cpu->util_cfs;
util += sg_cpu->util_rt;

- if ((util + sg_cpu->util_dl) >= sg_cpu->max)
- return sg_cpu->max;
+ /*
+ * Interrupt time is not seen by RQS utilization so we can compare
+ * them with the CPU capacity
+ */
+ if ((util + sg_cpu->util_dl) >= max)
+ return max;

/*
* As there is still idle time on the CPU, we need to compute the
@@ -220,10 +230,17 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* ready for such an interface. So, we only do the latter for now.
*/

+ /* Weight RQS utilization to normal context window */
+ util *= (max - sg_cpu->util_irq);
+ util /= max;
+
+ /* Add interrupt utilization */
+ util += sg_cpu->util_irq;
+
/* Add DL bandwidth requirement */
util += sg_cpu->bw_dl;

- return min(sg_cpu->max, util);
+ return min(max, util);
}

/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b26d0c9948dd..b2833e2b4b6a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2228,4 +2228,17 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
{
return rq->avg_rt.util_avg;
}
+
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+ return rq->avg_irq.util_avg;
+}
+#else
+static inline unsigned long cpu_util_irq(struct rq *rq)
+{
+ return 0;
+}
+
+#endif
#endif

Subject: [tip:sched/core] sched/cpufreq: Remove sugov_aggregate_util()

Commit-ID: dfa444dc2ff62edbaf1ff95ed22dd2ce8a5715da
Gitweb: https://git.kernel.org/tip/dfa444dc2ff62edbaf1ff95ed22dd2ce8a5715da
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:11 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 15 Jul 2018 23:51:21 +0200

sched/cpufreq: Remove sugov_aggregate_util()

There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Signed-off-by: Vincent Guittot <[email protected]>
[ Rebased after adding irq tracking and fixed some compilation errors. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 44 ++++++++++++++--------------------------
kernel/sched/sched.h | 2 +-
2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7016bde9d194..c9622b3f183d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -53,12 +53,7 @@ struct sugov_cpu {
unsigned int iowait_boost_max;
u64 last_update;

- /* The fields below are only needed when sharing a policy: */
- unsigned long util_cfs;
- unsigned long util_dl;
unsigned long bw_dl;
- unsigned long util_rt;
- unsigned long util_irq;
unsigned long max;

/* The field below is for single-CPU policies only: */
@@ -182,38 +177,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

-static void sugov_get_util(struct sugov_cpu *sg_cpu)
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util, irq, max;

- sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
- sg_cpu->util_cfs = cpu_util_cfs(rq);
- sg_cpu->util_dl = cpu_util_dl(rq);
- sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util_rt = cpu_util_rt(rq);
- sg_cpu->util_irq = cpu_util_irq(rq);
-}
-
-static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
-{
- struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util, max = sg_cpu->max;
+ sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ sg_cpu->bw_dl = cpu_bw_dl(rq);

if (rt_rq_is_runnable(&rq->rt))
- return sg_cpu->max;
+ return max;
+
+ irq = cpu_util_irq(rq);

- if (unlikely(sg_cpu->util_irq >= max))
+ if (unlikely(irq >= max))
return max;

/* Sum rq utilization */
- util = sg_cpu->util_cfs;
- util += sg_cpu->util_rt;
+ util = cpu_util_cfs(rq);
+ util += cpu_util_rt(rq);

/*
* Interrupt time is not seen by RQS utilization so we can compare
* them with the CPU capacity
*/
- if ((util + sg_cpu->util_dl) >= max)
+ if ((util + cpu_util_dl(rq)) >= max)
return max;

/*
@@ -231,11 +219,11 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
*/

/* Weight RQS utilization to normal context window */
- util *= (max - sg_cpu->util_irq);
+ util *= (max - irq);
util /= max;

/* Add interrupt utilization */
- util += sg_cpu->util_irq;
+ util += irq;

/* Add DL bandwidth requirement */
util += sg_cpu->bw_dl;
@@ -418,9 +406,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,

busy = sugov_cpu_is_busy(sg_cpu);

- sugov_get_util(sg_cpu);
+ util = sugov_get_util(sg_cpu);
max = sg_cpu->max;
- util = sugov_aggregate_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time, &util, &max);
next_f = get_next_freq(sg_policy, util, max);
/*
@@ -459,9 +446,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long j_util, j_max;

- sugov_get_util(j_sg_cpu);
+ j_util = sugov_get_util(j_sg_cpu);
j_max = j_sg_cpu->max;
- j_util = sugov_aggregate_util(j_sg_cpu);
sugov_iowait_apply(j_sg_cpu, time, &j_util, &j_max);

if (j_util * max > j_max * util) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2833e2b4b6a..061d51fb5b44 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2226,7 +2226,7 @@ static inline unsigned long cpu_util_cfs(struct rq *rq)

static inline unsigned long cpu_util_rt(struct rq *rq)
{
- return rq->avg_rt.util_avg;
+ return READ_ONCE(rq->avg_rt.util_avg);
}

#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)

Subject: [tip:sched/core] sched/core: Remove the rt_avg code

Commit-ID: bbb62c0b024a1c721232667fa1d625cf6b3a555b
Gitweb: https://git.kernel.org/tip/bbb62c0b024a1c721232667fa1d625cf6b3a555b
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:13 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/core: Remove the rt_avg code

rt_avg is not used anywhere anymore, so we can remove all related code.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 26 --------------------------
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 17 -----------------
3 files changed, 45 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 38107a95baca..a691b07390ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -651,23 +651,6 @@ bool sched_can_stop_tick(struct rq *rq)
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
-
-void sched_avg_update(struct rq *rq)
-{
- s64 period = sched_avg_period();
-
- while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
- /*
- * Inline assembly required to prevent the compiler
- * optimising this loop into a divmod call.
- * See __iter_div_u64_rem() for another example of this.
- */
- asm("" : "+rm" (rq->age_stamp));
- rq->age_stamp += period;
- rq->rt_avg /= 2;
- }
-}
-
#endif /* CONFIG_SMP */

#if defined(CONFIG_RT_GROUP_SCHED) || (defined(CONFIG_FAIR_GROUP_SCHED) && \
@@ -5716,13 +5699,6 @@ void set_rq_offline(struct rq *rq)
}
}

-static void set_cpu_rq_start_time(unsigned int cpu)
-{
- struct rq *rq = cpu_rq(cpu);
-
- rq->age_stamp = sched_clock_cpu(cpu);
-}
-
/*
* used to mark begin/end of suspend/resume:
*/
@@ -5840,7 +5816,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)

int sched_cpu_starting(unsigned int cpu)
{
- set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
sched_tick_start(cpu);
return 0;
@@ -6108,7 +6083,6 @@ void __init sched_init(void)

#ifdef CONFIG_SMP
idle_thread_set_boot_cpu();
- set_cpu_rq_start_time(smp_processor_id());
#endif
init_sched_fair_class();

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d265fa9756a2..d5f7d521e448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,8 +5323,6 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,

this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
}
-
- sched_avg_update(this_rq);
}

/* Used instead of source_load when we know the type == 0 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 061d51fb5b44..14aac2d2de80 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -853,8 +853,6 @@ struct rq {

struct list_head cfs_tasks;

- u64 rt_avg;
- u64 age_stamp;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
@@ -1719,11 +1717,6 @@ extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

-static inline u64 sched_avg_period(void)
-{
- return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
-}
-
#ifdef CONFIG_SCHED_HRTICK

/*
@@ -1760,8 +1753,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
#endif

#ifdef CONFIG_SMP
-extern void sched_avg_update(struct rq *rq);
-
#ifndef arch_scale_cpu_capacity
static __always_inline
unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
@@ -1772,12 +1763,6 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
-{
- rq->rt_avg += rt_delta * arch_scale_freq_capacity(cpu_of(rq));
- sched_avg_update(rq);
-}
#else
#ifndef arch_scale_cpu_capacity
static __always_inline
@@ -1786,8 +1771,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
return SCHED_CAPACITY_SCALE;
}
#endif
-static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-static inline void sched_avg_update(struct rq *rq) { }
#endif

struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)

Subject: [tip:sched/core] sched/core: Use PELT for scale_rt_capacity()

Commit-ID: 523e979d31648112bad07f427c183525c0258c75
Gitweb: https://git.kernel.org/tip/523e979d31648112bad07f427c183525c0258c75
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:12 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 16 Jul 2018 00:16:25 +0200

sched/core: Use PELT for scale_rt_capacity()

The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.

The same formula as schedutil is used:

IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/deadline.c | 2 --
kernel/sched/fair.c | 44 ++++++++++++++++++++++----------------------
kernel/sched/pelt.c | 2 +-
kernel/sched/rt.c | 2 --
4 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4de26982d80..68b8a9f1c9ca 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1180,8 +1180,6 @@ static void update_curr_dl(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2782b29c79f..d265fa9756a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7551,39 +7551,39 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
static unsigned long scale_rt_capacity(int cpu)
{
struct rq *rq = cpu_rq(cpu);
- u64 total, used, age_stamp, avg;
- s64 delta;
-
- /*
- * Since we're reading these variables without serialization make sure
- * we read them once before doing sanity checks on them.
- */
- age_stamp = READ_ONCE(rq->age_stamp);
- avg = READ_ONCE(rq->rt_avg);
- delta = __rq_clock_broken(rq) - age_stamp;
+ unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+ unsigned long used, free;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ unsigned long irq;
+#endif

- if (unlikely(delta < 0))
- delta = 0;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ irq = READ_ONCE(rq->avg_irq.util_avg);

- total = sched_avg_period() + delta;
+ if (unlikely(irq >= max))
+ return 1;
+#endif

- used = div_u64(avg, total);
+ used = READ_ONCE(rq->avg_rt.util_avg);
+ used += READ_ONCE(rq->avg_dl.util_avg);

- if (likely(used < SCHED_CAPACITY_SCALE))
- return SCHED_CAPACITY_SCALE - used;
+ if (unlikely(used >= max))
+ return 1;

- return 1;
+ free = max - used;
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+ free *= (max - irq);
+ free /= max;
+#endif
+ return free;
}

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
+ unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;

- cpu_rq(cpu)->cpu_capacity_orig = capacity;
-
- capacity *= scale_rt_capacity(cpu);
- capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);

if (!capacity)
capacity = 1;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index ead6d8b4a8b8..35475c0c5419 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -237,7 +237,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
*/
sa->load_avg = div_u64(load * sa->load_sum, divider);
sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
- sa->util_avg = sa->util_sum / divider;
+ WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0dc8ad1915e6..2df72abfa24a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -973,8 +973,6 @@ static void update_curr_rt(struct rq *rq)
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);

- sched_rt_avg_update(rq, delta_exec);
-
if (!rt_bandwidth_enabled())
return;


Subject: [tip:sched/core] sched/sysctl: Remove unused sched_time_avg_ms sysctl

Commit-ID: 5fd778915ad29184a5ff8eb82d1118f6916b79e4
Gitweb: https://git.kernel.org/tip/5fd778915ad29184a5ff8eb82d1118f6916b79e4
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 28 Jun 2018 17:45:14 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/sysctl: Remove unused sched_time_avg_ms sysctl

/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
remove it.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Luis R. Rodriguez <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 8 --------
kernel/sched/sched.h | 1 -
kernel/sysctl.c | 8 --------
4 files changed, 18 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a1512ec55..913488d828cb 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -40,7 +40,6 @@ extern unsigned int sysctl_numa_balancing_scan_size;
#ifdef CONFIG_SCHED_DEBUG
extern __read_mostly unsigned int sysctl_sched_migration_cost;
extern __read_mostly unsigned int sysctl_sched_nr_migrate;
-extern __read_mostly unsigned int sysctl_sched_time_avg;

int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a691b07390ab..ba6bb805693a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -46,14 +46,6 @@ const_debug unsigned int sysctl_sched_features =
*/
const_debug unsigned int sysctl_sched_nr_migrate = 32;

-/*
- * period over which we average the RT time consumption, measured
- * in ms.
- *
- * default: 1s
- */
-const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
-
/*
* period over which we measure -rt task CPU usage in us.
* default: 1s
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14aac2d2de80..ebb4b3c3ece7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,7 +1713,6 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);

extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

-extern const_debug unsigned int sysctl_sched_time_avg;
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2d9837c0aff4..f22f76b7a138 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -368,14 +368,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
- {
- .procname = "sched_time_avg_ms",
- .data = &sysctl_sched_time_avg,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &one,
- },
#ifdef CONFIG_SCHEDSTATS
{
.procname = "sched_schedstats",

Subject: [tip:sched/core] sched/cpufreq: Clarify sugov_get_util()

Commit-ID: 45f5519ec55e75af3565dd737586d3b041834f71
Gitweb: https://git.kernel.org/tip/45f5519ec55e75af3565dd737586d3b041834f71
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 5 Jul 2018 14:36:17 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 16 Jul 2018 00:16:29 +0200

sched/cpufreq: Clarify sugov_get_util()

Add a few comments to (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 75 +++++++++++++++++++++++++++++-----------
1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c9622b3f183d..97dcd4472a0e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -177,6 +177,26 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}

+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ * cpu_util_{cfs,rt,dl,irq}()
+ * cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued
+ * in the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured metric but a value computed
+ * based on the task model parameters and gives the minimal utilization
+ * required to meet deadlines.
+ */
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
@@ -188,47 +208,60 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
if (rt_rq_is_runnable(&rq->rt))
return max;

+ /*
+ * Early check to see if IRQ/steal time saturates the CPU, can be
+ * because of inaccuracies in how we track these -- see
+ * update_irq_load_avg().
+ */
irq = cpu_util_irq(rq);
-
if (unlikely(irq >= max))
return max;

- /* Sum rq utilization */
+ /*
+ * Because the time spend on RT/DL tasks is visible as 'lost' time to
+ * CFS tasks and we use the same metric to track the effective
+ * utilization (PELT windows are synchronized) we can directly add them
+ * to obtain the CPU's actual utilization.
+ */
util = cpu_util_cfs(rq);
util += cpu_util_rt(rq);

/*
- * Interrupt time is not seen by RQS utilization so we can compare
- * them with the CPU capacity
+ * We do not make cpu_util_dl() a permanent part of this sum because we
+ * want to use cpu_bw_dl() later on, but we need to check if the
+ * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
+ * f_max when there is no idle time.
+ *
+ * NOTE: numerical errors or stop class might cause us to not quite hit
+ * saturation when we should -- something for later.
*/
if ((util + cpu_util_dl(rq)) >= max)
return max;

/*
- * As there is still idle time on the CPU, we need to compute the
- * utilization level of the CPU.
+ * There is still idle time; further improve the number by using the
+ * irq metric. Because IRQ/steal time is hidden from the task clock we
+ * need to scale the task numbers:
*
+ * 1 - irq
+ * U' = irq + ------- * U
+ * max
+ */
+ util *= (max - irq);
+ util /= max;
+ util += irq;
+
+ /*
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
- * Ideally we would like to set util_dl as min/guaranteed freq and
- * util_cfs + util_dl as requested freq. However, cpufreq is not yet
- * ready for such an interface. So, we only do the latter for now.
+ * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+ * bw_dl as requested freq. However, cpufreq is not yet ready for such
+ * an interface. So, we only do the latter for now.
*/
-
- /* Weight RQS utilization to normal context window */
- util *= (max - irq);
- util /= max;
-
- /* Add interrupt utilization */
- util += irq;
-
- /* Add DL bandwidth requirement */
- util += sg_cpu->bw_dl;
-
- return min(max, util);
+ return min(max, util + sg_cpu->bw_dl);
}

/**

2018-07-16 11:26:03

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()

Hi Ingo,

On Mon, 16 Jul 2018 at 00:15, Ingo Molnar <[email protected]> wrote:
>
>
> * Vincent Guittot <[email protected]> wrote:
>

> > @@ -7550,39 +7550,36 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
> > static unsigned long scale_rt_capacity(int cpu)
> > {
> > struct rq *rq = cpu_rq(cpu);
> > - u64 total, used, age_stamp, avg;
> > - s64 delta;
> > -
> > - /*
> > - * Since we're reading these variables without serialization make sure
> > - * we read them once before doing sanity checks on them.
> > - */
> > - age_stamp = READ_ONCE(rq->age_stamp);
> > - avg = READ_ONCE(rq->rt_avg);
> > - delta = __rq_clock_broken(rq) - age_stamp;
> > + unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> > + unsigned long used, irq, free;
> >
> > - if (unlikely(delta < 0))
> > - delta = 0;
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > + irq = READ_ONCE(rq->avg_irq.util_avg);
> >
> > - total = sched_avg_period() + delta;
> > + if (unlikely(irq >= max))
> > + return 1;
> > +#endif
>
> Note that 'irq' is unused outside that macro block, resulting in a new warning on
> defconfig builds:
>
> CC kernel/sched/fair.o
> kernel/sched/fair.c: In function ‘scale_rt_capacity’:
> kernel/sched/fair.c:7553:22: warning: unused variable ‘irq’ [-Wunused-variable]
> unsigned long used, irq, free;
> ^~~
>
> I have applied the delta fix below for simplicity, but what we really want is a
> cleanup of that function to eliminate the #ifdefs. One solution would be to factor
> out the 'irq' utilization value into a helper inline, and double check that if the
> configs are off the compiler does the right thing and eliminates this identity
> transformation for the irq==0 case:
>
> free *= (max - irq);
> free /= max;
>
> If the compiler refuses to optimize this away (due to the zero and overflow
> cases), try to find something more clever?

Thanks for the fix.
I'm off for now and will look at your proposal above once back

Regards,
Vincent

>
> Thanks,
>
> Ingo
>
> kernel/sched/fair.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e3221db0511a..d5f7d521e448 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7550,7 +7550,10 @@ static unsigned long scale_rt_capacity(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> - unsigned long used, irq, free;
> + unsigned long used, free;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> + unsigned long irq;
> +#endif
>
> #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> irq = READ_ONCE(rq->avg_irq.util_avg);

2018-07-16 11:41:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 09/11] sched: use pelt for scale_rt_capacity()


* Vincent Guittot <[email protected]> wrote:

> > If the compiler refuses to optimize this away (due to the zero and overflow
> > cases), try to find something more clever?
>
> Thanks for the fix.
> I'm off for now and will look at your proposal above once back

Sounds good, there's no rush, we've still got time until ~rc7.

Thanks,

Ingo

2018-07-26 03:10:31

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 06/11] sched/irq: add irq utilization tracking

Hi Vincent,
On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
<[email protected]> wrote:
>
> interrupt and steal time are the only remaining activities tracked by
> rt_avg. Like for sched classes, we can use PELT to track their average
> utilization of the CPU. But unlike sched class, we don't track when
> entering/leaving interrupt; Instead, we take into account the time spent
> under interrupt context when we update rqs' clock (rq_clock_task).
> This also means that we have to decay the normal context time and account
> for interrupt time during the update.
>
> That's also important to note that because
> rq_clock == rq_clock_task + interrupt time
> and rq_clock_task is used by a sched class to compute its utilization, the
> util_avg of a sched class only reflects the utilization of the time spent
> in normal context and not of the whole time of the CPU. The utilization of
> interrupt gives an more accurate level of utilization of CPU.
> The CPU utilization is :
> avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
>
> Most of the time, avg_irq is small and neglictible so the use of the
> approximation CPU utilization = /Sum avg_rq was enough
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/core.c | 4 +++-
> kernel/sched/fair.c | 13 ++++++++++---
> kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 16 ++++++++++++++++
> kernel/sched/sched.h | 3 +++
> 5 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d8fac..e5263a4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -18,6 +18,8 @@
> #include "../workqueue_internal.h"
> #include "../smpboot.h"
>
> +#include "pelt.h"
> +
> #define CREATE_TRACE_POINTS
> #include <trace/events/sched.h>
>
> @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>
> #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> - sched_rt_avg_update(rq, irq_delta + steal);
> + update_irq_load_avg(rq, irq_delta + steal);

I think we should not add steal time into irq load tracking, steal
time is always 0 on native kernel which doesn't matter, what will
happen when guest disables IRQ_TIME_ACCOUNTING and enables
PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
addition, we haven't exposed power management for performance which
means that e.g. schedutil governor can not cooperate with passive mode
intel_pstate driver to tune the OPP. To decay the old steal time avg
and add the new one just wastes cpu cycles.

Regards,
Wanpeng Li

> #endif
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffce4b2..d2758e3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> return false;
> }
>
> -static inline bool others_rqs_have_blocked(struct rq *rq)
> +static inline bool others_have_blocked(struct rq *rq)
> {
> if (READ_ONCE(rq->avg_rt.util_avg))
> return true;
> @@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
> if (READ_ONCE(rq->avg_dl.util_avg))
> return true;
>
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> + if (READ_ONCE(rq->avg_irq.util_avg))
> + return true;
> +#endif
> +
> return false;
> }
>
> @@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
> }
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> + update_irq_load_avg(rq, 0);
> /* Don't need periodic decay once load/util_avg are null */
> - if (others_rqs_have_blocked(rq))
> + if (others_have_blocked(rq))
> done = false;
>
> #ifdef CONFIG_NO_HZ_COMMON
> @@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> + update_irq_load_avg(rq, 0);
> #ifdef CONFIG_NO_HZ_COMMON
> rq->last_blocked_load_update_tick = jiffies;
> - if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
> + if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
> rq->has_blocked_load = 0;
> #endif
> rq_unlock_irqrestore(rq, &rf);
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 8b78b63..ead6d8b 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>
> return 0;
> }
> +
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +/*
> + * irq:
> + *
> + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> + * util_sum = cpu_scale * load_sum
> + * runnable_load_sum = load_sum
> + *
> + */
> +
> +int update_irq_load_avg(struct rq *rq, u64 running)
> +{
> + int ret = 0;
> + /*
> + * We know the time that has been used by interrupt since last update
> + * but we don't when. Let be pessimistic and assume that interrupt has
> + * happened just before the update. This is not so far from reality
> + * because interrupt will most probably wake up task and trig an update
> + * of rq clock during which the metric si updated.
> + * We start to decay with normal context time and then we add the
> + * interrupt context time.
> + * We can safely remove running from rq->clock because
> + * rq->clock += delta with delta >= running
> + */
> + ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> + 0,
> + 0,
> + 0);
> + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> + 1,
> + 1,
> + 1);
> +
> + if (ret)
> + ___update_load_avg(&rq->avg_irq, 1, 1);
> +
> + return ret;
> +}
> +#endif
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 0e4f912..d2894db 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
> int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> +int update_irq_load_avg(struct rq *rq, u64 running);
> +#else
> +static inline int
> +update_irq_load_avg(struct rq *rq, u64 running)
> +{
> + return 0;
> +}
> +#endif
> +
> /*
> * When a task is dequeued, its estimated utilization should not be update if
> * its util_avg has not been updated at least once.
> @@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> {
> return 0;
> }
> +
> +static inline int
> +update_irq_load_avg(struct rq *rq, u64 running)
> +{
> + return 0;
> +}
> #endif
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ef5d6aa..377be2b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -850,6 +850,9 @@ struct rq {
> u64 age_stamp;
> struct sched_avg avg_rt;
> struct sched_avg avg_dl;
> +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> + struct sched_avg avg_irq;
> +#endif
> u64 idle_stamp;
> u64 avg_idle;
>
> --
> 2.7.4
>

2018-07-30 16:44:50

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 06/11] sched/irq: add irq utilization tracking

Hi Wanpeng,

On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <[email protected]> wrote:
>
> Hi Vincent,
> On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
> <[email protected]> wrote:
> >
> > interrupt and steal time are the only remaining activities tracked by
> > rt_avg. Like for sched classes, we can use PELT to track their average
> > utilization of the CPU. But unlike sched class, we don't track when
> > entering/leaving interrupt; Instead, we take into account the time spent
> > under interrupt context when we update rqs' clock (rq_clock_task).
> > This also means that we have to decay the normal context time and account
> > for interrupt time during the update.
> >
> > That's also important to note that because
> > rq_clock == rq_clock_task + interrupt time
> > and rq_clock_task is used by a sched class to compute its utilization, the
> > util_avg of a sched class only reflects the utilization of the time spent
> > in normal context and not of the whole time of the CPU. The utilization of
> > interrupt gives an more accurate level of utilization of CPU.
> > The CPU utilization is :
> > avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
> >
> > Most of the time, avg_irq is small and neglictible so the use of the
> > approximation CPU utilization = /Sum avg_rq was enough
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/core.c | 4 +++-
> > kernel/sched/fair.c | 13 ++++++++++---
> > kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/pelt.h | 16 ++++++++++++++++
> > kernel/sched/sched.h | 3 +++
> > 5 files changed, 72 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 78d8fac..e5263a4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -18,6 +18,8 @@
> > #include "../workqueue_internal.h"
> > #include "../smpboot.h"
> >
> > +#include "pelt.h"
> > +
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/sched.h>
> >
> > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> >
> > #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > - sched_rt_avg_update(rq, irq_delta + steal);
> > + update_irq_load_avg(rq, irq_delta + steal);
>
> I think we should not add steal time into irq load tracking, steal
> time is always 0 on native kernel which doesn't matter, what will
> happen when guest disables IRQ_TIME_ACCOUNTING and enables
> PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> addition, we haven't exposed power management for performance which
> means that e.g. schedutil governor can not cooperate with passive mode
> intel_pstate driver to tune the OPP. To decay the old steal time avg
> and add the new one just wastes cpu cycles.

In fact, I have kept the same behavior as with rt_avg, which was
already adding steal time when computing scale_rt_capacity, which is
used to reflect the remaining capacity for FAIR tasks and is used in
load balance. I'm not sure that it's worth using different variables
for irq and steal.
That being said, I see a possible optimization in schedutil when
PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
With this kind of config, scale_irq_capacity can be a nop for
schedutil but scales the utilization for scale_rt_capacity

Regards,
Vincent

>
> Regards,
> Wanpeng Li
>
> > #endif
> > }
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ffce4b2..d2758e3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7289,7 +7289,7 @@ static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> > return false;
> > }
> >
> > -static inline bool others_rqs_have_blocked(struct rq *rq)
> > +static inline bool others_have_blocked(struct rq *rq)
> > {
> > if (READ_ONCE(rq->avg_rt.util_avg))
> > return true;
> > @@ -7297,6 +7297,11 @@ static inline bool others_rqs_have_blocked(struct rq *rq)
> > if (READ_ONCE(rq->avg_dl.util_avg))
> > return true;
> >
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > + if (READ_ONCE(rq->avg_irq.util_avg))
> > + return true;
> > +#endif
> > +
> > return false;
> > }
> >
> > @@ -7361,8 +7366,9 @@ static void update_blocked_averages(int cpu)
> > }
> > update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> > update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> > + update_irq_load_avg(rq, 0);
> > /* Don't need periodic decay once load/util_avg are null */
> > - if (others_rqs_have_blocked(rq))
> > + if (others_have_blocked(rq))
> > done = false;
> >
> > #ifdef CONFIG_NO_HZ_COMMON
> > @@ -7431,9 +7437,10 @@ static inline void update_blocked_averages(int cpu)
> > update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
> > update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
> > update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
> > + update_irq_load_avg(rq, 0);
> > #ifdef CONFIG_NO_HZ_COMMON
> > rq->last_blocked_load_update_tick = jiffies;
> > - if (!cfs_rq_has_blocked(cfs_rq) && !others_rqs_have_blocked(rq))
> > + if (!cfs_rq_has_blocked(cfs_rq) && !others_have_blocked(rq))
> > rq->has_blocked_load = 0;
> > #endif
> > rq_unlock_irqrestore(rq, &rf);
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index 8b78b63..ead6d8b 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -357,3 +357,43 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> >
> > return 0;
> > }
> > +
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +/*
> > + * irq:
> > + *
> > + * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
> > + * util_sum = cpu_scale * load_sum
> > + * runnable_load_sum = load_sum
> > + *
> > + */
> > +
> > +int update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > + int ret = 0;
> > + /*
> > + * We know the time that has been used by interrupt since last update
> > + * but we don't when. Let be pessimistic and assume that interrupt has
> > + * happened just before the update. This is not so far from reality
> > + * because interrupt will most probably wake up task and trig an update
> > + * of rq clock during which the metric si updated.
> > + * We start to decay with normal context time and then we add the
> > + * interrupt context time.
> > + * We can safely remove running from rq->clock because
> > + * rq->clock += delta with delta >= running
> > + */
> > + ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
> > + 0,
> > + 0,
> > + 0);
> > + ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
> > + 1,
> > + 1,
> > + 1);
> > +
> > + if (ret)
> > + ___update_load_avg(&rq->avg_irq, 1, 1);
> > +
> > + return ret;
> > +}
> > +#endif
> > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> > index 0e4f912..d2894db 100644
> > --- a/kernel/sched/pelt.h
> > +++ b/kernel/sched/pelt.h
> > @@ -6,6 +6,16 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
> > int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
> > int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> >
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > +int update_irq_load_avg(struct rq *rq, u64 running);
> > +#else
> > +static inline int
> > +update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > /*
> > * When a task is dequeued, its estimated utilization should not be update if
> > * its util_avg has not been updated at least once.
> > @@ -51,6 +61,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
> > {
> > return 0;
> > }
> > +
> > +static inline int
> > +update_irq_load_avg(struct rq *rq, u64 running)
> > +{
> > + return 0;
> > +}
> > #endif
> >
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ef5d6aa..377be2b 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -850,6 +850,9 @@ struct rq {
> > u64 age_stamp;
> > struct sched_avg avg_rt;
> > struct sched_avg avg_dl;
> > +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > + struct sched_avg avg_irq;
> > +#endif
> > u64 idle_stamp;
> > u64 avg_idle;
> >
> > --
> > 2.7.4
> >

2018-07-31 03:34:21

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 06/11] sched/irq: add irq utilization tracking

On Tue, 31 Jul 2018 at 00:43, Vincent Guittot
<[email protected]> wrote:
>
> Hi Wanpeng,
>
> On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <[email protected]> wrote:
> >
> > Hi Vincent,
> > On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
> > <[email protected]> wrote:
> > >
> > > interrupt and steal time are the only remaining activities tracked by
> > > rt_avg. Like for sched classes, we can use PELT to track their average
> > > utilization of the CPU. But unlike sched class, we don't track when
> > > entering/leaving interrupt; Instead, we take into account the time spent
> > > under interrupt context when we update rqs' clock (rq_clock_task).
> > > This also means that we have to decay the normal context time and account
> > > for interrupt time during the update.
> > >
> > > That's also important to note that because
> > > rq_clock == rq_clock_task + interrupt time
> > > and rq_clock_task is used by a sched class to compute its utilization, the
> > > util_avg of a sched class only reflects the utilization of the time spent
> > > in normal context and not of the whole time of the CPU. The utilization of
> > > interrupt gives an more accurate level of utilization of CPU.
> > > The CPU utilization is :
> > > avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
> > >
> > > Most of the time, avg_irq is small and neglictible so the use of the
> > > approximation CPU utilization = /Sum avg_rq was enough
> > >
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > kernel/sched/core.c | 4 +++-
> > > kernel/sched/fair.c | 13 ++++++++++---
> > > kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
> > > kernel/sched/pelt.h | 16 ++++++++++++++++
> > > kernel/sched/sched.h | 3 +++
> > > 5 files changed, 72 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 78d8fac..e5263a4 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -18,6 +18,8 @@
> > > #include "../workqueue_internal.h"
> > > #include "../smpboot.h"
> > >
> > > +#include "pelt.h"
> > > +
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/sched.h>
> > >
> > > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > >
> > > #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > > - sched_rt_avg_update(rq, irq_delta + steal);
> > > + update_irq_load_avg(rq, irq_delta + steal);
> >
> > I think we should not add steal time into irq load tracking, steal
> > time is always 0 on native kernel which doesn't matter, what will
> > happen when guest disables IRQ_TIME_ACCOUNTING and enables
> > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> > addition, we haven't exposed power management for performance which
> > means that e.g. schedutil governor can not cooperate with passive mode
> > intel_pstate driver to tune the OPP. To decay the old steal time avg
> > and add the new one just wastes cpu cycles.
>
> In fact, I have kept the same behavior as with rt_avg, which was
> already adding steal time when computing scale_rt_capacity, which is
> used to reflect the remaining capacity for FAIR tasks and is used in
> load balance. I'm not sure that it's worth using different variables
> for irq and steal.
> That being said, I see a possible optimization in schedutil when
> PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
> With this kind of config, scale_irq_capacity can be a nop for
> schedutil but scales the utilization for scale_rt_capacity

Yeah, this is what in my mind before, you can make a patch for that. :)

Regards,
Wanpeng Li

2018-07-31 08:22:32

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 06/11] sched/irq: add irq utilization tracking

On Tue, 31 Jul 2018 at 05:32, Wanpeng Li <[email protected]> wrote:

> > > >
> > > > #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > > > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > > > - sched_rt_avg_update(rq, irq_delta + steal);
> > > > + update_irq_load_avg(rq, irq_delta + steal);
> > >
> > > I think we should not add steal time into irq load tracking, steal
> > > time is always 0 on native kernel which doesn't matter, what will
> > > happen when guest disables IRQ_TIME_ACCOUNTING and enables
> > > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> > > addition, we haven't exposed power management for performance which
> > > means that e.g. schedutil governor can not cooperate with passive mode
> > > intel_pstate driver to tune the OPP. To decay the old steal time avg
> > > and add the new one just wastes cpu cycles.
> >
> > In fact, I have kept the same behavior as with rt_avg, which was
> > already adding steal time when computing scale_rt_capacity, which is
> > used to reflect the remaining capacity for FAIR tasks and is used in
> > load balance. I'm not sure that it's worth using different variables
> > for irq and steal.
> > That being said, I see a possible optimization in schedutil when
> > PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
> > With this kind of config, scale_irq_capacity can be a nop for
> > schedutil but scales the utilization for scale_rt_capacity
>
> Yeah, this is what in my mind before, you can make a patch for that. :)

ok, I'm going to prepare a patch

Thanks

>
> Regards,
> Wanpeng Li