From: Luca Abeni <[email protected]>
Hi all,
here is yet another version of the patchset implementing CPU reclaiming
(using the GRUB algorithm[1]) for SCHED_DEADLINE.
Basically, this feature allows SCHED_DEADLINE tasks to consume more
than their reserved runtime, up to a maximum fraction of the CPU time
(so that other tasks are left some spare CPU time to execute), if this
does not break the guarantees of other SCHED_DEADLINE tasks.
The patchset applies on top of tip/master.
The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 2 patches), and modifying the
runtime accounting rule (see patches 0004 and 0008).
The original GRUB algorithm is modified as described in [2,3] to support
multiple CPUs (the original algorithm only considered one single CPU,
this one tracks U_act per runqueue) and to leave an "unreclaimable"
fraction of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
original algorithm can consume 100% of the CPU time, starving all the
other tasks).
Patch 0003 uses the newly introduced "inactive timer" (introduced in
patch 0002) to fix dl_overflow() and __setparam_dl().
Patch 0006 allows to enable CPU reclaiming only on selected tasks.
Patches 0007, 0008 and 0009 fix an issue found by Daniel in my last
submission, by basing the GRUB reclaiming algorithm on the inactive
utilization U_inact as shown in [3]. Here, U_inact is computed as
the difference between the "rq utilization" (see patch 0007) and
U_act.
Changes since v4:
the most important change is the addition of patches 0007, 0008
and 0009 to solve the fairness problem highlighted by Daniel:
http://lkml.iu.edu/hypermail/linux/kernel/1701.0/01290.html
I added this fix in the last 3 patches for easier reviewing (up
to patch 0006, the patches are similar to the previous ones),
but I can merge / reorder patches if needed.
I also performed some additional testing, finding and fixing some
bugs in the first 6 patches. Daniel and Claudio also stress-tested
the patchset, finding some other bugs that have been fixed.
Other than this, I tried to address all the comments I received, and to
add comments requested in the previous reviews.
Finally, I updated the patches to apply on top of tip/master.
[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000. Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October). Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany
[3] Abeni, L., Lipari, G., Parri, A., & Sun, Y. (2016, April). Multicore CPU reclaiming: parallel or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (pp. 1877-1884). ACM..
Luca Abeni (9):
sched/deadline: track the active utilization
sched/deadline: improve the tracking of active utilization
sched/deadline: implement GRUB accounting
sched/deadline: do not reclaim the whole CPU bandwidth
sched/deadline: make GRUB a task's flag
sched/deadline: base GRUB reclaiming on the inactive utilization
sched/deadline: also reclaim bandwidth not used by dl tasks
sched/deadline: fix the update of the total -deadline utilization
sched/deadline: track the "total rq utilization" too
include/linux/sched.h | 17 +++
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 65 ++++----
kernel/sched/deadline.c | 366 ++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 62 +++++++-
5 files changed, 454 insertions(+), 57 deletions(-)
--
2.7.4
From: Luca Abeni <[email protected]>
This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Claudio Scordino <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
include/linux/sched.h | 17 ++++
kernel/sched/core.c | 3 +
kernel/sched/deadline.c | 208 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 +
4 files changed, 215 insertions(+), 15 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d67eee8..952cac8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -445,16 +445,33 @@ struct sched_dl_entity {
*
* @dl_yielded tells if task gave up the CPU before consuming
* all its available runtime during the last job.
+ *
+ * @dl_non_contending tells if task is inactive while still
+ * contributing to the active utilization. In other words, it
+ * indicates if the inactive timer has been armed and its handler
+ * has not been executed yet. This flag is useful to avoid race
+ * conditions between the inactive timer handler and the wakeup
+ * code.
*/
int dl_throttled;
int dl_boosted;
int dl_yielded;
+ int dl_non_contending;
/*
* Bandwidth enforcement timer. Each -deadline task has its
* own bandwidth to be enforced, thus we need one timer per task.
*/
struct hrtimer dl_timer;
+
+ /*
+ * Inactive timer, responsible for decreasing the active utilization
+ * at the "0-lag time". When a -deadline task blocks, it contributes
+ * to GRUB's active utilization until the "0-lag time", hence a
+ * timer is needed to decrease the active utilization at the correct
+ * time.
+ */
+ struct hrtimer inactive_timer;
};
union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d6cad9..bf0b0b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2165,6 +2165,7 @@ void __dl_clear_params(struct task_struct *p)
dl_se->dl_throttled = 0;
dl_se->dl_yielded = 0;
+ dl_se->dl_non_contending = 0;
}
/*
@@ -2196,6 +2197,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
RB_CLEAR_NODE(&p->dl.rb_node);
init_dl_task_timer(&p->dl);
+ init_inactive_task_timer(&p->dl);
__dl_clear_params(p);
INIT_LIST_HEAD(&p->rt.run_list);
@@ -2518,6 +2520,7 @@ static int dl_overflow(struct task_struct *p, int policy,
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
__dl_clear(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
+ dl_change_utilization(p, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
__dl_clear(dl_b, p->dl.dl_bw);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cef9adb..86aed82 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
dl_rq->running_bw = 0;
}
+void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+ if (!task_on_rq_queued(p)) {
+ struct rq *rq = task_rq(p);
+
+ if (p->dl.dl_non_contending) {
+ sub_running_bw(p->dl.dl_bw, &rq->dl);
+ p->dl.dl_non_contending = 0;
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will not touch the rq's active utilization,
+ * so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
+ }
+ }
+}
+
+static void task_non_contending(struct task_struct *p)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+ struct hrtimer *timer = &dl_se->inactive_timer;
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+ s64 zerolag_time;
+
+ /*
+ * If this is a non-deadline task that has been boosted,
+ * do nothing
+ */
+ if (dl_se->dl_runtime == 0)
+ return;
+
+ WARN_ON(hrtimer_active(&dl_se->inactive_timer));
+ WARN_ON(dl_se->dl_non_contending);
+
+ zerolag_time = dl_se->deadline -
+ div64_long((dl_se->runtime * dl_se->dl_period),
+ dl_se->dl_runtime);
+
+ /*
+ * Using relative times instead of the absolute "0-lag time"
+ * allows to simplify the code
+ */
+ zerolag_time -= rq_clock(rq);
+
+ /*
+ * If the "0-lag time" already passed, decrease the active
+ * utilization now, instead of starting a timer
+ */
+ if (zerolag_time < 0) {
+ if (dl_task(p))
+ sub_running_bw(dl_se->dl_bw, dl_rq);
+ if (!dl_task(p) || p->state == TASK_DEAD)
+ __dl_clear_params(p);
+
+ return;
+ }
+
+ dl_se->dl_non_contending = 1;
+ get_task_struct(p);
+ hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
+static void task_contending(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ /*
+ * If this is a non-deadline task that has been boosted,
+ * do nothing
+ */
+ if (dl_se->dl_runtime == 0)
+ return;
+
+ if (dl_se->dl_non_contending) {
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will not touch the rq's active utilization,
+ * so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
+ put_task_struct(dl_task_of(dl_se));
+ dl_se->dl_non_contending = 0;
+ } else {
+ /*
+ * Since "dl_non_contending" is not set, the
+ * task's utilization has already been removed from
+ * active utilization (either when the task blocked,
+ * when the "inactive timer" fired).
+ * So, add it back.
+ */
+ add_running_bw(dl_se->dl_bw, dl_rq);
+ }
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -615,10 +716,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* The task might have changed its scheduling policy to something
* different than SCHED_DEADLINE (through switched_from_dl()).
*/
- if (!dl_task(p)) {
- __dl_clear_params(p);
+ if (!dl_task(p))
goto unlock;
- }
/*
* The task might have been boosted by someone else and might be in the
@@ -837,6 +936,49 @@ static void update_curr_dl(struct rq *rq)
}
}
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+ struct sched_dl_entity *dl_se = container_of(timer,
+ struct sched_dl_entity,
+ inactive_timer);
+ struct task_struct *p = dl_task_of(dl_se);
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+
+ if (!dl_task(p) || p->state == TASK_DEAD) {
+ if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
+ sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+ dl_se->dl_non_contending = 0;
+ }
+ __dl_clear_params(p);
+
+ goto unlock;
+ }
+ if (dl_se->dl_non_contending == 0)
+ goto unlock;
+
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ sub_running_bw(dl_se->dl_bw, &rq->dl);
+ dl_se->dl_non_contending = 0;
+unlock:
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+
+ return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+ struct hrtimer *timer = &dl_se->inactive_timer;
+
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ timer->function = inactive_task_timer;
+}
+
#ifdef CONFIG_SMP
static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -969,9 +1111,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
* we want a replenishment of its runtime.
*/
if (flags & ENQUEUE_WAKEUP) {
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-
- add_running_bw(dl_se->dl_bw, dl_rq);
+ task_contending(dl_se);
update_dl_entity(dl_se, pi_se);
}
else if (flags & ENQUEUE_REPLENISH)
@@ -1040,7 +1180,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* add_running_bw().
*/
if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
- add_running_bw(p->dl.dl_bw, &rq->dl);
+ if (flags & ENQUEUE_WAKEUP)
+ task_contending(&p->dl);
+
return;
}
@@ -1065,7 +1207,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
sub_running_bw(p->dl.dl_bw, &rq->dl);
/*
- * This check allows to decrease the active utilization in two cases:
+ * This check allows to start the inactive timer (or to immediately
+ * decrease the active utilization, if needed) in two cases:
* when the task blocks and when it is terminating
* (p->state == TASK_DEAD). We can handle the two cases in the same
* way, because from GRUB's point of view the same thing is happening
@@ -1073,7 +1216,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* or "inactive")
*/
if (flags & DEQUEUE_SLEEP)
- sub_running_bw(p->dl.dl_bw, &rq->dl);
+ task_non_contending(p);
}
/*
@@ -1151,6 +1294,28 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
return cpu;
}
+static void migrate_task_rq_dl(struct task_struct *p)
+{
+ if ((p->state == TASK_WAKING) && (p->dl.dl_non_contending)) {
+ struct rq *rq = task_rq(p);
+
+ raw_spin_lock(&rq->lock);
+ sub_running_bw(p->dl.dl_bw, &rq->dl);
+ p->dl.dl_non_contending = 0;
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will not touch the rq's active utilization,
+ * so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
+
+ raw_spin_unlock(&rq->lock);
+ }
+}
+
static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
{
/*
@@ -1792,13 +1957,23 @@ void __init init_sched_dl_class(void)
static void switched_from_dl(struct rq *rq, struct task_struct *p)
{
/*
- * Start the deadline timer; if we switch back to dl before this we'll
- * continue consuming our current CBS slice. If we stay outside of
- * SCHED_DEADLINE until the deadline passes, the timer will reset the
- * task.
+ * task_non_contending() can start the "inactive timer" (if the 0-lag
+ * time is in the future). If the task switches back to dl before
+ * the "inactive timer" fires, it can continue to consume its current
+ * runtime using its current deadline. If it stays outside of
+ * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+ * will reset the task parameters.
*/
- if (!start_dl_timer(p))
- __dl_clear_params(p);
+ if (task_on_rq_queued(p) && p->dl.dl_runtime)
+ task_non_contending(p);
+
+ /*
+ * We cannot use inactive_task_timer() to invoke sub_running_bw()
+ * at the 0-lag time, because the task could have been migrated
+ * while SCHED_OTHER in the meanwhile.
+ */
+ if (p->dl.dl_non_contending)
+ p->dl.dl_non_contending = 0;
/*
* Since this might be the only -deadline task on the rq,
@@ -1817,6 +1992,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
*/
static void switched_to_dl(struct rq *rq, struct task_struct *p)
{
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p))
@@ -1891,6 +2068,7 @@ const struct sched_class dl_sched_class = {
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl,
+ .migrate_task_rq = migrate_task_rq_dl,
.set_cpus_allowed = set_cpus_allowed_dl,
.rq_online = rq_online_dl,
.rq_offline = rq_offline_dl,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index caaa7d3..57bb79b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -244,6 +244,7 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
}
+void dl_change_utilization(struct task_struct *p, u64 new_bw);
extern void init_dl_bw(struct dl_bw *dl_b);
#ifdef CONFIG_CGROUP_SCHED
@@ -1490,6 +1491,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
extern struct dl_bandwidth def_dl_bandwidth;
extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);
unsigned long to_ratio(u64 period, u64 runtime);
--
2.7.4
From: Luca Abeni <[email protected]>
Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 38 ++++++++++++++------------------------
kernel/sched/deadline.c | 23 +++++++++++++----------
2 files changed, 27 insertions(+), 34 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf0b0b9..20c62e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2487,9 +2487,6 @@ static inline int dl_bw_cpus(int i)
* allocated bandwidth to reflect the new situation.
*
* This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
*/
static int dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
@@ -2514,16 +2511,29 @@ static int dl_overflow(struct task_struct *p, int policy,
cpus = dl_bw_cpus(task_cpu(p));
if (dl_policy(policy) && !task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ if (hrtimer_active(&p->dl.inactive_timer))
+ __dl_clear(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ /*
+ * XXX this is slightly incorrect: when the task
+ * utilization decreases, we should delay the total
+ * utilization change until the task's 0-lag point.
+ * But this would require to set the task's "inactive
+ * timer" when the task is not inactive.
+ */
__dl_clear(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
dl_change_utilization(p, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ /*
+ * Do not decrease the total deadline utilization here,
+ * switched_from_dl() will take care to do it at the correct
+ * (0-lag) time.
+ */
err = 0;
}
raw_spin_unlock(&dl_b->lock);
@@ -3964,26 +3974,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
- /*
- * Changing the parameters of a task is 'tricky' and we're not doing
- * the correct thing -- also see task_dead_dl() and switched_from_dl().
- *
- * What we SHOULD do is delay the bandwidth release until the 0-lag
- * point. This would include retaining the task_struct until that time
- * and change dl_overflow() to not immediately decrement the current
- * amount.
- *
- * Instead we retain the current runtime/deadline and let the new
- * parameters take effect after the current reservation period lapses.
- * This is safe (albeit pessimistic) because the 0-lag point is always
- * before the current scheduling deadline.
- *
- * We can still have temporary overloads because we do not delay the
- * change in bandwidth until that time; so admission control is
- * not on the safe side. It does however guarantee tasks will never
- * consume more than promised.
- */
}
/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 86aed82..238713e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -121,8 +121,14 @@ static void task_non_contending(struct task_struct *p)
if (zerolag_time < 0) {
if (dl_task(p))
sub_running_bw(dl_se->dl_bw, dl_rq);
- if (!dl_task(p) || p->state == TASK_DEAD)
+ if (!dl_task(p) || p->state == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
__dl_clear_params(p);
+ raw_spin_unlock(&dl_b->lock);
+ }
return;
}
@@ -948,10 +954,16 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
rq = task_rq_lock(p, &rf);
if (!dl_task(p) || p->state == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
dl_se->dl_non_contending = 0;
}
+
+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
+ raw_spin_unlock(&dl_b->lock);
__dl_clear_params(p);
goto unlock;
@@ -1473,15 +1485,6 @@ static void task_fork_dl(struct task_struct *p)
static void task_dead_dl(struct task_struct *p)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-
- /*
- * Since we are TASK_DEAD we won't slip out of the domain!
- */
- raw_spin_lock_irq(&dl_b->lock);
- /* XXX we should retain the bw until 0-lag */
- dl_b->total_bw -= p->dl.dl_bw;
- raw_spin_unlock_irq(&dl_b->lock);
}
static void set_curr_task_dl(struct rq *rq)
--
2.7.4
From: Luca Abeni <[email protected]>
This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 21 ++++++++++++---------
kernel/sched/deadline.c | 26 ++++++++++++++++----------
kernel/sched/sched.h | 37 +++++++++++++++++++++++++++++++++++--
3 files changed, 63 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 88e108b..69895fb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2456,7 +2456,7 @@ inline struct dl_bw *dl_bw_of(int i)
return &cpu_rq(i)->rd->dl_bw;
}
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
{
struct root_domain *rd = cpu_rq(i)->rd;
int cpus = 0;
@@ -2474,7 +2474,7 @@ inline struct dl_bw *dl_bw_of(int i)
return &cpu_rq(i)->dl.dl_bw;
}
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
{
return 1;
}
@@ -2512,8 +2512,8 @@ static int dl_overflow(struct task_struct *p, int policy,
if (dl_policy(policy) && !task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
- __dl_clear(dl_b, p->dl.dl_bw);
- __dl_add(dl_b, new_bw);
+ __dl_clear(dl_b, p->dl.dl_bw, cpus);
+ __dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
@@ -2524,8 +2524,8 @@ static int dl_overflow(struct task_struct *p, int policy,
* But this would require to set the task's "inactive
* timer" when the task is not inactive.
*/
- __dl_clear(dl_b, p->dl.dl_bw);
- __dl_add(dl_b, new_bw);
+ __dl_clear(dl_b, p->dl.dl_bw, cpus);
+ __dl_add(dl_b, new_bw, cpus);
dl_change_utilization(p, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
@@ -5460,7 +5460,7 @@ int task_can_attach(struct task_struct *p,
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
- __dl_add(dl_b, p->dl.dl_bw);
+ __dl_add(dl_b, p->dl.dl_bw, cpus);
}
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
@@ -6717,12 +6717,15 @@ static void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
- if (dl_b->bw == -1)
+ if (dl_b->bw == -1) {
cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
- else
+ cpu_rq(cpu)->dl.extra_bw = 1 << 20;
+ } else {
cpu_rq(cpu)->dl.deadline_bw_inv =
to_ratio(global_rt_runtime(),
global_rt_period()) >> 12;
+ cpu_rq(cpu)->dl.extra_bw = new_bw;
+ }
}
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c393c3d..5547101 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -153,7 +153,7 @@ static void task_non_contending(struct task_struct *p)
if (p->state == TASK_DEAD)
sub_rq_bw(p->dl.dl_bw, &rq->dl);
raw_spin_lock(&dl_b->lock);
- __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
__dl_clear_params(p);
raw_spin_unlock(&dl_b->lock);
}
@@ -243,11 +243,15 @@ void init_dl_rq(struct dl_rq *dl_rq)
#else
init_dl_bw(&dl_rq->dl_bw);
#endif
- if (global_rt_runtime() == RUNTIME_INF)
+ if (global_rt_runtime() == RUNTIME_INF) {
dl_rq->deadline_bw_inv = 1 << 8;
- else
+ dl_rq->extra_bw = 1 << 20;
+ } else {
dl_rq->deadline_bw_inv =
to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
+ dl_rq->extra_bw =
+ to_ratio(global_rt_period(), global_rt_runtime());
+ }
}
#ifdef CONFIG_SMP
@@ -909,12 +913,14 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
*/
u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
{
- u64 u_act;
+ s64 u_act;
+ s64 u_act_min;
- if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
- u_act = u;
- else
- u_act = (1 << 20) - rq->dl.this_bw + rq->dl.running_bw;
+ u_act = (1 << 20) - rq->dl.this_bw - rq->dl.extra_bw +
+ rq->dl.running_bw;
+ u_act_min = (u * rq->dl.deadline_bw_inv) >> 8;
+ if (u_act < u_act_min)
+ u_act = u_act_min;
return (delta * u_act) >> 20;
}
@@ -1023,7 +1029,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
}
raw_spin_lock(&dl_b->lock);
- __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&dl_b->lock);
__dl_clear_params(p);
@@ -1989,7 +1995,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
* until we complete the update.
*/
raw_spin_lock(&src_dl_b->lock);
- __dl_clear(src_dl_b, p->dl.dl_bw);
+ __dl_clear(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&src_dl_b->lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3818b3c..aec71f3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -219,22 +219,27 @@ static inline int dl_bandwidth_enabled(void)
}
extern struct dl_bw *dl_bw_of(int i);
+extern int dl_bw_cpus(int i);
struct dl_bw {
raw_spinlock_t lock;
u64 bw, total_bw;
};
+static inline void __dl_update(struct dl_bw *dl_b, s64 bw);
+
static inline
-void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
{
dl_b->total_bw -= tsk_bw;
+ __dl_update(dl_b, (s32)tsk_bw / cpus);
}
static inline
-void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
{
dl_b->total_bw += tsk_bw;
+ __dl_update(dl_b, -((s32)tsk_bw / cpus));
}
static inline
@@ -576,6 +581,7 @@ struct dl_rq {
* runqueue (inactive utilization = this_bw - running_bw).
*/
u64 this_bw;
+ u64 extra_bw;
/*
* Inverse of the fraction of CPU utilization that can be reclaimed
@@ -1951,6 +1957,33 @@ extern void nohz_balance_exit_idle(unsigned int cpu);
static inline void nohz_balance_exit_idle(unsigned int cpu) { }
#endif
+
+#ifdef CONFIG_SMP
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+ struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
+ int i;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+ for_each_cpu_and(i, rd->span, cpu_active_mask) {
+ struct rq *rq = cpu_rq(i);
+
+ rq->dl.extra_bw += bw;
+ }
+}
+#else
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+ struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw);
+
+ dl->extra_bw += bw;
+}
+#endif
+
+
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
struct irqtime {
u64 tick_delta;
--
2.7.4
From: Luca Abeni <[email protected]>
Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time.
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/deadline.c | 7 ++++++-
kernel/sched/sched.h | 6 ++++++
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20c62e7..efa88eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6716,6 +6716,12 @@ static void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
+ if (dl_b->bw == -1)
+ cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
+ else
+ cpu_rq(cpu)->dl.deadline_bw_inv =
+ to_ratio(global_rt_runtime(),
+ global_rt_period()) >> 12;
}
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6035311..e964051 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -212,6 +212,11 @@ void init_dl_rq(struct dl_rq *dl_rq)
#else
init_dl_bw(&dl_rq->dl_bw);
#endif
+ if (global_rt_runtime() == RUNTIME_INF)
+ dl_rq->deadline_bw_inv = 1 << 8;
+ else
+ dl_rq->deadline_bw_inv =
+ to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
}
#ifdef CONFIG_SMP
@@ -871,7 +876,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
*/
u64 grub_reclaim(u64 delta, struct rq *rq)
{
- return (delta * rq->dl.running_bw) >> 20;
+ return (delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20 >> 8;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57bb79b..141549b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,12 @@ struct dl_rq {
* task blocks
*/
u64 running_bw;
+
+ /*
+ * Inverse of the fraction of CPU utilization that can be reclaimed
+ * by the GRUB algorithm.
+ */
+ u64 deadline_bw_inv;
};
#ifdef CONFIG_SMP
--
2.7.4
From: Luca Abeni <[email protected]>
Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -(1 - Uinact) dt", where Uinact is the "inactive
utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves some fairness issues that have been noticed with
"traditional" global GRUB reclaiming.
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d70a7b9..c393c3d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -900,14 +900,23 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
/*
* This function implements the GRUB accounting rule:
* according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as "dq = -Uact dt", where
- * Uact is the (per-runqueue) active utilization.
- * Since rq->dl.running_bw contains Uact * 2^20, the result
- * has to be shifted right by 20.
+ * not decreased as "dq = -dt", but as "dq = (1 - Uinact) dt", where
+ * Uinact is the (per-runqueue) inactive utilization, computed as the
+ * difference between the "total runqueue utilization" and the runqueue
+ * active utilization.
+ * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
+ * multiplied by 2^20, the result has to be shifted right by 20.
*/
-u64 grub_reclaim(u64 delta, struct rq *rq)
+u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
{
- return (delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20 >> 8;
+ u64 u_act;
+
+ if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
+ u_act = u;
+ else
+ u_act = (1 << 20) - rq->dl.this_bw + rq->dl.running_bw;
+
+ return (delta * u_act) >> 20;
}
/*
@@ -953,7 +962,7 @@ static void update_curr_dl(struct rq *rq)
sched_rt_avg_update(rq, delta_exec);
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
- delta_exec = grub_reclaim(delta_exec, rq);
+ delta_exec = grub_reclaim(delta_exec, rq, curr->dl.dl_bw);
dl_se->runtime -= delta_exec;
throttle:
--
2.7.4
From: Luca Abeni <[email protected]>
The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)
Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Claudio Scordino <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 87 +++++++++++++++++++++++++++++++++++++------------
kernel/sched/sched.h | 11 +++++++
2 files changed, 78 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1951379..d70a7b9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -51,6 +51,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
dl_rq->running_bw += dl_bw;
SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+ SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
}
static inline
@@ -65,6 +66,29 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
dl_rq->running_bw = 0;
}
+static inline
+void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->this_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->this_bw += dl_bw;
+ SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
+}
+
+static inline
+void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->this_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->this_bw -= dl_bw;
+ SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
+ if (dl_rq->this_bw > old)
+ dl_rq->this_bw = 0;
+ SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+}
+
void dl_change_utilization(struct task_struct *p, u64 new_bw)
{
if (!task_on_rq_queued(p)) {
@@ -83,6 +107,8 @@ void dl_change_utilization(struct task_struct *p, u64 new_bw)
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
put_task_struct(p);
}
+ sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ add_rq_bw(new_bw, &rq->dl);
}
}
@@ -124,6 +150,8 @@ static void task_non_contending(struct task_struct *p)
if (!dl_task(p) || p->state == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ if (p->state == TASK_DEAD)
+ sub_rq_bw(p->dl.dl_bw, &rq->dl);
raw_spin_lock(&dl_b->lock);
__dl_clear(dl_b, p->dl.dl_bw);
__dl_clear_params(p);
@@ -138,7 +166,7 @@ static void task_non_contending(struct task_struct *p)
hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
}
-static void task_contending(struct sched_dl_entity *dl_se)
+static void task_contending(struct sched_dl_entity *dl_se, int flags)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -149,6 +177,9 @@ static void task_contending(struct sched_dl_entity *dl_se)
if (dl_se->dl_runtime == 0)
return;
+ if (flags & ENQUEUE_MIGRATED)
+ add_rq_bw(dl_se->dl_bw, dl_rq);
+
if (dl_se->dl_non_contending) {
/*
* If the timer handler is currently running and the
@@ -978,6 +1009,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+ sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
dl_se->dl_non_contending = 0;
}
@@ -1143,7 +1175,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
* we want a replenishment of its runtime.
*/
if (flags & ENQUEUE_WAKEUP) {
- task_contending(dl_se);
+ task_contending(dl_se, flags);
update_dl_entity(dl_se, pi_se);
}
else if (flags & ENQUEUE_REPLENISH)
@@ -1196,8 +1228,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
dl_check_constrained_dl(&p->dl);
- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+ if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
+ add_rq_bw(p->dl.dl_bw, &rq->dl);
add_running_bw(p->dl.dl_bw, &rq->dl);
+ }
/*
* If p is throttled, we do not enqueue it. In fact, if it exhausted
@@ -1213,7 +1247,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
*/
if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
if (flags & ENQUEUE_WAKEUP)
- task_contending(&p->dl);
+ task_contending(&p->dl, flags);
return;
}
@@ -1235,8 +1269,10 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);
- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+ if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
sub_running_bw(p->dl.dl_bw, &rq->dl);
+ sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ }
/*
* This check allows to start the inactive timer (or to immediately
@@ -1328,22 +1364,24 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
static void migrate_task_rq_dl(struct task_struct *p)
{
- if ((p->state == TASK_WAKING) && (p->dl.dl_non_contending)) {
+ if (p->state == TASK_WAKING) {
struct rq *rq = task_rq(p);
raw_spin_lock(&rq->lock);
- sub_running_bw(p->dl.dl_bw, &rq->dl);
- p->dl.dl_non_contending = 0;
- /*
- * If the timer handler is currently running and the
- * timer cannot be cancelled, inactive_task_timer()
- * will see that dl_not_contending is not set, and
- * will not touch the rq's active utilization,
- * so we are still safe.
- */
- if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
-
+ if (p->dl.dl_non_contending) {
+ sub_running_bw(p->dl.dl_bw, &rq->dl);
+ p->dl.dl_non_contending = 0;
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will not touch the rq's active utilization,
+ * so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
+ }
+ sub_rq_bw(p->dl.dl_bw, &rq->dl);
raw_spin_unlock(&rq->lock);
}
}
@@ -1791,7 +1829,9 @@ static int push_dl_task(struct rq *rq)
deactivate_task(rq, next_task, 0);
sub_running_bw(next_task->dl.dl_bw, &rq->dl);
+ sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
+ add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -1881,7 +1921,9 @@ static void pull_dl_task(struct rq *this_rq)
deactivate_task(src_rq, p, 0);
sub_running_bw(p->dl.dl_bw, &src_rq->dl);
+ sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
set_task_cpu(p, this_cpu);
+ add_rq_bw(p->dl.dl_bw, &this_rq->dl);
add_running_bw(p->dl.dl_bw, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
@@ -1990,6 +2032,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
if (task_on_rq_queued(p) && p->dl.dl_runtime)
task_non_contending(p);
+ if (!task_on_rq_queued(p))
+ sub_rq_bw(p->dl.dl_bw, &rq->dl);
+
/*
* We cannot use inactive_task_timer() to invoke sub_running_bw()
* at the 0-lag time, because the task could have been migrated
@@ -2019,9 +2064,11 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
put_task_struct(p);
/* If p is not queued we will update its parameters at next wakeup. */
- if (!task_on_rq_queued(p))
- return;
+ if (!task_on_rq_queued(p)) {
+ add_rq_bw(p->dl.dl_bw, &rq->dl);
+ return;
+ }
/*
* If p is boosted we already updated its params in
* rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 141549b..3818b3c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -567,6 +567,17 @@ struct dl_rq {
u64 running_bw;
/*
+ * Utilization of the tasks "assigned" to this runqueue (including
+ * the tasks that are in runqueue and the tasks that executed on this
+ * CPU and blocked). Increased when a task moves to this runqueue, and
+ * decreased when the task moves away (migrates, changes scheduling
+ * policy, or terminates).
+ * This is needed to compute the "inactive utilization" for the
+ * runqueue (inactive utilization = this_bw - running_bw).
+ */
+ u64 this_bw;
+
+ /*
* Inverse of the fraction of CPU utilization that can be reclaimed
* by the GRUB algorithm.
*/
--
2.7.4
From: Luca Abeni <[email protected]>
According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 238713e..6035311 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -862,6 +862,19 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
/*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^20, the result
+ * has to be shifted right by 20.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+ return (delta * rq->dl.running_bw) >> 20;
+}
+
+/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
*/
@@ -903,6 +916,7 @@ static void update_curr_dl(struct rq *rq)
sched_rt_avg_update(rq, delta_exec);
+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;
throttle:
--
2.7.4
From: Luca Abeni <[email protected]>
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 3 ++-
kernel/sched/deadline.c | 3 ++-
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
* For the sched_{set,get}attr() calls
*/
#define SCHED_FLAG_RESET_ON_FORK 0x01
+#define SCHED_FLAG_RECLAIM 0x02
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efa88eb..88e108b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4140,7 +4140,8 @@ static int __sched_setscheduler(struct task_struct *p,
return -EINVAL;
}
- if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+ if (attr->sched_flags &
+ ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
return -EINVAL;
/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e964051..1951379 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -921,7 +921,8 @@ static void update_curr_dl(struct rq *rq)
sched_rt_avg_update(rq, delta_exec);
- delta_exec = grub_reclaim(delta_exec, rq);
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;
throttle:
--
2.7.4
From: Luca Abeni <[email protected]>
Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.
When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution will follow.
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Luca Abeni <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 6 +++++
2 files changed, 64 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce590..cef9adb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
return !RB_EMPTY_NODE(&dl_se->rb_node);
}
+static inline
+void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->running_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->running_bw += dl_bw;
+ SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+}
+
+static inline
+void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->running_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->running_bw -= dl_bw;
+ SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
+ if (dl_rq->running_bw > old)
+ dl_rq->running_bw = 0;
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -946,8 +968,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
* parameters of the task might need updating. Otherwise,
* we want a replenishment of its runtime.
*/
- if (flags & ENQUEUE_WAKEUP)
+ if (flags & ENQUEUE_WAKEUP) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ add_running_bw(dl_se->dl_bw, dl_rq);
update_dl_entity(dl_se, pi_se);
+ }
else if (flags & ENQUEUE_REPLENISH)
replenish_dl_entity(dl_se, pi_se);
@@ -998,14 +1024,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
dl_check_constrained_dl(&p->dl);
+ if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+ add_running_bw(p->dl.dl_bw, &rq->dl);
+
/*
- * If p is throttled, we do nothing. In fact, if it exhausted
+ * If p is throttled, we do not enqueue it. In fact, if it exhausted
* its budget it needs a replenishment and, since it now is on
* its rq, the bandwidth timer callback (which clearly has not
* run yet) will take care of this.
+ * However, the active utilization does not depend on the fact
+ * that the task is on the runqueue or not (but depends on the
+ * task's state - in GRUB parlance, "inactive" vs "active contending").
+ * In other words, even if a task is throttled its utilization must
+ * be counted in the active utilization; hence, we need to call
+ * add_running_bw().
*/
- if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+ if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ add_running_bw(p->dl.dl_bw, &rq->dl);
return;
+ }
enqueue_dl_entity(&p->dl, pi_se, flags);
@@ -1023,6 +1060,20 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);
+
+ if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+ sub_running_bw(p->dl.dl_bw, &rq->dl);
+
+ /*
+ * This check allows to decrease the active utilization in two cases:
+ * when the task blocks and when it is terminating
+ * (p->state == TASK_DEAD). We can handle the two cases in the same
+ * way, because from GRUB's point of view the same thing is happening
+ * (the task moves from "active contending" to "active non contending"
+ * or "inactive")
+ */
+ if (flags & DEQUEUE_SLEEP)
+ sub_running_bw(p->dl.dl_bw, &rq->dl);
}
/*
@@ -1551,7 +1602,9 @@ static int push_dl_task(struct rq *rq)
}
deactivate_task(rq, next_task, 0);
+ sub_running_bw(next_task->dl.dl_bw, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
+ add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -1639,7 +1692,9 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
+ sub_running_bw(p->dl.dl_bw, &src_rq->dl);
set_task_cpu(p, this_cpu);
+ add_running_bw(p->dl.dl_bw, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57caf36..caaa7d3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,12 @@ struct dl_rq {
#else
struct dl_bw dl_bw;
#endif
+ /*
+ * "Active utilization" for this runqueue: increased when a
+ * task wakes up (becomes TASK_RUNNING) and decreased when a
+ * task blocks
+ */
+ u64 running_bw;
};
#ifdef CONFIG_SMP
--
2.7.4
On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..952cac8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -445,16 +445,33 @@ struct sched_dl_entity {
> *
> * @dl_yielded tells if task gave up the CPU before consuming
> * all its available runtime during the last job.
> + *
> + * @dl_non_contending tells if task is inactive while still
> + * contributing to the active utilization. In other words, it
> + * indicates if the inactive timer has been armed and its handler
> + * has not been executed yet. This flag is useful to avoid race
> + * conditions between the inactive timer handler and the wakeup
> + * code.
> */
> int dl_throttled;
> int dl_boosted;
> int dl_yielded;
> + int dl_non_contending;
We should maybe look into making those bits :1, not something for this
patch though;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index cef9adb..86aed82 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> dl_rq->running_bw = 0;
> }
>
> +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +{
> + if (!task_on_rq_queued(p)) {
> + struct rq *rq = task_rq(p);
> +
> + if (p->dl.dl_non_contending) {
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> + p->dl.dl_non_contending = 0;
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> + put_task_struct(p);
> + }
> + }
> +}
If we rearrange that slightly we can avoid the double indent:
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -67,23 +67,23 @@ void sub_running_bw(u64 dl_bw, struct dl
void dl_change_utilization(struct task_struct *p, u64 new_bw)
{
- if (!task_on_rq_queued(p)) {
- struct rq *rq = task_rq(p);
+ if (task_on_rq_queued(p))
+ return;
- if (p->dl.dl_non_contending) {
- sub_running_bw(p->dl.dl_bw, &rq->dl);
- p->dl.dl_non_contending = 0;
- /*
- * If the timer handler is currently running and the
- * timer cannot be cancelled, inactive_task_timer()
- * will see that dl_not_contending is not set, and
- * will not touch the rq's active utilization,
- * so we are still safe.
- */
- if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
- }
- }
+ if (!p->dl.dl_non_contending)
+ return;
+
+ sub_running_bw(p->dl.dl_bw, &task_rq(p)->dl);
+ p->dl.dl_non_contending = 0;
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will not touch the rq's active utilization,
+ * so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
}
static void task_non_contending(struct task_struct *p)
> +
> +static void task_non_contending(struct task_struct *p)
> +{
> +}
> +
> +static void task_contending(struct sched_dl_entity *dl_se)
> +{
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +
> + /*
> + * If this is a non-deadline task that has been boosted,
> + * do nothing
> + */
> + if (dl_se->dl_runtime == 0)
> + return;
> +
> + if (dl_se->dl_non_contending) {
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
> + put_task_struct(dl_task_of(dl_se));
> + dl_se->dl_non_contending = 0;
This had me worried for a little while as being a use-after-free, but
this put_task_struct() cannot be the last in this case. Might be worth a
comment.
> + } else {
> + /*
> + * Since "dl_non_contending" is not set, the
> + * task's utilization has already been removed from
> + * active utilization (either when the task blocked,
> + * when the "inactive timer" fired).
> + * So, add it back.
> + */
> + add_running_bw(dl_se->dl_bw, dl_rq);
> + }
> +}
In general I feel it would be nice to have a state diagram included
somewhere near these two functions. It would be nice to not have to dig
out the PDF every time.
> +static void migrate_task_rq_dl(struct task_struct *p)
> +{
> + if ((p->state == TASK_WAKING) && (p->dl.dl_non_contending)) {
> + struct rq *rq = task_rq(p);
> +
> + raw_spin_lock(&rq->lock);
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> + p->dl.dl_non_contending = 0;
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> + put_task_struct(p);
> +
> + raw_spin_unlock(&rq->lock);
> + }
> +}
This can similarly be reduced in indent, albeit this is only a single
indent level, so not as onerous as the other one.
What had me puzzled for a while here is taking the lock; because some
callers of set_task_cpu() must in fact hold this lock already. Worth a
comment I feel.
Once I figured out the exact locking; I realized you do this on
cross-cpu wakeups. We spend a fair amount of effort not to take both rq
locks. But I suppose you have to here.
On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> @@ -2518,6 +2520,7 @@ static int dl_overflow(struct task_struct *p, int policy,
> !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> __dl_clear(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> + dl_change_utilization(p, new_bw);
> err = 0;
Every time I see that I want to do this..
---
kernel/sched/core.c | 4 ++--
kernel/sched/deadline.c | 2 +-
kernel/sched/sched.h | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc05a0f1..b845ee4b3e55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2512,11 +2512,11 @@ static int dl_overflow(struct task_struct *p, int policy,
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_sub(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_sub(dl_b, p->dl.dl_bw);
err = 0;
}
raw_spin_unlock(&dl_b->lock);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce59015642..229660088138 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1695,7 +1695,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
* until we complete the update.
*/
raw_spin_lock(&src_dl_b->lock);
- __dl_clear(src_dl_b, p->dl.dl_bw);
+ __dl_sub(src_dl_b, p->dl.dl_bw);
raw_spin_unlock(&src_dl_b->lock);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5cbf92214ad8..1a521324ecee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -226,7 +226,7 @@ struct dl_bw {
};
static inline
-void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
{
dl_b->total_bw -= tsk_bw;
}
On Fri, Mar 24, 2017 at 04:52:58AM +0100, luca abeni wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20c62e7..efa88eb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6716,6 +6716,12 @@ static void sched_dl_do_global(void)
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>
> rcu_read_unlock_sched();
> + if (dl_b->bw == -1)
> + cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
> + else
> + cpu_rq(cpu)->dl.deadline_bw_inv =
> + to_ratio(global_rt_runtime(),
> + global_rt_period()) >> 12;
Coding style requires braces here (on both legs of the condition)..
Also, I find deadline_bw_inv an awkward name; would something like
bw_ratio or so be more accurate?
> + if (global_rt_runtime() == RUNTIME_INF)
> + dl_rq->deadline_bw_inv = 1 << 8;
> + else
> + dl_rq->deadline_bw_inv =
> + to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
That's almost the same code; do we want a helper function?
>
> u64 grub_reclaim(u64 delta, struct rq *rq)
> {
> + return (delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20 >> 8;
> }
At which point we might want a note about how this doesn't overflow I
suppose.
Also:
delta *= rq->dl.running_bw;
delta *= rq->dl.bw_ratio;
delta >>= 20 + 8;
return delta;
Might be more readable ?
Alternatively:
delta = (delta * rq->dl.running_bw) >> 8;
delta = (delta * rq->dl.bw_ratio) >> 20;
return delta;
But I doubt we care about those extra 8 bit of space; delta should not
be over 36 bits (~64 seconds) anyway I suppose.
Hi Peter,
On Fri, 24 Mar 2017 14:20:41 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
>
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d67eee8..952cac8 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -445,16 +445,33 @@ struct sched_dl_entity {
> > *
> > * @dl_yielded tells if task gave up the CPU before
> > consuming
> > * all its available runtime during the last job.
> > + *
> > + * @dl_non_contending tells if task is inactive while still
> > + * contributing to the active utilization. In other words,
> > it
> > + * indicates if the inactive timer has been armed and its
> > handler
> > + * has not been executed yet. This flag is useful to avoid
> > race
> > + * conditions between the inactive timer handler and the
> > wakeup
> > + * code.
> > */
> > int dl_throttled;
> > int dl_boosted;
> > int dl_yielded;
> > + int dl_non_contending;
>
> We should maybe look into making those bits :1, not something for this
> patch though;
Yes, grouping all the flags in a single field was my intention too... I
planned to submit a patch to do this after merging the reclaiming
patches... But maybe it is better to do this first :)
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index cef9adb..86aed82 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq
> > *dl_rq) dl_rq->running_bw = 0;
> > }
> >
> > +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> > +{
> > + if (!task_on_rq_queued(p)) {
> > + struct rq *rq = task_rq(p);
> > +
> > + if (p->dl.dl_non_contending) {
> > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > + p->dl.dl_non_contending = 0;
> > + /*
> > + * If the timer handler is currently
> > running and the
> > + * timer cannot be cancelled,
> > inactive_task_timer()
> > + * will see that dl_not_contending is not
> > set, and
> > + * will not touch the rq's active
> > utilization,
> > + * so we are still safe.
> > + */
> > + if
> > (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> > + put_task_struct(p);
> > + }
> > + }
> > +}
>
> If we rearrange that slightly we can avoid the double indent:
[...]
Ok; I'll rework the code in this way on Monday.
[...]
> > + if (hrtimer_try_to_cancel(&dl_se->inactive_timer)
> > == 1)
> > + put_task_struct(dl_task_of(dl_se));
> > + dl_se->dl_non_contending = 0;
>
> This had me worried for a little while as being a use-after-free, but
> this put_task_struct() cannot be the last in this case. Might be
> worth a comment.
Or maybe it is better to move "dl_se->dl_non_contending = 0;" before
hrtimer_try_to_cancel()?
>
> > + } else {
> > + /*
> > + * Since "dl_non_contending" is not set, the
> > + * task's utilization has already been removed from
> > + * active utilization (either when the task
> > blocked,
> > + * when the "inactive timer" fired).
> > + * So, add it back.
> > + */
> > + add_running_bw(dl_se->dl_bw, dl_rq);
> > + }
> > +}
>
> In general I feel it would be nice to have a state diagram included
> somewhere near these two functions. It would be nice to not have to
> dig out the PDF every time.
Ok... Since I am not good at ascii art, would it be ok to add a textual
description? If yes, I'll add a comment like:
"
The utilization of a task is added to the runqueue's active utilization
when the task becomes active (is enqueued in the runqueue), and is
removed when the task becomes inactive. A task does not become
immediately inactive when it blocks, but becomes inactive at the so
called "0 lag time"; so, we setup the "inactive timer" to fire at the
"0 lag time". When the "inactive timer" fires, the task utilization is
removed from the runqueue's active utilization. If the task wakes up
again on the same runqueue before the "0 lag time", the active
utilization must not be changed and the "inactive timer" must be
cancelled. If the task wakes up again on a different runqueue before
the "0 lag time", then the task's utilization must be removed from the
previous runqueue's active utilization and must be added to the new
runqueue's active utilization.
In order to avoid races between a task waking up on a runqueue while the
"inactive timer" is running on a different CPU, the "dl_non_contending"
flag is used to indicate that a task is not on a runqueue but is active
(so, the flag is set when the task blocks and is cleared when the
"inactive timer" fires or when the task wakes up).
"
(if this is ok, where can I add this comment?)
> > +static void migrate_task_rq_dl(struct task_struct *p)
> > +{
> > + if ((p->state == TASK_WAKING) &&
> > (p->dl.dl_non_contending)) {
> > + struct rq *rq = task_rq(p);
> > +
> > + raw_spin_lock(&rq->lock);
> > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > + p->dl.dl_non_contending = 0;
> > + /*
> > + * If the timer handler is currently running and
> > the
> > + * timer cannot be cancelled, inactive_task_timer()
> > + * will see that dl_not_contending is not set, and
> > + * will not touch the rq's active utilization,
> > + * so we are still safe.
> > + */
> > + if (hrtimer_try_to_cancel(&p->dl.inactive_timer)
> > == 1)
> > + put_task_struct(p);
> > +
> > + raw_spin_unlock(&rq->lock);
> > + }
> > +}
>
> This can similarly be reduced in indent, albeit this is only a single
> indent level, so not as onerous as the other one.
Ok; I'll do this on Monday.
> What had me puzzled for a while here is taking the lock; because some
> callers of set_task_cpu() must in fact hold this lock already. Worth a
> comment I feel.
Ok; I'll add a comment mentioning that since state is TASK_WAKING we do
not have the lock.
> Once I figured out the exact locking; I realized you do this on
> cross-cpu wakeups. We spend a fair amount of effort not to take both
> rq locks. But I suppose you have to here.
The problem is that when a task wakes up before the "0 lag time" on a
different runqueue, we must "migrate" its utilization from the old
runqueue to the new one (see comment above). And I need the lock to
avoid races... The only alternative I can think about is to ad a new
lock protecting the active utilization, and to take it instead of the
runqueue lock (I do not know if this is enough, I need to check...).
I'll have a look on Monday.
Thanks,
Luca
Hi Peter,
On Fri, 24 Mar 2017 15:00:15 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:52:58AM +0100, luca abeni wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 20c62e7..efa88eb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6716,6 +6716,12 @@ static void sched_dl_do_global(void)
> > raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> >
> > rcu_read_unlock_sched();
> > + if (dl_b->bw == -1)
> > + cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
> > + else
> > + cpu_rq(cpu)->dl.deadline_bw_inv =
> > + to_ratio(global_rt_runtime(),
> > + global_rt_period()) >>
> > 12;
>
> Coding style requires braces here (on both legs of the condition)..
Sorry about this; checkpatch did not complain and I did not check the
coding rules. I'll add the braces.
> Also, I find deadline_bw_inv an awkward name; would something like
> bw_ratio or so be more accurate?
I am not good at finding names :)
(I used "deadline_bw_inv" because it represents the inverse of the
deadline tasks bandwidth")
I'll change the name in bw_ratio or something better (suggestions?)
> > + if (global_rt_runtime() == RUNTIME_INF)
> > + dl_rq->deadline_bw_inv = 1 << 8;
> > + else
> > + dl_rq->deadline_bw_inv =
> > + to_ratio(global_rt_runtime(),
> > global_rt_period()) >> 12;
>
> That's almost the same code; do we want a helper function?
OK, I'll look at this.
> > u64 grub_reclaim(u64 delta, struct rq *rq)
> > {
> > + return (delta * rq->dl.running_bw *
> > rq->dl.deadline_bw_inv) >> 20 >> 8; }
>
> At which point we might want a note about how this doesn't overflow I
> suppose.
I'll add it on Monday.
>
> Also:
>
> delta *= rq->dl.running_bw;
> delta *= rq->dl.bw_ratio;
> delta >>= 20 + 8;
>
> return delta;
>
> Might be more readable ?
>
> Alternatively:
>
> delta = (delta * rq->dl.running_bw) >> 8;
> delta = (delta * rq->dl.bw_ratio) >> 20;
>
> return delta;
>
> But I doubt we care about those extra 8 bit of space; delta should not
> be over 36 bits (~64 seconds) anyway I suppose.
I think the version with all the shifts after the multiplications is
more precise, right?
Thanks,
Luca
On Fri, 24 Mar 2017 22:47:15 +0100
luca abeni <[email protected]> wrote:
> Ok... Since I am not good at ascii art, would it be ok to add a textual
> description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active utilization
> when the task becomes active (is enqueued in the runqueue), and is
> removed when the task becomes inactive. A task does not become
> immediately inactive when it blocks, but becomes inactive at the so
> called "0 lag time"; so, we setup the "inactive timer" to fire at the
> "0 lag time". When the "inactive timer" fires, the task utilization is
> removed from the runqueue's active utilization. If the task wakes up
> again on the same runqueue before the "0 lag time", the active
> utilization must not be changed and the "inactive timer" must be
> cancelled. If the task wakes up again on a different runqueue before
> the "0 lag time", then the task's utilization must be removed from the
> previous runqueue's active utilization and must be added to the new
> runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while the
> "inactive timer" is running on a different CPU, the "dl_non_contending"
> flag is used to indicate that a task is not on a runqueue but is active
> (so, the flag is set when the task blocks and is cleared when the
> "inactive timer" fires or when the task wakes up).
Sure, the above is great if you never want anyone to read it ;)
Can you please break it up a little. My head starts to spin by the
third line down.
-- Steve
> "
> (if this is ok, where can I add this comment?)
On Fri, 24 Mar 2017 22:58:27 +0100
luca abeni <[email protected]> wrote:
> Hi Peter,
>
> On Fri, 24 Mar 2017 15:00:15 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Mar 24, 2017 at 04:52:58AM +0100, luca abeni wrote:
> >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 20c62e7..efa88eb 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6716,6 +6716,12 @@ static void sched_dl_do_global(void)
> > > raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> > >
> > > rcu_read_unlock_sched();
> > > + if (dl_b->bw == -1)
> > > + cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
> > > + else
> > > + cpu_rq(cpu)->dl.deadline_bw_inv =
> > > + to_ratio(global_rt_runtime(),
> > > + global_rt_period()) >>
> > > 12;
> >
> > Coding style requires braces here (on both legs of the condition)..
>
> Sorry about this; checkpatch did not complain and I did not check the
> coding rules. I'll add the braces.
I'm not sure it's completely documented anywhere.
The brackets are not needed if there's one statement after the if, but
for readability, it's sometimes best to put brackets in if there's more
than one line. That can even include comments. It's not a hard rule,
but more of a preference. I'm personally OK with the above, but Peter
being the maintainer, has the say to give the preference of this kind
of rule.
-- Steve
Hello Luca,
On 23 March 2017 at 21:52, luca abeni <[email protected]> wrote:
> From: Luca Abeni <[email protected]>
>
> Active utilization is defined as the total utilization of active
> (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> when a task wakes up and is decreased when a task blocks.
>
> When a task is migrated from CPUi to CPUj, immediately subtract the
> task's utilization from CPUi and add it to CPUj. This mechanism is
> implemented by modifying the pull and push functions.
> Note: this is not fully correct from the theoretical point of view
> (the utilization should be removed from CPUi only at the 0 lag
> time), a more theoretically sound solution will follow.
>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Luca Abeni <[email protected]>
> Tested-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> kernel/sched/deadline.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 6 +++++
> 2 files changed, 64 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a2ce590..cef9adb 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> return !RB_EMPTY_NODE(&dl_se->rb_node);
> }
>
> +static inline
> +void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +{
> + u64 old = dl_rq->running_bw;
> +
> + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> + dl_rq->running_bw += dl_bw;
> + SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> +}
> +
> +static inline
> +void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> +{
> + u64 old = dl_rq->running_bw;
> +
> + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> + dl_rq->running_bw -= dl_bw;
> + SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> + if (dl_rq->running_bw > old)
> + dl_rq->running_bw = 0;
> +}
> +
> static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> {
> struct sched_dl_entity *dl_se = &p->dl;
> @@ -946,8 +968,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> * parameters of the task might need updating. Otherwise,
> * we want a replenishment of its runtime.
> */
> - if (flags & ENQUEUE_WAKEUP)
> + if (flags & ENQUEUE_WAKEUP) {
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +
> + add_running_bw(dl_se->dl_bw, dl_rq);
> update_dl_entity(dl_se, pi_se);
> + }
What do we do for tasks that go from normal to dl? Unless I'm
mistaking their utilisation won't be accounted for when they are first
enqueued.
> else if (flags & ENQUEUE_REPLENISH)
> replenish_dl_entity(dl_se, pi_se);
>
> @@ -998,14 +1024,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
> dl_check_constrained_dl(&p->dl);
>
> + if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
> + add_running_bw(p->dl.dl_bw, &rq->dl);
> +
> /*
> - * If p is throttled, we do nothing. In fact, if it exhausted
> + * If p is throttled, we do not enqueue it. In fact, if it exhausted
> * its budget it needs a replenishment and, since it now is on
> * its rq, the bandwidth timer callback (which clearly has not
> * run yet) will take care of this.
> + * However, the active utilization does not depend on the fact
> + * that the task is on the runqueue or not (but depends on the
> + * task's state - in GRUB parlance, "inactive" vs "active contending").
> + * In other words, even if a task is throttled its utilization must
> + * be counted in the active utilization; hence, we need to call
> + * add_running_bw().
> */
> - if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> + if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> + add_running_bw(p->dl.dl_bw, &rq->dl);
> return;
> + }
>
> enqueue_dl_entity(&p->dl, pi_se, flags);
>
> @@ -1023,6 +1060,20 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> {
> update_curr_dl(rq);
> __dequeue_task_dl(rq, p, flags);
> +
> + if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> +
> + /*
> + * This check allows to decrease the active utilization in two cases:
> + * when the task blocks and when it is terminating
> + * (p->state == TASK_DEAD). We can handle the two cases in the same
> + * way, because from GRUB's point of view the same thing is happening
> + * (the task moves from "active contending" to "active non contending"
> + * or "inactive")
> + */
> + if (flags & DEQUEUE_SLEEP)
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> }
>
> /*
> @@ -1551,7 +1602,9 @@ static int push_dl_task(struct rq *rq)
> }
>
> deactivate_task(rq, next_task, 0);
> + sub_running_bw(next_task->dl.dl_bw, &rq->dl);
> set_task_cpu(next_task, later_rq->cpu);
> + add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
> activate_task(later_rq, next_task, 0);
> ret = 1;
>
> @@ -1639,7 +1692,9 @@ static void pull_dl_task(struct rq *this_rq)
> resched = true;
>
> deactivate_task(src_rq, p, 0);
> + sub_running_bw(p->dl.dl_bw, &src_rq->dl);
> set_task_cpu(p, this_cpu);
> + add_running_bw(p->dl.dl_bw, &this_rq->dl);
> activate_task(this_rq, p, 0);
> dmin = p->dl.deadline;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 57caf36..caaa7d3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -558,6 +558,12 @@ struct dl_rq {
> #else
> struct dl_bw dl_bw;
> #endif
> + /*
> + * "Active utilization" for this runqueue: increased when a
> + * task wakes up (becomes TASK_RUNNING) and decreased when a
> + * task blocks
> + */
> + u64 running_bw;
Would it be a good idea to initialise and clear this field in
rq_online/offline_dl()?
> };
>
> #ifdef CONFIG_SMP
> --
> 2.7.4
>
On 23 March 2017 at 21:52, luca abeni <[email protected]> wrote:
> From: Luca Abeni <[email protected]>
>
> This patch implements a more theoretically sound algorithm for
> tracking active utilization: instead of decreasing it when a
> task blocks, use a timer (the "inactive timer", named after the
> "Inactive" task state of the GRUB algorithm) to decrease the
> active utilization at the so called "0-lag time".
>
> Signed-off-by: Luca Abeni <[email protected]>
> Tested-by: Claudio Scordino <[email protected]>
> Tested-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> include/linux/sched.h | 17 ++++
> kernel/sched/core.c | 3 +
> kernel/sched/deadline.c | 208 ++++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 2 +
> 4 files changed, 215 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..952cac8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -445,16 +445,33 @@ struct sched_dl_entity {
> *
> * @dl_yielded tells if task gave up the CPU before consuming
> * all its available runtime during the last job.
> + *
> + * @dl_non_contending tells if task is inactive while still
> + * contributing to the active utilization. In other words, it
> + * indicates if the inactive timer has been armed and its handler
> + * has not been executed yet. This flag is useful to avoid race
> + * conditions between the inactive timer handler and the wakeup
> + * code.
> */
> int dl_throttled;
> int dl_boosted;
> int dl_yielded;
> + int dl_non_contending;
>
> /*
> * Bandwidth enforcement timer. Each -deadline task has its
> * own bandwidth to be enforced, thus we need one timer per task.
> */
> struct hrtimer dl_timer;
> +
> + /*
> + * Inactive timer, responsible for decreasing the active utilization
> + * at the "0-lag time". When a -deadline task blocks, it contributes
> + * to GRUB's active utilization until the "0-lag time", hence a
> + * timer is needed to decrease the active utilization at the correct
> + * time.
> + */
> + struct hrtimer inactive_timer;
> };
>
> union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d6cad9..bf0b0b9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2165,6 +2165,7 @@ void __dl_clear_params(struct task_struct *p)
>
> dl_se->dl_throttled = 0;
> dl_se->dl_yielded = 0;
> + dl_se->dl_non_contending = 0;
> }
>
> /*
> @@ -2196,6 +2197,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>
> RB_CLEAR_NODE(&p->dl.rb_node);
> init_dl_task_timer(&p->dl);
> + init_inactive_task_timer(&p->dl);
> __dl_clear_params(p);
>
> INIT_LIST_HEAD(&p->rt.run_list);
> @@ -2518,6 +2520,7 @@ static int dl_overflow(struct task_struct *p, int policy,
> !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> __dl_clear(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> + dl_change_utilization(p, new_bw);
> err = 0;
> } else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> __dl_clear(dl_b, p->dl.dl_bw);
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index cef9adb..86aed82 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> dl_rq->running_bw = 0;
> }
>
> +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +{
> + if (!task_on_rq_queued(p)) {
> + struct rq *rq = task_rq(p);
> +
> + if (p->dl.dl_non_contending) {
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> + p->dl.dl_non_contending = 0;
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> + put_task_struct(p);
> + }
> + }
> +}
> +
> +static void task_non_contending(struct task_struct *p)
> +{
> + struct sched_dl_entity *dl_se = &p->dl;
> + struct hrtimer *timer = &dl_se->inactive_timer;
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
> + s64 zerolag_time;
> +
> + /*
> + * If this is a non-deadline task that has been boosted,
> + * do nothing
> + */
> + if (dl_se->dl_runtime == 0)
> + return;
> +
> + WARN_ON(hrtimer_active(&dl_se->inactive_timer));
> + WARN_ON(dl_se->dl_non_contending);
> +
> + zerolag_time = dl_se->deadline -
> + div64_long((dl_se->runtime * dl_se->dl_period),
> + dl_se->dl_runtime);
> +
> + /*
> + * Using relative times instead of the absolute "0-lag time"
> + * allows to simplify the code
> + */
> + zerolag_time -= rq_clock(rq);
> +
> + /*
> + * If the "0-lag time" already passed, decrease the active
> + * utilization now, instead of starting a timer
> + */
> + if (zerolag_time < 0) {
> + if (dl_task(p))
> + sub_running_bw(dl_se->dl_bw, dl_rq);
> + if (!dl_task(p) || p->state == TASK_DEAD)
> + __dl_clear_params(p);
> +
> + return;
> + }
> +
> + dl_se->dl_non_contending = 1;
> + get_task_struct(p);
> + hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
> +}
> +
> +static void task_contending(struct sched_dl_entity *dl_se)
> +{
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +
> + /*
> + * If this is a non-deadline task that has been boosted,
> + * do nothing
> + */
> + if (dl_se->dl_runtime == 0)
> + return;
> +
> + if (dl_se->dl_non_contending) {
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
> + put_task_struct(dl_task_of(dl_se));
> + dl_se->dl_non_contending = 0;
> + } else {
> + /*
> + * Since "dl_non_contending" is not set, the
> + * task's utilization has already been removed from
> + * active utilization (either when the task blocked,
> + * when the "inactive timer" fired).
> + * So, add it back.
> + */
> + add_running_bw(dl_se->dl_bw, dl_rq);
> + }
> +}
> +
> static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> {
> struct sched_dl_entity *dl_se = &p->dl;
> @@ -615,10 +716,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> * The task might have changed its scheduling policy to something
> * different than SCHED_DEADLINE (through switched_from_dl()).
> */
> - if (!dl_task(p)) {
> - __dl_clear_params(p);
> + if (!dl_task(p))
> goto unlock;
> - }
>
> /*
> * The task might have been boosted by someone else and might be in the
> @@ -837,6 +936,49 @@ static void update_curr_dl(struct rq *rq)
> }
> }
>
> +static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> +{
> + struct sched_dl_entity *dl_se = container_of(timer,
> + struct sched_dl_entity,
> + inactive_timer);
> + struct task_struct *p = dl_task_of(dl_se);
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, &rf);
> +
> + if (!dl_task(p) || p->state == TASK_DEAD) {
> + if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
> + sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
> + dl_se->dl_non_contending = 0;
> + }
> + __dl_clear_params(p);
> +
> + goto unlock;
> + }
> + if (dl_se->dl_non_contending == 0)
> + goto unlock;
> +
> + sched_clock_tick();
> + update_rq_clock(rq);
> +
> + sub_running_bw(dl_se->dl_bw, &rq->dl);
> + dl_se->dl_non_contending = 0;
> +unlock:
> + task_rq_unlock(rq, p, &rf);
> + put_task_struct(p);
> +
> + return HRTIMER_NORESTART;
> +}
> +
> +void init_inactive_task_timer(struct sched_dl_entity *dl_se)
To be consistent with the other DL related functions:
s/init_inactive_task_timer(...)/init_dl_inactive_task_timer(...)
> +{
> + struct hrtimer *timer = &dl_se->inactive_timer;
> +
> + hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + timer->function = inactive_task_timer;
> +}
> +
> #ifdef CONFIG_SMP
>
> static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -969,9 +1111,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> * we want a replenishment of its runtime.
> */
> if (flags & ENQUEUE_WAKEUP) {
> - struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> -
> - add_running_bw(dl_se->dl_bw, dl_rq);
> + task_contending(dl_se);
> update_dl_entity(dl_se, pi_se);
> }
> else if (flags & ENQUEUE_REPLENISH)
> @@ -1040,7 +1180,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> * add_running_bw().
> */
> if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> - add_running_bw(p->dl.dl_bw, &rq->dl);
> + if (flags & ENQUEUE_WAKEUP)
> + task_contending(&p->dl);
> +
> return;
> }
>
> @@ -1065,7 +1207,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> sub_running_bw(p->dl.dl_bw, &rq->dl);
>
> /*
> - * This check allows to decrease the active utilization in two cases:
> + * This check allows to start the inactive timer (or to immediately
> + * decrease the active utilization, if needed) in two cases:
> * when the task blocks and when it is terminating
> * (p->state == TASK_DEAD). We can handle the two cases in the same
> * way, because from GRUB's point of view the same thing is happening
> @@ -1073,7 +1216,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> * or "inactive")
> */
> if (flags & DEQUEUE_SLEEP)
> - sub_running_bw(p->dl.dl_bw, &rq->dl);
> + task_non_contending(p);
> }
>
> /*
> @@ -1151,6 +1294,28 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> return cpu;
> }
>
> +static void migrate_task_rq_dl(struct task_struct *p)
> +{
> + if ((p->state == TASK_WAKING) && (p->dl.dl_non_contending)) {
> + struct rq *rq = task_rq(p);
> +
> + raw_spin_lock(&rq->lock);
> + sub_running_bw(p->dl.dl_bw, &rq->dl);
> + p->dl.dl_non_contending = 0;
> + /*
> + * If the timer handler is currently running and the
> + * timer cannot be cancelled, inactive_task_timer()
> + * will see that dl_not_contending is not set, and
> + * will not touch the rq's active utilization,
> + * so we are still safe.
> + */
> + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> + put_task_struct(p);
> +
> + raw_spin_unlock(&rq->lock);
> + }
> +}
> +
> static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> {
> /*
> @@ -1792,13 +1957,23 @@ void __init init_sched_dl_class(void)
> static void switched_from_dl(struct rq *rq, struct task_struct *p)
> {
> /*
> - * Start the deadline timer; if we switch back to dl before this we'll
> - * continue consuming our current CBS slice. If we stay outside of
> - * SCHED_DEADLINE until the deadline passes, the timer will reset the
> - * task.
> + * task_non_contending() can start the "inactive timer" (if the 0-lag
> + * time is in the future). If the task switches back to dl before
> + * the "inactive timer" fires, it can continue to consume its current
> + * runtime using its current deadline. If it stays outside of
> + * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
> + * will reset the task parameters.
> */
> - if (!start_dl_timer(p))
> - __dl_clear_params(p);
> + if (task_on_rq_queued(p) && p->dl.dl_runtime)
> + task_non_contending(p);
> +
> + /*
> + * We cannot use inactive_task_timer() to invoke sub_running_bw()
> + * at the 0-lag time, because the task could have been migrated
> + * while SCHED_OTHER in the meanwhile.
> + */
> + if (p->dl.dl_non_contending)
> + p->dl.dl_non_contending = 0;
>
> /*
> * Since this might be the only -deadline task on the rq,
> @@ -1817,6 +1992,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> */
> static void switched_to_dl(struct rq *rq, struct task_struct *p)
> {
> + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> + put_task_struct(p);
>
> /* If p is not queued we will update its parameters at next wakeup. */
> if (!task_on_rq_queued(p))
> @@ -1891,6 +2068,7 @@ const struct sched_class dl_sched_class = {
>
> #ifdef CONFIG_SMP
> .select_task_rq = select_task_rq_dl,
> + .migrate_task_rq = migrate_task_rq_dl,
> .set_cpus_allowed = set_cpus_allowed_dl,
> .rq_online = rq_online_dl,
> .rq_offline = rq_offline_dl,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index caaa7d3..57bb79b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -244,6 +244,7 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
> dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> }
>
> +void dl_change_utilization(struct task_struct *p, u64 new_bw);
> extern void init_dl_bw(struct dl_bw *dl_b);
>
> #ifdef CONFIG_CGROUP_SCHED
> @@ -1490,6 +1491,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
> extern struct dl_bandwidth def_dl_bandwidth;
> extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
> extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
> +extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);
>
> unsigned long to_ratio(u64 period, u64 runtime);
>
> --
> 2.7.4
>
Hi Mathieu,
On Sun, 26 Mar 2017 11:04:18 -0600
Mathieu Poirier <[email protected]> wrote:
[...]
> > @@ -946,8 +968,12 @@ enqueue_dl_entity(struct sched_dl_entity
> > *dl_se,
> > * parameters of the task might need updating. Otherwise,
> > * we want a replenishment of its runtime.
> > */
> > - if (flags & ENQUEUE_WAKEUP)
> > + if (flags & ENQUEUE_WAKEUP) {
> > + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > +
> > + add_running_bw(dl_se->dl_bw, dl_rq);
> > update_dl_entity(dl_se, pi_se);
> > + }
>
> What do we do for tasks that go from normal to dl? Unless I'm
> mistaking their utilisation won't be accounted for when they are first
> enqueued.
This case should be handled by the "if (flags & ENQUEUE_WAKEUP)" branch
in enqueue_dl_entity().
[...]
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 57caf36..caaa7d3 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -558,6 +558,12 @@ struct dl_rq {
> > #else
> > struct dl_bw dl_bw;
> > #endif
> > + /*
> > + * "Active utilization" for this runqueue: increased when a
> > + * task wakes up (becomes TASK_RUNNING) and decreased when a
> > + * task blocks
> > + */
> > + u64 running_bw;
>
> Would it be a good idea to initialise and clear this field in
> rq_online/offline_dl()?
Thanks for pointing this out... I have to admit I do not know the
online/offline code too much (and I am not sure if I am handling this
correctly).
I'll try to have a look on Monday.
Thanks,
Luca
Hi Mathieu,
On Sun, 26 Mar 2017 11:32:59 -0600
Mathieu Poirier <[email protected]> wrote:
[...]
> > + task_rq_unlock(rq, p, &rf);
> > + put_task_struct(p);
> > +
> > + return HRTIMER_NORESTART;
> > +}
> > +
> > +void init_inactive_task_timer(struct sched_dl_entity *dl_se)
>
> To be consistent with the other DL related functions:
>
> s/init_inactive_task_timer(...)/init_dl_inactive_task_timer(...)
Thanks; I'll change the name of this function.
Thanks,
Luca
On 24/03/17 22:47, Luca Abeni wrote:
> Hi Peter,
>
> On Fri, 24 Mar 2017 14:20:41 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> >
[...]
> >
> > In general I feel it would be nice to have a state diagram included
> > somewhere near these two functions. It would be nice to not have to
> > dig out the PDF every time.
>
> Ok... Since I am not good at ascii art, would it be ok to add a textual
> description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active utilization
> when the task becomes active (is enqueued in the runqueue), and is
Is enqueued for the first time on a new period, maybe? It seems to be
contradictory w.r.t. what below (if wakeup before 0 lag time) otherwise.
> removed when the task becomes inactive. A task does not become
> immediately inactive when it blocks, but becomes inactive at the so
> called "0 lag time"; so, we setup the "inactive timer" to fire at the
> "0 lag time". When the "inactive timer" fires, the task utilization is
> removed from the runqueue's active utilization. If the task wakes up
> again on the same runqueue before the "0 lag time", the active
> utilization must not be changed and the "inactive timer" must be
> cancelled. If the task wakes up again on a different runqueue before
> the "0 lag time", then the task's utilization must be removed from the
> previous runqueue's active utilization and must be added to the new
> runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while the
> "inactive timer" is running on a different CPU, the "dl_non_contending"
> flag is used to indicate that a task is not on a runqueue but is active
> (so, the flag is set when the task blocks and is cleared when the
> "inactive timer" fires or when the task wakes up).
> "
> (if this is ok, where can I add this comment?)
>
Thanks for this Luca. Not sure it adds much to your text above, but we
might want to consider adding something like below?
--->8---
1st enqueue +------------------+
| |
+---------------->+ ACTIVEcontending |
| | |
| +----+------+------+
| | ^
| | |
+--------+-------+ | |
| | dequeue | | wakeup before
| INACTIVE | | | 0 lag time
| | | |
+--------+-------+ | |
^ | |
| V |
| +----+------+------+
| | |
+-----------------+ ACTIVEnonCONTEND |
| |
0 lag time +------------------+
elapsed
--->8---
Thanks,
- Juri
On Fri, 24 Mar 2017 22:47:15 +0100
luca abeni <[email protected]> wrote:
[...]
> > > + } else {
> > > + /*
> > > + * Since "dl_non_contending" is not set, the
> > > + * task's utilization has already been removed
> > > from
> > > + * active utilization (either when the task
> > > blocked,
> > > + * when the "inactive timer" fired).
> > > + * So, add it back.
> > > + */
> > > + add_running_bw(dl_se->dl_bw, dl_rq);
> > > + }
> > > +}
> >
> > In general I feel it would be nice to have a state diagram included
> > somewhere near these two functions. It would be nice to not have to
> > dig out the PDF every time.
>
> Ok... Since I am not good at ascii art, would it be ok to add a
> textual description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active
> utilization when the task becomes active (is enqueued in the
> runqueue), and is removed when the task becomes inactive. A task does
> not become immediately inactive when it blocks, but becomes inactive
> at the so called "0 lag time"; so, we setup the "inactive timer" to
> fire at the "0 lag time". When the "inactive timer" fires, the task
> utilization is removed from the runqueue's active utilization. If the
> task wakes up again on the same runqueue before the "0 lag time", the
> active utilization must not be changed and the "inactive timer" must
> be cancelled. If the task wakes up again on a different runqueue
> before the "0 lag time", then the task's utilization must be removed
> from the previous runqueue's active utilization and must be added to
> the new runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while
> the "inactive timer" is running on a different CPU, the
> "dl_non_contending" flag is used to indicate that a task is not on a
> runqueue but is active (so, the flag is set when the task blocks and
> is cleared when the "inactive timer" fires or when the task wakes
> up). "
After re-reading all of this, I realized two things:
1) The comment I added before the definition of the dl_non_contending
flag in sched.h is confusing. I'll try to rephrase it
2) The "dl_non_contending" name might be part of the issue, here.
It tries to map concepts from the GRUB paper to the implementation,
so it is understandable and makes things clear (I hope) for people
who know the paper... But it is not the best name for people not
knowing the GRUB paper (and pretending that someone studies the
paper to understand the code is not good :).
So, what about "inactive_timer_armed" (or similar)? This would
immediately clarify what the flag is about... What do you think?
Would this renaming be useful?
Thanks,
Luca
Hi Juri,
On Mon, 27 Mar 2017 08:17:45 +0100
Juri Lelli <[email protected]> wrote:
[...]
> > > In general I feel it would be nice to have a state diagram
> > > included somewhere near these two functions. It would be nice to
> > > not have to dig out the PDF every time.
> >
> > Ok... Since I am not good at ascii art, would it be ok to add a
> > textual description? If yes, I'll add a comment like:
> > "
> > The utilization of a task is added to the runqueue's active
> > utilization when the task becomes active (is enqueued in the
> > runqueue), and is
>
> Is enqueued for the first time on a new period, maybe? It seems to be
> contradictory w.r.t. what below (if wakeup before 0 lag time)
> otherwise.
I think it should be "is enqueued in the runqueue and was previously
not active" (I did not write the "and was previously not active" to
avoid complicanting the sentence even more... But this
"simplification" was not a good idea :). The fact that this happens in a
new period or not is (in my understanding) irrelevant...
> > removed when the task becomes inactive. A task does not become
> > immediately inactive when it blocks, but becomes inactive at the so
> > called "0 lag time"; so, we setup the "inactive timer" to fire at
> > the "0 lag time". When the "inactive timer" fires, the task
> > utilization is removed from the runqueue's active utilization. If
> > the task wakes up again on the same runqueue before the "0 lag
> > time", the active utilization must not be changed and the "inactive
> > timer" must be cancelled. If the task wakes up again on a different
> > runqueue before the "0 lag time", then the task's utilization must
> > be removed from the previous runqueue's active utilization and must
> > be added to the new runqueue's active utilization.
> > In order to avoid races between a task waking up on a runqueue
> > while the "inactive timer" is running on a different CPU, the
> > "dl_non_contending" flag is used to indicate that a task is not on
> > a runqueue but is active (so, the flag is set when the task blocks
> > and is cleared when the "inactive timer" fires or when the task
> > wakes up). "
> > (if this is ok, where can I add this comment?)
> >
>
> Thanks for this Luca. Not sure it adds much to your text above, but we
> might want to consider adding something like below?
>
> --->8---
> 1st enqueue +------------------+
> | |
> +---------------->+ ACTIVEcontending |
> | | |
> | +----+------+------+
> | | ^
> | | |
> +--------+-------+ | |
> | | dequeue | | wakeup before
> | INACTIVE | | | 0 lag time
> | | | |
> +--------+-------+ | |
> ^ | |
> | V |
> | +----+------+------+
> | | |
> +-----------------+ ACTIVEnonCONTEND |
> | |
> 0 lag time +------------------+
> elapsed
> --->8---
I am not sure if introducing the "active non contending" name is a good
idea or not (see my previous email), but I am not the best person to
decide this... If people like this figure, I am more than happy to add
it :)
(but then maybe we can change "0 lag time elapsed" with "inactive timer
fires" and we can display in the figure the state of the
"dl_non_contending"/"inactive_timer_armed" flag)
Thanks,
Luca
On Fri, 24 Mar 2017 22:31:46 -0400
Steven Rostedt <[email protected]> wrote:
> On Fri, 24 Mar 2017 22:47:15 +0100
> luca abeni <[email protected]> wrote:
>
> > Ok... Since I am not good at ascii art, would it be ok to add a
> > textual description? If yes, I'll add a comment like:
> > "
> > The utilization of a task is added to the runqueue's active
> > utilization when the task becomes active (is enqueued in the
> > runqueue), and is removed when the task becomes inactive. A task
> > does not become immediately inactive when it blocks, but becomes
> > inactive at the so called "0 lag time"; so, we setup the "inactive
> > timer" to fire at the "0 lag time". When the "inactive timer"
> > fires, the task utilization is removed from the runqueue's active
> > utilization. If the task wakes up again on the same runqueue before
> > the "0 lag time", the active utilization must not be changed and
> > the "inactive timer" must be cancelled. If the task wakes up again
> > on a different runqueue before the "0 lag time", then the task's
> > utilization must be removed from the previous runqueue's active
> > utilization and must be added to the new runqueue's active
> > utilization. In order to avoid races between a task waking up on a
> > runqueue while the "inactive timer" is running on a different CPU,
> > the "dl_non_contending" flag is used to indicate that a task is not
> > on a runqueue but is active (so, the flag is set when the task
> > blocks and is cleared when the "inactive timer" fires or when the
> > task wakes up).
>
> Sure, the above is great if you never want anyone to read it ;)
>
> Can you please break it up a little. My head starts to spin by the
> third line down.
Ok... Maybe finding a clean and understandable way to explain the
above sentence is something that can be done at the OSPM summit?
Thanks,
Luca
On Fri, Mar 24, 2017 at 10:38:31PM -0400, Steven Rostedt wrote:
> > > > @@ -6716,6 +6716,12 @@ static void sched_dl_do_global(void)
> > > > raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> > > >
> > > > rcu_read_unlock_sched();
> > > > + if (dl_b->bw == -1)
> > > > + cpu_rq(cpu)->dl.deadline_bw_inv = 1 << 8;
> > > > + else
> > > > + cpu_rq(cpu)->dl.deadline_bw_inv =
> > > > + to_ratio(global_rt_runtime(),
> > > > + global_rt_period()) >>
> > > > 12;
> > >
> > > Coding style requires braces here (on both legs of the condition)..
> I'm not sure it's completely documented anywhere.
Two parts;
1) I prefer braces over any multi line block, irrespective if its a
single statement or not. This is, afaik, not strictly documented in
coding style.
Rationale is that missing braces are bad, and the larger the single
statement, the harder it is to be sure it is in fact a single
statement.
2) If one leg needs braces, then both should get it. This is in fact
part of CodingStyle.
On 27/03/17 09:43, Luca Abeni wrote:
> Hi Juri,
>
> On Mon, 27 Mar 2017 08:17:45 +0100
> Juri Lelli <[email protected]> wrote:
> [...]
> > > > In general I feel it would be nice to have a state diagram
> > > > included somewhere near these two functions. It would be nice to
> > > > not have to dig out the PDF every time.
> > >
> > > Ok... Since I am not good at ascii art, would it be ok to add a
> > > textual description? If yes, I'll add a comment like:
> > > "
> > > The utilization of a task is added to the runqueue's active
> > > utilization when the task becomes active (is enqueued in the
> > > runqueue), and is
> >
> > Is enqueued for the first time on a new period, maybe? It seems to be
> > contradictory w.r.t. what below (if wakeup before 0 lag time)
> > otherwise.
> I think it should be "is enqueued in the runqueue and was previously
> not active" (I did not write the "and was previously not active" to
Right.
> avoid complicanting the sentence even more... But this
> "simplification" was not a good idea :). The fact that this happens in a
> new period or not is (in my understanding) irrelevant...
>
>
> > > removed when the task becomes inactive. A task does not become
> > > immediately inactive when it blocks, but becomes inactive at the so
> > > called "0 lag time"; so, we setup the "inactive timer" to fire at
> > > the "0 lag time". When the "inactive timer" fires, the task
> > > utilization is removed from the runqueue's active utilization. If
> > > the task wakes up again on the same runqueue before the "0 lag
> > > time", the active utilization must not be changed and the "inactive
> > > timer" must be cancelled. If the task wakes up again on a different
> > > runqueue before the "0 lag time", then the task's utilization must
> > > be removed from the previous runqueue's active utilization and must
> > > be added to the new runqueue's active utilization.
> > > In order to avoid races between a task waking up on a runqueue
> > > while the "inactive timer" is running on a different CPU, the
> > > "dl_non_contending" flag is used to indicate that a task is not on
> > > a runqueue but is active (so, the flag is set when the task blocks
> > > and is cleared when the "inactive timer" fires or when the task
> > > wakes up). "
> > > (if this is ok, where can I add this comment?)
> > >
> >
> > Thanks for this Luca. Not sure it adds much to your text above, but we
> > might want to consider adding something like below?
> >
> > --->8---
> > 1st enqueue +------------------+
> > | |
> > +---------------->+ ACTIVEcontending |
> > | | |
> > | +----+------+------+
> > | | ^
> > | | |
> > +--------+-------+ | |
> > | | dequeue | | wakeup before
> > | INACTIVE | | | 0 lag time
> > | | | |
> > +--------+-------+ | |
> > ^ | |
> > | V |
> > | +----+------+------+
> > | | |
> > +-----------------+ ACTIVEnonCONTEND |
> > | |
> > 0 lag time +------------------+
> > elapsed
> > --->8---
>
> I am not sure if introducing the "active non contending" name is a good
> idea or not (see my previous email), but I am not the best person to
> decide this... If people like this figure, I am more than happy to add
> it :)
> (but then maybe we can change "0 lag time elapsed" with "inactive timer
> fires" and we can display in the figure the state of the
> "dl_non_contending"/"inactive_timer_armed" flag)
>
Sure. Let's see what people think about what you say in the other email
and I'll update the figure accordingly.
Hi guys,
2017-03-27 10:20 GMT+02:00 Luca Abeni <[email protected]>:
> On Fri, 24 Mar 2017 22:31:46 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> On Fri, 24 Mar 2017 22:47:15 +0100
>> luca abeni <[email protected]> wrote:
>>
>> > Ok... Since I am not good at ascii art, would it be ok to add a
>> > textual description? If yes, I'll add a comment like:
>> > "
>> > The utilization of a task is added to the runqueue's active
>> > utilization when the task becomes active (is enqueued in the
>> > runqueue), and is removed when the task becomes inactive. A task
>> > does not become immediately inactive when it blocks, but becomes
>> > inactive at the so called "0 lag time"; so, we setup the "inactive
>> > timer" to fire at the "0 lag time". When the "inactive timer"
>> > fires, the task utilization is removed from the runqueue's active
>> > utilization. If the task wakes up again on the same runqueue before
>> > the "0 lag time", the active utilization must not be changed and
>> > the "inactive timer" must be cancelled. If the task wakes up again
>> > on a different runqueue before the "0 lag time", then the task's
>> > utilization must be removed from the previous runqueue's active
>> > utilization and must be added to the new runqueue's active
>> > utilization. In order to avoid races between a task waking up on a
>> > runqueue while the "inactive timer" is running on a different CPU,
>> > the "dl_non_contending" flag is used to indicate that a task is not
>> > on a runqueue but is active (so, the flag is set when the task
>> > blocks and is cleared when the "inactive timer" fires or when the
>> > task wakes up).
>>
>> Sure, the above is great if you never want anyone to read it ;)
>>
>> Can you please break it up a little. My head starts to spin by the
>> third line down.
>
> Ok... Maybe finding a clean and understandable way to explain the
> above sentence is something that can be done at the OSPM summit?
What about adding the following text to Documentation/ ?
Does it make things clearer ?
Cheers,
Claudio
The following figure illustrates the state names of the GRUB algorithm:
------------
(d) | Active |
------------->| |
| | Contending |
| ------------
| A |
---------- | |
| | | |
| Inactive | |(b) | (a)
| | | |
---------- | |
A | V
| ------------
| | Active |
--------------| Non |
(c) | Contending |
------------
A task can be in one of the following states:
- ActiveContending: if it is ready for execution (or executing);
- ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
time;
- Inactive: if it is blocked and has surpassed the 0-lag time.
For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
- Active bandwidth (running_bw): this is the sum of the bandwidths of all
tasks in active state (i.e., ActiveContending or ActiveNonContending);
- Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
runqueue, including the tasks in Inactive state.
State transitions:
(a) When a task blocks, it does not become immediately inactive since its
bandwidth cannot be immediately reclaimed without breaking the
real-time guarantees. It therefore enters a transitional state called
ActiveNonContending. The scheduler arms the "inactive timer" to fire at
the 0-lag time, when the task's bandwidth can be reclaimed without
breaking the real-time guarantees.
(b) If the task wakes up before the inactive timer fires, the task re-enters
the ActiveContending state and the "inactive timer" is canceled.
In addition, if the task wakes up on a different runqueue, then
the task's utilization must be removed from the previous runqueue's active
utilization and must be added to the new runqueue's active utilization.
In order to avoid races between a task waking up on a runqueue while the
"inactive timer" is running on a different CPU, the "dl_non_contending"
flag is used to indicate that a task is not on a runqueue but is active
(so, the flag is set when the task blocks and is cleared when the
"inactive timer" fires or when the task wakes up).
(c) When the "inactive timer" fires, the task enters the Inactive state and its
utilization is removed from the runqueue's active utilization.
(d) When an inactive task wakes up, it enters the ActiveContending state and
its utilization is added to the active utilization of the runqueue where
it has been enqueued.
The algorithm reclaims the bandwidth of the tasks in Inactive state.
It does so by decrementing the runtime of the executing task Ti at a pace equal
to
(1 - Uinact)*dt
where Uinact is the inactive utilization, computed as (this_bq - running_bw).
The GRUB algorithm is described in the following papers:
[1] G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time Systems,
2000.
[2] L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
Dusseldorf, Germany, 2014.
[3] L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel or
sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
Computing, 2016.
On Fri, Mar 24, 2017 at 04:53:02AM +0100, luca abeni wrote:
> +static inline
> +void __dl_update(struct dl_bw *dl_b, s64 bw)
> +{
> + struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
> + int i;
> +
> + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> + "sched RCU must be held");
> + for_each_cpu_and(i, rd->span, cpu_active_mask) {
> + struct rq *rq = cpu_rq(i);
> +
> + rq->dl.extra_bw += bw;
> + }
So this is unfortunate (and we already have one such instance).
It effectively does an for_each_online_cpu() with IRQs disabled, and on
SGI class hardware that takes _forever_.
This is also what I got stuck on trying to rewrite AC to use Tommaso's
recoverable thing. In the end I had to do a 2 stage try/commit variant.
Which ended up being a pain and I didn't finish.
I'm not saying this patch is bad, but this is something we need to thing
about.
On Fri, Mar 24, 2017 at 04:53:01AM +0100, luca abeni wrote:
> From: Luca Abeni <[email protected]>
>
> Instead of decreasing the runtime as "dq = -Uact dt" (eventually
> divided by the maximum utilization available for deadline tasks),
> decrease it as "dq = -(1 - Uinact) dt", where Uinact is the "inactive
> utilization".
> In this way, the maximum fraction of CPU time that can be reclaimed
> is given by the total utilization of deadline tasks.
> This approach solves some fairness issues that have been noticed with
> "traditional" global GRUB reclaiming.
I think the Changelog could do with explicit enumeration of what "some"
is.
> Signed-off-by: Luca Abeni <[email protected]>
> Tested-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> kernel/sched/deadline.c | 23 ++++++++++++++++-------
> 1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d70a7b9..c393c3d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -900,14 +900,23 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
> /*
> * This function implements the GRUB accounting rule:
> * according to the GRUB reclaiming algorithm, the runtime is
> + * not decreased as "dq = -dt", but as "dq = (1 - Uinact) dt", where
Changelog had it right I think: dq = -(1 - Uinact) dt
> + * Uinact is the (per-runqueue) inactive utilization, computed as the
> + * difference between the "total runqueue utilization" and the runqueue
> + * active utilization.
> + * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
> + * multiplied by 2^20, the result has to be shifted right by 20.
> */
> -u64 grub_reclaim(u64 delta, struct rq *rq)
> +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
> {
> + u64 u_act;
> +
> + if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
> + u_act = u;
> + else
> + u_act = (1 << 20) - rq->dl.this_bw + rq->dl.running_bw;
> +
> + return (delta * u_act) >> 20;
But that's not what is done here I think, something like this instead:
Uinact = Utot - Uact
-t_u dt ; Uinact > (1 - t_u)
dq = {
-(1 - Uinact) dt
And nowhere do we have an explanation for that.
Now, I suspect we can write that like: dq = -max{ t_u, (1 - Uinact) } dt,
which would suggest this is a sanity check on Utot, which I suspect can
be over 1. Is this what is happening?
#define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT)
static inline
u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
{
u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
u64 u_act;
/*
* What we want to write is:
*
* max(BW_UNIT - u_inact, dl_se->dl_bw)
*
* but we cannot do that since Utot can be larger than 1,
* which means u_inact can be larger than 1, which would
* have the above result in negative values.
*/
if (u_inact > (BW_UNIT - dl_se->dl_bw))
u_act = dl_se->dl_bw;
else
u_act = BW_UNIT - u_inact;
return (delta * u_act) >> BW_SHIFT;
}
Hmm?
Hi Peter,
On Mon, 27 Mar 2017 16:03:41 +0200
Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:53:02AM +0100, luca abeni wrote:
>
> > +static inline
> > +void __dl_update(struct dl_bw *dl_b, s64 bw)
> > +{
> > + struct root_domain *rd = container_of(dl_b, struct
> > root_domain, dl_bw);
> > + int i;
> > +
> > + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> > + "sched RCU must be held");
> > + for_each_cpu_and(i, rd->span, cpu_active_mask) {
> > + struct rq *rq = cpu_rq(i);
> > +
> > + rq->dl.extra_bw += bw;
> > + }
>
> So this is unfortunate (and we already have one such instance).
>
> It effectively does an for_each_online_cpu() with IRQs disabled, and
> on SGI class hardware that takes _forever_.
I have to admit I copied this code from somewhere else... :)
I am happy to discuss a better solution.
Thanks,
Luca
>
> This is also what I got stuck on trying to rewrite AC to use Tommaso's
> recoverable thing. In the end I had to do a 2 stage try/commit
> variant. Which ended up being a pain and I didn't finish.
>
> I'm not saying this patch is bad, but this is something we need to
> thing about.
On Mon, 27 Mar 2017 16:26:33 +0200
Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:53:01AM +0100, luca abeni wrote:
> > From: Luca Abeni <[email protected]>
> >
> > Instead of decreasing the runtime as "dq = -Uact dt" (eventually
> > divided by the maximum utilization available for deadline tasks),
> > decrease it as "dq = -(1 - Uinact) dt", where Uinact is the
> > "inactive utilization".
>
> > In this way, the maximum fraction of CPU time that can be reclaimed
> > is given by the total utilization of deadline tasks.
> > This approach solves some fairness issues that have been noticed
> > with "traditional" global GRUB reclaiming.
>
> I think the Changelog could do with explicit enumeration of what
> "some" is.
Sorry, when writing the changelog I've been lazy; I'll add a link to
Daniel's email showing the problem in action.
> > Signed-off-by: Luca Abeni <[email protected]>
> > Tested-by: Daniel Bristot de Oliveira <[email protected]>
> > ---
> > kernel/sched/deadline.c | 23 ++++++++++++++++-------
> > 1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index d70a7b9..c393c3d 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -900,14 +900,23 @@ extern bool sched_rt_bandwidth_account(struct
> > rt_rq *rt_rq); /*
> > * This function implements the GRUB accounting rule:
> > * according to the GRUB reclaiming algorithm, the runtime is
> > + * not decreased as "dq = -dt", but as "dq = (1 - Uinact) dt",
> > where
>
> Changelog had it right I think: dq = -(1 - Uinact) dt
Sorry about the typo... I'll fix it
> > + * Uinact is the (per-runqueue) inactive utilization, computed as
> > the
> > + * difference between the "total runqueue utilization" and the
> > runqueue
> > + * active utilization.
> > + * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
> > + * multiplied by 2^20, the result has to be shifted right by 20.
> > */
> > -u64 grub_reclaim(u64 delta, struct rq *rq)
> > +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
> > {
> > + u64 u_act;
> > +
> > + if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
> > + u_act = u;
> > + else
> > + u_act = (1 << 20) - rq->dl.this_bw +
> > rq->dl.running_bw; +
> > + return (delta * u_act) >> 20;
>
> But that's not what is done here I think, something like this instead:
>
> Uinact = Utot - Uact
>
> -t_u dt ; Uinact > (1 - t_u)
> dq = {
> -(1 - Uinact) dt
>
>
> And nowhere do we have an explanation for that.
Sorry about this confusion... The accounting should be
dq = -(1 - Uinact)dt
but if (1 - Uinact) is too large (larger than the task's utilization)
then we use the task's utilization instead (otherwise, we end up
reclaiming other runqueues' time). I realized that this check was
needed after writing the comments, and I forgot to update the comments
when I fixed the code :(
> Now, I suspect we can write that like: dq = -max{ t_u, (1 - Uinact) }
> dt, which would suggest this is a sanity check on Utot, which I
> suspect can be over 1. Is this what is happening?
Right... I'll fix the code and comments according to your suggestion.
Thanks,
Luca
> #define BW_SHIFT 20
> #define BW_UNIT (1 << BW_SHIFT)
>
> static inline
> u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity
> *dl_se) {
> u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot -
> Uact */ u64 u_act;
>
> /*
> * What we want to write is:
> *
> * max(BW_UNIT - u_inact, dl_se->dl_bw)
> *
> * but we cannot do that since Utot can be larger than 1,
> * which means u_inact can be larger than 1, which would
> * have the above result in negative values.
> */
> if (u_inact > (BW_UNIT - dl_se->dl_bw))
> u_act = dl_se->dl_bw;
> else
> u_act = BW_UNIT - u_inact;
>
> return (delta * u_act) >> BW_SHIFT;
> }
>
> Hmm?
On Mon, Mar 27, 2017 at 04:56:51PM +0200, Luca Abeni wrote:
> > > +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
> > > {
> > > + u64 u_act;
> > > +
> > > + if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
> > > + u_act = u;
> > > + else
> > > + u_act = (1 << 20) - rq->dl.this_bw +
> > > rq->dl.running_bw; +
> > > + return (delta * u_act) >> 20;
> >
> > But that's not what is done here I think, something like this instead:
> >
> > Uinact = Utot - Uact
> >
> > -t_u dt ; Uinact > (1 - t_u)
> > dq = {
> > -(1 - Uinact) dt
> >
> >
> > And nowhere do we have an explanation for that.
>
> Sorry about this confusion... The accounting should be
> dq = -(1 - Uinact)dt
> but if (1 - Uinact) is too large (larger than the task's utilization)
> then we use the task's utilization instead (otherwise, we end up
> reclaiming other runqueues' time). I realized that this check was
> needed after writing the comments, and I forgot to update the comments
> when I fixed the code :(
>
> > Now, I suspect we can write that like: dq = -max{ t_u, (1 - Uinact) }
> > dt, which would suggest this is a sanity check on Utot, which I
> > suspect can be over 1. Is this what is happening?
>
> Right... I'll fix the code and comments according to your suggestion.
But doesn't that suggest there is now another corner case where we
'always' select t_u because of Utot overload?
My intuition suggests we'd reclaim insufficient time in that case, but
I've not thought much about it.
I feel we want a few words explaining the trade-offs made here and the
corner cases explored.
Does that make sense?
On Mon, 27 Mar 2017 17:53:35 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Mar 27, 2017 at 04:56:51PM +0200, Luca Abeni wrote:
>
> > > > +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
> > > > {
> > > > + u64 u_act;
> > > > +
> > > > + if (rq->dl.this_bw - rq->dl.running_bw > (1 << 20) - u)
> > > > + u_act = u;
> > > > + else
> > > > + u_act = (1 << 20) - rq->dl.this_bw +
> > > > rq->dl.running_bw; +
> > > > + return (delta * u_act) >> 20;
> > >
> > > But that's not what is done here I think, something like this
> > > instead:
> > >
> > > Uinact = Utot - Uact
> > >
> > > -t_u dt ; Uinact > (1 - t_u)
> > > dq = {
> > > -(1 - Uinact) dt
> > >
> > >
> > > And nowhere do we have an explanation for that.
> >
> > Sorry about this confusion... The accounting should be
> > dq = -(1 - Uinact)dt
> > but if (1 - Uinact) is too large (larger than the task's
> > utilization) then we use the task's utilization instead (otherwise,
> > we end up reclaiming other runqueues' time). I realized that this
> > check was needed after writing the comments, and I forgot to update
> > the comments when I fixed the code :(
> >
> > > Now, I suspect we can write that like: dq = -max{ t_u, (1 -
> > > Uinact) } dt, which would suggest this is a sanity check on Utot,
> > > which I suspect can be over 1. Is this what is happening?
> >
> > Right... I'll fix the code and comments according to your
> > suggestion.
>
> But doesn't that suggest there is now another corner case where we
> 'always' select t_u because of Utot overload?
>
> My intuition suggests we'd reclaim insufficient time in that case, but
> I've not thought much about it.
Well, setting U_act = u_i (task utilization) means that task i is
reclaiming the whole CPU time (then, the next patch will make sure that
deadline tasks cannot consume 100% of the CPU time on a single CPU).
> I feel we want a few words explaining the trade-offs made here and the
> corner cases explored.
>
> Does that make sense?
I think it is a good idea; maybe at the OSPM summit we can work on
finding the correct wording for these comments?
Thanks,
Luca
Hi Peter,
I put this change in a local tree together with other fixes / cleanups
I plan to submit in the next weeks. Should I send it together with the
other patches, or are you going to apply it separately?
In the first case, what is the correct authorship / SOB chain (I ask
because I keep getting this wrong every time :)
Thanks,
Luca
On Fri, 24 Mar 2017 14:23:51 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> > @@ -2518,6 +2520,7 @@ static int dl_overflow(struct task_struct *p, int policy,
> > !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> > __dl_clear(dl_b, p->dl.dl_bw);
> > __dl_add(dl_b, new_bw);
> > + dl_change_utilization(p, new_bw);
> > err = 0;
>
> Every time I see that I want to do this..
>
>
> ---
> kernel/sched/core.c | 4 ++--
> kernel/sched/deadline.c | 2 +-
> kernel/sched/sched.h | 2 +-
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3b31fc05a0f1..b845ee4b3e55 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2512,11 +2512,11 @@ static int dl_overflow(struct task_struct *p, int policy,
> err = 0;
> } else if (dl_policy(policy) && task_has_dl_policy(p) &&
> !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + __dl_sub(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> err = 0;
> } else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + __dl_sub(dl_b, p->dl.dl_bw);
> err = 0;
> }
> raw_spin_unlock(&dl_b->lock);
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a2ce59015642..229660088138 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1695,7 +1695,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
> * until we complete the update.
> */
> raw_spin_lock(&src_dl_b->lock);
> - __dl_clear(src_dl_b, p->dl.dl_bw);
> + __dl_sub(src_dl_b, p->dl.dl_bw);
> raw_spin_unlock(&src_dl_b->lock);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5cbf92214ad8..1a521324ecee 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -226,7 +226,7 @@ struct dl_bw {
> };
>
> static inline
> -void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
> +void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
> {
> dl_b->total_bw -= tsk_bw;
> }
Hi Peter,
On Fri, 24 Mar 2017 22:47:15 +0100
luca abeni <[email protected]> wrote:
> Hi Peter,
>
> On Fri, 24 Mar 2017 14:20:41 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index d67eee8..952cac8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -445,16 +445,33 @@ struct sched_dl_entity {
> > > *
> > > * @dl_yielded tells if task gave up the CPU before
> > > consuming
> > > * all its available runtime during the last job.
> > > + *
> > > + * @dl_non_contending tells if task is inactive while still
> > > + * contributing to the active utilization. In other words,
> > > it
> > > + * indicates if the inactive timer has been armed and its
> > > handler
> > > + * has not been executed yet. This flag is useful to avoid
> > > race
> > > + * conditions between the inactive timer handler and the
> > > wakeup
> > > + * code.
> > > */
> > > int dl_throttled;
> > > int dl_boosted;
> > > int dl_yielded;
> > > + int dl_non_contending;
> >
> > We should maybe look into making those bits :1, not something for this
> > patch though;
>
> Yes, grouping all the flags in a single field was my intention too... I
> planned to submit a patch to do this after merging the reclaiming
> patches... But maybe it is better to do this first :)
I implemented this change, but before submitting the patch I have a
small question.
I implemented some helpers to access the various
{throttled,boosted,yielded,non_contending} flags. I have some
"dl_{throttled,boosted,...}()" inline functions for reading the values
of the flags, and some inline functions for setting / clearing the
flags. For these, I have two possibilities:
- using two separate "dl_set_{throttled,...}()" and
"dl_clear_{throttled,..}()" functions for each flag
- using one single "dl_set_{throttled,...}(dl, value)" function per
flag, in which the flag's value is specified.
I have no preferences (with the first proposal, I introduce more inline
functions, but I think the functions can be made more efficient /
optimized). Which one of the two proposals is preferred? (or, there is
a third, better, idea that I overlooked?)
Thanks,
Luca
>
>
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index cef9adb..86aed82 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq
> > > *dl_rq) dl_rq->running_bw = 0;
> > > }
> > >
> > > +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> > > +{
> > > + if (!task_on_rq_queued(p)) {
> > > + struct rq *rq = task_rq(p);
> > > +
> > > + if (p->dl.dl_non_contending) {
> > > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > + p->dl.dl_non_contending = 0;
> > > + /*
> > > + * If the timer handler is currently
> > > running and the
> > > + * timer cannot be cancelled,
> > > inactive_task_timer()
> > > + * will see that dl_not_contending is not
> > > set, and
> > > + * will not touch the rq's active
> > > utilization,
> > > + * so we are still safe.
> > > + */
> > > + if
> > > (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> > > + put_task_struct(p);
> > > + }
> > > + }
> > > +}
> >
> > If we rearrange that slightly we can avoid the double indent:
> [...]
> Ok; I'll rework the code in this way on Monday.
>
> [...]
> > > + if (hrtimer_try_to_cancel(&dl_se->inactive_timer)
> > > == 1)
> > > + put_task_struct(dl_task_of(dl_se));
> > > + dl_se->dl_non_contending = 0;
> >
> > This had me worried for a little while as being a use-after-free, but
> > this put_task_struct() cannot be the last in this case. Might be
> > worth a comment.
>
> Or maybe it is better to move "dl_se->dl_non_contending = 0;" before
> hrtimer_try_to_cancel()?
>
> >
> > > + } else {
> > > + /*
> > > + * Since "dl_non_contending" is not set, the
> > > + * task's utilization has already been removed from
> > > + * active utilization (either when the task
> > > blocked,
> > > + * when the "inactive timer" fired).
> > > + * So, add it back.
> > > + */
> > > + add_running_bw(dl_se->dl_bw, dl_rq);
> > > + }
> > > +}
> >
> > In general I feel it would be nice to have a state diagram included
> > somewhere near these two functions. It would be nice to not have to
> > dig out the PDF every time.
>
> Ok... Since I am not good at ascii art, would it be ok to add a textual
> description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active utilization
> when the task becomes active (is enqueued in the runqueue), and is
> removed when the task becomes inactive. A task does not become
> immediately inactive when it blocks, but becomes inactive at the so
> called "0 lag time"; so, we setup the "inactive timer" to fire at the
> "0 lag time". When the "inactive timer" fires, the task utilization is
> removed from the runqueue's active utilization. If the task wakes up
> again on the same runqueue before the "0 lag time", the active
> utilization must not be changed and the "inactive timer" must be
> cancelled. If the task wakes up again on a different runqueue before
> the "0 lag time", then the task's utilization must be removed from the
> previous runqueue's active utilization and must be added to the new
> runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while the
> "inactive timer" is running on a different CPU, the "dl_non_contending"
> flag is used to indicate that a task is not on a runqueue but is active
> (so, the flag is set when the task blocks and is cleared when the
> "inactive timer" fires or when the task wakes up).
> "
> (if this is ok, where can I add this comment?)
>
>
> > > +static void migrate_task_rq_dl(struct task_struct *p)
> > > +{
> > > + if ((p->state == TASK_WAKING) &&
> > > (p->dl.dl_non_contending)) {
> > > + struct rq *rq = task_rq(p);
> > > +
> > > + raw_spin_lock(&rq->lock);
> > > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > + p->dl.dl_non_contending = 0;
> > > + /*
> > > + * If the timer handler is currently running and
> > > the
> > > + * timer cannot be cancelled, inactive_task_timer()
> > > + * will see that dl_not_contending is not set, and
> > > + * will not touch the rq's active utilization,
> > > + * so we are still safe.
> > > + */
> > > + if (hrtimer_try_to_cancel(&p->dl.inactive_timer)
> > > == 1)
> > > + put_task_struct(p);
> > > +
> > > + raw_spin_unlock(&rq->lock);
> > > + }
> > > +}
> >
> > This can similarly be reduced in indent, albeit this is only a single
> > indent level, so not as onerous as the other one.
>
> Ok; I'll do this on Monday.
>
>
> > What had me puzzled for a while here is taking the lock; because some
> > callers of set_task_cpu() must in fact hold this lock already. Worth a
> > comment I feel.
>
> Ok; I'll add a comment mentioning that since state is TASK_WAKING we do
> not have the lock.
>
>
> > Once I figured out the exact locking; I realized you do this on
> > cross-cpu wakeups. We spend a fair amount of effort not to take both
> > rq locks. But I suppose you have to here.
>
> The problem is that when a task wakes up before the "0 lag time" on a
> different runqueue, we must "migrate" its utilization from the old
> runqueue to the new one (see comment above). And I need the lock to
> avoid races... The only alternative I can think about is to ad a new
> lock protecting the active utilization, and to take it instead of the
> runqueue lock (I do not know if this is enough, I need to check...).
> I'll have a look on Monday.
>
>
>
> Thanks,
> Luca
On Mon, Jul 24, 2017 at 10:06:09AM +0200, Luca Abeni wrote:
> > Yes, grouping all the flags in a single field was my intention too... I
> > planned to submit a patch to do this after merging the reclaiming
> > patches... But maybe it is better to do this first :)
>
> I implemented this change, but before submitting the patch I have a
> small question.
> I implemented some helpers to access the various
> {throttled,boosted,yielded,non_contending} flags. I have some
> "dl_{throttled,boosted,...}()" inline functions for reading the values
> of the flags, and some inline functions for setting / clearing the
> flags. For these, I have two possibilities:
> - using two separate "dl_set_{throttled,...}()" and
> "dl_clear_{throttled,..}()" functions for each flag
> - using one single "dl_set_{throttled,...}(dl, value)" function per
> flag, in which the flag's value is specified.
>
> I have no preferences (with the first proposal, I introduce more inline
> functions, but I think the functions can be made more efficient /
> optimized). Which one of the two proposals is preferred? (or, there is
> a third, better, idea that I overlooked?)
- Use bitfields and let the compiler sort it out.
- Use macros to generate all the inlines as per the first.
Personally, because I'm lazy, I'd try the bitfield thing first and see
what kind code that generates. If that's not too horrendous, keep it :-)
On Mon, Jul 24, 2017 at 09:54:54AM +0200, Luca Abeni wrote:
> Hi Peter,
>
> I put this change in a local tree together with other fixes / cleanups
> I plan to submit in the next weeks. Should I send it together with the
> other patches, or are you going to apply it separately?
Posting them in a series is fine; it is customary to put independent
things first such that they will not get stuck after the larger changes.
> In the first case, what is the correct authorship / SOB chain (I ask
> because I keep getting this wrong every time :)
Yes, this is a 'fun' case :-) I'd just merge the change into your patch
introducing it and forget I 'contributed' the name change.
For larger patches you could do something like (in your email body):
From: Peter Zijlstra <[email protected]>
Changelog goes here...
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Luca...
---
$PATCH
Which says this patch is from me, carried by you, and then I'll stick
another SoB on to indicated I took it back. Its a bit weird, but we've
done it before.
On Mon, 24 Jul 2017 11:04:52 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Jul 24, 2017 at 10:06:09AM +0200, Luca Abeni wrote:
> > > Yes, grouping all the flags in a single field was my intention too... I
> > > planned to submit a patch to do this after merging the reclaiming
> > > patches... But maybe it is better to do this first :)
> >
> > I implemented this change, but before submitting the patch I have a
> > small question.
> > I implemented some helpers to access the various
> > {throttled,boosted,yielded,non_contending} flags. I have some
> > "dl_{throttled,boosted,...}()" inline functions for reading the values
> > of the flags, and some inline functions for setting / clearing the
> > flags. For these, I have two possibilities:
>
> > - using two separate "dl_set_{throttled,...}()" and
> > "dl_clear_{throttled,..}()" functions for each flag
>
> > - using one single "dl_set_{throttled,...}(dl, value)" function per
> > flag, in which the flag's value is specified.
> >
> > I have no preferences (with the first proposal, I introduce more inline
> > functions, but I think the functions can be made more efficient /
> > optimized). Which one of the two proposals is preferred? (or, there is
> > a third, better, idea that I overlooked?)
>
> - Use bitfields and let the compiler sort it out.
>
> - Use macros to generate all the inlines as per the first.
>
>
> Personally, because I'm lazy, I'd try the bitfield thing first and see
> what kind code that generates. If that's not too horrendous, keep it :-)
Thanks for the suggestions; I'll test the C bitfields and I'll see how
the assembly generated by gcc compares with the inline functions (I did
not propose this idea originally because I got the impression that
people tend not to trust gcc)
Thanks,
Luca
On Mon, 24 Jul 2017 11:11:30 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Jul 24, 2017 at 09:54:54AM +0200, Luca Abeni wrote:
> > Hi Peter,
> >
> > I put this change in a local tree together with other fixes / cleanups
> > I plan to submit in the next weeks. Should I send it together with the
> > other patches, or are you going to apply it separately?
>
> Posting them in a series is fine; it is customary to put independent
> things first such that they will not get stuck after the larger changes.
>
> > In the first case, what is the correct authorship / SOB chain (I ask
> > because I keep getting this wrong every time :)
>
> Yes, this is a 'fun' case :-) I'd just merge the change into your patch
> introducing it and forget I 'contributed' the name change.
I think this patch is independent from the other patches I have in my
tree... So, I will go for the solution you describe below.
Thanks,
Luca
>
> For larger patches you could do something like (in your email body):
>
>
> From: Peter Zijlstra <[email protected]>
>
> Changelog goes here...
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Luca...
> ---
>
> $PATCH
>
>
> Which says this patch is from me, carried by you, and then I'll stick
> another SoB on to indicated I took it back. Its a bit weird, but we've
> done it before.