2016-04-01 15:18:24

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 0/7] CPU reclaiming for SCHED_DEADLINE

Hi all,

this patchset implements CPU reclaiming (using the GRUB algorithm[1])
for SCHED_DEADLINE: basically, this feature allows SCHED_DEADLINE tasks
to consume more than their reserved runtime, up to a maximum fraction
of the CPU time (so that other tasks are left some spare CPU time to
execute), if this does not break the guarantees of other SCHED_DEADLINE
tasks.
The patchset applies on top of tip/master.

Respect to the first version of the RFC:
- I tried to address all the comments I received
- I removed some patches that were not really used in the patchset
- I rebased on tip/master
- I removed all the checkpatch warnings
- I added a new patch (patch 0004) to update the total -deadline
utilization (dl_b->total_bw) at the correct time, addressing the
large comment in __setparam_dl() (see both dl_overflow() and
__setparam_dl()).


The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 3 patches), and modifying the
runtime accounting rule (see patch 0005). The original GRUB algorithm is
modified as described in [2] to support multiple CPUs (the original
algorithm only considered one single CPU, this one tracks U_act per
runqueue) and to leave an "unreclaimable" fraction of CPU time to non
SCHED_DEADLINE tasks (the original algorithm can consume 100% of the CPU
time, starving all the other tasks).

I tried to split the patches so that the whole patchset can be better
understood; if they should be organized in a different way, let me know.
The first 3 patches (tracking of per-runqueue active utilization) can
be useful for frequency scaling too.
Patches 0005-0007 implement the reclaiming algorithm. and patch 0004
uses the newly introduced "inactive timer" (introduced in patch 0003)
to fix dl_overflow() and __setparam_dl().


Luca Abeni (7):
Track the active utilisation
Correctly track the active utilisation for migrating tasks
Improve the tracking of active utilisation
Fix the update of the total -deadline utilization
GRUB accounting
Make GRUB a task's flag
Do not reclaim the whole CPU bandwidth

include/linux/sched.h | 1 +
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 44 ++++-----
kernel/sched/deadline.c | 225 +++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 13 +++
5 files changed, 239 insertions(+), 45 deletions(-)

--
2.5.0


2016-04-01 15:18:26

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 2/7] Correctly track the active utilisation for migrating tasks

Fix active utilisation accounting on migration: when a task is migrated
from CPUi to CPUj, immediately subtract the task's utilisation from
CPUi and add it to CPUj. This mechanism is implemented by modifying the
pull and push functions.

Note: this is not fully correct from the theoretical point of view
(the utilisation should be removed from CPUi only at the 0 lag time),
but doing the right thing would be _MUCH_ more complex (leaving the
timer armed when the task is on a different CPU... Inactive timers should
be moved from per-task timers to per-runqueue lists of timers! Bah...)

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/deadline.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3c64ebf..05cfccb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1530,7 +1530,9 @@ retry:
}

deactivate_task(rq, next_task, 0);
+ sub_running_bw(&next_task->dl, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
+ add_running_bw(&next_task->dl, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;

@@ -1618,7 +1620,9 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;

deactivate_task(src_rq, p, 0);
+ sub_running_bw(&p->dl, &src_rq->dl);
set_task_cpu(p, this_cpu);
+ add_running_bw(&p->dl, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;

--
2.5.0

2016-04-01 15:18:28

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 5/7] GRUB accounting

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/deadline.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ca7910a..647d779 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -783,6 +783,19 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);

/*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^20, the result
+ * has to be shifted right by 20.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+ return (delta * rq->dl.running_bw) >> 20;
+}
+
+/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
*/
@@ -821,6 +834,7 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;

throttle:
--
2.5.0

2016-04-01 15:18:36

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 7/7] Do not reclaim the whole CPU bandwidth

Original GRUB tends to reclaim 100% of the CPU time... And this allows a
CPU hot to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a specified
fraction of CPU time.

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/core.c | 4 ++++
kernel/sched/deadline.c | 7 ++++++-
kernel/sched/sched.h | 6 ++++++
3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3224132..b22fe83 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7941,6 +7941,10 @@ static void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

rcu_read_unlock_sched();
+ if (dl_b->bw == -1)
+ cpu_rq(cpu)->dl.non_deadline_bw = 0;
+ else
+ cpu_rq(cpu)->dl.non_deadline_bw = (1 << 20) - new_bw;
}
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b56f76f..b6ecec2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -154,6 +154,11 @@ void init_dl_rq(struct dl_rq *dl_rq)
#else
init_dl_bw(&dl_rq->dl_bw);
#endif
+ if (global_rt_runtime() == RUNTIME_INF)
+ dl_rq->non_deadline_bw = 0;
+ else
+ dl_rq->non_deadline_bw = (1 << 20) -
+ to_ratio(global_rt_period(), global_rt_runtime());
}

#ifdef CONFIG_SMP
@@ -792,7 +797,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
*/
u64 grub_reclaim(u64 delta, struct rq *rq)
{
- return (delta * rq->dl.running_bw) >> 20;
+ return (delta * (rq->dl.non_deadline_bw + rq->dl.running_bw)) >> 20;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22d36b2..9fb3413 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -526,6 +526,12 @@ struct dl_rq {
* task blocks
*/
s64 running_bw;
+
+ /*
+ * Fraction of the CPU utilization that cannot be reclaimed
+ * by the GRUB algorithm.
+ */
+ s64 non_deadline_bw;
};

#ifdef CONFIG_SMP
--
2.5.0

2016-04-01 15:19:18

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 6/7] Make GRUB a task's flag

Signed-off-by: Luca Abeni <[email protected]>
---
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 3 ++-
kernel/sched/deadline.c | 3 ++-
3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..9279562 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -48,5 +48,6 @@
* For the sched_{set,get}attr() calls
*/
#define SCHED_FLAG_RESET_ON_FORK 0x01
+#define SCHED_FLAG_RECLAIM 0x02

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4158d1f..3224132 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3822,7 +3822,8 @@ recheck:
return -EINVAL;
}

- if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+ if (attr->sched_flags &
+ ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
return -EINVAL;

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 647d779..b56f76f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -834,7 +834,8 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

- delta_exec = grub_reclaim(delta_exec, rq);
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;

throttle:
--
2.5.0

2016-04-01 15:19:34

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 4/7] Fix the update of the total -deadline utilization

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/core.c | 36 ++++++++++++------------------------
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++--------
2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 23d235c..4158d1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2347,9 +2347,6 @@ static inline int dl_bw_cpus(int i)
* allocated bandwidth to reflect the new situation.
*
* This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
*/
static int dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
@@ -2377,11 +2374,22 @@ static int dl_overflow(struct task_struct *p, int policy,
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ /*
+ * XXX this is slightly incorrect: when the task
+ * utilization decreases, we should delay the total
+ * utilization change until the task's 0-lag point.
+ * But this would require to set the task's "inactive
+ * timer" when the task is not inactive.
+ */
__dl_clear(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ /*
+ * Do not decrease the total deadline utilization here,
+ * switched_from_dl() will take care to do it at the correct
+ * (0-lag) time.
+ */
err = 0;
}
raw_spin_unlock(&dl_b->lock);
@@ -3647,26 +3655,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
- /*
- * Changing the parameters of a task is 'tricky' and we're not doing
- * the correct thing -- also see task_dead_dl() and switched_from_dl().
- *
- * What we SHOULD do is delay the bandwidth release until the 0-lag
- * point. This would include retaining the task_struct until that time
- * and change dl_overflow() to not immediately decrement the current
- * amount.
- *
- * Instead we retain the current runtime/deadline and let the new
- * parameters take effect after the current reservation period lapses.
- * This is safe (albeit pessimistic) because the 0-lag point is always
- * before the current scheduling deadline.
- *
- * We can still have temporary overloads because we do not delay the
- * change in bandwidth until that time; so admission control is
- * not on the safe side. It does however guarantee tasks will never
- * consume more than promised.
- */
}

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 97cd5f2..ca7910a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -98,8 +98,14 @@ static void task_go_inactive(struct task_struct *p)
*/
if (ktime_us_delta(act, now) < 0) {
sub_running_bw(dl_se, dl_rq);
- if (!dl_task(p))
+ if (!dl_task(p)) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
__dl_clear_params(p);
+ raw_spin_unlock(&dl_b->lock);
+ }

return;
}
@@ -865,8 +871,13 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)

rq = task_rq_lock(p, &flags);

- if (!dl_task(p)) {
+ if (!dl_task(p) || p->state == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
__dl_clear_params(p);
+ raw_spin_unlock(&dl_b->lock);

goto unlock;
}
@@ -1341,15 +1352,21 @@ static void task_fork_dl(struct task_struct *p)

static void task_dead_dl(struct task_struct *p)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-
/*
* Since we are TASK_DEAD we won't slip out of the domain!
*/
- raw_spin_lock_irq(&dl_b->lock);
- /* XXX we should retain the bw until 0-lag */
- dl_b->total_bw -= p->dl.dl_bw;
- raw_spin_unlock_irq(&dl_b->lock);
+ if (!hrtimer_active(&p->dl.inactive_timer)) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ /*
+ * If the "inactive timer is not active, the 0-lag time
+ * is already passed, so we immediately decrease the
+ * total deadline utilization
+ */
+ raw_spin_lock_irq(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
+ raw_spin_unlock_irq(&dl_b->lock);
+ }
}

static void set_curr_task_dl(struct rq *rq)
--
2.5.0

2016-04-01 15:20:41

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 3/7] Improve the tracking of active utilisation

This patch implements a more theoretically sound algorithm for
thracking the active utilisation: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilisaation at the so called "0-lag time".

Signed-off-by: Luca Abeni <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/deadline.c | 158 +++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 1 +
4 files changed, 139 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea1..f285461 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,6 +1351,7 @@ struct sched_dl_entity {
* own bandwidth to be enforced, thus we need one timer per task.
*/
struct hrtimer dl_timer;
+ struct hrtimer inactive_timer;
};

union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de38077..23d235c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2076,6 +2076,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)

RB_CLEAR_NODE(&p->dl.rb_node);
init_dl_task_timer(&p->dl);
+ init_inactive_task_timer(&p->dl);
__dl_clear_params(p);

INIT_LIST_HEAD(&p->rt.run_list);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 05cfccb..97cd5f2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -47,6 +47,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
u64 se_bw = dl_se->dl_bw;

+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
dl_rq->running_bw += se_bw;
}

@@ -54,11 +55,59 @@ static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
u64 se_bw = dl_se->dl_bw;

+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
dl_rq->running_bw -= se_bw;
if (WARN_ON(dl_rq->running_bw < 0))
dl_rq->running_bw = 0;
}

+static void task_go_inactive(struct task_struct *p)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+ struct hrtimer *timer = &dl_se->inactive_timer;
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+ ktime_t now, act;
+ s64 delta;
+ u64 zerolag_time;
+
+ WARN_ON(dl_se->dl_runtime == 0);
+
+ /* If the inactive timer is already armed, return immediately */
+ if (hrtimer_active(&dl_se->inactive_timer))
+ return;
+
+
+ /*
+ * We want the timer to fire at the "0 lag time", but considering
+ * that it is actually coming from rq->clock and not from
+ * hrtimer's time base reading.
+ */
+ zerolag_time = dl_se->deadline -
+ div64_long((dl_se->runtime * dl_se->dl_period),
+ dl_se->dl_runtime);
+
+ act = ns_to_ktime(zerolag_time);
+ now = hrtimer_cb_get_time(timer);
+ delta = ktime_to_ns(now) - rq_clock(rq);
+ act = ktime_add_ns(act, delta);
+
+ /*
+ * If the "0-lag time" already passed, decrease the active
+ * utilization now, instead of starting a timer
+ */
+ if (ktime_us_delta(act, now) < 0) {
+ sub_running_bw(dl_se, dl_rq);
+ if (!dl_task(p))
+ __dl_clear_params(p);
+
+ return;
+ }
+
+ get_task_struct(p);
+ hrtimer_start(timer, act, HRTIMER_MODE_ABS);
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -526,7 +575,18 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);

- add_running_bw(dl_se, dl_rq);
+ /*
+ * If the "inactive timer" is still active, stop it and leave
+ * the active utilisation unchanged.
+ * Otherwise, increase the active utilisation.
+ * If the timer cannot be cancelled, inactive_task_timer() will
+ * find the task state as TASK_RUNNING, and will do nothing, so
+ * we are still safe.
+ */
+ if (hrtimer_active(&dl_se->inactive_timer))
+ hrtimer_try_to_cancel(&dl_se->inactive_timer);
+ else
+ add_running_bw(dl_se, dl_rq);

if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
@@ -614,14 +674,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)

rq = task_rq_lock(p, &flags);

- /*
- * The task might have changed its scheduling policy to something
- * different than SCHED_DEADLINE (through switched_fromd_dl()).
- */
- if (!dl_task(p)) {
- __dl_clear_params(p);
+ if (!dl_task(p))
goto unlock;
- }

/*
* The task might have been boosted by someone else and might be in the
@@ -800,6 +854,44 @@ throttle:
}
}

+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+ struct sched_dl_entity *dl_se = container_of(timer,
+ struct sched_dl_entity,
+ inactive_timer);
+ struct task_struct *p = dl_task_of(dl_se);
+ unsigned long flags;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+
+ if (!dl_task(p)) {
+ __dl_clear_params(p);
+
+ goto unlock;
+ }
+ if (p->state == TASK_RUNNING)
+ goto unlock;
+
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ sub_running_bw(dl_se, &rq->dl);
+unlock:
+ task_rq_unlock(rq, p, &flags);
+ put_task_struct(p);
+
+ return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+ struct hrtimer *timer = &dl_se->inactive_timer;
+
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ timer->function = inactive_task_timer;
+}
+
#ifdef CONFIG_SMP

static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -976,7 +1068,8 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* run yet) will take care of this.
*/
if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
- add_running_bw(&p->dl, &rq->dl);
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) < 0)
+ add_running_bw(&p->dl, &rq->dl);
return;
}

@@ -997,7 +1090,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);
if (flags & DEQUEUE_SLEEP)
- sub_running_bw(&p->dl, &rq->dl);
+ task_go_inactive(p);
}

/*
@@ -1071,6 +1164,23 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
}
rcu_read_unlock();

+ if (rq != cpu_rq(cpu)) {
+ int migrate_active;
+
+ raw_spin_lock(&rq->lock);
+ migrate_active = hrtimer_active(&p->dl.inactive_timer);
+ if (migrate_active)
+ sub_running_bw(&p->dl, &rq->dl);
+ raw_spin_unlock(&rq->lock);
+ if (migrate_active) {
+ rq = cpu_rq(cpu);
+ raw_spin_lock(&rq->lock);
+ add_running_bw(&p->dl, &rq->dl);
+ raw_spin_unlock(&rq->lock);
+ }
+ }
+
+
out:
return cpu;
}
@@ -1232,8 +1342,6 @@ static void task_fork_dl(struct task_struct *p)
static void task_dead_dl(struct task_struct *p)
{
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
- struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
- struct rq *rq = rq_of_dl_rq(dl_rq);

/*
* Since we are TASK_DEAD we won't slip out of the domain!
@@ -1242,9 +1350,6 @@ static void task_dead_dl(struct task_struct *p)
/* XXX we should retain the bw until 0-lag */
dl_b->total_bw -= p->dl.dl_bw;
raw_spin_unlock_irq(&dl_b->lock);
-
- if (task_on_rq_queued(p))
- sub_running_bw(&p->dl, &rq->dl);
}

static void set_curr_task_dl(struct rq *rq)
@@ -1720,15 +1825,23 @@ void __init init_sched_dl_class(void)
static void switched_from_dl(struct rq *rq, struct task_struct *p)
{
/*
- * Start the deadline timer; if we switch back to dl before this we'll
- * continue consuming our current CBS slice. If we stay outside of
- * SCHED_DEADLINE until the deadline passes, the timer will reset the
- * task.
+ * task_go_inactive() can start the "inactive timer" (if the 0-lag
+ * time is in the future). If the task switches back to dl before
+ * the "inactive timer" fires, it can continue to consume its current
+ * runtime using its current deadline. If it stays outside of
+ * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+ * will reset the task parameters.
*/
- if (!start_dl_timer(p))
- __dl_clear_params(p);
+ if (task_on_rq_queued(p) && p->dl.dl_runtime)
+ task_go_inactive(p);

- if (task_on_rq_queued(p))
+ /*
+ * We cannot use inactive_task_timer() to invoke sub_running_bw()
+ * at the 0-lag time, because the task could have been migrated
+ * while SCHED_OTHER in the meanwhile.
+ */
+ if (hrtimer_active(&p->dl.inactive_timer) &&
+ !hrtimer_callback_running(&p->dl.inactive_timer))
sub_running_bw(&p->dl, &rq->dl);

/*
@@ -1750,6 +1863,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
{
if (dl_time_before(p->dl.deadline, rq_clock(rq)))
setup_new_dl_entity(&p->dl, &p->dl);
+ add_running_bw(&p->dl, &rq->dl);

if (task_on_rq_queued(p) && rq->curr != p) {
#ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc05c29..22d36b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1315,6 +1315,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
extern struct dl_bandwidth def_dl_bandwidth;
extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);

unsigned long to_ratio(u64 period, u64 runtime);

--
2.5.0

2016-04-01 15:21:28

by Luca Abeni

[permalink] [raw]
Subject: [RFC v2 1/7] Track the active utilisation

The active utilisation here is defined as the total utilisation of the
active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.
This might need to be fixed / improved by decreasing the active
utilisation at the so-called "0-lag time" instead of when the task blocks.

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/deadline.c | 32 +++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 6 ++++++
2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c7a036f..3c64ebf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,22 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
return !RB_EMPTY_NODE(&dl_se->rb_node);
}

+static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 se_bw = dl_se->dl_bw;
+
+ dl_rq->running_bw += se_bw;
+}
+
+static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 se_bw = dl_se->dl_bw;
+
+ dl_rq->running_bw -= se_bw;
+ if (WARN_ON(dl_rq->running_bw < 0))
+ dl_rq->running_bw = 0;
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -510,6 +526,8 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);

+ add_running_bw(dl_se, dl_rq);
+
if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
@@ -957,8 +975,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* its rq, the bandwidth timer callback (which clearly has not
* run yet) will take care of this.
*/
- if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+ if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ add_running_bw(&p->dl, &rq->dl);
return;
+ }

enqueue_dl_entity(&p->dl, pi_se, flags);

@@ -976,6 +996,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);
+ if (flags & DEQUEUE_SLEEP)
+ sub_running_bw(&p->dl, &rq->dl);
}

/*
@@ -1210,6 +1232,8 @@ static void task_fork_dl(struct task_struct *p)
static void task_dead_dl(struct task_struct *p)
{
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
+ struct rq *rq = rq_of_dl_rq(dl_rq);

/*
* Since we are TASK_DEAD we won't slip out of the domain!
@@ -1218,6 +1242,9 @@ static void task_dead_dl(struct task_struct *p)
/* XXX we should retain the bw until 0-lag */
dl_b->total_bw -= p->dl.dl_bw;
raw_spin_unlock_irq(&dl_b->lock);
+
+ if (task_on_rq_queued(p))
+ sub_running_bw(&p->dl, &rq->dl);
}

static void set_curr_task_dl(struct rq *rq)
@@ -1697,6 +1724,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
if (!start_dl_timer(p))
__dl_clear_params(p);

+ if (task_on_rq_queued(p))
+ sub_running_bw(&p->dl, &rq->dl);
+
/*
* Since this might be the only -deadline task on the rq,
* this is the right place to try to pull some other one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6d4a3f..bc05c29 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -520,6 +520,12 @@ struct dl_rq {
#else
struct dl_bw dl_bw;
#endif
+ /*
+ * "Active utilization" for this runqueue: increased when a
+ * task wakes up (becomes TASK_RUNNING) and decreased when a
+ * task blocks
+ */
+ s64 running_bw;
};

#ifdef CONFIG_SMP
--
2.5.0

2016-04-05 12:23:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 1/7] Track the active utilisation

On Fri, Apr 01, 2016 at 05:12:27PM +0200, Luca Abeni wrote:
> @@ -1210,6 +1232,8 @@ static void task_fork_dl(struct task_struct *p)
> static void task_dead_dl(struct task_struct *p)
> {
> struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> + struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
>
> /*
> * Since we are TASK_DEAD we won't slip out of the domain!
> @@ -1218,6 +1242,9 @@ static void task_dead_dl(struct task_struct *p)
> /* XXX we should retain the bw until 0-lag */
> dl_b->total_bw -= p->dl.dl_bw;
> raw_spin_unlock_irq(&dl_b->lock);
> +
> + if (task_on_rq_queued(p))
> + sub_running_bw(&p->dl, &rq->dl);
> }
>

A dead task cannot be running, not be queued, right? ISTR you remove
this hunk in a later patch as well.

2016-04-05 12:24:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 2/7] Correctly track the active utilisation for migrating tasks

On Fri, Apr 01, 2016 at 05:12:28PM +0200, Luca Abeni wrote:
> Fix active utilisation accounting on migration: when a task is migrated
> from CPUi to CPUj, immediately subtract the task's utilisation from
> CPUi and add it to CPUj. This mechanism is implemented by modifying the
> pull and push functions.
>
> Note: this is not fully correct from the theoretical point of view
> (the utilisation should be removed from CPUi only at the 0 lag time),
> but doing the right thing would be _MUCH_ more complex (leaving the
> timer armed when the task is on a different CPU... Inactive timers should
> be moved from per-task timers to per-runqueue lists of timers! Bah...)
>
> Signed-off-by: Luca Abeni <[email protected]>
> ---
> kernel/sched/deadline.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3c64ebf..05cfccb 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1530,7 +1530,9 @@ retry:
> }
>
> deactivate_task(rq, next_task, 0);
> + sub_running_bw(&next_task->dl, &rq->dl);
> set_task_cpu(next_task, later_rq->cpu);
> + add_running_bw(&next_task->dl, &later_rq->dl);
> activate_task(later_rq, next_task, 0);
> ret = 1;
>
> @@ -1618,7 +1620,9 @@ static void pull_dl_task(struct rq *this_rq)
> resched = true;
>
> deactivate_task(src_rq, p, 0);
> + sub_running_bw(&p->dl, &src_rq->dl);
> set_task_cpu(p, this_cpu);
> + add_running_bw(&p->dl, &this_rq->dl);
> activate_task(this_rq, p, 0);
> dmin = p->dl.deadline;
>

Are these the only places a DL task might be migrated from? In
particular I worry about the case where we assign an existing DL task to
a different cpuset.

Juri?

2016-04-05 12:42:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> +static void task_go_inactive(struct task_struct *p)
> +{
> + struct sched_dl_entity *dl_se = &p->dl;
> + struct hrtimer *timer = &dl_se->inactive_timer;
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
> + ktime_t now, act;
> + s64 delta;
> + u64 zerolag_time;

s64 zerolag_time;

> +
> + WARN_ON(dl_se->dl_runtime == 0);
> +
> + /* If the inactive timer is already armed, return immediately */
> + if (hrtimer_active(&dl_se->inactive_timer))
> + return;
> +
> +
> + /*
> + * We want the timer to fire at the "0 lag time", but considering
> + * that it is actually coming from rq->clock and not from
> + * hrtimer's time base reading.
> + */
> + zerolag_time = dl_se->deadline -
> + div64_long((dl_se->runtime * dl_se->dl_period),
> + dl_se->dl_runtime);
> +
> + act = ns_to_ktime(zerolag_time);
> + now = hrtimer_cb_get_time(timer);
> + delta = ktime_to_ns(now) - rq_clock(rq);
> + act = ktime_add_ns(act, delta);

Would something like:

zerolag_time -= rq_clock(rq);

> +
> + /*
> + * If the "0-lag time" already passed, decrease the active
> + * utilization now, instead of starting a timer
> + */
> + if (ktime_us_delta(act, now) < 0) {

if (zerolag_time < 0)

> + sub_running_bw(dl_se, dl_rq);
> + if (!dl_task(p))
> + __dl_clear_params(p);
> +
> + return;
> + }
> +
> + get_task_struct(p);
> + hrtimer_start(timer, act, HRTIMER_MODE_ABS);

hrtimer_start(timer, ns_to_ktime(zerolag), HRTIMER_MODE_REL);

> +}

Not be simpler ?

2016-04-05 12:42:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> @@ -526,7 +575,18 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> struct rq *rq = rq_of_dl_rq(dl_rq);
>
> - add_running_bw(dl_se, dl_rq);
> + /*
> + * If the "inactive timer" is still active, stop it and leave
> + * the active utilisation unchanged.
> + * Otherwise, increase the active utilisation.
> + * If the timer cannot be cancelled, inactive_task_timer() will
> + * find the task state as TASK_RUNNING, and will do nothing, so
> + * we are still safe.
> + */
> + if (hrtimer_active(&dl_se->inactive_timer))
> + hrtimer_try_to_cancel(&dl_se->inactive_timer);

_try_, what happens if that fails?

> + else
> + add_running_bw(dl_se, dl_rq);
>
> if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
> dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {

2016-04-05 12:59:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 4/7] Fix the update of the total -deadline utilization

On Fri, Apr 01, 2016 at 05:12:30PM +0200, Luca Abeni wrote:
> + /*
> + * XXX this is slightly incorrect: when the task
> + * utilization decreases, we should delay the total
> + * utilization change until the task's 0-lag point.
> + * But this would require to set the task's "inactive
> + * timer" when the task is not inactive.
> + */

Could you quickly remind me why this is a problem? The timeline does
continue running right? So even if the task isn't active now, it will
still reach its 0-lag point.

> __dl_clear(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> err = 0;

2016-04-05 14:48:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> + /*
> + * We cannot use inactive_task_timer() to invoke sub_running_bw()
> + * at the 0-lag time, because the task could have been migrated
> + * while SCHED_OTHER in the meanwhile.
> + */
> + if (hrtimer_active(&p->dl.inactive_timer) &&
> + !hrtimer_callback_running(&p->dl.inactive_timer))
> sub_running_bw(&p->dl, &rq->dl);

hrtimer_is_queued() ?

2016-04-05 15:00:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> +static void task_go_inactive(struct task_struct *p)
> +{
> + struct sched_dl_entity *dl_se = &p->dl;
> + struct hrtimer *timer = &dl_se->inactive_timer;
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
> + ktime_t now, act;
> + s64 delta;
> + u64 zerolag_time;
> +
> + WARN_ON(dl_se->dl_runtime == 0);
> +
> + /* If the inactive timer is already armed, return immediately */
> + if (hrtimer_active(&dl_se->inactive_timer))
> + return;

So while we start the timer on the local cpu, we don't migrate the timer
when we migrate the task, so the callback can happen on a remote cpu,
right?

Therefore, the timer function might still be running, but just have done
task_rq_unlock(), which would have allowed our cpu to acquire the
rq->lock and get here.

Then the above check is true, we'll quit, but effectively the inactive
timer will not run 'again'.

> +
> +
> + /*
> + * We want the timer to fire at the "0 lag time", but considering
> + * that it is actually coming from rq->clock and not from
> + * hrtimer's time base reading.
> + */
> + zerolag_time = dl_se->deadline -
> + div64_long((dl_se->runtime * dl_se->dl_period),
> + dl_se->dl_runtime);
> +
> + act = ns_to_ktime(zerolag_time);
> + now = hrtimer_cb_get_time(timer);
> + delta = ktime_to_ns(now) - rq_clock(rq);
> + act = ktime_add_ns(act, delta);
> +
> + /*
> + * If the "0-lag time" already passed, decrease the active
> + * utilization now, instead of starting a timer
> + */
> + if (ktime_us_delta(act, now) < 0) {
> + sub_running_bw(dl_se, dl_rq);
> + if (!dl_task(p))
> + __dl_clear_params(p);
> +
> + return;
> + }
> +
> + get_task_struct(p);
> + hrtimer_start(timer, act, HRTIMER_MODE_ABS);
> +}


> @@ -1071,6 +1164,23 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> }
> rcu_read_unlock();
>
> + if (rq != cpu_rq(cpu)) {

I don't think this is right, you want:

if (task_cpu(p) != cpu) {

because @cpu does not need to be task_cpu().

> + int migrate_active;
> +
> + raw_spin_lock(&rq->lock);

Which then also means @rq is 'wrong', so you'll have to add:

rq = task_rq(p);

before this.

> + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> + if (migrate_active)
> + sub_running_bw(&p->dl, &rq->dl);
> + raw_spin_unlock(&rq->lock);

At this point task_rq() is still the above rq, so if the inactive timer
hits here it will lock this rq and subtract the running bw here _again_,
right?

> + if (migrate_active) {
> + rq = cpu_rq(cpu);
> + raw_spin_lock(&rq->lock);
> + add_running_bw(&p->dl, &rq->dl);
> + raw_spin_unlock(&rq->lock);
> + }
> + }

2016-04-05 16:47:48

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 1/7] Track the active utilisation

Hi Peter,

first of all, thanks for all the reviews!

On Tue, 5 Apr 2016 14:23:27 +0200
Peter Zijlstra <[email protected]> wrote:
[...]
> > @@ -1218,6 +1242,9 @@ static void task_dead_dl(struct task_struct *p)
> > /* XXX we should retain the bw until 0-lag */
> > dl_b->total_bw -= p->dl.dl_bw;
> > raw_spin_unlock_irq(&dl_b->lock);
> > +
> > + if (task_on_rq_queued(p))
> > + sub_running_bw(&p->dl, &rq->dl);
> > }
> >
>
> A dead task cannot be running, not be queued, right? ISTR you remove
> this hunk in a later patch as well.
I suspect this is some "better safe than sorry" code I added trying
to solve some issue that then I solved in a different way... But I
forgot to remove it.

I'll fix the patch and re-test.



Thanks,
Luca

2016-04-05 16:51:13

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 2/7] Correctly track the active utilisation for migrating tasks

On Tue, 5 Apr 2016 14:24:25 +0200
Peter Zijlstra <[email protected]> wrote:
[...]
> > @@ -1618,7 +1620,9 @@ static void pull_dl_task(struct rq *this_rq)
> > resched = true;
> >
> > deactivate_task(src_rq, p, 0);
> > + sub_running_bw(&p->dl, &src_rq->dl);
> > set_task_cpu(p, this_cpu);
> > + add_running_bw(&p->dl, &this_rq->dl);
> > activate_task(this_rq, p, 0);
> > dmin = p->dl.deadline;
> >
>
> Are these the only places a DL task might be migrated from? In
> particular I worry about the case where we assign an existing DL task to
> a different cpuset.
I was under the impression that these (+ the select_task_rq_dl() thing)
covered all of the migration cases... But I was probably wrong: now that
you say this, I realise that I never tested moving dl tasks between
different cpusets.

I'll test that, and fix the patch if some new issue appears.



Thanks,
Luca

2016-04-05 17:00:39

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

Hi Peter,

On Tue, 5 Apr 2016 14:42:18 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> > +static void task_go_inactive(struct task_struct *p)
> > +{
> > + struct sched_dl_entity *dl_se = &p->dl;
> > + struct hrtimer *timer = &dl_se->inactive_timer;
> > + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > + struct rq *rq = rq_of_dl_rq(dl_rq);
> > + ktime_t now, act;
> > + s64 delta;
> > + u64 zerolag_time;
>
> s64 zerolag_time;
Ok.

[...]
> Would something like:
>
> zerolag_time -= rq_clock(rq);
>
> > +
> > + /*
> > + * If the "0-lag time" already passed, decrease the active
> > + * utilization now, instead of starting a timer
> > + */
> > + if (ktime_us_delta(act, now) < 0) {
>
> if (zerolag_time < 0)
>
> > + sub_running_bw(dl_se, dl_rq);
> > + if (!dl_task(p))
> > + __dl_clear_params(p);
> > +
> > + return;
> > + }
> > +
> > + get_task_struct(p);
> > + hrtimer_start(timer, act, HRTIMER_MODE_ABS);
>
> hrtimer_start(timer, ns_to_ktime(zerolag), HRTIMER_MODE_REL);
>
> > +}
>
> Not be simpler ?
I think I blindly copied this code from the deadline timer, but yes,
you are right, I am doing a lot of useless conversions here. I'll fix
this.



Thanks,
Luca

2016-04-05 17:05:27

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

Hi Peter,

On Tue, 5 Apr 2016 14:42:42 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> > @@ -526,7 +575,18 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
> > struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > struct rq *rq = rq_of_dl_rq(dl_rq);
> >
> > - add_running_bw(dl_se, dl_rq);
> > + /*
> > + * If the "inactive timer" is still active, stop it and leave
> > + * the active utilisation unchanged.
> > + * Otherwise, increase the active utilisation.
> > + * If the timer cannot be cancelled, inactive_task_timer() will
> > + * find the task state as TASK_RUNNING, and will do nothing, so
> > + * we are still safe.
> > + */
> > + if (hrtimer_active(&dl_se->inactive_timer))
> > + hrtimer_try_to_cancel(&dl_se->inactive_timer);
>
> _try_, what happens if that fails?

I think inactive_task_timer() will run, but will see p->state == TASK_RUNNING
and will return immediately.

I think I have actually seen this happening during my tests, because adding
the "if (p->state == TASK_RUNNING)" in inactive_task_timer() fixed some issues
that I was seeing.



Thanks,
Luca

2016-04-05 17:16:29

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 4/7] Fix the update of the total -deadline utilization

Hi Peter,

On Tue, 5 Apr 2016 14:58:59 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 01, 2016 at 05:12:30PM +0200, Luca Abeni wrote:
> > + /*
> > + * XXX this is slightly incorrect: when the task
> > + * utilization decreases, we should delay the total
> > + * utilization change until the task's 0-lag point.
> > + * But this would require to set the task's "inactive
> > + * timer" when the task is not inactive.
> > + */
>
> Could you quickly remind me why this is a problem?

Do you mean, why it is a problem to activate the "inactive timer"
when the task is inactive?
I designed inactive_task_timer() to be executed when the task is not
active (for example, if the task's state is TASK_RUNNING inactive_task_timer()
returns without doing anything).
One problem could be that inactive_task_timer() decreases the utilisation of
the task's rq (I am beginning to wonder if using the task's rq instead of the
rq of the CPU on which the timer is running is a good idea, BTW); if the task
is running, it can migrate to a different rq and inactive_task_timer() will
decrease the wrong utilisation.

Or do you mean why updating the utilisation now is a problem?
This is slightly incorrect because decreasing the utilisation too early we
risk to admit tasks that create a transient overload (with some unexpected
deadline misses).

Or did I misunderstand your question?



Thanks,
Luca

> The timeline does
> continue running right? So even if the task isn't active now, it will
> still reach its 0-lag point.

2016-04-05 17:17:50

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, 5 Apr 2016 16:48:03 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> > + /*
> > + * We cannot use inactive_task_timer() to invoke sub_running_bw()
> > + * at the 0-lag time, because the task could have been migrated
> > + * while SCHED_OTHER in the meanwhile.
> > + */
> > + if (hrtimer_active(&p->dl.inactive_timer) &&
> > + !hrtimer_callback_running(&p->dl.inactive_timer))
> > sub_running_bw(&p->dl, &rq->dl);
>
> hrtimer_is_queued() ?

Uhm... I do not remember why I used this condition, but hrtimer_is_queued()
should be the right thing to be used, yes... I'll update the patch.



Thanks,
Luca

2016-04-05 17:57:11

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, 5 Apr 2016 17:00:36 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> > +static void task_go_inactive(struct task_struct *p)
> > +{
> > + struct sched_dl_entity *dl_se = &p->dl;
> > + struct hrtimer *timer = &dl_se->inactive_timer;
> > + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > + struct rq *rq = rq_of_dl_rq(dl_rq);
> > + ktime_t now, act;
> > + s64 delta;
> > + u64 zerolag_time;
> > +
> > + WARN_ON(dl_se->dl_runtime == 0);
> > +
> > + /* If the inactive timer is already armed, return immediately */
> > + if (hrtimer_active(&dl_se->inactive_timer))
> > + return;
>
> So while we start the timer on the local cpu, we don't migrate the timer
> when we migrate the task, so the callback can happen on a remote cpu,
> right?
>
> Therefore, the timer function might still be running, but just have done
> task_rq_unlock(), which would have allowed our cpu to acquire the
> rq->lock and get here.
>
> Then the above check is true, we'll quit, but effectively the inactive
> timer will not run 'again'.
Uhm... So the problem is:
- Task T wakes up, but cannot cancel its inactive timer, because it is running
+ This should not be a problem: inactive_task_timer() will return without
doing anything
- Before inactive_task_timer() can actually run, task T migrates to a different CPU
- Befere the timer finishes to run, the task blocks again... So, task_go_inactive()
sees the timer as active and returns immediately. But the timer has already
executed (without doing anything). So noone decreases the rq utilisation.

I did not think about this issue, and I never managed to trigger it in my
tests... I'll try to see how it can be addressed. Do you have any suggestions?

[...]
> > @@ -1071,6 +1164,23 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> > }
> > rcu_read_unlock();
> >
> > + if (rq != cpu_rq(cpu)) {
>
> I don't think this is right, you want:
>
> if (task_cpu(p) != cpu) {
>
> because @cpu does not need to be task_cpu().
Uhm... I must have misunderstood something in the code, then :(
What I want to do here is to check if select_task_rq_dl() selected
a new CPU for this task... Since at the beginning of the function
rq is set as
rq = cpu_rq(cpu);
I was thinkint about checking if this is still true (if not, it
means that the value of "cpu" changed).

I'll look at it again.


>
> > + int migrate_active;
> > +
> > + raw_spin_lock(&rq->lock);
>
> Which then also means @rq is 'wrong', so you'll have to add:
>
> rq = task_rq(p);
Ok; I completely misunderstood the current code, then... :(


>
> before this.
>
> > + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> > + if (migrate_active)
> > + sub_running_bw(&p->dl, &rq->dl);
> > + raw_spin_unlock(&rq->lock);
>
> At this point task_rq() is still the above rq, so if the inactive timer
> hits here it will lock this rq and subtract the running bw here _again_,
> right?
I think it will see the task state as TASK_RUNNING, so it will do nothing.
Or it will cancelled later when the task is enqueued... I'll double check this.



Thanks,
Luca

>
> > + if (migrate_active) {
> > + rq = cpu_rq(cpu);
> > + raw_spin_lock(&rq->lock);
> > + add_running_bw(&p->dl, &rq->dl);
> > + raw_spin_unlock(&rq->lock);
> > + }
> > + }

2016-04-05 18:00:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
> > > + if (rq != cpu_rq(cpu)) {
> >
> > I don't think this is right, you want:
> >
> > if (task_cpu(p) != cpu) {
> >
> > because @cpu does not need to be task_cpu().
> Uhm... I must have misunderstood something in the code, then :(
> What I want to do here is to check if select_task_rq_dl() selected
> a new CPU for this task... Since at the beginning of the function
> rq is set as
> rq = cpu_rq(cpu);
> I was thinkint about checking if this is still true (if not, it
> means that the value of "cpu" changed).
>
> I'll look at it again.

Basically because:

ac66f5477239 ("sched/numa: Introduce migrate_swap()")

we cannot (in general) assume .cpu == task_cpu(p).

Now it might still be true for deadline tasks, but I find it easier to
simply not rely on such assumptions.

2016-04-05 18:02:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:

> > > + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> > > + if (migrate_active)
> > > + sub_running_bw(&p->dl, &rq->dl);
> > > + raw_spin_unlock(&rq->lock);
> >
> > At this point task_rq() is still the above rq, so if the inactive timer
> > hits here it will lock this rq and subtract the running bw here _again_,
> > right?
> I think it will see the task state as TASK_RUNNING, so it will do nothing.
> Or it will cancelled later when the task is enqueued... I'll double check this.

Right, so this is select_task_rq_dl(), we run this in wakeups, before
TASK_RUNNING.

2016-04-05 18:11:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
> On Tue, 5 Apr 2016 17:00:36 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Apr 01, 2016 at 05:12:29PM +0200, Luca Abeni wrote:
> > > +static void task_go_inactive(struct task_struct *p)
> > > +{
> > > + struct sched_dl_entity *dl_se = &p->dl;
> > > + struct hrtimer *timer = &dl_se->inactive_timer;
> > > + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > > + struct rq *rq = rq_of_dl_rq(dl_rq);
> > > + ktime_t now, act;
> > > + s64 delta;
> > > + u64 zerolag_time;
> > > +
> > > + WARN_ON(dl_se->dl_runtime == 0);
> > > +
> > > + /* If the inactive timer is already armed, return immediately */
> > > + if (hrtimer_active(&dl_se->inactive_timer))
> > > + return;
> >
> > So while we start the timer on the local cpu, we don't migrate the timer
> > when we migrate the task, so the callback can happen on a remote cpu,
> > right?
> >
> > Therefore, the timer function might still be running, but just have done
> > task_rq_unlock(), which would have allowed our cpu to acquire the
> > rq->lock and get here.
> >
> > Then the above check is true, we'll quit, but effectively the inactive
> > timer will not run 'again'.
> Uhm... So the problem is:
> - Task T wakes up, but cannot cancel its inactive timer, because it is running
> + This should not be a problem: inactive_task_timer() will return without
> doing anything
> - Before inactive_task_timer() can actually run, task T migrates to a different CPU
> - Befere the timer finishes to run, the task blocks again... So, task_go_inactive()
> sees the timer as active and returns immediately. But the timer has already
> executed (without doing anything). So noone decreases the rq utilisation.
>
> I did not think about this issue, and I never managed to trigger it in my
> tests... I'll try to see how it can be addressed. Do you have any suggestions?

So my brain is about to give out, but it might be easiest to simply
track if the current tasks' bandwidth is added with a per task variable
under pi and rq lock.

2016-04-05 19:24:36

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, 5 Apr 2016 20:02:52 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
>
> > > > + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> > > > + if (migrate_active)
> > > > + sub_running_bw(&p->dl, &rq->dl);
> > > > + raw_spin_unlock(&rq->lock);
> > >
> > > At this point task_rq() is still the above rq, so if the inactive timer
> > > hits here it will lock this rq and subtract the running bw here _again_,
> > > right?
> > I think it will see the task state as TASK_RUNNING, so it will do nothing.
> > Or it will cancelled later when the task is enqueued... I'll double check this.
>
> Right, so this is select_task_rq_dl(), we run this in wakeups, before
> TASK_RUNNING.

Sigh... I knew I was missing something here... :(
So, I think the solution here is to use double_lock_balance() (or something
like that) to take both the rq locks so that the inactive timer handler cannot
run between sub_running_bw() and add_running_bw()... I'll try this.



Thanks,
Luca

2016-04-05 19:31:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, Apr 05, 2016 at 09:24:24PM +0200, luca abeni wrote:
> On Tue, 5 Apr 2016 20:02:52 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
> >
> > > > > + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> > > > > + if (migrate_active)
> > > > > + sub_running_bw(&p->dl, &rq->dl);
> > > > > + raw_spin_unlock(&rq->lock);
> > > >
> > > > At this point task_rq() is still the above rq, so if the inactive timer
> > > > hits here it will lock this rq and subtract the running bw here _again_,
> > > > right?
> > > I think it will see the task state as TASK_RUNNING, so it will do nothing.
> > > Or it will cancelled later when the task is enqueued... I'll double check this.
> >
> > Right, so this is select_task_rq_dl(), we run this in wakeups, before
> > TASK_RUNNING.
>
> Sigh... I knew I was missing something here... :(
> So, I think the solution here is to use double_lock_balance() (or something
> like that) to take both the rq locks so that the inactive timer handler cannot
> run between sub_running_bw() and add_running_bw()... I'll try this.

I'm not sure that'll fix it, because after you unlock both again, we can
hit after, and there task_rq() will still be the first rq, not the
second. So we again subtract twice from the old rq.

Only after __set_task_cpu()'s store to task_thread_info(p)->cpu will the
timer hit the new rq.

And you cannot hold a lock over that..

2016-04-05 19:32:46

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, 5 Apr 2016 21:24:24 +0200
luca abeni <[email protected]> wrote:

> On Tue, 5 Apr 2016 20:02:52 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
> >
> > > > > + migrate_active = hrtimer_active(&p->dl.inactive_timer);
> > > > > + if (migrate_active)
> > > > > + sub_running_bw(&p->dl, &rq->dl);
> > > > > + raw_spin_unlock(&rq->lock);
> > > >
> > > > At this point task_rq() is still the above rq, so if the inactive timer
> > > > hits here it will lock this rq and subtract the running bw here _again_,
> > > > right?
> > > I think it will see the task state as TASK_RUNNING, so it will do nothing.
> > > Or it will cancelled later when the task is enqueued... I'll double check this.
> >
> > Right, so this is select_task_rq_dl(), we run this in wakeups, before
> > TASK_RUNNING.
>
> Sigh... I knew I was missing something here... :(
> So, I think the solution here is to use double_lock_balance() (or something
> like that) to take both the rq locks so that the inactive timer handler cannot
> run between sub_running_bw() and add_running_bw()... I'll try this.
Double thinking about this: isn't p->pi_lock saving us here?
I mean:
- try_to_wake_up() takes p->pi_lock before doing anything else
- so, select_task_rq() is invoked with p->pi_lock locked
- but inactive_task_timer() does "rq = task_rq_lock(p, &flags)", and
task_rq_lock() tries to take p->pi_lock
- so, we should be safe, no?

Maybe this is why I never managed to trigger this race... :)



Thanks,
Luca

2016-04-05 19:35:58

by Luca Abeni

[permalink] [raw]
Subject: Re: [RFC v2 3/7] Improve the tracking of active utilisation

On Tue, 5 Apr 2016 20:00:50 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 05, 2016 at 07:56:57PM +0200, luca abeni wrote:
> > > > + if (rq != cpu_rq(cpu)) {
> > >
> > > I don't think this is right, you want:
> > >
> > > if (task_cpu(p) != cpu) {
> > >
> > > because @cpu does not need to be task_cpu().
> > Uhm... I must have misunderstood something in the code, then :(
> > What I want to do here is to check if select_task_rq_dl() selected
> > a new CPU for this task... Since at the beginning of the function
> > rq is set as
> > rq = cpu_rq(cpu);
> > I was thinkint about checking if this is still true (if not, it
> > means that the value of "cpu" changed).
> >
> > I'll look at it again.
>
> Basically because:
>
> ac66f5477239 ("sched/numa: Introduce migrate_swap()")

Thanks; I am going to look at it


Thanks,
Luca

>
> we cannot (in general) assume .cpu == task_cpu(p).
>
> Now it might still be true for deadline tasks, but I find it easier to
> simply not rely on such assumptions.