LinuxLists.cc - [RFC v4 0/6] CPU reclaiming for SCHED

2016-12-30 11:33:47

Subject: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: Luca Abeni <[email protected]>

Hi all,

here is a new version of the patchset implementing CPU reclaiming
(using the GRUB algorithm[1]) for SCHED_DEADLINE.
Basically, this feature allows SCHED_DEADLINE tasks to consume more
than their reserved runtime, up to a maximum fraction of the CPU time
(so that other tasks are left some spare CPU time to execute), if this
does not break the guarantees of other SCHED_DEADLINE tasks.
The patchset applies on top of tip/master.

The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 2 patches), and modifying the
runtime accounting rule (see patch 0004). The original GRUB algorithm is
modified as described in [2] to support multiple CPUs (the original
algorithm only considered one single CPU, this one tracks U_act per
runqueue) and to leave an "unreclaimable" fraction of CPU time to non
SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
100% of the CPU time, starving all the other tasks).
Patch 0003 uses the newly introduced "inactive timer" (introduced in
patch 0002) to fix dl_overflow() and __setparam_dl().
Patch 0006 allows to enable CPU reclaiming only on selected tasks.

Changes since v3:
the most important change is the introduction of a new "dl_non_contending"
flag in the "sched_dl_entity" structure, that allows to avoid a race
condition identified by Peter
(http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02822.html) and Juri
(http://lkml.iu.edu/hypermail/linux/kernel/1611.1/02298.html).
For the moment, I added a new field (similar to the other "dl_*" flags)
to the deadline scheduling entity; if needed I can move all the dl_* flags
to a single field in a following patch.

Other than this, I tried to address all the comments I received, and to
add comments requested in the previous reviews.
In particular, the add_running_bw() and sub_running_bw() functions are now
marked as inline, and have been simplified as suggested by Daniel and
Steven.
The overflow and underflow checks in these functions have been modified
as suggested by Peter; because of a limitation of SCHED_WARN_ON(), the
code in sub_running_bw() is slightly more complex. If SCHED_WARN_ON() is
improved (as suggested in a previous email of mine), I can simplify
sub_running_bw() in a following patch.
I also updated the patches to apply on top of tip/master.
Finally, I (hopefully) fixed an issue with my usage of get_task_struct() /
put_task_struct() in the previous patches: previously, I did
"get_task_struct(p)" before arming the "inactive task timer", and
"put_task_struct(p)" in the timer handler... But I forgot to call
"put_task_struct(p)" when successfully cancelling the timer; this should
be fixed in the new version of patch 0002.

[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000. Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October). Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany.

Luca Abeni (6):
sched/deadline: track the active utilization
sched/deadline: improve the tracking of active utilization
sched/deadline: fix the update of the total -deadline utilization
sched/deadline: implement GRUB accounting
sched/deadline: do not reclaim the whole CPU bandwidth
sched/deadline: make GRUB a task's flag

include/linux/sched.h | 18 +++-
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 45 ++++----
kernel/sched/deadline.c | 260 +++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 13 +++
5 files changed, 291 insertions(+), 46 deletions(-)

--
2.7.4

2016-12-30 11:33:55

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 1/6] sched/deadline: track the active utilization

From: Luca Abeni <[email protected]>

Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution will follow.

Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/deadline.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 6 +++++
2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 70ef2b1..23c840e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
return !RB_EMPTY_NODE(&dl_se->rb_node);
}

+static inline
+void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->running_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->running_bw += dl_se->dl_bw;
+ SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+}
+
+static inline
+void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 old = dl_rq->running_bw;
+
+ lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+ dl_rq->running_bw -= dl_se->dl_bw;
+ SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
+ if (dl_rq->running_bw > old)
+ dl_rq->running_bw = 0;
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -909,8 +931,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
* parameters of the task might need updating. Otherwise,
* we want a replenishment of its runtime.
*/
- if (flags & ENQUEUE_WAKEUP)
+ if (flags & ENQUEUE_WAKEUP) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ add_running_bw(dl_se, dl_rq);
update_dl_entity(dl_se, pi_se);
+ }
else if (flags & ENQUEUE_REPLENISH)
replenish_dl_entity(dl_se, pi_se);

@@ -947,14 +973,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
return;
}

+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ add_running_bw(&p->dl, &rq->dl);
+
/*
- * If p is throttled, we do nothing. In fact, if it exhausted
+ * If p is throttled, we do not enqueue it. In fact, if it exhausted
* its budget it needs a replenishment and, since it now is on
* its rq, the bandwidth timer callback (which clearly has not
* run yet) will take care of this.
+ * However, the active utilization does not depend on the fact
+ * that the task is on the runqueue or not (but depends on the
+ * task's state - in GRUB parlance, "inactive" vs "active contending").
+ * In other words, even if a task is throttled its utilization must
+ * be counted in the active utilization; hence, we need to call
+ * add_running_bw().
*/
- if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+ if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ add_running_bw(&p->dl, &rq->dl);
return;
+ }

enqueue_dl_entity(&p->dl, pi_se, flags);

@@ -972,6 +1009,21 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);
+
+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ sub_running_bw(&p->dl, &rq->dl);
+
+ /*
+ * This check allows to start the inactive timer (or to immediately
+ * decrease the active utilization, if needed) in two cases:
+ * when the task blocks and when it is terminating
+ * (p->state == TASK_DEAD). We can handle the two cases in the same
+ * way, because from GRUB's point of view the same thing is happening
+ * (the task moves from "active contending" to "active non contending"
+ * or "inactive")
+ */
+ if (flags & DEQUEUE_SLEEP)
+ sub_running_bw(&p->dl, &rq->dl);
}

/*
@@ -1501,7 +1553,9 @@ static int push_dl_task(struct rq *rq)
}

deactivate_task(rq, next_task, 0);
+ sub_running_bw(&next_task->dl, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
+ add_running_bw(&next_task->dl, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;

@@ -1589,7 +1643,9 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;

deactivate_task(src_rq, p, 0);
+ sub_running_bw(&p->dl, &src_rq->dl);
set_task_cpu(p, this_cpu);
+ add_running_bw(&p->dl, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;

@@ -1695,6 +1751,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
if (!start_dl_timer(p))
__dl_clear_params(p);

+ if (task_on_rq_queued(p))
+ sub_running_bw(&p->dl, &rq->dl);
+
/*
* Since this might be the only -deadline task on the rq,
* this is the right place to try to pull some other one
@@ -1712,6 +1771,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
*/
static void switched_to_dl(struct rq *rq, struct task_struct *p)
{
+ add_running_bw(&p->dl, &rq->dl);

/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b34c78..0659772 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -536,6 +536,12 @@ struct dl_rq {
#else
struct dl_bw dl_bw;
#endif
+ /*
+ * "Active utilization" for this runqueue: increased when a
+ * task wakes up (becomes TASK_RUNNING) and decreased when a
+ * task blocks
+ */
+ u64 running_bw;
};

#ifdef CONFIG_SMP
--
2.7.4

2016-12-30 11:34:00

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 2/6] sched/deadline: improve the tracking of active utilization

From: Luca Abeni <[email protected]>

This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Signed-off-by: Luca Abeni <[email protected]>
---
include/linux/sched.h | 18 +++++-
kernel/sched/core.c | 2 +
kernel/sched/deadline.c | 150 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 1 +
4 files changed, 158 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..f34633c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,14 +1451,30 @@ struct sched_dl_entity {
*
* @dl_yielded tells if task gave up the cpu before consuming
* all its available runtime during the last job.
+ *
+ * @dl_non_contending tells if task is inactive while still
+ * contributing to the active utilization. In other words, it
+ * indicates if the inactive timer has been armed and its handler
+ * has not been executed yet. This flag is useful to avoid race
+ * conditions between the inactive timer handler and the wakeup
+ * code.
*/
- int dl_throttled, dl_boosted, dl_yielded;
+ int dl_throttled, dl_boosted, dl_yielded, dl_non_contending;

/*
* Bandwidth enforcement timer. Each -deadline task has its
* own bandwidth to be enforced, thus we need one timer per task.
*/
struct hrtimer dl_timer;
+
+ /*
+ * Inactive timer, responsible for decreasing the active utilization
+ * at the "0-lag time". When a -deadline task blocks, it contributes
+ * to GRUB's active utilization until the "0-lag time", hence a
+ * timer is needed to decrease the active utilization at the correct
+ * time.
+ */
+ struct hrtimer inactive_timer;
};

union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57..98f9944 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2187,6 +2187,7 @@ void __dl_clear_params(struct task_struct *p)

dl_se->dl_throttled = 0;
dl_se->dl_yielded = 0;
+ dl_se->dl_non_contending = 0;
}

/*
@@ -2218,6 +2219,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)

RB_CLEAR_NODE(&p->dl.rb_node);
init_dl_task_timer(&p->dl);
+ init_inactive_task_timer(&p->dl);
__dl_clear_params(p);

INIT_LIST_HEAD(&p->rt.run_list);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 23c840e..cdb7274 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -65,6 +65,46 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
dl_rq->running_bw = 0;
}

+static void task_go_inactive(struct task_struct *p)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+ struct hrtimer *timer = &dl_se->inactive_timer;
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+ s64 zerolag_time;
+
+ WARN_ON(dl_se->dl_runtime == 0);
+
+ WARN_ON(hrtimer_active(&dl_se->inactive_timer));
+ WARN_ON(dl_se->dl_non_contending);
+
+ zerolag_time = dl_se->deadline -
+ div64_long((dl_se->runtime * dl_se->dl_period),
+ dl_se->dl_runtime);
+
+ /*
+ * Using relative times instead of the absolute "0-lag time"
+ * allows to simplify the code
+ */
+ zerolag_time -= rq_clock(rq);
+
+ /*
+ * If the "0-lag time" already passed, decrease the active
+ * utilization now, instead of starting a timer
+ */
+ if (zerolag_time < 0) {
+ sub_running_bw(dl_se, dl_rq);
+ if (!dl_task(p))
+ __dl_clear_params(p);
+
+ return;
+ }
+
+ dl_se->dl_non_contending = 1;
+ get_task_struct(p);
+ hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -610,10 +650,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* The task might have changed its scheduling policy to something
* different than SCHED_DEADLINE (through switched_from_dl()).
*/
- if (!dl_task(p)) {
- __dl_clear_params(p);
+ if (!dl_task(p))
goto unlock;
- }

/*
* The task might have been boosted by someone else and might be in the
@@ -800,6 +838,48 @@ static void update_curr_dl(struct rq *rq)
}
}

+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+ struct sched_dl_entity *dl_se = container_of(timer,
+ struct sched_dl_entity,
+ inactive_timer);
+ struct task_struct *p = dl_task_of(dl_se);
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+
+ if (!dl_task(p) || p->state == TASK_DEAD) {
+ if (p->state == TASK_DEAD && dl_se->dl_non_contending)
+ sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+
+ __dl_clear_params(p);
+
+ goto unlock;
+ }
+ if (dl_se->dl_non_contending == 0)
+ goto unlock;
+
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ sub_running_bw(dl_se, &rq->dl);
+ dl_se->dl_non_contending = 0;
+unlock:
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+
+ return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+ struct hrtimer *timer = &dl_se->inactive_timer;
+
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ timer->function = inactive_task_timer;
+}
+
#ifdef CONFIG_SMP

static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -934,7 +1014,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
if (flags & ENQUEUE_WAKEUP) {
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);

- add_running_bw(dl_se, dl_rq);
+ if (dl_se->dl_non_contending) {
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will do nothing, so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
+ put_task_struct(dl_task_of(dl_se));
+ WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);
+ dl_se->dl_non_contending = 0;
+ } else {
+ /*
+ * Since "dl_non_contending" is not set, the
+ * task's utilization has already been removed from
+ * active utilization (either when the task blocked,
+ * when the "inactive timer" fired, or when it has
+ * been cancelled in select_task_rq_dl()).
+ * So, add it back.
+ */
+ add_running_bw(dl_se, dl_rq);
+ }
update_dl_entity(dl_se, pi_se);
}
else if (flags & ENQUEUE_REPLENISH)
@@ -1023,7 +1124,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* or "inactive")
*/
if (flags & DEQUEUE_SLEEP)
- sub_running_bw(&p->dl, &rq->dl);
+ task_go_inactive(p);
}

/*
@@ -1097,6 +1198,22 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
}
rcu_read_unlock();

+ rq = task_rq(p);
+ raw_spin_lock(&rq->lock);
+ if (p->dl.dl_non_contending) {
+ sub_running_bw(&p->dl, &rq->dl);
+ p->dl.dl_non_contending = 0;
+ /*
+ * If the timer handler is currently running and the
+ * timer cannot be cancelled, inactive_task_timer()
+ * will see that dl_not_contending is not set, and
+ * will do nothing, so we are still safe.
+ */
+ if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+ put_task_struct(p);
+ }
+ raw_spin_unlock(&rq->lock);
+
out:
return cpu;
}
@@ -1743,16 +1860,25 @@ void __init init_sched_dl_class(void)
static void switched_from_dl(struct rq *rq, struct task_struct *p)
{
/*
- * Start the deadline timer; if we switch back to dl before this we'll
- * continue consuming our current CBS slice. If we stay outside of
- * SCHED_DEADLINE until the deadline passes, the timer will reset the
- * task.
+ * task_go_inactive() can start the "inactive timer" (if the 0-lag
+ * time is in the future). If the task switches back to dl before
+ * the "inactive timer" fires, it can continue to consume its current
+ * runtime using its current deadline. If it stays outside of
+ * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+ * will reset the task parameters.
*/
- if (!start_dl_timer(p))
- __dl_clear_params(p);
+ if (task_on_rq_queued(p) && p->dl.dl_runtime)
+ task_go_inactive(p);

- if (task_on_rq_queued(p))
+ /*
+ * We cannot use inactive_task_timer() to invoke sub_running_bw()
+ * at the 0-lag time, because the task could have been migrated
+ * while SCHED_OTHER in the meanwhile.
+ */
+ if (p->dl.dl_non_contending) {
sub_running_bw(&p->dl, &rq->dl);
+ p->dl.dl_non_contending = 0;
+ }

/*
* Since this might be the only -deadline task on the rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0659772..e422803 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1367,6 +1367,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
extern struct dl_bandwidth def_dl_bandwidth;
extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);

unsigned long to_ratio(u64 period, u64 runtime);

--
2.7.4

2016-12-30 11:34:05

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 4/6] sched/deadline: implement GRUB accounting

From: Luca Abeni <[email protected]>

According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/deadline.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c087c3d..361887b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -764,6 +764,19 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);

/*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^20, the result
+ * has to be shifted right by 20.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+ return (delta * rq->dl.running_bw) >> 20;
+}
+
+/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
*/
@@ -805,6 +818,7 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;

throttle:
--
2.7.4

2016-12-30 11:34:10

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization

From: Luca Abeni <[email protected]>

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/core.c | 36 ++++++++++++------------------------
kernel/sched/deadline.c | 32 +++++++++++++++++++++++---------
2 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98f9944..5030b3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2509,9 +2509,6 @@ static inline int dl_bw_cpus(int i)
* allocated bandwidth to reflect the new situation.
*
* This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
*/
static int dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
@@ -2540,11 +2537,22 @@ static int dl_overflow(struct task_struct *p, int policy,
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
!__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ /*
+ * XXX this is slightly incorrect: when the task
+ * utilization decreases, we should delay the total
+ * utilization change until the task's 0-lag point.
+ * But this would require to set the task's "inactive
+ * timer" when the task is not inactive.
+ */
__dl_clear(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
err = 0;
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ /*
+ * Do not decrease the total deadline utilization here,
+ * switched_from_dl() will take care to do it at the correct
+ * (0-lag) time.
+ */
err = 0;
}
raw_spin_unlock(&dl_b->lock);
@@ -3914,26 +3922,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
dl_se->flags = attr->sched_flags;
dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
- /*
- * Changing the parameters of a task is 'tricky' and we're not doing
- * the correct thing -- also see task_dead_dl() and switched_from_dl().
- *
- * What we SHOULD do is delay the bandwidth release until the 0-lag
- * point. This would include retaining the task_struct until that time
- * and change dl_overflow() to not immediately decrement the current
- * amount.
- *
- * Instead we retain the current runtime/deadline and let the new
- * parameters take effect after the current reservation period lapses.
- * This is safe (albeit pessimistic) because the 0-lag point is always
- * before the current scheduling deadline.
- *
- * We can still have temporary overloads because we do not delay the
- * change in bandwidth until that time; so admission control is
- * not on the safe side. It does however guarantee tasks will never
- * consume more than promised.
- */
}

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cdb7274..c087c3d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -94,8 +94,14 @@ static void task_go_inactive(struct task_struct *p)
*/
if (zerolag_time < 0) {
sub_running_bw(dl_se, dl_rq);
- if (!dl_task(p))
+ if (!dl_task(p)) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
__dl_clear_params(p);
+ raw_spin_unlock(&dl_b->lock);
+ }

return;
}
@@ -850,9 +856,14 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
rq = task_rq_lock(p, &rf);

if (!dl_task(p) || p->state == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
if (p->state == TASK_DEAD && dl_se->dl_non_contending)
sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));

+ raw_spin_lock(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
+ raw_spin_unlock(&dl_b->lock);
__dl_clear_params(p);

goto unlock;
@@ -1375,15 +1386,18 @@ static void task_fork_dl(struct task_struct *p)

static void task_dead_dl(struct task_struct *p)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ if (!hrtimer_active(&p->dl.inactive_timer)) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));

- /*
- * Since we are TASK_DEAD we won't slip out of the domain!
- */
- raw_spin_lock_irq(&dl_b->lock);
- /* XXX we should retain the bw until 0-lag */
- dl_b->total_bw -= p->dl.dl_bw;
- raw_spin_unlock_irq(&dl_b->lock);
+ /*
+ * If the "inactive timer is not active, the 0-lag time
+ * is already passed, so we immediately decrease the
+ * total deadline utilization
+ */
+ raw_spin_lock_irq(&dl_b->lock);
+ __dl_clear(dl_b, p->dl.dl_bw);
+ raw_spin_unlock_irq(&dl_b->lock);
+ }
}

static void set_curr_task_dl(struct rq *rq)
--
2.7.4

2016-12-30 11:34:08

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 6/6] sched/deadline: make GRUB a task's flag

From: Luca Abeni <[email protected]>

Signed-off-by: Luca Abeni <[email protected]>
---
include/uapi/linux/sched.h | 1 +
kernel/sched/core.c | 3 ++-
kernel/sched/deadline.c | 3 ++-
3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
* For the sched_{set,get}attr() calls
*/
#define SCHED_FLAG_RESET_ON_FORK 0x01
+#define SCHED_FLAG_RECLAIM 0x02

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4010af7..af9c882 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4089,7 +4089,8 @@ static int __sched_setscheduler(struct task_struct *p,
return -EINVAL;
}

- if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+ if (attr->sched_flags &
+ ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
return -EINVAL;

/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7585dfb..93ff400 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,7 +823,8 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

- delta_exec = grub_reclaim(delta_exec, rq);
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+ delta_exec = grub_reclaim(delta_exec, rq);
dl_se->runtime -= delta_exec;

throttle:
--
2.7.4

2016-12-30 11:34:42

by Luca Abeni

[permalink] [raw]

Subject: [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth

From: Luca Abeni <[email protected]>

Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time.

Signed-off-by: Luca Abeni <[email protected]>
---
kernel/sched/core.c | 4 ++++
kernel/sched/deadline.c | 7 ++++++-
kernel/sched/sched.h | 6 ++++++
3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5030b3c..4010af7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8286,6 +8286,10 @@ static void sched_dl_do_global(void)
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

rcu_read_unlock_sched();
+ if (dl_b->bw == -1)
+ cpu_rq(cpu)->dl.non_deadline_bw = 0;
+ else
+ cpu_rq(cpu)->dl.non_deadline_bw = (1 << 20) - new_bw;
}
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 361887b..7585dfb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -151,6 +151,11 @@ void init_dl_rq(struct dl_rq *dl_rq)
#else
init_dl_bw(&dl_rq->dl_bw);
#endif
+ if (global_rt_runtime() == RUNTIME_INF)
+ dl_rq->non_deadline_bw = 0;
+ else
+ dl_rq->non_deadline_bw = (1 << 20) -
+ to_ratio(global_rt_period(), global_rt_runtime());
}

#ifdef CONFIG_SMP
@@ -773,7 +778,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
*/
u64 grub_reclaim(u64 delta, struct rq *rq)
{
- return (delta * rq->dl.running_bw) >> 20;
+ return (delta * (rq->dl.non_deadline_bw + rq->dl.running_bw)) >> 20;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e422803..ef4bdaa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,6 +542,12 @@ struct dl_rq {
* task blocks
*/
u64 running_bw;
+
+ /*
+ * Fraction of the CPU utilization that cannot be reclaimed
+ * by the GRUB algorithm.
+ */
+ u64 non_deadline_bw;
};

#ifdef CONFIG_SMP
--
2.7.4