Subject: [PATCH v4 0/7] SCHED_DEADLINE server infrastructure

This is v4 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:

# rtla timerlat top -u -d 10m

--------------------- %< ------------------------
Timer Latency
0 01:42:24 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #6143559 | 0 0 0 92 | 2 1 3 98 | 4 1 5 100
1 #6143559 | 1 0 0 97 | 7 1 5 101 | 9 1 7 103
2 #6143559 | 0 0 0 88 | 3 1 5 95 | 5 1 7 99
3 #6143559 | 0 0 0 90 | 6 1 5 103 | 10 1 7 126
4 #6143558 | 1 0 0 81 | 7 1 4 86 | 9 1 7 90
5 #6143558 | 0 0 0 74 | 3 1 5 79 | 4 1 7 83
6 #6143558 | 0 0 0 83 | 2 1 5 89 | 3 0 7 108
7 #6143558 | 0 0 0 85 | 3 1 4 126 | 5 1 6 137
--------------------- >% ------------------------

And this is the same tests with DL server activating without any delay:
--------------------- %< ------------------------
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #579147 | 0 0 0 54 | 2 1 52 61095 | 2 2 56 61102
1 #578766 | 0 0 0 83 | 2 1 49 55824 | 3 2 53 55831
2 #578559 | 0 0 1 59 | 2 1 50 55760 | 3 2 54 55770
3 #578318 | 0 0 0 76 | 2 1 49 55751 | 3 2 54 55760
4 #578611 | 0 0 0 64 | 2 1 49 55811 | 3 2 53 55820
5 #578347 | 0 0 1 40 | 2 1 50 56121 | 3 2 55 56133
6 #578938 | 0 0 1 75 | 2 1 49 55755 | 3 2 53 55764
7 #578631 | 0 0 1 36 | 3 1 51 55528 | 4 2 55 55541
--------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.

Juri and I discussed this and though about delaying the server
activation for the 0-lag time, thus enabling the server only
if the fair scheduler is about to starve.

The patch 6/7 adds the possibility to defer the server start
to the (absolute deadline - runtime) point in time. This is
achieved by enqueuing the dl server throttled, with a next
replenishing time set to activate the server at
(absolute deadline - runtime).

The patch 7/7 add a per_rq interface for the knobs:
fair_server_runtime (950 ms)
fair_server_period (1s)
fair_server_defer (enabled)

With defer enabled on CPUs [0:3], the results get better,
having a behavior similar to the one we have with the rt
throttling.
--------------------- %< ------------------------
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #600003 | 0 0 0 34 | 6 1 5 75 | 10 2 7 108
1 #600003 | 1 0 1 38 | 9 1 6 96 | 14 2 9 144
2 #600005 | 1 0 1 85 | 10 1 6 94 | 14 2 9 120
3 #600006 | 0 0 1 72 | 8 1 6 103 | 13 2 9 108
4 #600005 | 1 0 1 61 | 10 1 6 110 | 14 2 9 126
5 #578569 | 0 0 0 65 | 13 1 49 55962 | 20 2 54 55974
6 #578852 | 0 0 0 56 | 5 1 48 55559 | 9 2 53 55568
7 #578710 | 0 0 0 91 | 10 1 49 55773 | 16 2 53 55786
--------------------- >% ------------------------

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
- CPU 2 isolated
- CPU 3 isolated shared with a CFS busy loop task
- CPU 8 non-isolated
- CPU 9 non-isolated shared with a CFS busy loop task

--------------------- %< ------------------------
~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
~# sysctl kernel.sched_rt_runtime_us=-1
~# tuna isolate -c 2
~# tuna isolate -c 3
~# taskset -c 3 ./f &
~# taskset -c 9 ./f &
~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
Operating System Noise
duration: 0 00:10:00 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
2 #599 599000000 3 99.99999 2 1 3 0 0 0 0
3 #598 598001768 31188796 94.78449 53907 53907 0 0 2842602 0 2394
8 #599 599000000 918224 99.84670 1735 36 0 88 615903 0 37958
9 #598 598000000 31441197 94.74227 53875 53448 0 88 3417253 0 1364
--------------------- >% ------------------------

the system runs fine!
- no crashes (famous last words)
- FIFO property is kept
- per cpu interface because it is more flexible - and to detach this from
the throttling concept.

Global is broken, but it will > /dev/null.

Changes from V3:
- Add the defer server (Daniel)
- Add an per rq interface (Daniel with peter's feedback)
- Add an option not defer the server (for Joel)
- Typos and 1-liner fixes (Valentin, Luca, Peter)
- Fair scheduler running on dl server do not account as RT task (Daniel)
- Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
- Refactor/rephrase/typos changes
- Defferable server using throttling
- The server starts when RT && Fair tasks are enqueued
- Interface with runtime/period/defer option
Changes from v1:
- rebased on 6.4-rc1 tip/sched/core

Daniel Bristot de Oliveira (2):
sched/deadline: Deferrable dl server
sched/fair: Fair server interface

Peter Zijlstra (5):
sched: Unify runtime accounting across classes
sched/deadline: Collect sched_dl_entity initialization
sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
sched/deadline: Introduce deadline servers
sched/fair: Add trivial fair server

include/linux/sched.h | 31 ++-
kernel/sched/core.c | 23 +-
kernel/sched/deadline.c | 555 ++++++++++++++++++++++++++-------------
kernel/sched/debug.c | 177 +++++++++++++
kernel/sched/fair.c | 92 ++++++-
kernel/sched/rt.c | 21 +-
kernel/sched/sched.h | 64 ++++-
kernel/sched/stop_task.c | 13 +-
8 files changed, 737 insertions(+), 239 deletions(-)

--
2.40.1



Subject: [PATCH v4 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity

From: Peter Zijlstra <[email protected]>

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Reviewed-by: Phil Auld <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 130 ++++++++++++++++++++++------------------
kernel/sched/sched.h | 6 ++
2 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f8c402079404..957baaf6dc92 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -391,12 +391,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
* up, and checks if the task is still in the "ACTIVE non contending"
* state or not (in the second case, it updates running_bw).
*/
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
struct hrtimer *timer = &dl_se->inactive_timer;
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
+ struct task_struct *p = dl_task_of(dl_se);
s64 zerolag_time;

/*
@@ -428,13 +428,14 @@ static void task_non_contending(struct task_struct *p)
if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
if (dl_task(p))
sub_running_bw(dl_se, dl_rq);
+
if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));

if (READ_ONCE(p->__state) == TASK_DEAD)
- sub_rq_bw(&p->dl, &rq->dl);
+ sub_rq_bw(dl_se, &rq->dl);
raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+ __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&dl_b->lock);
__dl_clear_params(dl_se);
}
@@ -1627,6 +1628,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)

update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);

+ /*
+ * Check if a constrained deadline task was activated
+ * after the deadline but before the next period.
+ * If that is the case, the task will be throttled and
+ * the replenishment timer will be set to the next period.
+ */
+ if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+ dl_check_constrained_dl(dl_se);
+
+ if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ add_rq_bw(dl_se, dl_rq);
+ add_running_bw(dl_se, dl_rq);
+ }
+
+ /*
+ * If p is throttled, we do not enqueue it. In fact, if it exhausted
+ * its budget it needs a replenishment and, since it now is on
+ * its rq, the bandwidth timer callback (which clearly has not
+ * run yet) will take care of this.
+ * However, the active utilization does not depend on the fact
+ * that the task is on the runqueue or not (but depends on the
+ * task's state - in GRUB parlance, "inactive" vs "active contending").
+ * In other words, even if a task is throttled its utilization must
+ * be counted in the active utilization; hence, we need to call
+ * add_running_bw().
+ */
+ if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ if (flags & ENQUEUE_WAKEUP)
+ task_contending(dl_se, flags);
+
+ return;
+ }
+
/*
* If this is a wakeup or a new instance, the scheduling
* parameters of the task might need updating. Otherwise,
@@ -1646,9 +1682,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
__enqueue_dl_entity(dl_se);
}

-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
{
__dequeue_dl_entity(dl_se);
+
+ if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ sub_running_bw(dl_se, dl_rq);
+ sub_rq_bw(dl_se, dl_rq);
+ }
+
+ /*
+ * This check allows to start the inactive timer (or to immediately
+ * decrease the active utilization, if needed) in two cases:
+ * when the task blocks and when it is terminating
+ * (p->state == TASK_DEAD). We can handle the two cases in the same
+ * way, because from GRUB's point of view the same thing is happening
+ * (the task moves from "active contending" to "active non contending"
+ * or "inactive")
+ */
+ if (flags & DEQUEUE_SLEEP)
+ task_non_contending(dl_se);
}

static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1693,76 +1748,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
return;
}

- /*
- * Check if a constrained deadline task was activated
- * after the deadline but before the next period.
- * If that is the case, the task will be throttled and
- * the replenishment timer will be set to the next period.
- */
- if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
- dl_check_constrained_dl(&p->dl);
-
- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
- add_rq_bw(&p->dl, &rq->dl);
- add_running_bw(&p->dl, &rq->dl);
- }
-
- /*
- * If p is throttled, we do not enqueue it. In fact, if it exhausted
- * its budget it needs a replenishment and, since it now is on
- * its rq, the bandwidth timer callback (which clearly has not
- * run yet) will take care of this.
- * However, the active utilization does not depend on the fact
- * that the task is on the runqueue or not (but depends on the
- * task's state - in GRUB parlance, "inactive" vs "active contending").
- * In other words, even if a task is throttled its utilization must
- * be counted in the active utilization; hence, we need to call
- * add_running_bw().
- */
- if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
- if (flags & ENQUEUE_WAKEUP)
- task_contending(&p->dl, flags);
-
- return;
- }
-
check_schedstat_required();
update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);

+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ flags |= ENQUEUE_MIGRATING;
+
enqueue_dl_entity(&p->dl, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}

static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
- dequeue_dl_entity(&p->dl);
- dequeue_pushable_dl_task(rq, p);
+ dequeue_dl_entity(&p->dl, flags);
+
+ if (!p->dl.dl_throttled)
+ dequeue_pushable_dl_task(rq, p);
}

static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
- __dequeue_task_dl(rq, p, flags);

- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
- sub_running_bw(&p->dl, &rq->dl);
- sub_rq_bw(&p->dl, &rq->dl);
- }
+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ flags |= DEQUEUE_MIGRATING;

- /*
- * This check allows to start the inactive timer (or to immediately
- * decrease the active utilization, if needed) in two cases:
- * when the task blocks and when it is terminating
- * (p->state == TASK_DEAD). We can handle the two cases in the same
- * way, because from GRUB's point of view the same thing is happening
- * (the task moves from "active contending" to "active non contending"
- * or "inactive")
- */
- if (flags & DEQUEUE_SLEEP)
- task_non_contending(p);
+ __dequeue_task_dl(rq, p, flags);
}

/*
@@ -2578,7 +2592,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
* will reset the task parameters.
*/
if (task_on_rq_queued(p) && p->dl.dl_runtime)
- task_non_contending(p);
+ task_non_contending(&p->dl);

/*
* In case a task is setscheduled out from SCHED_DEADLINE we need to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5e0df4bba476..9f48ed3e9028 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2193,6 +2193,10 @@ extern const u32 sched_prio_to_wmult[40];
* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
* in the runqueue.
*
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
* ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
@@ -2203,6 +2207,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */
#define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */
#define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */

#define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02
@@ -2217,6 +2222,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATED 0x00
#endif
#define ENQUEUE_INITIAL 0x80
+#define ENQUEUE_MIGRATING 0x100

#define RETRY_TASK ((void *)-1UL)

--
2.40.1