Subject: [RFC PATCH V3 0/6] SCHED_DEADLINE server infrastructure

This is RFC v3 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

I rebased Peter's patches (adding changelogs where needed) on
tip/sched/core as of today and incorporated fixes to issues discussed
during RFC v1 & v2.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat** scheduling latency as-is, with kernel
build background workload:

# rtla timerlat top -u -d 10m

--------------------- %< ------------------------
0 01:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #3599960 | 1 0 1 31 | 6 1 6 65 | 9 2 9 86
1 #3599972 | 1 0 1 41 | 4 1 5 54 | 7 2 7 78
2 #3599966 | 1 0 1 36 | 6 1 6 65 | 9 2 9 81
3 #3599945 | 0 0 1 31 | 6 1 6 55 | 9 2 9 84
4 #3599939 | 1 0 1 32 | 4 1 6 53 | 7 2 8 85
5 #3599944 | 0 0 1 31 | 4 1 6 50 | 6 2 9 54
6 #3599945 | 1 0 1 38 | 5 1 6 53 | 8 2 9 88
7 #3599944 | 0 0 1 36 | 4 1 5 62 | 6 2 8 86
--------------------- >% ------------------------

And this is the same tests with DL server activating without any delay*:
--------------------- %< ------------------------
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #595748 | 1 0 1 254 | 8 1 31 1417 | 12 2 33 1422
1 #597951 | 1 0 1 239 | 6 1 27 1435 | 9 2 30 1438
2 #595060 | 1 0 1 24 | 5 1 28 1437 | 7 2 30 1441
3 #595914 | 1 0 1 218 | 6 1 29 1382 | 9 2 32 1385
4 #597829 | 1 0 1 233 | 8 1 26 1368 | 11 2 29 1427
5 #596314 | 2 0 1 21 | 7 1 29 1442 | 10 2 32 1447
6 #595532 | 1 0 1 238 | 6 1 31 1389 | 9 2 34 1392
7 #595852 | 0 0 1 34 | 6 1 30 1481 | 9 2 33 1484
--------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill. Juri and I discussed that, and that is why Juri added
the patch 6/6.

The patch 6/6 adds a PoC of an starvation monitor/watchdog that delays
enqueuing of deadline servers to the point when fair tasks might start
to actually suffer from starvation (just randomly picked HZ/2 for now).

With that in place, the results get better again*:

--------------------- %< ------------------------
0 01:00:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #3600004 | 1 0 1 29 | 8 1 5 50 | 11 2 8 66
1 #3600010 | 1 0 1 30 | 7 1 5 50 | 10 2 8 58
2 #3600010 | 0 0 1 30 | 5 1 5 43 | 7 2 7 70
3 #3600010 | 1 0 1 25 | 8 1 6 52 | 12 2 8 74
4 #3600010 | 1 0 1 63 | 8 1 6 72 | 12 2 8 88
5 #3600010 | 1 0 1 26 | 8 1 6 59 | 11 2 8 94
6 #3600010 | 1 0 1 29 | 9 1 5 55 | 12 2 8 82
7 #3600003 | 0 0 1 117 | 6 1 5 124 | 9 2 7 127
--------------------- >% ------------------------

So, that is in the right direction but we can improve it. Here
are the next steps I am taking:

- Getting parameters from the sysctl sched_rt...
- Trying to delay the start of the server for the 0-laxity time
- Maybe starting the server throttled with replenish time
at 0-laxity
- Maybe implement a starvation monitor offload, where the DL server
is started remotely, avoiding the overhead of its activation - like
stalld does;
- Test with micro-interference to measure overheads.

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups*:
- CPU 2 isolated
- CPU 3 isolated shared with a CFS busy loop task
- CPU 8 non-isolated
- CPU 9 non-isolated shared with a CFS busy loop task

--------------------- %< ------------------------
# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1 -q
Operating System Noise
duration: 0 00:12:39 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
2 #757 757000000 49 99.99999 14 3 0 0 106 0 0
3 #757 757001039 39322713 94.80546 52992 1103 0 0 3657933 0 59685
8 #757 757000000 779821 99.89698 1513 4 0 113 794899 0 189
9 #757 757001043 39922677 94.72620 53775 1105 0 112 4361243 0 49009
--------------------- >% ------------------------

The results are promising, but there is a problem when no setting
HRTICK_DL... checking it. No splat on any of these scenarios.

* tests with throttling disabled, on the 6.3 stable RT. But also
on 6.4 and tip/sched/core.
** timerlat with user-space support under dev, you need these patch series:
https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]
or just run without the -u option :-)

Changes from v2:
- rebased on 6.4-rc1 tip/sched/core

Juri Lelli (1):
sched/fair: Implement starvation monitor

Peter Zijlstra (5):
sched: Unify runtime accounting across classes
sched/deadline: Collect sched_dl_entity initialization
sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
sched/deadline: Introduce deadline servers
sched/fair: Add trivial fair server

include/linux/sched.h | 24 +-
kernel/sched/core.c | 23 +-
kernel/sched/deadline.c | 497 +++++++++++++++++++++++++--------------
kernel/sched/fair.c | 143 +++++++++--
kernel/sched/rt.c | 15 +-
kernel/sched/sched.h | 60 +++--
kernel/sched/stop_task.c | 13 +-
7 files changed, 538 insertions(+), 237 deletions(-)

--
2.40.1



Subject: [RFC PATCH V3 2/6] sched/deadline: Collect sched_dl_entity initialization

From: Peter Zijlstra <[email protected]>

Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 5 +----
kernel/sched/deadline.c | 22 +++++++++++++++-------
kernel/sched/sched.h | 5 +----
3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ac38225e6d09..e34b02cbe41f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4511,10 +4511,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
memset(&p->stats, 0, sizeof(p->stats));
#endif

- RB_CLEAR_NODE(&p->dl.rb_node);
- init_dl_task_timer(&p->dl);
- init_dl_inactive_task_timer(&p->dl);
- __dl_clear_params(p);
+ init_dl_entity(&p->dl);

INIT_LIST_HEAD(&p->rt.run_list);
p->rt.timeout = 0;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 030e7c11607f..22e5e64812c9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -333,6 +333,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
__add_rq_bw(new_bw, &rq->dl);
}

+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
/*
* The utilization of a task cannot be immediately removed from
* the rq active utilization (running_bw) when the task blocks.
@@ -432,7 +434,7 @@ static void task_non_contending(struct task_struct *p)
raw_spin_lock(&dl_b->lock);
__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&dl_b->lock);
- __dl_clear_params(p);
+ __dl_clear_params(dl_se);
}

return;
@@ -1205,7 +1207,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
{
struct hrtimer *timer = &dl_se->dl_timer;

@@ -1415,7 +1417,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
raw_spin_lock(&dl_b->lock);
__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&dl_b->lock);
- __dl_clear_params(p);
+ __dl_clear_params(dl_se);

goto unlock;
}
@@ -1431,7 +1433,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
}

-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
{
struct hrtimer *timer = &dl_se->inactive_timer;

@@ -2974,10 +2976,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
/*
* This function clears the sched_dl_entity static params.
*/
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
-
dl_se->dl_runtime = 0;
dl_se->dl_deadline = 0;
dl_se->dl_period = 0;
@@ -2995,6 +2995,14 @@ void __dl_clear_params(struct task_struct *p)
#endif
}

+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+ RB_CLEAR_NODE(&dl_se->rb_node);
+ init_dl_task_timer(dl_se);
+ init_dl_inactive_task_timer(dl_se);
+ __dl_clear_params(dl_se);
+}
+
bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
{
struct sched_dl_entity *dl_se = &p->dl;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index da0cec2fc63a..fa6512070fa7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -284,8 +284,6 @@ struct rt_bandwidth {
unsigned int rt_period_active;
};

-void __dl_clear_params(struct task_struct *p);
-
static inline int dl_bandwidth_enabled(void)
{
return sysctl_sched_rt_runtime >= 0;
@@ -2390,8 +2388,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);

-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_entity(struct sched_dl_entity *dl_se);

#define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT)
--
2.40.1


Subject: [RFC PATCH V3 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity

From: Peter Zijlstra <[email protected]>

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 130 ++++++++++++++++++++++------------------
kernel/sched/sched.h | 6 ++
2 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 22e5e64812c9..869734eecb2c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -389,12 +389,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
* up, and checks if the task is still in the "ACTIVE non contending"
* state or not (in the second case, it updates running_bw).
*/
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
struct hrtimer *timer = &dl_se->inactive_timer;
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
+ struct task_struct *p = dl_task_of(dl_se);
s64 zerolag_time;

/*
@@ -426,13 +426,14 @@ static void task_non_contending(struct task_struct *p)
if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
if (dl_task(p))
sub_running_bw(dl_se, dl_rq);
+
if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));

if (READ_ONCE(p->__state) == TASK_DEAD)
- sub_rq_bw(&p->dl, &rq->dl);
+ sub_rq_bw(dl_se, &rq->dl);
raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+ __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
raw_spin_unlock(&dl_b->lock);
__dl_clear_params(dl_se);
}
@@ -1629,6 +1630,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)

update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);

+ /*
+ * Check if a constrained deadline task was activated
+ * after the deadline but before the next period.
+ * If that is the case, the task will be throttled and
+ * the replenishment timer will be set to the next period.
+ */
+ if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+ dl_check_constrained_dl(dl_se);
+
+ if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ add_rq_bw(dl_se, dl_rq);
+ add_running_bw(dl_se, dl_rq);
+ }
+
+ /*
+ * If p is throttled, we do not enqueue it. In fact, if it exhausted
+ * its budget it needs a replenishment and, since it now is on
+ * its rq, the bandwidth timer callback (which clearly has not
+ * run yet) will take care of this.
+ * However, the active utilization does not depend on the fact
+ * that the task is on the runqueue or not (but depends on the
+ * task's state - in GRUB parlance, "inactive" vs "active contending").
+ * In other words, even if a task is throttled its utilization must
+ * be counted in the active utilization; hence, we need to call
+ * add_running_bw().
+ */
+ if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ if (flags & ENQUEUE_WAKEUP)
+ task_contending(dl_se, flags);
+
+ return;
+ }
+
/*
* If this is a wakeup or a new instance, the scheduling
* parameters of the task might need updating. Otherwise,
@@ -1648,9 +1684,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
__enqueue_dl_entity(dl_se);
}

-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
{
__dequeue_dl_entity(dl_se);
+
+ if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ sub_running_bw(dl_se, dl_rq);
+ sub_rq_bw(dl_se, dl_rq);
+ }
+
+ /*
+ * This check allows to start the inactive timer (or to immediately
+ * decrease the active utilization, if needed) in two cases:
+ * when the task blocks and when it is terminating
+ * (p->state == TASK_DEAD). We can handle the two cases in the same
+ * way, because from GRUB's point of view the same thing is happening
+ * (the task moves from "active contending" to "active non contending"
+ * or "inactive")
+ */
+ if (flags & DEQUEUE_SLEEP)
+ task_non_contending(dl_se);
}

static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1695,76 +1750,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
return;
}

- /*
- * Check if a constrained deadline task was activated
- * after the deadline but before the next period.
- * If that is the case, the task will be throttled and
- * the replenishment timer will be set to the next period.
- */
- if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
- dl_check_constrained_dl(&p->dl);
-
- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
- add_rq_bw(&p->dl, &rq->dl);
- add_running_bw(&p->dl, &rq->dl);
- }
-
- /*
- * If p is throttled, we do not enqueue it. In fact, if it exhausted
- * its budget it needs a replenishment and, since it now is on
- * its rq, the bandwidth timer callback (which clearly has not
- * run yet) will take care of this.
- * However, the active utilization does not depend on the fact
- * that the task is on the runqueue or not (but depends on the
- * task's state - in GRUB parlance, "inactive" vs "active contending").
- * In other words, even if a task is throttled its utilization must
- * be counted in the active utilization; hence, we need to call
- * add_running_bw().
- */
- if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
- if (flags & ENQUEUE_WAKEUP)
- task_contending(&p->dl, flags);
-
- return;
- }
-
check_schedstat_required();
update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);

+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ flags |= ENQUEUE_MIGRATING;
+
enqueue_dl_entity(&p->dl, flags);

- if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}

static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
- dequeue_dl_entity(&p->dl);
- dequeue_pushable_dl_task(rq, p);
+ dequeue_dl_entity(&p->dl, flags);
+
+ if (!p->dl.dl_throttled)
+ dequeue_pushable_dl_task(rq, p);
}

static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
- __dequeue_task_dl(rq, p, flags);

- if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
- sub_running_bw(&p->dl, &rq->dl);
- sub_rq_bw(&p->dl, &rq->dl);
- }
+ if (p->on_rq == TASK_ON_RQ_MIGRATING)
+ flags |= DEQUEUE_MIGRATING;

- /*
- * This check allows to start the inactive timer (or to immediately
- * decrease the active utilization, if needed) in two cases:
- * when the task blocks and when it is terminating
- * (p->state == TASK_DEAD). We can handle the two cases in the same
- * way, because from GRUB's point of view the same thing is happening
- * (the task moves from "active contending" to "active non contending"
- * or "inactive")
- */
- if (flags & DEQUEUE_SLEEP)
- task_non_contending(p);
+ __dequeue_task_dl(rq, p, flags);
}

/*
@@ -2580,7 +2594,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
* will reset the task parameters.
*/
if (task_on_rq_queued(p) && p->dl.dl_runtime)
- task_non_contending(p);
+ task_non_contending(&p->dl);

if (!task_on_rq_queued(p)) {
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fa6512070fa7..aaf163695c2e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2142,6 +2142,10 @@ extern const u32 sched_prio_to_wmult[40];
* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
* in the runqueue.
*
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
* ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
@@ -2152,6 +2156,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */
#define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */
#define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING 0x80 /* Matches ENQUEUE_MIGRATING */

#define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02
@@ -2165,6 +2170,7 @@ extern const u32 sched_prio_to_wmult[40];
#else
#define ENQUEUE_MIGRATED 0x00
#endif
+#define ENQUEUE_MIGRATING 0x80

#define RETRY_TASK ((void *)-1UL)

--
2.40.1


Subject: [RFC PATCH V3 5/6] sched/fair: Add trivial fair server

From: Peter Zijlstra <[email protected]>

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
3 files changed, 34 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b88b822ec89..7506dde9849d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10058,6 +10058,7 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+ fair_server_init(rq);

#ifdef CONFIG_SCHED_CORE
rq->core = rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c58d8e55b69..f493f05c1f84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6336,6 +6336,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
*/
util_est_enqueue(&rq->cfs, p);

+ if (!rq->cfs.h_nr_running)
+ dl_server_start(&rq->fair_server);
+
/*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6480,6 +6483,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
rq->next_balance = jiffies;

dequeue_throttle:
+ if (!rq->cfs.h_nr_running)
+ dl_server_stop(&rq->fair_server);
+
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@@ -8221,6 +8227,29 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
return pick_next_task_fair(rq, NULL, NULL);
}

+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+ return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+ return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+ struct sched_dl_entity *dl_se = &rq->fair_server;
+
+ init_dl_entity(dl_se);
+
+ dl_se->dl_runtime = TICK_NSEC;
+ dl_se->dl_deadline = 20 * TICK_NSEC;
+ dl_se->dl_period = 20 * TICK_NSEC;
+
+ dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
/*
* Account for a descheduled task:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 390c99e2f8a8..d4a7c0823c53 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -353,6 +353,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick);

+extern void fair_server_init(struct rq *);
+
#ifdef CONFIG_CGROUP_SCHED

struct cfs_rq;
@@ -1015,6 +1017,8 @@ struct rq {
struct rt_rq rt;
struct dl_rq dl;

+ struct sched_dl_entity fair_server;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
struct list_head leaf_cfs_rq_list;
--
2.40.1


Subject: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

From: Juri Lelli <[email protected]>

Starting deadline server for lower priority classes right away when
first task is enqueued might break guarantees, as tasks belonging to
intermediate priority classes could be uselessly preempted. E.g., a well
behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
there are still CPU cycles available for NORMAL tasks to run, as they'll
be running inside the fair deadline server for some period of time.

To prevent this issue, implement a starvation monitor mechanism that
starts the deadline server only if a (fair in this case) task hasn't
been scheduled for some interval of time after it has been enqueued.
Use pick/put functions to manage starvation monitor status.

Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 4 ++++
2 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f493f05c1f84..75eadd85e2b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6315,6 +6315,53 @@ static int sched_idle_cpu(int cpu)
}
#endif

+
+static void fair_server_watchdog(struct timer_list *list)
+{
+ struct rq *rq = container_of(list, struct rq, fair_server_wd);
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ rq->fair_server_wd_running = 0;
+
+ if (!rq->cfs.h_nr_running)
+ goto out;
+
+ update_rq_clock(rq);
+ dl_server_start(&rq->fair_server);
+ rq->fair_server_active = 1;
+ resched_curr(rq);
+
+out:
+ rq_unlock_irqrestore(rq, &rf);
+}
+
+static inline void fair_server_watchdog_start(struct rq *rq)
+{
+ if (rq->fair_server_wd_running || rq->fair_server_active)
+ return;
+
+ timer_setup(&rq->fair_server_wd, fair_server_watchdog, 0);
+ rq->fair_server_wd.expires = jiffies + FAIR_SERVER_WATCHDOG_INTERVAL;
+ add_timer_on(&rq->fair_server_wd, cpu_of(rq));
+ rq->fair_server_active = 0;
+ rq->fair_server_wd_running = 1;
+}
+
+static inline void fair_server_watchdog_stop(struct rq *rq, bool stop_server)
+{
+ if (!rq->fair_server_wd_running && !stop_server)
+ return;
+
+ del_timer(&rq->fair_server_wd);
+ rq->fair_server_wd_running = 0;
+
+ if (stop_server && rq->fair_server_active) {
+ dl_server_stop(&rq->fair_server);
+ rq->fair_server_active = 0;
+ }
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -6337,7 +6384,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
util_est_enqueue(&rq->cfs, p);

if (!rq->cfs.h_nr_running)
- dl_server_start(&rq->fair_server);
+ fair_server_watchdog_start(rq);

/*
* If in_iowait is set, the code below may not trigger any cpufreq
@@ -6484,7 +6531,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
if (!rq->cfs.h_nr_running)
- dl_server_stop(&rq->fair_server);
+ fair_server_watchdog_stop(rq, true);

util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
@@ -8193,6 +8240,7 @@ done: __maybe_unused;
hrtick_start_fair(rq, p);

update_misfit_status(p, rq);
+ fair_server_watchdog_stop(rq, false);

return p;

@@ -8248,6 +8296,8 @@ void fair_server_init(struct rq *rq)
dl_se->dl_period = 20 * TICK_NSEC;

dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+
+ rq->fair_server_wd_running = 0;
}

/*
@@ -8262,6 +8312,9 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ if (rq->cfs.h_nr_running)
+ fair_server_watchdog_start(rq);
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d4a7c0823c53..cab5d2b1e71f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -353,6 +353,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick);

+#define FAIR_SERVER_WATCHDOG_INTERVAL (HZ >> 1)
extern void fair_server_init(struct rq *);

#ifdef CONFIG_CGROUP_SCHED
@@ -1018,6 +1019,9 @@ struct rq {
struct dl_rq dl;

struct sched_dl_entity fair_server;
+ int fair_server_active;
+ struct timer_list fair_server_wd;
+ int fair_server_wd_running;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
--
2.40.1


Subject: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers

From: Peter Zijlstra <[email protected]>

Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
include/linux/sched.h | 22 ++-
kernel/sched/core.c | 17 ++
kernel/sched/deadline.c | 350 +++++++++++++++++++++++++++-------------
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 29 ++++
5 files changed, 309 insertions(+), 113 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26b1925a702a..4c90d7693a75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -64,12 +64,14 @@ struct robust_list_head;
struct root_domain;
struct rq;
struct sched_attr;
+struct sched_dl_entity;
struct sched_param;
struct seq_file;
struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct task_struct;
struct user_event_mm;

/*
@@ -600,6 +602,9 @@ struct sched_rt_entity {
#endif
} __randomize_layout;

+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
struct sched_dl_entity {
struct rb_node rb_node;

@@ -647,6 +652,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
+ unsigned int dl_server : 1;

/*
* Bandwidth enforcement timer. Each -deadline task has its
@@ -661,7 +667,20 @@ struct sched_dl_entity {
* timer is needed to decrease the active utilization at the correct
* time.
*/
- struct hrtimer inactive_timer;
+ struct hrtimer inactive_timer;
+
+ /*
+ * Bits for DL-server functionality. Also see the comment near
+ * dl_server_update().
+ *
+ * @rq the runqueue this server is for
+ *
+ * @server_has_tasks() returns true if @server_pick return a
+ * runnable task.
+ */
+ struct rq *rq;
+ dl_server_has_tasks_f server_has_tasks;
+ dl_server_pick_f server_pick;

#ifdef CONFIG_RT_MUTEXES
/*
@@ -790,6 +809,7 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
+ struct sched_dl_entity *server;
const struct sched_class *sched_class;

#ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e34b02cbe41f..5b88b822ec89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3803,6 +3803,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
rq->idle_stamp = 0;
}
#endif
+
+ p->server = NULL;
}

/*
@@ -6013,12 +6015,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq);
}

+ /*
+ * This is the fast path; it cannot be a DL server pick;
+ * therefore even if @p == @prev, ->server must be NULL.
+ */
+ if (p->server)
+ p->server = NULL;
+
return p;
}

restart:
put_prev_task_balance(rq, prev, rf);

+ /*
+ * We've updated @prev and no longer need the server link, clear it.
+ * Must be done before ->pick_next_task() because that can (re)set
+ * ->server.
+ */
+ if (prev->server)
+ prev->server = NULL;
+
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 869734eecb2c..c67056ff5749 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -52,8 +52,14 @@ static int __init sched_dl_sysctl_init(void)
late_initcall(sched_dl_sysctl_init);
#endif

+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+ return dl_se->dl_server;
+}
+
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
+ BUG_ON(dl_server(dl_se));
return container_of(dl_se, struct task_struct, dl);
}

@@ -62,14 +68,22 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
return container_of(dl_rq, struct rq, dl);
}

-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = task_rq(p);
+ struct rq *rq = dl_se->rq;
+
+ if (!dl_server(dl_se))
+ rq = task_rq(dl_task_of(dl_se));

- return &rq->dl;
+ return rq;
}

+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+ return &rq_of_dl_se(dl_se)->dl;
+}
+
+
static inline int on_dl_rq(struct sched_dl_entity *dl_se)
{
return !RB_EMPTY_NODE(&dl_se->rb_node);
@@ -392,9 +406,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
static void task_non_contending(struct sched_dl_entity *dl_se)
{
struct hrtimer *timer = &dl_se->inactive_timer;
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
- struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct dl_rq *dl_rq = &rq->dl;
s64 zerolag_time;

/*
@@ -424,25 +437,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
* utilization now, instead of starting a timer
*/
if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
- if (dl_task(p))
+ if (dl_server(dl_se)) {
sub_running_bw(dl_se, dl_rq);
+ } else {
+ struct task_struct *p = dl_task_of(dl_se);
+
+ if (dl_task(p))
+ sub_running_bw(dl_se, dl_rq);

- if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));

- if (READ_ONCE(p->__state) == TASK_DEAD)
- sub_rq_bw(dl_se, &rq->dl);
- raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
- raw_spin_unlock(&dl_b->lock);
- __dl_clear_params(dl_se);
+ if (READ_ONCE(p->__state) == TASK_DEAD)
+ sub_rq_bw(dl_se, &rq->dl);
+ raw_spin_lock(&dl_b->lock);
+ __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+ raw_spin_unlock(&dl_b->lock);
+ __dl_clear_params(dl_se);
+ }
}

return;
}

dl_se->dl_non_contending = 1;
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
+
hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
}

@@ -469,8 +490,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* will not touch the rq's active utilization,
* so we are still safe.
*/
- if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
- put_task_struct(dl_task_of(dl_se));
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+ if (!dl_server(dl_se))
+ put_task_struct(dl_task_of(dl_se));
+ }
} else {
/*
* Since "dl_non_contending" is not set, the
@@ -483,10 +506,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
}
}

-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- struct sched_dl_entity *dl_se = &p->dl;
-
return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
}

@@ -573,8 +594,6 @@ static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

if (p->nr_cpus_allowed > 1)
dl_rq->dl_nr_migratory++;
-
- update_dl_migration(dl_rq);
}

static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -583,8 +602,6 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

if (p->nr_cpus_allowed > 1)
dl_rq->dl_nr_migratory--;
-
- update_dl_migration(dl_rq);
}

#define __node_2_pdl(node) \
@@ -762,8 +779,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
}
#endif /* CONFIG_SMP */

+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags);

static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -1011,8 +1030,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
*/
static void update_dl_entity(struct sched_dl_entity *dl_se)
{
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
+ struct rq *rq = rq_of_dl_se(dl_se);

if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1043,11 +1061,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
* actually started or not (i.e., the replenishment instant is in
* the future or in the past).
*/
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
struct hrtimer *timer = &dl_se->dl_timer;
- struct rq *rq = task_rq(p);
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
ktime_t now, act;
s64 delta;

@@ -1081,13 +1099,33 @@ static int start_dl_timer(struct task_struct *p)
* and observe our state.
*/
if (!hrtimer_is_queued(timer)) {
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
}

return 1;
}

+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+ /*
+ * Queueing this task back might have overloaded rq, check if we need
+ * to kick someone away.
+ */
+ if (has_pushable_dl_tasks(rq)) {
+ /*
+ * Nothing relies on rq->lock after this, so its safe to drop
+ * rq->lock.
+ */
+ rq_unpin_lock(rq, rf);
+ push_dl_task(rq);
+ rq_repin_lock(rq, rf);
+ }
+#endif
+}
+
/*
* This is the bandwidth enforcement timer callback. If here, we know
* a task is not on its dl_rq, since the fact that the timer was running
@@ -1106,10 +1144,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
dl_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p;
struct rq_flags rf;
struct rq *rq;

+ if (dl_server(dl_se)) {
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct rq_flags rf;
+
+ rq_lock(rq, &rf);
+ if (dl_se->dl_throttled) {
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ if (dl_se->server_has_tasks(dl_se)) {
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ resched_curr(rq);
+ __push_dl_task(rq, &rf);
+ } else {
+ replenish_dl_entity(dl_se);
+ }
+
+ }
+ rq_unlock(rq, &rf);
+
+ return HRTIMER_NORESTART;
+ }
+
+ p = dl_task_of(dl_se);
rq = task_rq_lock(p, &rf);

/*
@@ -1180,21 +1242,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
else
resched_curr(rq);

-#ifdef CONFIG_SMP
- /*
- * Queueing this task back might have overloaded rq, check if we need
- * to kick someone away.
- */
- if (has_pushable_dl_tasks(rq)) {
- /*
- * Nothing relies on rq->lock after this, so its safe to drop
- * rq->lock.
- */
- rq_unpin_lock(rq, &rf);
- push_dl_task(rq);
- rq_repin_lock(rq, &rf);
- }
-#endif
+ __push_dl_task(rq, &rf);

unlock:
task_rq_unlock(rq, p, &rf);
@@ -1236,12 +1284,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
*/
static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+ struct rq *rq = rq_of_dl_se(dl_se);

if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
return;
dl_se->dl_throttled = 1;
if (dl_se->runtime > 0)
@@ -1296,29 +1343,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
return (delta * u_act) >> BW_SHIFT;
}

-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+ int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
{
- struct task_struct *curr = rq->curr;
- struct sched_dl_entity *dl_se = &curr->dl;
- s64 delta_exec, scaled_delta_exec;
- int cpu = cpu_of(rq);
-
- if (!dl_task(curr) || !on_dl_rq(dl_se))
- return;
+ s64 scaled_delta_exec;

- /*
- * Consumed budget is computed considering the time as
- * observed by schedulable tasks (excluding time spent
- * in hardirq context, etc.). Deadlines are instead
- * computed using hard walltime. This seems to be the more
- * natural solution, but the full ramifications of this
- * approach need further study.
- */
- delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
@@ -1336,10 +1367,9 @@ static void update_curr_dl(struct rq *rq)
* according to current frequency and CPU maximum capacity.
*/
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
- scaled_delta_exec = grub_reclaim(delta_exec,
- rq,
- &curr->dl);
+ scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
} else {
+ int cpu = cpu_of(rq);
unsigned long scale_freq = arch_scale_freq_capacity(cpu);
unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);

@@ -1358,11 +1388,21 @@ static void update_curr_dl(struct rq *rq)
(dl_se->flags & SCHED_FLAG_DL_OVERRUN))
dl_se->dl_overrun = 1;

- __dequeue_task_dl(rq, curr, 0);
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
- enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+ dequeue_dl_entity(dl_se, 0);
+ if (!dl_server(dl_se)) {
+ /* XXX: After v2, from __dequeue_task_dl() */
+ update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+ dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+ }
+
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+ if (dl_server(dl_se))
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ else
+ enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+ }

- if (!is_leftmost(curr, &rq->dl))
+ if (!is_leftmost(dl_se, &rq->dl))
resched_curr(rq);
}

@@ -1392,20 +1432,82 @@ static void update_curr_dl(struct rq *rq)
}
}

+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+ update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+ if (!dl_server(dl_se)) {
+ dl_se->dl_server = 1;
+ setup_new_dl_entity(dl_se);
+ }
+ enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+ dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick)
+{
+ dl_se->rq = rq;
+ dl_se->server_has_tasks = has_tasks;
+ dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct sched_dl_entity *dl_se = &curr->dl;
+ s64 delta_exec;
+
+ if (!dl_task(curr) || !on_dl_rq(dl_se))
+ return;
+
+ /*
+ * Consumed budget is computed considering the time as
+ * observed by schedulable tasks (excluding time spent
+ * in hardirq context, etc.). Deadlines are instead
+ * computed using hard walltime. This seems to be the more
+ * natural solution, but the full ramifications of this
+ * approach need further study.
+ */
+ delta_exec = update_curr_common(rq);
+ update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
{
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
inactive_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p = NULL;
struct rq_flags rf;
struct rq *rq;

- rq = task_rq_lock(p, &rf);
+ if (!dl_server(dl_se)) {
+ p = dl_task_of(dl_se);
+ rq = task_rq_lock(p, &rf);
+ } else {
+ rq = dl_se->rq;
+ rq_lock(rq, &rf);
+ }

sched_clock_tick();
update_rq_clock(rq);

+ if (dl_server(dl_se))
+ goto no_task;
+
if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));

@@ -1422,14 +1524,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)

goto unlock;
}
+
+no_task:
if (dl_se->dl_non_contending == 0)
goto unlock;

sub_running_bw(dl_se, &rq->dl);
dl_se->dl_non_contending = 0;
unlock:
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
+
+ if (!dl_server(dl_se)) {
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ } else {
+ rq_unlock(rq, &rf);
+ }

return HRTIMER_NORESTART;
}
@@ -1487,34 +1596,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}

+static inline void update_dl_migration(struct dl_rq *dl_rq) {}
+
#endif /* CONFIG_SMP */

static inline
void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
u64 deadline = dl_se->deadline;

- WARN_ON(!dl_prio(prio));
dl_rq->dl_nr_running++;
add_nr_running(rq_of_dl_rq(dl_rq), 1);

inc_dl_deadline(dl_rq, deadline);
- inc_dl_migration(dl_se, dl_rq);
+ if (!dl_server(dl_se))
+ inc_dl_migration(dl_se, dl_rq);
+ update_dl_migration(dl_rq);
}

static inline
void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
-
- WARN_ON(!dl_prio(prio));
WARN_ON(!dl_rq->dl_nr_running);
dl_rq->dl_nr_running--;
sub_nr_running(rq_of_dl_rq(dl_rq), 1);

dec_dl_deadline(dl_rq, dl_se->deadline);
- dec_dl_migration(dl_se, dl_rq);
+ if (!dl_server(dl_se))
+ dec_dl_migration(dl_se, dl_rq);
+ update_dl_migration(dl_rq);
}

static inline bool __dl_less(struct rb_node *a, const struct rb_node *b)
@@ -1676,8 +1786,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
} else if (flags & ENQUEUE_REPLENISH) {
replenish_dl_entity(dl_se);
} else if ((flags & ENQUEUE_RESTORE) &&
- dl_time_before(dl_se->deadline,
- rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+ dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
setup_new_dl_entity(dl_se);
}

@@ -1762,14 +1871,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
enqueue_pushable_dl_task(rq, p);
}

-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
- update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
- dequeue_dl_entity(&p->dl, flags);
-
- if (!p->dl.dl_throttled)
- dequeue_pushable_dl_task(rq, p);
-}

static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
@@ -1778,7 +1879,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (p->on_rq == TASK_ON_RQ_MIGRATING)
flags |= DEQUEUE_MIGRATING;

- __dequeue_task_dl(rq, p, flags);
+ dequeue_dl_entity(&p->dl, flags);
+ if (!p->dl.dl_throttled)
+ dequeue_pushable_dl_task(rq, p);
}

/*
@@ -1968,12 +2071,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
}

#ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
- hrtick_start(rq, p->dl.runtime);
+ hrtick_start(rq, dl_se->runtime);
}
#else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
}
#endif
@@ -1993,9 +2096,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;

- if (hrtick_enabled_dl(rq))
- start_hrtick_dl(rq, p);
-
if (rq->curr->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);

@@ -2018,12 +2118,26 @@ static struct task_struct *pick_task_dl(struct rq *rq)
struct dl_rq *dl_rq = &rq->dl;
struct task_struct *p;

+again:
if (!sched_dl_runnable(rq))
return NULL;

dl_se = pick_next_dl_entity(dl_rq);
WARN_ON_ONCE(!dl_se);
- p = dl_task_of(dl_se);
+
+
+ if (dl_server(dl_se)) {
+ p = dl_se->server_pick(dl_se);
+ if (!p) {
+ // XXX should not happen, warn?!
+ dl_se->dl_yielded = 1;
+ update_curr_dl_se(rq, dl_se, 0);
+ goto again;
+ }
+ p->server = dl_se;
+ } else {
+ p = dl_task_of(dl_se);
+ }

return p;
}
@@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
struct task_struct *p;

p = pick_task_dl(rq);
- if (p)
+ if (!p)
+ return p;
+
+ /*
+ * XXX: re-check !dl_server, changed from v2 because of
+ * pick_next_task_dl change
+ */
+ if (!dl_server(&p->dl))
set_next_task_dl(rq, p, true);

+ /* XXX not quite right */
+ if (hrtick_enabled(rq))
+ start_hrtick_dl(rq, &p->dl);
+
return p;
}

@@ -2073,8 +2198,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
* be set and schedule() will start a new hrtick for the next task.
*/
if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
- is_leftmost(p, &rq->dl))
- start_hrtick_dl(rq, p);
+ is_leftmost(&p->dl, &rq->dl))
+ start_hrtick_dl(rq, &p->dl);
}

static void task_fork_dl(struct task_struct *p)
@@ -3003,6 +3128,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
dl_se->dl_yielded = 0;
dl_se->dl_non_contending = 0;
dl_se->dl_overrun = 0;
+ dl_se->dl_server = 0;

#ifdef CONFIG_RT_MUTEXES
dl_se->pi_se = dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fda67f05190d..0c58d8e55b69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,6 +930,8 @@ s64 update_curr_common(struct rq *rq)

account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);
+ if (curr->server)
+ dl_server_update(curr->server, delta_exec);

return delta_exec;
}
@@ -958,6 +960,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+ if (curtask->server)
+ dl_server_update(curtask->server, delta_exec);
}

account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aaf163695c2e..390c99e2f8a8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -324,6 +324,35 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_cpu_busy(int cpu, struct task_struct *p);

+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ * dl_se::rq -- runqueue we belong to.
+ *
+ * dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ * server when it runs out of tasks to run.
+ *
+ * dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ * returns NULL.
+ *
+ * dl_server_update() -- called from update_curr_common(), propagates runtime
+ * to the server.
+ *
+ * dl_server_start()
+ * dl_server_stop() -- start/stop the server when it has (no) tasks
+ *
+ * dl_server_init()
+ *
+ * XXX
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick);
+
#ifdef CONFIG_CGROUP_SCHED

struct cfs_rq;
--
2.40.1


Subject: [RFC PATCH V3 1/6] sched: Unify runtime accounting across classes

From: Peter Zijlstra <[email protected]>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/deadline.c | 15 +++--------
kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++----------
kernel/sched/rt.c | 15 +++--------
kernel/sched/sched.h | 12 ++-------
kernel/sched/stop_task.c | 13 +--------
6 files changed, 53 insertions(+), 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1292d38d66cc..26b1925a702a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,7 +521,7 @@ struct sched_statistics {
u64 block_max;
s64 sum_block_runtime;

- u64 exec_max;
+ s64 exec_max;
u64 slice_max;

u64 nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f827067ad03b..030e7c11607f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1301,9 +1301,8 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec, scaled_delta_exec;
+ s64 delta_exec, scaled_delta_exec;
int cpu = cpu_of(rq);
- u64 now;

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1316,21 +1315,13 @@ static void update_curr_dl(struct rq *rq)
* natural solution, but the full ramifications of this
* approach need further study.
*/
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0)) {
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
return;
}

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
- trace_sched_stat_runtime(curr, delta_exec, 0);
-
- update_current_exec_runtime(curr, now, delta_exec);
-
if (dl_entity_is_special(dl_se))
return;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6189d1a45635..fda67f05190d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,23 +891,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */

-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
{
- struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_clock_task(rq_of(cfs_rq));
- u64 delta_exec;
-
- if (unlikely(!curr))
- return;
+ u64 now = rq_clock_task(rq);
+ s64 delta_exec;

delta_exec = now - curr->exec_start;
- if (unlikely((s64)delta_exec <= 0))
- return;
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;

curr->exec_start = now;
+ curr->sum_exec_runtime += delta_exec;

if (schedstat_enabled()) {
struct sched_statistics *stats;
@@ -917,8 +911,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
max(delta_exec, stats->exec_max));
}

- curr->sum_exec_runtime += delta_exec;
- schedstat_add(cfs_rq->exec_clock, delta_exec);
+ return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ s64 delta_exec;
+
+ delta_exec = update_curr_se(rq, &curr->se);
+ if (unlikely(delta_exec <= 0))
+ return delta_exec;
+
+ trace_sched_stat_runtime(curr, delta_exec, 0);
+
+ account_group_exec_runtime(curr, delta_exec);
+ cgroup_account_cputime(curr, delta_exec);
+
+ return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 delta_exec;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+ if (unlikely(delta_exec <= 0))
+ return;

curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e5074115..efec4f3fef83 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,15 @@ static void update_curr_rt(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_rt_entity *rt_se = &curr->rt;
- u64 delta_exec;
- u64 now;
+ s64 delta_exec;

if (curr->sched_class != &rt_sched_class)
return;

- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec <= 0))
+ delta_exec = update_curr_common(rq);
+ if (unlikely(delta_exec <= 0))
return;

- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
- trace_sched_stat_runtime(curr, delta_exec, 0);
-
- update_current_exec_runtime(curr, now, delta_exec);
-
if (!rt_bandwidth_enabled())
return;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 556496c77dc2..da0cec2fc63a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2176,6 +2176,8 @@ struct affinity_context {
unsigned int flags;
};

+extern s64 update_curr_common(struct rq *rq);
+
struct sched_class {

#ifdef CONFIG_UCLAMP_TASK
@@ -3207,16 +3209,6 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

-static inline void update_current_exec_runtime(struct task_struct *curr,
- u64 now, u64 delta_exec)
-{
- curr->se.sum_exec_runtime += delta_exec;
- account_group_exec_runtime(curr, delta_exec);
-
- curr->se.exec_start = now;
- cgroup_account_cputime(curr, delta_exec);
-}
-
#ifdef CONFIG_SCHED_MM_CID

#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)

static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
{
- struct task_struct *curr = rq->curr;
- u64 now, delta_exec;
-
- now = rq_clock_task(rq);
- delta_exec = now - curr->se.exec_start;
- if (unlikely((s64)delta_exec < 0))
- delta_exec = 0;
-
- schedstat_set(curr->stats.exec_max,
- max(curr->stats.exec_max, delta_exec));
-
- update_current_exec_runtime(curr, now, delta_exec);
+ update_curr_common(rq);
}

/*
--
2.40.1


2023-06-12 02:02:43

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

Hello,

On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
>
> From: Juri Lelli <[email protected]>
>
> Starting deadline server for lower priority classes right away when
> first task is enqueued might break guarantees, as tasks belonging to
> intermediate priority classes could be uselessly preempted. E.g., a well
> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> there are still CPU cycles available for NORMAL tasks to run, as they'll
> be running inside the fair deadline server for some period of time.
>
> To prevent this issue, implement a starvation monitor mechanism that
> starts the deadline server only if a (fair in this case) task hasn't
> been scheduled for some interval of time after it has been enqueued.
> Use pick/put functions to manage starvation monitor status.

Me and Vineeth were discussing that another way of resolving this
issue is to use a DL-server for RT as well, and then using a smaller
deadline for RT. That way the RT is more likely to be selected due to
its earlier deadline/period.

Another approach could be to implement the 0-laxity scheduling as a
general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
And then opt-in the feature for the CFS deadline server task.

Lastly, if the goal is to remove RT throttling code eventually, are
you also planning to remove RT group scheduling as well? Are there
users of RT group scheduling that might be impacted? On the other
hand, RT throttling / group scheduling code can be left as it is
(perhaps documenting it as deprecated) and the server stuff can be
implemented via a CONFIG option.

- Joel

> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 4 ++++
> 2 files changed, 59 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f493f05c1f84..75eadd85e2b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6315,6 +6315,53 @@ static int sched_idle_cpu(int cpu)
> }
> #endif
>
> +
> +static void fair_server_watchdog(struct timer_list *list)
> +{
> + struct rq *rq = container_of(list, struct rq, fair_server_wd);
> + struct rq_flags rf;
> +
> + rq_lock_irqsave(rq, &rf);
> + rq->fair_server_wd_running = 0;
> +
> + if (!rq->cfs.h_nr_running)
> + goto out;
> +
> + update_rq_clock(rq);
> + dl_server_start(&rq->fair_server);
> + rq->fair_server_active = 1;
> + resched_curr(rq);
> +
> +out:
> + rq_unlock_irqrestore(rq, &rf);
> +}
> +
> +static inline void fair_server_watchdog_start(struct rq *rq)
> +{
> + if (rq->fair_server_wd_running || rq->fair_server_active)
> + return;
> +
> + timer_setup(&rq->fair_server_wd, fair_server_watchdog, 0);
> + rq->fair_server_wd.expires = jiffies + FAIR_SERVER_WATCHDOG_INTERVAL;
> + add_timer_on(&rq->fair_server_wd, cpu_of(rq));
> + rq->fair_server_active = 0;
> + rq->fair_server_wd_running = 1;
> +}
> +
> +static inline void fair_server_watchdog_stop(struct rq *rq, bool stop_server)
> +{
> + if (!rq->fair_server_wd_running && !stop_server)
> + return;
> +
> + del_timer(&rq->fair_server_wd);
> + rq->fair_server_wd_running = 0;
> +
> + if (stop_server && rq->fair_server_active) {
> + dl_server_stop(&rq->fair_server);
> + rq->fair_server_active = 0;
> + }
> +}
> +
> /*
> * The enqueue_task method is called before nr_running is
> * increased. Here we update the fair scheduling stats and
> @@ -6337,7 +6384,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> util_est_enqueue(&rq->cfs, p);
>
> if (!rq->cfs.h_nr_running)
> - dl_server_start(&rq->fair_server);
> + fair_server_watchdog_start(rq);
>
> /*
> * If in_iowait is set, the code below may not trigger any cpufreq
> @@ -6484,7 +6531,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> dequeue_throttle:
> if (!rq->cfs.h_nr_running)
> - dl_server_stop(&rq->fair_server);
> + fair_server_watchdog_stop(rq, true);
>
> util_est_update(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> @@ -8193,6 +8240,7 @@ done: __maybe_unused;
> hrtick_start_fair(rq, p);
>
> update_misfit_status(p, rq);
> + fair_server_watchdog_stop(rq, false);
>
> return p;
>
> @@ -8248,6 +8296,8 @@ void fair_server_init(struct rq *rq)
> dl_se->dl_period = 20 * TICK_NSEC;
>
> dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
> +
> + rq->fair_server_wd_running = 0;
> }
>
> /*
> @@ -8262,6 +8312,9 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
> cfs_rq = cfs_rq_of(se);
> put_prev_entity(cfs_rq, se);
> }
> +
> + if (rq->cfs.h_nr_running)
> + fair_server_watchdog_start(rq);
> }
>
> /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d4a7c0823c53..cab5d2b1e71f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -353,6 +353,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> dl_server_has_tasks_f has_tasks,
> dl_server_pick_f pick);
>
> +#define FAIR_SERVER_WATCHDOG_INTERVAL (HZ >> 1)
> extern void fair_server_init(struct rq *);
>
> #ifdef CONFIG_CGROUP_SCHED
> @@ -1018,6 +1019,9 @@ struct rq {
> struct dl_rq dl;
>
> struct sched_dl_entity fair_server;
> + int fair_server_active;
> + struct timer_list fair_server_wd;
> + int fair_server_wd_running;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /* list of leaf cfs_rq on this CPU: */
> --
> 2.40.1
>

Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/12/23 03:57, Joel Fernandes wrote:
> Hello,
>
> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> <[email protected]> wrote:
>>
>> From: Juri Lelli <[email protected]>
>>
>> Starting deadline server for lower priority classes right away when
>> first task is enqueued might break guarantees, as tasks belonging to
>> intermediate priority classes could be uselessly preempted. E.g., a well
>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
>> there are still CPU cycles available for NORMAL tasks to run, as they'll
>> be running inside the fair deadline server for some period of time.
>>
>> To prevent this issue, implement a starvation monitor mechanism that
>> starts the deadline server only if a (fair in this case) task hasn't
>> been scheduled for some interval of time after it has been enqueued.
>> Use pick/put functions to manage starvation monitor status.
>
> Me and Vineeth were discussing that another way of resolving this
> issue is to use a DL-server for RT as well, and then using a smaller
> deadline for RT. That way the RT is more likely to be selected due to
> its earlier deadline/period.

It would not be that different from what we have now.

One of the problems of throttling nowadays is that it accounts for a large window
of time, and any "imprecision" can cause the mechanism not to work as expected.

For example, we work on a fully-isolated CPU scenario, where some very sporadic
workload can be placed on the isolated CPU because of per-cpu kernel activities,
e.g., kworkers... We need to let them run, but for a minimal amount of time, for
instance, 20 us, to bound the interference.

The current mechanism does not give this precision because the IRQ accounting
does not account for runtime for the rt throttling (which makes sense). So the
RT throttling has the 20 us stolen from IRQs and keeps running. The same will
happen if we swap the current mechanism with a DL server for the RT.

Also, thinking about short deadlines to fake a fixed priority is... not starting
well. A fixed-priority higher instance is not a property of a deadline-based
scheduler, and Linux has a fixed-priority hierarchy: STOP -> DL -> RT -> CFS...
It is simple, and that is good.

That is why it is better to boost CFS instead of throttling RT. By boosting
CFS, you do not need a server for RT, and we account for anything on top of CFS
for free (IRQ/DL/FIFO...).

>
> Another approach could be to implement the 0-laxity scheduling as a
> general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
> tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
> And then opt-in the feature for the CFS deadline server task.

A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
on the "C" (runtime, generally WCET), which is hard to find and embeds
pessimism. Also, having such a feature would make other mechanisms harder, as
well as debugging things. For example, proxy-execution or a more precise
schedulability test...

In a paper, the scheduler alone is the solution. In real life, the solution
for problems like locking is as fundamental as the scheduler. We need to keep
things simple to expand on these other topics as well.

So, I do not think we need all the drawbacks of a mixed solution to just fix
the throttling problem, and EDF is more capable and explored for the
general case.

With this patch's idea (and expansions), we can fix the throttling problem
without breaking other behaviors like scheduling order...

>
> Lastly, if the goal is to remove RT throttling code eventually, are
> you also planning to remove RT group scheduling as well? Are there
> users of RT group scheduling that might be impacted? On the other
> hand, RT throttling / group scheduling code can be left as it is
> (perhaps documenting it as deprecated) and the server stuff can be
> implemented via a CONFIG option.

I think that the idea is to have the DL servers eventually replace the group
schedule. But I also believe that it is better to start by solving the
throttling and then moving to other constructions on top of the mechanism.

-- Daniel
>
> - Joel
>
>> Signed-off-by: Juri Lelli <[email protected]>
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>


2023-06-12 20:41:39

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

Hello Daniel!

On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
<[email protected]> wrote:
[...]
> > On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> > <[email protected]> wrote:
> >>
> >> From: Juri Lelli <[email protected]>
> >>
> >> Starting deadline server for lower priority classes right away when
> >> first task is enqueued might break guarantees, as tasks belonging to
> >> intermediate priority classes could be uselessly preempted. E.g., a well
> >> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >> be running inside the fair deadline server for some period of time.
> >>
> >> To prevent this issue, implement a starvation monitor mechanism that
> >> starts the deadline server only if a (fair in this case) task hasn't
> >> been scheduled for some interval of time after it has been enqueued.
> >> Use pick/put functions to manage starvation monitor status.
> >
> > Me and Vineeth were discussing that another way of resolving this
> > issue is to use a DL-server for RT as well, and then using a smaller
> > deadline for RT. That way the RT is more likely to be selected due to
> > its earlier deadline/period.
>
> It would not be that different from what we have now.
>
> One of the problems of throttling nowadays is that it accounts for a large window
> of time, and any "imprecision" can cause the mechanism not to work as expected.
>
> For example, we work on a fully-isolated CPU scenario, where some very sporadic
> workload can be placed on the isolated CPU because of per-cpu kernel activities,
> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
> instance, 20 us, to bound the interference.
>
> The current mechanism does not give this precision because the IRQ accounting
> does not account for runtime for the rt throttling (which makes sense).

I lost you here, "Runtime for the rt throttling" does not make much
sense to me as a statement.

> So the
> RT throttling has the 20 us stolen from IRQs and keeps running. The same will
> happen if we swap the current mechanism with a DL server for the RT.

I read this about 10 times to learn that *maybe* you mean that IRQs
stole time from the "Well behaved running time" of the RT class. I am
not seeing how that is related to creation of a DL-server for the RT
class. Maybe describe your point a bit more clearly?

>
> Also, thinking about short deadlines to fake a fixed priority is... not starting
> well. A fixed-priority higher instance is not a property of a deadline-based
> scheduler, and Linux has a fixed-priority hierarchy: STOP -> DL -> RT -> CFS...
> It is simple, and that is good.
>
> That is why it is better to boost CFS instead of throttling RT. By boosting
> CFS, you do not need a server for RT, and we account for anything on top of CFS
> for free (IRQ/DL/FIFO...).

I did mention in my last email that it is not ideal. I just brought it
up as an option. It might reduce the problem being seen and is better
than not having it.

> > Another approach could be to implement the 0-laxity scheduling as a
> > general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
> > tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
> > And then opt-in the feature for the CFS deadline server task.
>
> A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
> on the "C" (runtime, generally WCET), which is hard to find and embeds
> pessimism. Also, having such a feature would make other mechanisms harder, as
> well as debugging things. For example, proxy-execution or a more precise
> schedulability test...

I think you did not read my email properly, I was saying make the
0-laxity default-off and the opt-in for certain DL tasks. That may
work perfectly well for a system like ChromeOS where likely we will
use the DL server as the sole deadline task and opt-in for the
0-laxity. Then we don't need watchdog hacks at all and it all cleanly
works within the DL class itself. There are the drawbacks of the
pessimism/locking etc (I already knew that by the way as the obvious
drawbacks of 0-laxity) but I am not immediately seeing how this
CFS-watchdog with 0-laxity is any different from the DL-server itself
having such a property. If you really have a concrete point on why
that won't work, and if you could clarify that more clearly why a
watchdog is better than it, that would be great.

> In a paper, the scheduler alone is the solution. In real life, the solution
> for problems like locking is as fundamental as the scheduler. We need to keep
> things simple to expand on these other topics as well.
>
> So, I do not think we need all the drawbacks of a mixed solution to just fix
> the throttling problem, and EDF is more capable and explored for the
> general case.

Again, I was saying making it opt-in seems like a reasonable approach
and just enabling such property for the DL server.

> With this patch's idea (and expansions), we can fix the throttling problem
> without breaking other behaviors like scheduling order...

I don't mind the watchdog patch as such, of course. I presented its
mechanics at OSSNA and I know how it works, but I feel the DL server
opting-in for 0-laxity would be cleaner while keeping such behavior as
default-off for regular DL uses, that's my opinion -- but what else am
I missing? Either way, no harm in discussing alternate approaches as
well even if we are settling for the watchdog.

> > Lastly, if the goal is to remove RT throttling code eventually, are
> > you also planning to remove RT group scheduling as well? Are there
> > users of RT group scheduling that might be impacted? On the other
> > hand, RT throttling / group scheduling code can be left as it is
> > (perhaps documenting it as deprecated) and the server stuff can be
> > implemented via a CONFIG option.
>
> I think that the idea is to have the DL servers eventually replace the group
> schedule. But I also believe that it is better to start by solving the
> throttling and then moving to other constructions on top of the mechanism.

Hmm. For throttling at the root level yes, but I am not seeing how
you can replace the group scheduling code for existing users of RT
Cgroups with this. The throttling in the RT group scheduling code is
not exactly only about "not starving CFS", it is more related to
letting RT groups run with certain bandwidth. So you cannot really
delete it if there are real users of that code -- you'll have to
migrate those users away first (to an alternate implementation like
DL). If there are no users of RT group scheduling, that's lovely
though. We don't use it in ChromeOS fwiw.

- Joel

2023-06-13 13:35:31

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers

Hi,

On Thu, Jun 08, 2023 at 05:58:16PM +0200 Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
> with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
>
> RT Throttling has been introduced a while ago as a (mostly debug)
> countermeasure one can utilize to reserve some CPU time for low priority
> tasks (usually background type of work, e.g. workqueues, timers, etc.).
> It however has its own problems (see documentation) and the undesired
> effect of unconditionally throttling FIFO tasks even when no lower
> priority activity needs to run (there are mechanisms to fix this issue
> as well, but, again, with their own problems).
>
> Introduce deadline servers to service low priority tasks needs under
> starvation conditions. Deadline servers are built extending SCHED_DEADLINE
> implementation to allow 2-level scheduling (a sched_deadline entity
> becomes a container for lower priority scheduling entities).
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> include/linux/sched.h | 22 ++-
> kernel/sched/core.c | 17 ++
> kernel/sched/deadline.c | 350 +++++++++++++++++++++++++++-------------
> kernel/sched/fair.c | 4 +
> kernel/sched/sched.h | 29 ++++
> 5 files changed, 309 insertions(+), 113 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 26b1925a702a..4c90d7693a75 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -64,12 +64,14 @@ struct robust_list_head;
> struct root_domain;
> struct rq;
> struct sched_attr;
> +struct sched_dl_entity;
> struct sched_param;
> struct seq_file;
> struct sighand_struct;
> struct signal_struct;
> struct task_delay_info;
> struct task_group;
> +struct task_struct;
> struct user_event_mm;
>
> /*
> @@ -600,6 +602,9 @@ struct sched_rt_entity {
> #endif
> } __randomize_layout;
>
> +typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
> +typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
> +
> struct sched_dl_entity {
> struct rb_node rb_node;
>
> @@ -647,6 +652,7 @@ struct sched_dl_entity {
> unsigned int dl_yielded : 1;
> unsigned int dl_non_contending : 1;
> unsigned int dl_overrun : 1;
> + unsigned int dl_server : 1;
>
> /*
> * Bandwidth enforcement timer. Each -deadline task has its
> @@ -661,7 +667,20 @@ struct sched_dl_entity {
> * timer is needed to decrease the active utilization at the correct
> * time.
> */
> - struct hrtimer inactive_timer;
> + struct hrtimer inactive_timer;
> +
> + /*
> + * Bits for DL-server functionality. Also see the comment near
> + * dl_server_update().
> + *
> + * @rq the runqueue this server is for
> + *
> + * @server_has_tasks() returns true if @server_pick return a
> + * runnable task.
> + */
> + struct rq *rq;
> + dl_server_has_tasks_f server_has_tasks;
> + dl_server_pick_f server_pick;
>
> #ifdef CONFIG_RT_MUTEXES
> /*
> @@ -790,6 +809,7 @@ struct task_struct {
> struct sched_entity se;
> struct sched_rt_entity rt;
> struct sched_dl_entity dl;
> + struct sched_dl_entity *server;
> const struct sched_class *sched_class;


I like the cleanups in the first 3 patches a lot. I'm still wrapping my head
around the rest. But "server" is a very non-descript, generic term for
something in task_struct. Maybe "sched_server" or something?

Cheers,
Phil

>
> #ifdef CONFIG_SCHED_CORE
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e34b02cbe41f..5b88b822ec89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3803,6 +3803,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> rq->idle_stamp = 0;
> }
> #endif
> +
> + p->server = NULL;
> }
>
> /*
> @@ -6013,12 +6015,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> p = pick_next_task_idle(rq);
> }
>
> + /*
> + * This is the fast path; it cannot be a DL server pick;
> + * therefore even if @p == @prev, ->server must be NULL.
> + */
> + if (p->server)
> + p->server = NULL;
> +
> return p;
> }
>
> restart:
> put_prev_task_balance(rq, prev, rf);
>
> + /*
> + * We've updated @prev and no longer need the server link, clear it.
> + * Must be done before ->pick_next_task() because that can (re)set
> + * ->server.
> + */
> + if (prev->server)
> + prev->server = NULL;
> +
> for_each_class(class) {
> p = class->pick_next_task(rq);
> if (p)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 869734eecb2c..c67056ff5749 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -52,8 +52,14 @@ static int __init sched_dl_sysctl_init(void)
> late_initcall(sched_dl_sysctl_init);
> #endif
>
> +static bool dl_server(struct sched_dl_entity *dl_se)
> +{
> + return dl_se->dl_server;
> +}
> +
> static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
> {
> + BUG_ON(dl_server(dl_se));
> return container_of(dl_se, struct task_struct, dl);
> }
>
> @@ -62,14 +68,22 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
> return container_of(dl_rq, struct rq, dl);
> }
>
> -static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
> +static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
> {
> - struct task_struct *p = dl_task_of(dl_se);
> - struct rq *rq = task_rq(p);
> + struct rq *rq = dl_se->rq;
> +
> + if (!dl_server(dl_se))
> + rq = task_rq(dl_task_of(dl_se));
>
> - return &rq->dl;
> + return rq;
> }
>
> +static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
> +{
> + return &rq_of_dl_se(dl_se)->dl;
> +}
> +
> +
> static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> {
> return !RB_EMPTY_NODE(&dl_se->rb_node);
> @@ -392,9 +406,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
> static void task_non_contending(struct sched_dl_entity *dl_se)
> {
> struct hrtimer *timer = &dl_se->inactive_timer;
> - struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> - struct rq *rq = rq_of_dl_rq(dl_rq);
> - struct task_struct *p = dl_task_of(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> + struct dl_rq *dl_rq = &rq->dl;
> s64 zerolag_time;
>
> /*
> @@ -424,25 +437,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
> * utilization now, instead of starting a timer
> */
> if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
> - if (dl_task(p))
> + if (dl_server(dl_se)) {
> sub_running_bw(dl_se, dl_rq);
> + } else {
> + struct task_struct *p = dl_task_of(dl_se);
> +
> + if (dl_task(p))
> + sub_running_bw(dl_se, dl_rq);
>
> - if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
> - struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> + if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
> + struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
>
> - if (READ_ONCE(p->__state) == TASK_DEAD)
> - sub_rq_bw(dl_se, &rq->dl);
> - raw_spin_lock(&dl_b->lock);
> - __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
> - raw_spin_unlock(&dl_b->lock);
> - __dl_clear_params(dl_se);
> + if (READ_ONCE(p->__state) == TASK_DEAD)
> + sub_rq_bw(dl_se, &rq->dl);
> + raw_spin_lock(&dl_b->lock);
> + __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
> + raw_spin_unlock(&dl_b->lock);
> + __dl_clear_params(dl_se);
> + }
> }
>
> return;
> }
>
> dl_se->dl_non_contending = 1;
> - get_task_struct(p);
> + if (!dl_server(dl_se))
> + get_task_struct(dl_task_of(dl_se));
> +
> hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
> }
>
> @@ -469,8 +490,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
> * will not touch the rq's active utilization,
> * so we are still safe.
> */
> - if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
> - put_task_struct(dl_task_of(dl_se));
> + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
> + if (!dl_server(dl_se))
> + put_task_struct(dl_task_of(dl_se));
> + }
> } else {
> /*
> * Since "dl_non_contending" is not set, the
> @@ -483,10 +506,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
> }
> }
>
> -static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> +static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> {
> - struct sched_dl_entity *dl_se = &p->dl;
> -
> return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
> }
>
> @@ -573,8 +594,6 @@ static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>
> if (p->nr_cpus_allowed > 1)
> dl_rq->dl_nr_migratory++;
> -
> - update_dl_migration(dl_rq);
> }
>
> static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> @@ -583,8 +602,6 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>
> if (p->nr_cpus_allowed > 1)
> dl_rq->dl_nr_migratory--;
> -
> - update_dl_migration(dl_rq);
> }
>
> #define __node_2_pdl(node) \
> @@ -762,8 +779,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
> }
> #endif /* CONFIG_SMP */
>
> +static void
> +enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
> static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> -static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> +static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
> static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags);
>
> static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> @@ -1011,8 +1030,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
> */
> static void update_dl_entity(struct sched_dl_entity *dl_se)
> {
> - struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> - struct rq *rq = rq_of_dl_rq(dl_rq);
> + struct rq *rq = rq_of_dl_se(dl_se);
>
> if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
> dl_entity_overflow(dl_se, rq_clock(rq))) {
> @@ -1043,11 +1061,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
> * actually started or not (i.e., the replenishment instant is in
> * the future or in the past).
> */
> -static int start_dl_timer(struct task_struct *p)
> +static int start_dl_timer(struct sched_dl_entity *dl_se)
> {
> - struct sched_dl_entity *dl_se = &p->dl;
> struct hrtimer *timer = &dl_se->dl_timer;
> - struct rq *rq = task_rq(p);
> + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
> ktime_t now, act;
> s64 delta;
>
> @@ -1081,13 +1099,33 @@ static int start_dl_timer(struct task_struct *p)
> * and observe our state.
> */
> if (!hrtimer_is_queued(timer)) {
> - get_task_struct(p);
> + if (!dl_server(dl_se))
> + get_task_struct(dl_task_of(dl_se));
> hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
> }
>
> return 1;
> }
>
> +static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
> +{
> +#ifdef CONFIG_SMP
> + /*
> + * Queueing this task back might have overloaded rq, check if we need
> + * to kick someone away.
> + */
> + if (has_pushable_dl_tasks(rq)) {
> + /*
> + * Nothing relies on rq->lock after this, so its safe to drop
> + * rq->lock.
> + */
> + rq_unpin_lock(rq, rf);
> + push_dl_task(rq);
> + rq_repin_lock(rq, rf);
> + }
> +#endif
> +}
> +
> /*
> * This is the bandwidth enforcement timer callback. If here, we know
> * a task is not on its dl_rq, since the fact that the timer was running
> @@ -1106,10 +1144,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> struct sched_dl_entity *dl_se = container_of(timer,
> struct sched_dl_entity,
> dl_timer);
> - struct task_struct *p = dl_task_of(dl_se);
> + struct task_struct *p;
> struct rq_flags rf;
> struct rq *rq;
>
> + if (dl_server(dl_se)) {
> + struct rq *rq = rq_of_dl_se(dl_se);
> + struct rq_flags rf;
> +
> + rq_lock(rq, &rf);
> + if (dl_se->dl_throttled) {
> + sched_clock_tick();
> + update_rq_clock(rq);
> +
> + if (dl_se->server_has_tasks(dl_se)) {
> + enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
> + resched_curr(rq);
> + __push_dl_task(rq, &rf);
> + } else {
> + replenish_dl_entity(dl_se);
> + }
> +
> + }
> + rq_unlock(rq, &rf);
> +
> + return HRTIMER_NORESTART;
> + }
> +
> + p = dl_task_of(dl_se);
> rq = task_rq_lock(p, &rf);
>
> /*
> @@ -1180,21 +1242,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> else
> resched_curr(rq);
>
> -#ifdef CONFIG_SMP
> - /*
> - * Queueing this task back might have overloaded rq, check if we need
> - * to kick someone away.
> - */
> - if (has_pushable_dl_tasks(rq)) {
> - /*
> - * Nothing relies on rq->lock after this, so its safe to drop
> - * rq->lock.
> - */
> - rq_unpin_lock(rq, &rf);
> - push_dl_task(rq);
> - rq_repin_lock(rq, &rf);
> - }
> -#endif
> + __push_dl_task(rq, &rf);
>
> unlock:
> task_rq_unlock(rq, p, &rf);
> @@ -1236,12 +1284,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
> */
> static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
> {
> - struct task_struct *p = dl_task_of(dl_se);
> - struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
> + struct rq *rq = rq_of_dl_se(dl_se);
>
> if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
> dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
> - if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
> + if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
> return;
> dl_se->dl_throttled = 1;
> if (dl_se->runtime > 0)
> @@ -1296,29 +1343,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
> return (delta * u_act) >> BW_SHIFT;
> }
>
> -/*
> - * Update the current task's runtime statistics (provided it is still
> - * a -deadline task and has not been removed from the dl_rq).
> - */
> -static void update_curr_dl(struct rq *rq)
> +static inline void
> +update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
> + int flags);
> +static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
> {
> - struct task_struct *curr = rq->curr;
> - struct sched_dl_entity *dl_se = &curr->dl;
> - s64 delta_exec, scaled_delta_exec;
> - int cpu = cpu_of(rq);
> -
> - if (!dl_task(curr) || !on_dl_rq(dl_se))
> - return;
> + s64 scaled_delta_exec;
>
> - /*
> - * Consumed budget is computed considering the time as
> - * observed by schedulable tasks (excluding time spent
> - * in hardirq context, etc.). Deadlines are instead
> - * computed using hard walltime. This seems to be the more
> - * natural solution, but the full ramifications of this
> - * approach need further study.
> - */
> - delta_exec = update_curr_common(rq);
> if (unlikely(delta_exec <= 0)) {
> if (unlikely(dl_se->dl_yielded))
> goto throttle;
> @@ -1336,10 +1367,9 @@ static void update_curr_dl(struct rq *rq)
> * according to current frequency and CPU maximum capacity.
> */
> if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> - scaled_delta_exec = grub_reclaim(delta_exec,
> - rq,
> - &curr->dl);
> + scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
> } else {
> + int cpu = cpu_of(rq);
> unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
>
> @@ -1358,11 +1388,21 @@ static void update_curr_dl(struct rq *rq)
> (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
> dl_se->dl_overrun = 1;
>
> - __dequeue_task_dl(rq, curr, 0);
> - if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
> - enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> + dequeue_dl_entity(dl_se, 0);
> + if (!dl_server(dl_se)) {
> + /* XXX: After v2, from __dequeue_task_dl() */
> + update_stats_dequeue_dl(&rq->dl, dl_se, 0);
> + dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
> + }
> +
> + if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
> + if (dl_server(dl_se))
> + enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
> + else
> + enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
> + }
>
> - if (!is_leftmost(curr, &rq->dl))
> + if (!is_leftmost(dl_se, &rq->dl))
> resched_curr(rq);
> }
>
> @@ -1392,20 +1432,82 @@ static void update_curr_dl(struct rq *rq)
> }
> }
>
> +void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
> +{
> + update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> +}
> +
> +void dl_server_start(struct sched_dl_entity *dl_se)
> +{
> + if (!dl_server(dl_se)) {
> + dl_se->dl_server = 1;
> + setup_new_dl_entity(dl_se);
> + }
> + enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
> +}
> +
> +void dl_server_stop(struct sched_dl_entity *dl_se)
> +{
> + dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
> +}
> +
> +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> + dl_server_has_tasks_f has_tasks,
> + dl_server_pick_f pick)
> +{
> + dl_se->rq = rq;
> + dl_se->server_has_tasks = has_tasks;
> + dl_se->server_pick = pick;
> +}
> +
> +/*
> + * Update the current task's runtime statistics (provided it is still
> + * a -deadline task and has not been removed from the dl_rq).
> + */
> +static void update_curr_dl(struct rq *rq)
> +{
> + struct task_struct *curr = rq->curr;
> + struct sched_dl_entity *dl_se = &curr->dl;
> + s64 delta_exec;
> +
> + if (!dl_task(curr) || !on_dl_rq(dl_se))
> + return;
> +
> + /*
> + * Consumed budget is computed considering the time as
> + * observed by schedulable tasks (excluding time spent
> + * in hardirq context, etc.). Deadlines are instead
> + * computed using hard walltime. This seems to be the more
> + * natural solution, but the full ramifications of this
> + * approach need further study.
> + */
> + delta_exec = update_curr_common(rq);
> + update_curr_dl_se(rq, dl_se, delta_exec);
> +}
> +
> static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> {
> struct sched_dl_entity *dl_se = container_of(timer,
> struct sched_dl_entity,
> inactive_timer);
> - struct task_struct *p = dl_task_of(dl_se);
> + struct task_struct *p = NULL;
> struct rq_flags rf;
> struct rq *rq;
>
> - rq = task_rq_lock(p, &rf);
> + if (!dl_server(dl_se)) {
> + p = dl_task_of(dl_se);
> + rq = task_rq_lock(p, &rf);
> + } else {
> + rq = dl_se->rq;
> + rq_lock(rq, &rf);
> + }
>
> sched_clock_tick();
> update_rq_clock(rq);
>
> + if (dl_server(dl_se))
> + goto no_task;
> +
> if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
> struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
>
> @@ -1422,14 +1524,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>
> goto unlock;
> }
> +
> +no_task:
> if (dl_se->dl_non_contending == 0)
> goto unlock;
>
> sub_running_bw(dl_se, &rq->dl);
> dl_se->dl_non_contending = 0;
> unlock:
> - task_rq_unlock(rq, p, &rf);
> - put_task_struct(p);
> +
> + if (!dl_server(dl_se)) {
> + task_rq_unlock(rq, p, &rf);
> + put_task_struct(p);
> + } else {
> + rq_unlock(rq, &rf);
> + }
>
> return HRTIMER_NORESTART;
> }
> @@ -1487,34 +1596,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>
> +static inline void update_dl_migration(struct dl_rq *dl_rq) {}
> +
> #endif /* CONFIG_SMP */
>
> static inline
> void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> {
> - int prio = dl_task_of(dl_se)->prio;
> u64 deadline = dl_se->deadline;
>
> - WARN_ON(!dl_prio(prio));
> dl_rq->dl_nr_running++;
> add_nr_running(rq_of_dl_rq(dl_rq), 1);
>
> inc_dl_deadline(dl_rq, deadline);
> - inc_dl_migration(dl_se, dl_rq);
> + if (!dl_server(dl_se))
> + inc_dl_migration(dl_se, dl_rq);
> + update_dl_migration(dl_rq);
> }
>
> static inline
> void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> {
> - int prio = dl_task_of(dl_se)->prio;
> -
> - WARN_ON(!dl_prio(prio));
> WARN_ON(!dl_rq->dl_nr_running);
> dl_rq->dl_nr_running--;
> sub_nr_running(rq_of_dl_rq(dl_rq), 1);
>
> dec_dl_deadline(dl_rq, dl_se->deadline);
> - dec_dl_migration(dl_se, dl_rq);
> + if (!dl_server(dl_se))
> + dec_dl_migration(dl_se, dl_rq);
> + update_dl_migration(dl_rq);
> }
>
> static inline bool __dl_less(struct rb_node *a, const struct rb_node *b)
> @@ -1676,8 +1786,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> } else if (flags & ENQUEUE_REPLENISH) {
> replenish_dl_entity(dl_se);
> } else if ((flags & ENQUEUE_RESTORE) &&
> - dl_time_before(dl_se->deadline,
> - rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
> + dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
> setup_new_dl_entity(dl_se);
> }
>
> @@ -1762,14 +1871,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> enqueue_pushable_dl_task(rq, p);
> }
>
> -static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> -{
> - update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
> - dequeue_dl_entity(&p->dl, flags);
> -
> - if (!p->dl.dl_throttled)
> - dequeue_pushable_dl_task(rq, p);
> -}
>
> static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> {
> @@ -1778,7 +1879,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> if (p->on_rq == TASK_ON_RQ_MIGRATING)
> flags |= DEQUEUE_MIGRATING;
>
> - __dequeue_task_dl(rq, p, flags);
> + dequeue_dl_entity(&p->dl, flags);
> + if (!p->dl.dl_throttled)
> + dequeue_pushable_dl_task(rq, p);
> }
>
> /*
> @@ -1968,12 +2071,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> }
>
> #ifdef CONFIG_SCHED_HRTICK
> -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
> {
> - hrtick_start(rq, p->dl.runtime);
> + hrtick_start(rq, dl_se->runtime);
> }
> #else /* !CONFIG_SCHED_HRTICK */
> -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
> {
> }
> #endif
> @@ -1993,9 +2096,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
> if (!first)
> return;
>
> - if (hrtick_enabled_dl(rq))
> - start_hrtick_dl(rq, p);
> -
> if (rq->curr->sched_class != &dl_sched_class)
> update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>
> @@ -2018,12 +2118,26 @@ static struct task_struct *pick_task_dl(struct rq *rq)
> struct dl_rq *dl_rq = &rq->dl;
> struct task_struct *p;
>
> +again:
> if (!sched_dl_runnable(rq))
> return NULL;
>
> dl_se = pick_next_dl_entity(dl_rq);
> WARN_ON_ONCE(!dl_se);
> - p = dl_task_of(dl_se);
> +
> +
> + if (dl_server(dl_se)) {
> + p = dl_se->server_pick(dl_se);
> + if (!p) {
> + // XXX should not happen, warn?!
> + dl_se->dl_yielded = 1;
> + update_curr_dl_se(rq, dl_se, 0);
> + goto again;
> + }
> + p->server = dl_se;
> + } else {
> + p = dl_task_of(dl_se);
> + }
>
> return p;
> }
> @@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> struct task_struct *p;
>
> p = pick_task_dl(rq);
> - if (p)
> + if (!p)
> + return p;
> +
> + /*
> + * XXX: re-check !dl_server, changed from v2 because of
> + * pick_next_task_dl change
> + */
> + if (!dl_server(&p->dl))
> set_next_task_dl(rq, p, true);
>
> + /* XXX not quite right */
> + if (hrtick_enabled(rq))
> + start_hrtick_dl(rq, &p->dl);
> +
> return p;
> }
>
> @@ -2073,8 +2198,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> * be set and schedule() will start a new hrtick for the next task.
> */
> if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
> - is_leftmost(p, &rq->dl))
> - start_hrtick_dl(rq, p);
> + is_leftmost(&p->dl, &rq->dl))
> + start_hrtick_dl(rq, &p->dl);
> }
>
> static void task_fork_dl(struct task_struct *p)
> @@ -3003,6 +3128,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
> dl_se->dl_yielded = 0;
> dl_se->dl_non_contending = 0;
> dl_se->dl_overrun = 0;
> + dl_se->dl_server = 0;
>
> #ifdef CONFIG_RT_MUTEXES
> dl_se->pi_se = dl_se;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fda67f05190d..0c58d8e55b69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -930,6 +930,8 @@ s64 update_curr_common(struct rq *rq)
>
> account_group_exec_runtime(curr, delta_exec);
> cgroup_account_cputime(curr, delta_exec);
> + if (curr->server)
> + dl_server_update(curr->server, delta_exec);
>
> return delta_exec;
> }
> @@ -958,6 +960,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
> trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> cgroup_account_cputime(curtask, delta_exec);
> account_group_exec_runtime(curtask, delta_exec);
> + if (curtask->server)
> + dl_server_update(curtask->server, delta_exec);
> }
>
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aaf163695c2e..390c99e2f8a8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -324,6 +324,35 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
> extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> extern int dl_cpu_busy(int cpu, struct task_struct *p);
>
> +/*
> + * SCHED_DEADLINE supports servers (nested scheduling) with the following
> + * interface:
> + *
> + * dl_se::rq -- runqueue we belong to.
> + *
> + * dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
> + * server when it runs out of tasks to run.
> + *
> + * dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
> + * returns NULL.
> + *
> + * dl_server_update() -- called from update_curr_common(), propagates runtime
> + * to the server.
> + *
> + * dl_server_start()
> + * dl_server_stop() -- start/stop the server when it has (no) tasks
> + *
> + * dl_server_init()
> + *
> + * XXX
> + */
> +extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
> +extern void dl_server_start(struct sched_dl_entity *dl_se);
> +extern void dl_server_stop(struct sched_dl_entity *dl_se);
> +extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> + dl_server_has_tasks_f has_tasks,
> + dl_server_pick_f pick);
> +
> #ifdef CONFIG_CGROUP_SCHED
>
> struct cfs_rq;
> --
> 2.40.1
>

--


2023-06-13 13:51:07

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH V3 1/6] sched: Unify runtime accounting across classes

On Thu, Jun 08, 2023 at 05:58:13PM +0200 Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> All classes use sched_entity::exec_start to track runtime and have
> copies of the exact same code around to compute runtime.
>
> Collapse all that.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>

Reviewed-by: Phil Auld <[email protected]>

> ---
> include/linux/sched.h | 2 +-
> kernel/sched/deadline.c | 15 +++--------
> kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++----------
> kernel/sched/rt.c | 15 +++--------
> kernel/sched/sched.h | 12 ++-------
> kernel/sched/stop_task.c | 13 +--------
> 6 files changed, 53 insertions(+), 61 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1292d38d66cc..26b1925a702a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -521,7 +521,7 @@ struct sched_statistics {
> u64 block_max;
> s64 sum_block_runtime;
>
> - u64 exec_max;
> + s64 exec_max;
> u64 slice_max;
>
> u64 nr_migrations_cold;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f827067ad03b..030e7c11607f 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1301,9 +1301,8 @@ static void update_curr_dl(struct rq *rq)
> {
> struct task_struct *curr = rq->curr;
> struct sched_dl_entity *dl_se = &curr->dl;
> - u64 delta_exec, scaled_delta_exec;
> + s64 delta_exec, scaled_delta_exec;
> int cpu = cpu_of(rq);
> - u64 now;
>
> if (!dl_task(curr) || !on_dl_rq(dl_se))
> return;
> @@ -1316,21 +1315,13 @@ static void update_curr_dl(struct rq *rq)
> * natural solution, but the full ramifications of this
> * approach need further study.
> */
> - now = rq_clock_task(rq);
> - delta_exec = now - curr->se.exec_start;
> - if (unlikely((s64)delta_exec <= 0)) {
> + delta_exec = update_curr_common(rq);
> + if (unlikely(delta_exec <= 0)) {
> if (unlikely(dl_se->dl_yielded))
> goto throttle;
> return;
> }
>
> - schedstat_set(curr->stats.exec_max,
> - max(curr->stats.exec_max, delta_exec));
> -
> - trace_sched_stat_runtime(curr, delta_exec, 0);
> -
> - update_current_exec_runtime(curr, now, delta_exec);
> -
> if (dl_entity_is_special(dl_se))
> return;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6189d1a45635..fda67f05190d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -891,23 +891,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> }
> #endif /* CONFIG_SMP */
>
> -/*
> - * Update the current task's runtime statistics.
> - */
> -static void update_curr(struct cfs_rq *cfs_rq)
> +static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> {
> - struct sched_entity *curr = cfs_rq->curr;
> - u64 now = rq_clock_task(rq_of(cfs_rq));
> - u64 delta_exec;
> -
> - if (unlikely(!curr))
> - return;
> + u64 now = rq_clock_task(rq);
> + s64 delta_exec;
>
> delta_exec = now - curr->exec_start;
> - if (unlikely((s64)delta_exec <= 0))
> - return;
> + if (unlikely(delta_exec <= 0))
> + return delta_exec;
>
> curr->exec_start = now;
> + curr->sum_exec_runtime += delta_exec;
>
> if (schedstat_enabled()) {
> struct sched_statistics *stats;
> @@ -917,8 +911,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
> max(delta_exec, stats->exec_max));
> }
>
> - curr->sum_exec_runtime += delta_exec;
> - schedstat_add(cfs_rq->exec_clock, delta_exec);
> + return delta_exec;
> +}
> +
> +/*
> + * Used by other classes to account runtime.
> + */
> +s64 update_curr_common(struct rq *rq)
> +{
> + struct task_struct *curr = rq->curr;
> + s64 delta_exec;
> +
> + delta_exec = update_curr_se(rq, &curr->se);
> + if (unlikely(delta_exec <= 0))
> + return delta_exec;
> +
> + trace_sched_stat_runtime(curr, delta_exec, 0);
> +
> + account_group_exec_runtime(curr, delta_exec);
> + cgroup_account_cputime(curr, delta_exec);
> +
> + return delta_exec;
> +}
> +
> +/*
> + * Update the current task's runtime statistics.
> + */
> +static void update_curr(struct cfs_rq *cfs_rq)
> +{
> + struct sched_entity *curr = cfs_rq->curr;
> + s64 delta_exec;
> +
> + if (unlikely(!curr))
> + return;
> +
> + delta_exec = update_curr_se(rq_of(cfs_rq), curr);
> + if (unlikely(delta_exec <= 0))
> + return;
>
> curr->vruntime += calc_delta_fair(delta_exec, curr);
> update_min_vruntime(cfs_rq);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 00e0e5074115..efec4f3fef83 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1046,24 +1046,15 @@ static void update_curr_rt(struct rq *rq)
> {
> struct task_struct *curr = rq->curr;
> struct sched_rt_entity *rt_se = &curr->rt;
> - u64 delta_exec;
> - u64 now;
> + s64 delta_exec;
>
> if (curr->sched_class != &rt_sched_class)
> return;
>
> - now = rq_clock_task(rq);
> - delta_exec = now - curr->se.exec_start;
> - if (unlikely((s64)delta_exec <= 0))
> + delta_exec = update_curr_common(rq);
> + if (unlikely(delta_exec <= 0))
> return;
>
> - schedstat_set(curr->stats.exec_max,
> - max(curr->stats.exec_max, delta_exec));
> -
> - trace_sched_stat_runtime(curr, delta_exec, 0);
> -
> - update_current_exec_runtime(curr, now, delta_exec);
> -
> if (!rt_bandwidth_enabled())
> return;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 556496c77dc2..da0cec2fc63a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2176,6 +2176,8 @@ struct affinity_context {
> unsigned int flags;
> };
>
> +extern s64 update_curr_common(struct rq *rq);
> +
> struct sched_class {
>
> #ifdef CONFIG_UCLAMP_TASK
> @@ -3207,16 +3209,6 @@ extern int sched_dynamic_mode(const char *str);
> extern void sched_dynamic_update(int mode);
> #endif
>
> -static inline void update_current_exec_runtime(struct task_struct *curr,
> - u64 now, u64 delta_exec)
> -{
> - curr->se.sum_exec_runtime += delta_exec;
> - account_group_exec_runtime(curr, delta_exec);
> -
> - curr->se.exec_start = now;
> - cgroup_account_cputime(curr, delta_exec);
> -}
> -
> #ifdef CONFIG_SCHED_MM_CID
>
> #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index 85590599b4d6..7595494ceb6d 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
>
> static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
> {
> - struct task_struct *curr = rq->curr;
> - u64 now, delta_exec;
> -
> - now = rq_clock_task(rq);
> - delta_exec = now - curr->se.exec_start;
> - if (unlikely((s64)delta_exec < 0))
> - delta_exec = 0;
> -
> - schedstat_set(curr->stats.exec_max,
> - max(curr->stats.exec_max, delta_exec));
> -
> - update_current_exec_runtime(curr, now, delta_exec);
> + update_curr_common(rq);
> }
>
> /*
> --
> 2.40.1
>

--


Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/12/23 22:35, Joel Fernandes wrote:
> Hello Daniel!
>
> On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
> <[email protected]> wrote:
> [...]
>>> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
>>> <[email protected]> wrote:
>>>>
>>>> From: Juri Lelli <[email protected]>
>>>>
>>>> Starting deadline server for lower priority classes right away when
>>>> first task is enqueued might break guarantees, as tasks belonging to
>>>> intermediate priority classes could be uselessly preempted. E.g., a well
>>>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
>>>> there are still CPU cycles available for NORMAL tasks to run, as they'll
>>>> be running inside the fair deadline server for some period of time.
>>>>
>>>> To prevent this issue, implement a starvation monitor mechanism that
>>>> starts the deadline server only if a (fair in this case) task hasn't
>>>> been scheduled for some interval of time after it has been enqueued.
>>>> Use pick/put functions to manage starvation monitor status.
>>>
>>> Me and Vineeth were discussing that another way of resolving this
>>> issue is to use a DL-server for RT as well, and then using a smaller
>>> deadline for RT. That way the RT is more likely to be selected due to
>>> its earlier deadline/period.
>>
>> It would not be that different from what we have now.
>>
>> One of the problems of throttling nowadays is that it accounts for a large window
>> of time, and any "imprecision" can cause the mechanism not to work as expected.
>>
>> For example, we work on a fully-isolated CPU scenario, where some very sporadic
>> workload can be placed on the isolated CPU because of per-cpu kernel activities,
>> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
>> instance, 20 us, to bound the interference.
>>
>> The current mechanism does not give this precision because the IRQ accounting
>> does not account for runtime for the rt throttling (which makes sense).
>
> I lost you here, "Runtime for the rt throttling" does not make much
> sense to me as a statement.

Both update_curr_rt() and update_curr_dl() use rq_clock_task() as clock. update_rq_clock_task()
reduces the irq_delta from task's clock (inside #ifdef CONFIG_IRQ_TIME_ACCOUNTING), and this
clock is used to throttling.

>> So the
>> RT throttling has the 20 us stolen from IRQs and keeps running. The same will
>> happen if we swap the current mechanism with a DL server for the RT.
>
> I read this about 10 times to learn that *maybe* you mean that IRQs
> stole time from the "Well behaved running time" of the RT class.

I also read emails many times :-)

I am
> not seeing how that is related to creation of a DL-server for the RT
> class. Maybe describe your point a bit more clearly?

This patch is targeting a better way to avoid SCHED_OTHER starvation.
Having a DL server for RT class does not help on that. We need to boost
SCHED_OTHER.

>>
>> Also, thinking about short deadlines to fake a fixed priority is... not starting
>> well. A fixed-priority higher instance is not a property of a deadline-based
>> scheduler, and Linux has a fixed-priority hierarchy: STOP -> DL -> RT -> CFS...
>> It is simple, and that is good.
>>
>> That is why it is better to boost CFS instead of throttling RT. By boosting
>> CFS, you do not need a server for RT, and we account for anything on top of CFS
>> for free (IRQ/DL/FIFO...).
>
> I did mention in my last email that it is not ideal. I just brought it
> up as an option. It might reduce the problem being seen and is better
> than not having it.

We have thought about it, but boosting SCHED_OTHER is the way to go.

>
>>> Another approach could be to implement the 0-laxity scheduling as a
>>> general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
>>> tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
>>> And then opt-in the feature for the CFS deadline server task.
>>
>> A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
>> on the "C" (runtime, generally WCET), which is hard to find and embeds
>> pessimism. Also, having such a feature would make other mechanisms harder, as
>> well as debugging things. For example, proxy-execution or a more precise
>> schedulability test...
>
> I think you did not read my email properly, I was saying make the
> 0-laxity default-off and the opt-in for certain DL tasks. That may
> work perfectly well for a system like ChromeOS where likely we will
> use the DL server as the sole deadline task and opt-in for the
> 0-laxity. Then we don't need watchdog hacks at all and it all cleanly
> works within the DL class itself. There are the drawbacks of the
> pessimism/locking etc (I already knew that by the way as the obvious
> drawbacks of 0-laxity) but I am not immediately seeing how this
> CFS-watchdog with 0-laxity is any different from the DL-server itself
> having such a property. If you really have a concrete point on why
> that won't work, and if you could clarify that more clearly why a
> watchdog is better than it, that would be great.


I think you are overloading a term and a goal, and this makes your
thoughts ambiguous.

0-laxity is a point in time. What do you want to do at 0-laxity? Do you
want to run or start/replenish?

In the previous discussions, we mentioned using a scheduler that uses
it as a way to prioritize the task (to run). That is an overkill, as
it would be another scheduler. That is the first interpretation for
0-laxity in this thread, mainly associated with the word "scheduling"
(not only I read that way).

In this patch, Juri's PoC shows that if we defer the DL server start
(replenish) for a point in the future, we can keep the fixed-priority
order of the schedulers, boosting SCHED_OTHER if it starves,
without breaking EDF.

If you see the cover, I mentioned using the 0-laxity point in time to
activate the DL server under EDF. In that way, at the 0-laxity point,
the DL server is replenished with runtime and deadline as
"now" + period. With that implemented...

In the base case:
it is never activated.

In the Busy-loop FIFO case:
Busy-loop FIFO task run starving OTHER for (period - runtime):
SCHED_OTHER server will be started at 0-laxity and get the
processor for its runtime immediately because there are no DL
tasks.

In the presence of DL & RT tasks:
DL and RT Starving OTHER for (period - runtime):
SCHED_OTHER server will be started & scheduled under EDF, before or
after the other DL tasks, following EDF. Anyways, before
returning to the SCHED_RT.

So, in this way, the OTHER will be boosted over SCHED_RT without breaking
SCHED_DEADLINE tasks.

In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
front of DL tasks... that would break EDF. It would be mixing two
schedulers in one. It is not required and likely not a good idea either.

In the cover, I mentioned improving this patch, so maybe watchdog is not
the appropriate name. 0-laxity server is not a good name either because
it might induce people to think that the server will RUN at 0-laxity
while it will actually be replenished at 0-laxity. Maybe a deferred server
might be a better name.

>> In a paper, the scheduler alone is the solution. In real life, the solution
>> for problems like locking is as fundamental as the scheduler. We need to keep
>> things simple to expand on these other topics as well.
>>
>> So, I do not think we need all the drawbacks of a mixed solution to just fix
>> the throttling problem, and EDF is more capable and explored for the
>> general case.
>
> Again, I was saying making it opt-in seems like a reasonable approach
> and just enabling such property for the DL server.

Can we have a "deferred DL server?" is that your question?

If so, I think so. But we have other points to look first. DL servers are per-cpu,
so they break global. We need to think about an interface... and there are
other points that we need to understand before trying some other more
optimized cases.

>> With this patch's idea (and expansions), we can fix the throttling problem
>> without breaking other behaviors like scheduling order...
>
> I don't mind the watchdog patch as such, of course. I presented its
> mechanics at OSSNA and I know how it works, but I feel the DL server
> opting-in for 0-laxity would be cleaner while keeping such behavior as
> default-off for regular DL uses, that's my opinion -- but what else am
> I missing? Either way, no harm in discussing alternate approaches as
> well even if we are settling for the watchdog.
>
>>> Lastly, if the goal is to remove RT throttling code eventually, are
>>> you also planning to remove RT group scheduling as well? Are there
>>> users of RT group scheduling that might be impacted? On the other
>>> hand, RT throttling / group scheduling code can be left as it is
>>> (perhaps documenting it as deprecated) and the server stuff can be
>>> implemented via a CONFIG option.
>>
>> I think that the idea is to have the DL servers eventually replace the group
>> schedule. But I also believe that it is better to start by solving the
>> throttling and then moving to other constructions on top of the mechanism.
>
> Hmm. For throttling at the root level yes, but I am not seeing how
> you can replace the group scheduling code for existing users of RT
> Cgroups with this. The throttling in the RT group scheduling code is
> not exactly only about "not starving CFS", it is more related to
> letting RT groups run with certain bandwidth. So you cannot really
> delete it if there are real users of that code -- you'll have to
> migrate those users away first (to an alternate implementation like
> DL). If there are no users of RT group scheduling, that's lovely
> though. We don't use it in ChromeOS fwiw.

The idea behind the base patchset from Peter is solid and is the best way we
can start, and starting with avoiding OTHER starvation is an easy starting point.
Many people will benefit from it - like all the people that ping me
because of the RT_RUNTIME_GREED (including Google in the past)... which is
the starting point of all this work.

Generalizing it requires time, but it will happen... and you know that Juri and I
care about Chromeos' use case, as I have been discussing this with you all and
even participating in Google/chrome focused meetings about sched...
at 6 pm our time ;-).

-- Daniel

>
> - Joel
>


2023-06-13 14:19:31

by Phil Auld

[permalink] [raw]
Subject: Re: [RFC PATCH V3 2/6] sched/deadline: Collect sched_dl_entity initialization

On Thu, Jun 08, 2023 at 05:58:14PM +0200 Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> Create a single function that initializes a sched_dl_entity.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>

Reviewed-by: Phil Auld <[email protected]>

> ---
> kernel/sched/core.c | 5 +----
> kernel/sched/deadline.c | 22 +++++++++++++++-------
> kernel/sched/sched.h | 5 +----
> 3 files changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ac38225e6d09..e34b02cbe41f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4511,10 +4511,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> memset(&p->stats, 0, sizeof(p->stats));
> #endif
>
> - RB_CLEAR_NODE(&p->dl.rb_node);
> - init_dl_task_timer(&p->dl);
> - init_dl_inactive_task_timer(&p->dl);
> - __dl_clear_params(p);
> + init_dl_entity(&p->dl);
>
> INIT_LIST_HEAD(&p->rt.run_list);
> p->rt.timeout = 0;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 030e7c11607f..22e5e64812c9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -333,6 +333,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> __add_rq_bw(new_bw, &rq->dl);
> }
>
> +static void __dl_clear_params(struct sched_dl_entity *dl_se);
> +
> /*
> * The utilization of a task cannot be immediately removed from
> * the rq active utilization (running_bw) when the task blocks.
> @@ -432,7 +434,7 @@ static void task_non_contending(struct task_struct *p)
> raw_spin_lock(&dl_b->lock);
> __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
> raw_spin_unlock(&dl_b->lock);
> - __dl_clear_params(p);
> + __dl_clear_params(dl_se);
> }
>
> return;
> @@ -1205,7 +1207,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> -void init_dl_task_timer(struct sched_dl_entity *dl_se)
> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
> {
> struct hrtimer *timer = &dl_se->dl_timer;
>
> @@ -1415,7 +1417,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> raw_spin_lock(&dl_b->lock);
> __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
> raw_spin_unlock(&dl_b->lock);
> - __dl_clear_params(p);
> + __dl_clear_params(dl_se);
>
> goto unlock;
> }
> @@ -1431,7 +1433,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> -void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
> +static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
> {
> struct hrtimer *timer = &dl_se->inactive_timer;
>
> @@ -2974,10 +2976,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
> /*
> * This function clears the sched_dl_entity static params.
> */
> -void __dl_clear_params(struct task_struct *p)
> +static void __dl_clear_params(struct sched_dl_entity *dl_se)
> {
> - struct sched_dl_entity *dl_se = &p->dl;
> -
> dl_se->dl_runtime = 0;
> dl_se->dl_deadline = 0;
> dl_se->dl_period = 0;
> @@ -2995,6 +2995,14 @@ void __dl_clear_params(struct task_struct *p)
> #endif
> }
>
> +void init_dl_entity(struct sched_dl_entity *dl_se)
> +{
> + RB_CLEAR_NODE(&dl_se->rb_node);
> + init_dl_task_timer(dl_se);
> + init_dl_inactive_task_timer(dl_se);
> + __dl_clear_params(dl_se);
> +}
> +
> bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
> {
> struct sched_dl_entity *dl_se = &p->dl;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index da0cec2fc63a..fa6512070fa7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -284,8 +284,6 @@ struct rt_bandwidth {
> unsigned int rt_period_active;
> };
>
> -void __dl_clear_params(struct task_struct *p);
> -
> static inline int dl_bandwidth_enabled(void)
> {
> return sysctl_sched_rt_runtime >= 0;
> @@ -2390,8 +2388,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
> extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
> extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
>
> -extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
> -extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
> +extern void init_dl_entity(struct sched_dl_entity *dl_se);
>
> #define BW_SHIFT 20
> #define BW_UNIT (1 << BW_SHIFT)
> --
> 2.40.1
>

--


2023-06-13 15:59:59

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Tue, Jun 13, 2023 at 9:43 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
[...]
> > On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
> > <[email protected]> wrote:
> > [...]
> >>> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> >>> <[email protected]> wrote:
> >>>>
> >>>> From: Juri Lelli <[email protected]>
> >>>>
> >>>> Starting deadline server for lower priority classes right away when
> >>>> first task is enqueued might break guarantees, as tasks belonging to
> >>>> intermediate priority classes could be uselessly preempted. E.g., a well
> >>>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >>>> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >>>> be running inside the fair deadline server for some period of time.
> >>>>
> >>>> To prevent this issue, implement a starvation monitor mechanism that
> >>>> starts the deadline server only if a (fair in this case) task hasn't
> >>>> been scheduled for some interval of time after it has been enqueued.
> >>>> Use pick/put functions to manage starvation monitor status.
> >>>
> >>> Me and Vineeth were discussing that another way of resolving this
> >>> issue is to use a DL-server for RT as well, and then using a smaller
> >>> deadline for RT. That way the RT is more likely to be selected due to
> >>> its earlier deadline/period.
> >>
> >> It would not be that different from what we have now.
> >>
> >> One of the problems of throttling nowadays is that it accounts for a large window
> >> of time, and any "imprecision" can cause the mechanism not to work as expected.
> >>
> >> For example, we work on a fully-isolated CPU scenario, where some very sporadic
> >> workload can be placed on the isolated CPU because of per-cpu kernel activities,
> >> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
> >> instance, 20 us, to bound the interference.
> >>
> >> The current mechanism does not give this precision because the IRQ accounting
> >> does not account for runtime for the rt throttling (which makes sense).
> >
> > I lost you here, "Runtime for the rt throttling" does not make much
> > sense to me as a statement.
>
> Both update_curr_rt() and update_curr_dl() use rq_clock_task() as clock. update_rq_clock_task()
> reduces the irq_delta from task's clock (inside #ifdef CONFIG_IRQ_TIME_ACCOUNTING), and this
> clock is used to throttling.

That was a much better description. You're basically saying that since
the running time of the RT class is not accounted for in the clock, it
affects the throttling and unthrottling times. I actually ran into a
similar issue on Android I recall, where the RT time was showing up as
CFS load if I recall.

For RT throttling though, in our testing the time scales are large
enough (for our usecase) that such time stealing wasn't an issue. I am
going for something that is practical and that works, does not have to
be perfect since it has been several years now with these problems and
leaving RT throttling broken is probably not a good thing.

> >> So the
> >> RT throttling has the 20 us stolen from IRQs and keeps running. The same will
> >> happen if we swap the current mechanism with a DL server for the RT.
> >
> > I read this about 10 times to learn that *maybe* you mean that IRQs
> > stole time from the "Well behaved running time" of the RT class.
>
> I also read emails many times :-)

Not a problem, ;-) Hopefully my emails are not too painful to read too
many times, but I appreciate that ;-). I was just mentioning that if
you're (or I am) not precise then it's hard to follow what you mean
that's all. Really the person writing the best emails I've seen is
Paul McKenney and that's kind of what I strive for. YMMV.

> > not seeing how that is related to creation of a DL-server for the RT
> > class. Maybe describe your point a bit more clearly?
>
> This patch is targeting a better way to avoid SCHED_OTHER starvation.
> Having a DL server for RT class does not help on that. We need to boost
> SCHED_OTHER.

Oh, actually the problem of boosting SCHED_OTHER is a bit orthogonal
to what I said. I was not saying not to boost SCHED_OTHER, I was
talking more about this particular patch and using an DL-based RT
server to mitigate that issue. The boosting is already handled in
previous patches with the DL-server.

> >> Also, thinking about short deadlines to fake a fixed priority is... not starting
> >> well. A fixed-priority higher instance is not a property of a deadline-based
> >> scheduler, and Linux has a fixed-priority hierarchy: STOP -> DL -> RT -> CFS...
> >> It is simple, and that is good.
> >>
> >> That is why it is better to boost CFS instead of throttling RT. By boosting
> >> CFS, you do not need a server for RT, and we account for anything on top of CFS
> >> for free (IRQ/DL/FIFO...).
> >
> > I did mention in my last email that it is not ideal. I just brought it
> > up as an option. It might reduce the problem being seen and is better
> > than not having it.
>
> We have thought about it, but boosting SCHED_OTHER is the way to go.

As mentioned above, there is no argument from my side on boosting CFS.
That I agree is the goal.

> >>> Another approach could be to implement the 0-laxity scheduling as a
> >>> general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
> >>> tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
> >>> And then opt-in the feature for the CFS deadline server task.
> >>
> >> A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
> >> on the "C" (runtime, generally WCET), which is hard to find and embeds
> >> pessimism. Also, having such a feature would make other mechanisms harder, as
> >> well as debugging things. For example, proxy-execution or a more precise
> >> schedulability test...
> >
> > I think you did not read my email properly, I was saying make the
> > 0-laxity default-off and the opt-in for certain DL tasks. That may
> > work perfectly well for a system like ChromeOS where likely we will
> > use the DL server as the sole deadline task and opt-in for the
> > 0-laxity. Then we don't need watchdog hacks at all and it all cleanly
> > works within the DL class itself. There are the drawbacks of the
> > pessimism/locking etc (I already knew that by the way as the obvious
> > drawbacks of 0-laxity) but I am not immediately seeing how this
> > CFS-watchdog with 0-laxity is any different from the DL-server itself
> > having such a property. If you really have a concrete point on why
> > that won't work, and if you could clarify that more clearly why a
> > watchdog is better than it, that would be great.
>
>
> I think you are overloading a term and a goal, and this makes your
> thoughts ambiguous.

Well the term was mentioned in this series cover letter as well. ;-)

> 0-laxity is a point in time. What do you want to do at 0-laxity? Do you
> want to run or start/replenish?

You could do either. It could be run a bit earlier than 0-laxity. Or
could you just push it out to run in the next period if it has the
flag.

Here is the definition of 0-laxity as I understand it. Please correct
me as I have not done a phD in these things like you ;-)

Laxity, also known as slack time, is the amount of time that a task
can be delayed without causing a missed deadline. A 0-laxity task is
one that has no more time to spare and must be executed immediately to
avoid missing its deadline.

And here's where I need your input: If we take all admitted DL tasks
and run it at their respective 0-laxity times or slightly earlier,
then in-theory, they should all meet their deadlines correctly?

Now, I don't really mean to run it exactly at 0-laxity. It could be
run a bit earlier to factor in sleep times, locking time, preemptions
etc. I mean with SCHED_DEADLINE, you really can't control those things
anyway -- so even if you run the DL task as soon as possible, you
might still miss your deadline. Or at 0-laxity, push its deadline out
to the next period and consider it "activated". I am just thinking out
loud.

That could break EDF the way it is now. However, it could be an
interesting idea that could be developed into a better idea. A DL
task does not have to run immediately to meet its deadline (it could
be run later as well) and that I know for a fact -- so why not add
this flexibility within SCHED_DEADLINE itself rather than inventing a
hack (and by hack I mean only this patch, not the other patches from
Peter or the idea of CFS boosting).

My impression is the other (DL tasks without flag) should still have
their promised bandwidth so it is not mixing 2 schedulers. If this
series gets stalled for some reason, I would probably explore such an
idea in my own time later.

> In the previous discussions, we mentioned using a scheduler that uses
> it as a way to prioritize the task (to run). That is an overkill, as
> it would be another scheduler. That is the first interpretation for
> 0-laxity in this thread, mainly associated with the word "scheduling"
> (not only I read that way).
>
> In this patch, Juri's PoC shows that if we defer the DL server start
> (replenish) for a point in the future, we can keep the fixed-priority
> order of the schedulers, boosting SCHED_OTHER if it starves,
> without breaking EDF.
>
> If you see the cover, I mentioned using the 0-laxity point in time to
> activate the DL server under EDF. In that way, at the 0-laxity point,
> the DL server is replenished with runtime and deadline as
> "now" + period. With that implemented...
>
> In the base case:
> it is never activated.
>
> In the Busy-loop FIFO case:
> Busy-loop FIFO task run starving OTHER for (period - runtime):
> SCHED_OTHER server will be started at 0-laxity and get the
> processor for its runtime immediately because there are no DL
> tasks.
>
> In the presence of DL & RT tasks:
> DL and RT Starving OTHER for (period - runtime):
> SCHED_OTHER server will be started & scheduled under EDF, before or
> after the other DL tasks, following EDF. Anyways, before
> returning to the SCHED_RT.
>
> So, in this way, the OTHER will be boosted over SCHED_RT without breaking
> SCHED_DEADLINE tasks.
>
> In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
> front of DL tasks... that would break EDF. It would be mixing two
> schedulers in one. It is not required and likely not a good idea either.

I am still missing why some tasks cannot be run at close to 0-laxity
time, and as opt-in. And if the DL-server is the sole task running,
then there's nothing else to break.

In fact, I am not immediately seeing how this can break SCHED_DEADLINE
if you allow at most 1-task to run at close to 0-laxity. The others
should still have their promised bandwidth so it is not mixing 2
schedulers, you just delay the DL-server till it's close to the
0-laxity. What am I missing?

Perhaps you mean the algorithm needs to push the new period/deadline
to a later time at the 0-laxity. That's also fine with me. But some
variation of the idea is possible AFAICS (again could be missing
something that mathematically makes this impossible).

> In the cover, I mentioned improving this patch, so maybe watchdog is not
> the appropriate name. 0-laxity server is not a good name either because
> it might induce people to think that the server will RUN at 0-laxity
> while it will actually be replenished at 0-laxity. Maybe a deferred server
> might be a better name.

Yeah maybe a deferred server could be a better name.

> >> In a paper, the scheduler alone is the solution. In real life, the solution
> >> for problems like locking is as fundamental as the scheduler. We need to keep
> >> things simple to expand on these other topics as well.
> >>
> >> So, I do not think we need all the drawbacks of a mixed solution to just fix
> >> the throttling problem, and EDF is more capable and explored for the
> >> general case.
> >
> > Again, I was saying making it opt-in seems like a reasonable approach
> > and just enabling such property for the DL server.
>
> Can we have a "deferred DL server?" is that your question?
>
> If so, I think so. But we have other points to look first. DL servers are per-cpu,
> so they break global. We need to think about an interface... and there are
> other points that we need to understand before trying some other more
> optimized cases.

You mean an interface for the DL server params? Those can be similar
to other sched knobs. And then boot with a CONFIG option and boost CFS
things by default if RT is active. Would that work or did you mean
something else by interface?

> >>> Lastly, if the goal is to remove RT throttling code eventually, are
> >>> you also planning to remove RT group scheduling as well? Are there
> >>> users of RT group scheduling that might be impacted? On the other
> >>> hand, RT throttling / group scheduling code can be left as it is
> >>> (perhaps documenting it as deprecated) and the server stuff can be
> >>> implemented via a CONFIG option.
> >>
> >> I think that the idea is to have the DL servers eventually replace the group
> >> schedule. But I also believe that it is better to start by solving the
> >> throttling and then moving to other constructions on top of the mechanism.
> >
> > Hmm. For throttling at the root level yes, but I am not seeing how
> > you can replace the group scheduling code for existing users of RT
> > Cgroups with this. The throttling in the RT group scheduling code is
> > not exactly only about "not starving CFS", it is more related to
> > letting RT groups run with certain bandwidth. So you cannot really
> > delete it if there are real users of that code -- you'll have to
> > migrate those users away first (to an alternate implementation like
> > DL). If there are no users of RT group scheduling, that's lovely
> > though. We don't use it in ChromeOS fwiw.
>
> The idea behind the base patchset from Peter is solid and is the best way we
> can start, and starting with avoiding OTHER starvation is an easy starting point.
> Many people will benefit from it - like all the people that ping me
> because of the RT_RUNTIME_GREED (including Google in the past)... which is
> the starting point of all this work.

Right, you know I am on the same page about that. I presented exactly
the same stuff at 2 conferences in 2 countries this year.

>
> Generalizing it requires time, but it will happen... and you know that Juri and I
> care about Chromeos' use case, as I have been discussing this with you all and
> even participating in Google/chrome focused meetings about sched...
> at 6 pm our time ;-).

I totally appreciate that, please don't get offended, we go a long way
back as friends ;-) And I really want to help, I am not trying to
prove I am an expert compared to you. I just want to get it *done* and
not have to wait for more years. You can see my 2 presentations this
year on this topic alone -- I travelled to 2 countries leaving my
family behind to discuss these.

Many thanks,

- Joel

2023-06-14 14:28:46

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

Hey,

So Daniel provided the gory details :) .. but please let me highlight
one of his points below, which I think it might clarify why we might
want to start with a special case, patch 6 improved approach, before
possibly moving to more complex implementations.

On 14/06/23 15:45, Daniel Bristot de Oliveira wrote:

...

> By postponing the enqueue/replanishment of the DL server here, we are fixing the
> problem in a practical way, that works without breaking existing useful properties &
> use-cases.

In my understanding, if we simply postpone actual activation of the DL
server up to the point it really needs to boost/run for giving CFS tasks
some breath (the infamous 0-laxity :), we save RT tasks from useless
interruptions and still can keep EDF/CBS working w/o much changes.

It looks like a low hanging fruit, small improvement on what we have today
than doesn't prevent us for implementing more complex features (i.e., full
blown hierarchical scheduling, alternative schedulers) in the future if
the need arises.

Thanks!
Juri


Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/13/23 17:32, Joel Fernandes wrote:
> On Tue, Jun 13, 2023 at 9:43 AM Daniel Bristot de Oliveira
> <[email protected]> wrote:
> [...]
>>> On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
>>> <[email protected]> wrote:
>>> [...]
>>>>> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> From: Juri Lelli <[email protected]>
>>>>>>
>>>>>> Starting deadline server for lower priority classes right away when
>>>>>> first task is enqueued might break guarantees, as tasks belonging to
>>>>>> intermediate priority classes could be uselessly preempted. E.g., a well
>>>>>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
>>>>>> there are still CPU cycles available for NORMAL tasks to run, as they'll
>>>>>> be running inside the fair deadline server for some period of time.
>>>>>>
>>>>>> To prevent this issue, implement a starvation monitor mechanism that
>>>>>> starts the deadline server only if a (fair in this case) task hasn't
>>>>>> been scheduled for some interval of time after it has been enqueued.
>>>>>> Use pick/put functions to manage starvation monitor status.
>>>>>
>>>>> Me and Vineeth were discussing that another way of resolving this
>>>>> issue is to use a DL-server for RT as well, and then using a smaller
>>>>> deadline for RT. That way the RT is more likely to be selected due to
>>>>> its earlier deadline/period.
>>>>
>>>> It would not be that different from what we have now.
>>>>
>>>> One of the problems of throttling nowadays is that it accounts for a large window
>>>> of time, and any "imprecision" can cause the mechanism not to work as expected.
>>>>
>>>> For example, we work on a fully-isolated CPU scenario, where some very sporadic
>>>> workload can be placed on the isolated CPU because of per-cpu kernel activities,
>>>> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
>>>> instance, 20 us, to bound the interference.
>>>>
>>>> The current mechanism does not give this precision because the IRQ accounting
>>>> does not account for runtime for the rt throttling (which makes sense).
>>>
>>> I lost you here, "Runtime for the rt throttling" does not make much
>>> sense to me as a statement.
>>
>> Both update_curr_rt() and update_curr_dl() use rq_clock_task() as clock. update_rq_clock_task()
>> reduces the irq_delta from task's clock (inside #ifdef CONFIG_IRQ_TIME_ACCOUNTING), and this
>> clock is used to throttling.
>
> That was a much better description. You're basically saying that since
> the running time of the RT class is not accounted for in the clock, it
> affects the throttling and unthrottling times. I actually ran into a
> similar issue on Android I recall, where the RT time was showing up as
> CFS load if I recall.
>
> For RT throttling though, in our testing the time scales are large
> enough (for our usecase) that such time stealing wasn't an issue. I am
> going for something that is practical and that works, does not have to
> be perfect since it has been several years now with these problems and
> leaving RT throttling broken is probably not a good thing.

By postponing the enqueue/replanishment of the DL server here, we are fixing the
problem in a practical way, that works without breaking existing useful properties &
use-cases.

[...]
>>> not seeing how that is related to creation of a DL-server for the RT
>>> class. Maybe describe your point a bit more clearly?
>>
>> This patch is targeting a better way to avoid SCHED_OTHER starvation.
>> Having a DL server for RT class does not help on that. We need to boost
>> SCHED_OTHER.
>
> Oh, actually the problem of boosting SCHED_OTHER is a bit orthogonal
> to what I said. I was not saying not to boost SCHED_OTHER, I was
> talking more about this particular patch and using an DL-based RT
> server to mitigate that issue. The boosting is already handled in
> previous patches with the DL-server.

The boosting of the previous patch is breaking FIFO priority. This patch fixes that
single point without touching and or breaking SCHED_DEADLINE/EDF properties. With
these things in place we do not mitigate, we fix the problem.

[...]
>>>>> Another approach could be to implement the 0-laxity scheduling as a
>>>>> general SCHED_DEADLINE feature, perhaps through a flag. And allow DL
>>>>> tasks to opt-in to 0-laxity scheduling unless there are idle cycles.
>>>>> And then opt-in the feature for the CFS deadline server task.
>>>>
>>>> A 0-laxity scheduler is not as simple as it sounds, as the priority also depends
>>>> on the "C" (runtime, generally WCET), which is hard to find and embeds
>>>> pessimism. Also, having such a feature would make other mechanisms harder, as
>>>> well as debugging things. For example, proxy-execution or a more precise
>>>> schedulability test...
>>>
>>> I think you did not read my email properly, I was saying make the
>>> 0-laxity default-off and the opt-in for certain DL tasks. That may
>>> work perfectly well for a system like ChromeOS where likely we will
>>> use the DL server as the sole deadline task and opt-in for the
>>> 0-laxity. Then we don't need watchdog hacks at all and it all cleanly
>>> works within the DL class itself. There are the drawbacks of the
>>> pessimism/locking etc (I already knew that by the way as the obvious
>>> drawbacks of 0-laxity) but I am not immediately seeing how this
>>> CFS-watchdog with 0-laxity is any different from the DL-server itself
>>> having such a property. If you really have a concrete point on why
>>> that won't work, and if you could clarify that more clearly why a
>>> watchdog is better than it, that would be great.
>>
>>
>> I think you are overloading a term and a goal, and this makes your
>> thoughts ambiguous.
>
> Well the term was mentioned in this series cover letter as well. ;-)

This term was mentioned also on my slides, and also on the other threads, but
not alone...

>
>> 0-laxity is a point in time. What do you want to do at 0-laxity? Do you
>> want to run or start/replenish?
>
> You could do either.

No, these are two different things depending on the current ready queue status.

It could be run a bit earlier than 0-laxity. Or
> could you just push it out to run in the next period if it has the
> flag.

The improvements on top of this patch is to postpone the enqueue/replenish to the 0-laxity
time. By doing so, the task receives a new period (and so deadline) a period ahead.

>
> Here is the definition of 0-laxity as I understand it. Please correct
> me as I have not done a phD in these things like you ;-)
>
> Laxity, also known as slack time,

laxity = absolute deadline - activation time - runtime.

is the amount of time that a task
> can be delayed without causing a missed deadline.

If you look at the task alone! e.g, if it does not face preemption!

A 0-laxity task is
> one that has no more time to spare and must be executed immediately

and not be preempted!

to
> avoid missing its deadline.

Again, you are looking at the task alone.

> And here's where I need your input: If we take all admitted DL tasks
> and run it at their respective 0-laxity times or slightly earlier,
> then in-theory, they should all meet their deadlines correctly?
For all tasksets, no!

There might be a taskset here o there that creates such timeline under EDF,
but it is not always true that a task under EDF will wait until the 0-laxity
time for them to run. EDF is working conserving.

For a working conserving scheduler to build such a timeline, it needs to
have no idle time. Then, lets get the classical single core assumptions
(these servers are per-cpu).

- Assuming single-core/partitioned scheduler.
- Assuming periodic tasks with deadline = period
- Assuming a task set with the sum of each task utilization = 1
- Assuming all tasks are dispatched at the same time (critical instant)
- Assuming all tasks will run for their entire runtime, without blocking

(so... the thing that EDF is optimum... fully loaded...)

Even so you will not have EDF always creating such timeline because the
task with the earliest deadline will run first, still deadlines will be met.

For example:

t1 = 5/10
t2 = 5/10

Each task you pick first will run 5 unities of time before the "0-laxity time".

If there is a scheduler that always build a timeline like you want, it will not
schedule the taskset I mentioned... thus.. it will schedule less than EDF.

>
> Now, I don't really mean to run it exactly at 0-laxity. It could be
> run a bit earlier to factor in sleep times, locking time, preemptions
> etc.

For you to force the delay... you would need to do things like... injecting idle?

Measuring these values is worst than measuring the runtime... you are pilling
up complexity.

I mean with SCHED_DEADLINE, you really can't control those things
> anyway -- so even if you run the DL task as soon as possible, you
> might still miss your deadline.

If the taskset is schedulable... EDF will. The overheads are part of the problem
for any real implementation, and moving to more complex scheduling design will
only make those things worse.

Or at 0-laxity, push its deadline out
> to the next period and consider it "activated". I am just thinking out
> loud.
>
> That could break EDF the way it is now. However, it could be an
> interesting idea that could be developed into a better idea. A DL
> task does not have to run immediately to meet its deadline (it could
> be run later as well)and that I know for a fact -- so why not add
> this flexibility within SCHED_DEADLINE itself rather than inventing a
> hack (and by hack I mean only this patch, not the other patches from
> Peter or the idea of CFS boosting).
This "hack" is not inside the deadline.c because it is a PoC... in the next version
it will not in fair.c anymore, and it will part of deadline. A deferred server start.

I think you are focusing here in the code, not in the idea. We said we will improve
the code.

>
> My impression is the other (DL tasks without flag) should still have
> their promised bandwidth so it is not mixing 2 schedulers.If this
> series gets stalled for some reason, I would probably explore such an
> idea in my own time later.
>
>> In the previous discussions, we mentioned using a scheduler that uses
>> it as a way to prioritize the task (to run). That is an overkill, as
>> it would be another scheduler. That is the first interpretation for
>> 0-laxity in this thread, mainly associated with the word "scheduling"
>> (not only I read that way).
>>
>> In this patch, Juri's PoC shows that if we defer the DL server start
>> (replenish) for a point in the future, we can keep the fixed-priority
>> order of the schedulers, boosting SCHED_OTHER if it starves,
>> without breaking EDF.
>>
>> If you see the cover, I mentioned using the 0-laxity point in time to
>> activate the DL server under EDF. In that way, at the 0-laxity point,
>> the DL server is replenished with runtime and deadline as
>> "now" + period. With that implemented...
>>
>> In the base case:
>> it is never activated.
>>
>> In the Busy-loop FIFO case:
>> Busy-loop FIFO task run starving OTHER for (period - runtime):
>> SCHED_OTHER server will be started at 0-laxity and get the
>> processor for its runtime immediately because there are no DL
>> tasks.
>>
>> In the presence of DL & RT tasks:
>> DL and RT Starving OTHER for (period - runtime):
>> SCHED_OTHER server will be started & scheduled under EDF, before or
>> after the other DL tasks, following EDF. Anyways, before
>> returning to the SCHED_RT.
>>
>> So, in this way, the OTHER will be boosted over SCHED_RT without breaking
>> SCHED_DEADLINE tasks.
>>
>> In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
>> front of DL tasks... that would break EDF. It would be mixing two
>> schedulers in one. It is not required and likely not a good idea either.
>
> I am still missing why some tasks cannot be run at close to 0-laxity
> time, and as opt-in.And if the DL-server is the sole task running,
> then there's nothing else to break.

If it is the sole (DL!) task running, this patch is equivalent to place the task
to run at 0-laxity. I explained this the previous email.

> In fact, I am not immediately seeing how this can break SCHED_DEADLINE
> if you allow at most 1-task to run at close to 0-laxity. The others
> should still have their promised bandwidth so it is not mixing 2
> schedulers, you just delay the DL-server till it's close to the
> 0-laxity. What am I missing?
if you want to do a 0-laxity scheduler like decision, that is, the task will
start running at 0-laxity, and once the task starts to run it will not be
preempted so it can finish before its deadline, you will break other EDF
tasks deadline in a schedulable task set.

It is not hard to create a timeline that breaks it...

server 50/1000 postponed by 950.
task 1/10.

At time 950 the server starts not to be preempted for 50.
at 951 the 1/10 starts... BOOM.

Expanding the idea in this patch, the task will be enqueued at 950,
with a deadline at 1950... so it will not break the EDF scheduler,
while still getting the time on top of SCHED_RT.

The SCHED_RT is the problem we are addressing here, because of lack of
fairness w.r.t SCHED_OTHER.

Moreover, DL tasks run when RT rq is throttled. So, that 1/10 preemption
on top of the deferred server happens in the current behavior. Thinking twice,
with this patch in place, SCHED_OTHER will also recover that time, which makes
even more sense.

(noticing that here we are only scratching the consequences, as anything that
is utilization based on our design will be broken as the system starts because
this is always enabled).

> Perhaps you mean the algorithm needs to push the new period/deadline
> to a later time at the 0-laxity.

This is the idea behind this patch ^^^^ This is the different between running
and replenishing I mention on previous emails.

That's also fine with me. But some
> variation of the idea is possible AFAICS (again could be missing
> something that mathematically makes this impossible).

you are looking for a fragment of the information... "0-laxity time," with
a single task in mind - not in the context of a scheduler.

>
>> In the cover, I mentioned improving this patch, so maybe watchdog is not
>> the appropriate name. 0-laxity server is not a good name either because
>> it might induce people to think that the server will RUN at 0-laxity
>> while it will actually be replenished at 0-laxity. Maybe a deferred server
>> might be a better name.
>
> Yeah maybe a deferred server could be a better name.
>
>>>> In a paper, the scheduler alone is the solution. In real life, the solution
>>>> for problems like locking is as fundamental as the scheduler. We need to keep
>>>> things simple to expand on these other topics as well.
>>>>
>>>> So, I do not think we need all the drawbacks of a mixed solution to just fix
>>>> the throttling problem, and EDF is more capable and explored for the
>>>> general case.
>>>
>>> Again, I was saying making it opt-in seems like a reasonable approach
>>> and just enabling such property for the DL server.
>>
>> Can we have a "deferred DL server?" is that your question?
>>
>> If so, I think so. But we have other points to look first. DL servers are per-cpu,
>> so they break global. We need to think about an interface... and there are
>> other points that we need to understand before trying some other more
>> optimized cases.
>
> You mean an interface for the DL server params? Those can be similar
> to other sched knobs. And then boot with a CONFIG option and boost CFS
> things by default if RT is active. Would that work or did you mean
> something else by interface?
>
>>>>> Lastly, if the goal is to remove RT throttling code eventually, are
>>>>> you also planning to remove RT group scheduling as well? Are there
>>>>> users of RT group scheduling that might be impacted? On the other
>>>>> hand, RT throttling / group scheduling code can be left as it is
>>>>> (perhaps documenting it as deprecated) and the server stuff can be
>>>>> implemented via a CONFIG option.
>>>>
>>>> I think that the idea is to have the DL servers eventually replace the group
>>>> schedule. But I also believe that it is better to start by solving the
>>>> throttling and then moving to other constructions on top of the mechanism.
>>>
>>> Hmm. For throttling at the root level yes, but I am not seeing how
>>> you can replace the group scheduling code for existing users of RT
>>> Cgroups with this. The throttling in the RT group scheduling code is
>>> not exactly only about "not starving CFS", it is more related to
>>> letting RT groups run with certain bandwidth. So you cannot really
>>> delete it if there are real users of that code -- you'll have to
>>> migrate those users away first (to an alternate implementation like
>>> DL). If there are no users of RT group scheduling, that's lovely
>>> though. We don't use it in ChromeOS fwiw.
>>
>> The idea behind the base patchset from Peter is solid and is the best way we
>> can start, and starting with avoiding OTHER starvation is an easy starting point.
>> Many people will benefit from it - like all the people that ping me
>> because of the RT_RUNTIME_GREED (including Google in the past)... which is
>> the starting point of all this work.
>
> Right, you know I am on the same page about that. I presented exactly
> the same stuff at 2 conferences in 2 countries this year.
>
>>
>> Generalizing it requires time, but it will happen... and you know that Juri and I
>> care about Chromeos' use case, as I have been discussing this with you all and
>> even participating in Google/chrome focused meetings about sched...
>> at 6 pm our time ;-).
>
> I totally appreciate that, please don't get offended, we go a long way
> back as friends ;-)

I did not get offended, and nothing changes on our friendship :-). I am just
clarifying you things we know - even before this rebase... We are aware of
Chrome needs, as well as general RT Linux needs.

The basic idea behind this patch works for all cases and is unlocking this
situation. The code will be improved in the next version.

Thanks
-- Daniel

And I really want to help, I am not trying to
> prove I am an expert compared to you. I just want to get it *done* and
> not have to wait for more years. You can see my 2 presentations this
> year on this topic alone -- I travelled to 2 countries leaving my
> family behind to discuss these.
>
> Many thanks,
>
> - Joel


2023-06-14 18:45:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

Before I reply, I just want to mention: I am OK with this patch 6/6 as
such and the ideas I mentioned earlier are just alternatives just for
patch 6/6 -- just for discussion sake. The devil is in the details as
Daniel and Juri pointed out.

With that said...

On Wed, Jun 14, 2023 at 9:45 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
>
> On 6/13/23 17:32, Joel Fernandes wrote:
> > On Tue, Jun 13, 2023 at 9:43 AM Daniel Bristot de Oliveira
> > <[email protected]> wrote:
> > [...]
> >>> On Mon, Jun 12, 2023 at 1:21 PM Daniel Bristot de Oliveira
> >>> <[email protected]> wrote:
> >>> [...]
> >>>>> On Thu, Jun 8, 2023 at 11:58 AM Daniel Bristot de Oliveira
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> From: Juri Lelli <[email protected]>
> >>>>>>
> >>>>>> Starting deadline server for lower priority classes right away when
> >>>>>> first task is enqueued might break guarantees, as tasks belonging to
> >>>>>> intermediate priority classes could be uselessly preempted. E.g., a well
> >>>>>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >>>>>> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >>>>>> be running inside the fair deadline server for some period of time.
> >>>>>>
> >>>>>> To prevent this issue, implement a starvation monitor mechanism that
> >>>>>> starts the deadline server only if a (fair in this case) task hasn't
> >>>>>> been scheduled for some interval of time after it has been enqueued.
> >>>>>> Use pick/put functions to manage starvation monitor status.
> >>>>>
> >>>>> Me and Vineeth were discussing that another way of resolving this
> >>>>> issue is to use a DL-server for RT as well, and then using a smaller
> >>>>> deadline for RT. That way the RT is more likely to be selected due to
> >>>>> its earlier deadline/period.
> >>>>
> >>>> It would not be that different from what we have now.
> >>>>
> >>>> One of the problems of throttling nowadays is that it accounts for a large window
> >>>> of time, and any "imprecision" can cause the mechanism not to work as expected.
> >>>>
> >>>> For example, we work on a fully-isolated CPU scenario, where some very sporadic
> >>>> workload can be placed on the isolated CPU because of per-cpu kernel activities,
> >>>> e.g., kworkers... We need to let them run, but for a minimal amount of time, for
> >>>> instance, 20 us, to bound the interference.
> >>>>
> >>>> The current mechanism does not give this precision because the IRQ accounting
> >>>> does not account for runtime for the rt throttling (which makes sense).
> >>>
> >>> I lost you here, "Runtime for the rt throttling" does not make much
> >>> sense to me as a statement.
> >>
> >> Both update_curr_rt() and update_curr_dl() use rq_clock_task() as clock. update_rq_clock_task()
> >> reduces the irq_delta from task's clock (inside #ifdef CONFIG_IRQ_TIME_ACCOUNTING), and this
> >> clock is used to throttling.
> >
> > That was a much better description. You're basically saying that since
> > the running time of the RT class is not accounted for in the clock, it
> > affects the throttling and unthrottling times. I actually ran into a
> > similar issue on Android I recall, where the RT time was showing up as
> > CFS load if I recall.
> >
> > For RT throttling though, in our testing the time scales are large
> > enough (for our usecase) that such time stealing wasn't an issue. I am
> > going for something that is practical and that works, does not have to
> > be perfect since it has been several years now with these problems and
> > leaving RT throttling broken is probably not a good thing.
>
> By postponing the enqueue/replanishment of the DL server here, we are fixing the
> problem in a practical way, that works without breaking existing useful properties &
> use-cases.

Ok, that sounds good to me.

> [...]
> >>> not seeing how that is related to creation of a DL-server for the RT
> >>> class. Maybe describe your point a bit more clearly?
> >>
> >> This patch is targeting a better way to avoid SCHED_OTHER starvation.
> >> Having a DL server for RT class does not help on that. We need to boost
> >> SCHED_OTHER.
> >
> > Oh, actually the problem of boosting SCHED_OTHER is a bit orthogonal
> > to what I said. I was not saying not to boost SCHED_OTHER, I was
> > talking more about this particular patch and using an DL-based RT
> > server to mitigate that issue. The boosting is already handled in
> > previous patches with the DL-server.
>
> The boosting of the previous patch is breaking FIFO priority. This patch fixes that
> single point without touching and or breaking SCHED_DEADLINE/EDF properties. With
> these things in place we do not mitigate, we fix the problem.

Sure, that's fine with me.

[..]
> > could you just push it out to run in the next period if it has the
> > flag.
>
> The improvements on top of this patch is to postpone the enqueue/replenish to the 0-laxity
> time. By doing so, the task receives a new period (and so deadline) a period ahead.

Yes, I understand. I was hoping for us to do that from within the DL
class itself as a DL task property, but perhaps that's my wishful
thinking...

> >
> > Here is the definition of 0-laxity as I understand it. Please correct
> > me as I have not done a phD in these things like you ;-)
> >
> > Laxity, also known as slack time,
>
> laxity = absolute deadline - activation time - runtime.

Correct.

> is the amount of time that a task
> > can be delayed without causing a missed deadline.
>
> If you look at the task alone! e.g, if it does not face preemption!
> A 0-laxity task is
> > one that has no more time to spare and must be executed immediately
>
> and not be preempted!
>
> to
> > avoid missing its deadline.
>
> Again, you are looking at the task alone.

Sure, that's why I mentioned running it before 0-laxity itself to
account for that and not just preemption but also other issues as well
like I/O, locking, sleeping etc. I don't have a formula right now and
I need to think more about it, but that was the idea at a high-level.
Again it is all rainbows and ponies and the devil is in the details so
just consider it as an idea/suggestion and not something we must
urgently do. I may find time later to go over the papers such as those
related to the laxity-based scheduling.

> > And here's where I need your input: If we take all admitted DL tasks
> > and run it at their respective 0-laxity times or slightly earlier,
> > then in-theory, they should all meet their deadlines correctly?
> For all tasksets, no!
>
> There might be a taskset here o there that creates such timeline under EDF,
> but it is not always true that a task under EDF will wait until the 0-laxity
> time for them to run. EDF is working conserving.
>
> For a working conserving scheduler to build such a timeline, it needs to
> have no idle time. Then, lets get the classical single core assumptions
> (these servers are per-cpu).
>
> - Assuming single-core/partitioned scheduler.
> - Assuming periodic tasks with deadline = period
> - Assuming a task set with the sum of each task utilization = 1
> - Assuming all tasks are dispatched at the same time (critical instant)
> - Assuming all tasks will run for their entire runtime, without blocking
>
> (so... the thing that EDF is optimum... fully loaded...)
>
> Even so you will not have EDF always creating such timeline because the
> task with the earliest deadline will run first, still deadlines will be met.
>
> For example:
>
> t1 = 5/10
> t2 = 5/10
>
> Each task you pick first will run 5 unities of time before the "0-laxity time".
>
> If there is a scheduler that always build a timeline like you want, it will not
> schedule the taskset I mentioned... thus.. it will schedule less than EDF.

Yes, I think you are kind of saying the same thing that I mentioned,
which is to run it before the 0-laxity time (perhaps depending on the
runtime of the existing admitted tasks).

> > Perhaps you mean the algorithm needs to push the new period/deadline
> > to a later time at the 0-laxity.
>
> This is the idea behind this patch ^^^^ This is the different between running
> and replenishing I mention on previous emails.
>
> That's also fine with me. But some
> > variation of the idea is possible AFAICS (again could be missing
> > something that mathematically makes this impossible).
>
> you are looking for a fragment of the information... "0-laxity time," with
> a single task in mind - not in the context of a scheduler.

I should have probably not used the word 0-laxity, because I did
mention running *before* 0-laxity arrives to account for delays,
that's what I was saying. So like run it before you get to 0 laxity to
account for delays, like that. But hey it is just an idea and sorry if
it sounded like noise. It goes back to the premise I mentioned which
is, DL task do not need to run immediately and preempt RT, it can run
later and still meet the deadline. When to run it is a different
question but if there was a crystal ball, DL task can still meet its
deadline by running at a later time.

> >> In the cover, I mentioned improving this patch, so maybe watchdog is not
> >> the appropriate name. 0-laxity server is not a good name either because
> >> it might induce people to think that the server will RUN at 0-laxity
> >> while it will actually be replenished at 0-laxity. Maybe a deferred server
> >> might be a better name.
> >
> > Yeah maybe a deferred server could be a better name.
> >
> >>>> In a paper, the scheduler alone is the solution. In real life, the solution
> >>>> for problems like locking is as fundamental as the scheduler. We need to keep
> >>>> things simple to expand on these other topics as well.
> >>>>
> >>>> So, I do not think we need all the drawbacks of a mixed solution to just fix
> >>>> the throttling problem, and EDF is more capable and explored for the
> >>>> general case.
> >>>
> >>> Again, I was saying making it opt-in seems like a reasonable approach
> >>> and just enabling such property for the DL server.
> >>
> >> Can we have a "deferred DL server?" is that your question?
> >>
> >> If so, I think so. But we have other points to look first. DL servers are per-cpu,
> >> so they break global. We need to think about an interface... and there are
> >> other points that we need to understand before trying some other more
> >> optimized cases.
> >
> > You mean an interface for the DL server params? Those can be similar
> > to other sched knobs. And then boot with a CONFIG option and boost CFS
> > things by default if RT is active. Would that work or did you mean
> > something else by interface?

About the interface, perhaps you are referring to using this mechanism
to replace the stalld daemon? That's what I remember from our
conversations in OSPM. In general, I think it is a great idea to
automatically detect "important" starving CFS tasks and boost them
(whether they are starving because of RT, or some other reason).

> >> Generalizing it requires time, but it will happen... and you know that Juri and I
> >> care about Chromeos' use case, as I have been discussing this with you all and
> >> even participating in Google/chrome focused meetings about sched...
> >> at 6 pm our time ;-).
> >
> > I totally appreciate that, please don't get offended, we go a long way
> > back as friends ;-)
>
> I did not get offended, and nothing changes on our friendship :-). I am just
> clarifying you things we know - even before this rebase... We are aware of
> Chrome needs, as well as general RT Linux needs.
>
> The basic idea behind this patch works for all cases and is unlocking this
> situation. The code will be improved in the next version.

Thanks for the discussions! I am looking forward to helping in any way
I can on the series, I am going to be testing it for ChromeOS.

- Joel

2023-06-14 19:00:36

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Wed, Jun 14, 2023 at 10:15 AM Juri Lelli <[email protected]> wrote:
>
> Hey,
[..]
> > By postponing the enqueue/replanishment of the DL server here, we are fixing the
> > problem in a practical way, that works without breaking existing useful properties &
> > use-cases.
>
> In my understanding, if we simply postpone actual activation of the DL
> server up to the point it really needs to boost/run for giving CFS tasks
> some breath (the infamous 0-laxity :), we save RT tasks from useless
> interruptions and still can keep EDF/CBS working w/o much changes.
>
> It looks like a low hanging fruit, small improvement on what we have today
> than doesn't prevent us for implementing more complex features (i.e., full
> blown hierarchical scheduling, alternative schedulers) in the future if
> the need arises.

Agreed, thanks!

- Joel

2023-06-16 12:19:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Thu, Jun 08, 2023 at 05:58:18PM +0200, Daniel Bristot de Oliveira wrote:
> From: Juri Lelli <[email protected]>
>
> Starting deadline server for lower priority classes right away when
> first task is enqueued might break guarantees, as tasks belonging to
> intermediate priority classes could be uselessly preempted. E.g., a well
> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> there are still CPU cycles available for NORMAL tasks to run, as they'll
> be running inside the fair deadline server for some period of time.
>
> To prevent this issue, implement a starvation monitor mechanism that

Why should this be prevented? FIFO can be preempted by a higher
priority FIFO/RR, or in this case by DEADLINE, and always by stop_task.

Just accept this.

Anything that's build around FIFO-99 not getting preempted is just plain
broken. Don't try and pander to broken.

2023-06-16 12:36:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Mon, Jun 12, 2023 at 04:45:54PM +0200, Daniel Bristot de Oliveira wrote:
> On 6/12/23 03:57, Joel Fernandes wrote:

> > Lastly, if the goal is to remove RT throttling code eventually, are
> > you also planning to remove RT group scheduling as well? Are there
> > users of RT group scheduling that might be impacted? On the other
> > hand, RT throttling / group scheduling code can be left as it is
> > (perhaps documenting it as deprecated) and the server stuff can be
> > implemented via a CONFIG option.
>
> I think that the idea is to have the DL servers eventually replace the group
> schedule. But I also believe that it is better to start by solving the
> throttling and then moving to other constructions on top of the mechanism.

The big problem with the rt group scheduling mess is affinities. Other
than that, yes abosolutely, that crap needs to go.

2023-06-16 12:38:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Tue, Jun 13, 2023 at 03:41:30PM +0200, Daniel Bristot de Oliveira wrote:

> In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
> front of DL tasks... that would break EDF. It would be mixing two
> schedulers in one. It is not required and likely not a good idea either.

I did consider a hybrid 0-laxity and EDF scheduler for mixed
criticality, as have others like Ted Baker IIRC. IIRC it can be done
using an augmented tree, but none of that solves the problems 0-laxity
has (like over preemption and the general problem of playing chicken by
doing things at the *VERY* last possible moment).

I think I did a talk at OSPERT on this at some point many years ago.
Luckily some bright fellow had this semi-partitioned stuff that would
make live much simpler :-)

Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/16/23 14:05, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 03:41:30PM +0200, Daniel Bristot de Oliveira wrote:
>
>> In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
>> front of DL tasks... that would break EDF. It would be mixing two
>> schedulers in one. It is not required and likely not a good idea either.
>
> I did consider a hybrid 0-laxity and EDF scheduler for mixed
> criticality, as have others like Ted Baker IIRC. IIRC it can be done
> using an augmented tree, but none of that solves the problems 0-laxity
> has (like over preemption and the general problem of playing chicken by
> doing things at the *VERY* last possible moment).

There are papers here or there about it, but it is far from being the most explored
way to do mixed criticality because of these side effects. It is more common to have
virtual deadlines for high and low criticalities, while using EDF.

Having EDF and being working conserving makes our life easier for other points
we are way behind, i.e., deadline inheritance.

> I think I did a talk at OSPERT on this at some point many years ago.
> Luckily some bright fellow had this semi-partitioned stuff that would
> make live much simpler :-)

It will, we can have two partitions, one for high and one for low. The general
case in the low power CPU, and if it does not make to finish on it, it continues
in the high power one. Still, always using EDF. Having virtual deadline is part
of semi-part. Anyways, EDF schedules more, and it is simpler... it is hard to beat.

This patch set is a warm up for that...

-- Daniel

2023-06-16 13:14:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 5/6] sched/fair: Add trivial fair server

On Thu, Jun 08, 2023 at 05:58:17PM +0200, Daniel Bristot de Oliveira wrote:
> +void fair_server_init(struct rq *rq)
> +{
> + struct sched_dl_entity *dl_se = &rq->fair_server;
> +
> + init_dl_entity(dl_se);
> +
> + dl_se->dl_runtime = TICK_NSEC;
> + dl_se->dl_deadline = 20 * TICK_NSEC;
> + dl_se->dl_period = 20 * TICK_NSEC;
> +
> + dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
> +}

These here numbers should obviously become a tunable somewhere... :-)

Subject: Re: [RFC PATCH V3 5/6] sched/fair: Add trivial fair server

On 6/16/23 14:59, Peter Zijlstra wrote:
> On Thu, Jun 08, 2023 at 05:58:17PM +0200, Daniel Bristot de Oliveira wrote:
>> +void fair_server_init(struct rq *rq)
>> +{
>> + struct sched_dl_entity *dl_se = &rq->fair_server;
>> +
>> + init_dl_entity(dl_se);
>> +
>> + dl_se->dl_runtime = TICK_NSEC;
>> + dl_se->dl_deadline = 20 * TICK_NSEC;
>> + dl_se->dl_period = 20 * TICK_NSEC;
>> +
>> + dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
>> +}
>
> These here numbers should obviously become a tunable somewhere... :-)

From sched_rt_runtime and sched_rt_period, no?

-- Daniel

2023-06-16 13:39:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 5/6] sched/fair: Add trivial fair server

On Fri, Jun 16, 2023 at 03:00:57PM +0200, Daniel Bristot de Oliveira wrote:
> On 6/16/23 14:59, Peter Zijlstra wrote:
> > On Thu, Jun 08, 2023 at 05:58:17PM +0200, Daniel Bristot de Oliveira wrote:
> >> +void fair_server_init(struct rq *rq)
> >> +{
> >> + struct sched_dl_entity *dl_se = &rq->fair_server;
> >> +
> >> + init_dl_entity(dl_se);
> >> +
> >> + dl_se->dl_runtime = TICK_NSEC;
> >> + dl_se->dl_deadline = 20 * TICK_NSEC;
> >> + dl_se->dl_period = 20 * TICK_NSEC;
> >> +
> >> + dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
> >> +}
> >
> > These here numbers should obviously become a tunable somewhere... :-)
>
> From sched_rt_runtime and sched_rt_period, no?

Well, probably new names. IIRC those are also limits on the whole
rt-cgroup mess, so e can't trivially change it without also ripping all
that out, and that's a little bit more work.

Ripping out the basic throttle is much simpler than replacing all that
and should be done earlier.


Subject: Re: [RFC PATCH V3 5/6] sched/fair: Add trivial fair server

On 6/16/23 15:12, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 03:00:57PM +0200, Daniel Bristot de Oliveira wrote:
>> On 6/16/23 14:59, Peter Zijlstra wrote:
>>> On Thu, Jun 08, 2023 at 05:58:17PM +0200, Daniel Bristot de Oliveira wrote:
>>>> +void fair_server_init(struct rq *rq)
>>>> +{
>>>> + struct sched_dl_entity *dl_se = &rq->fair_server;
>>>> +
>>>> + init_dl_entity(dl_se);
>>>> +
>>>> + dl_se->dl_runtime = TICK_NSEC;
>>>> + dl_se->dl_deadline = 20 * TICK_NSEC;
>>>> + dl_se->dl_period = 20 * TICK_NSEC;
>>>> +
>>>> + dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
>>>> +}
>>>
>>> These here numbers should obviously become a tunable somewhere... :-)
>>
>> From sched_rt_runtime and sched_rt_period, no?
>
> Well, probably new names. IIRC those are also limits on the whole
> rt-cgroup mess, so e can't trivially change it without also ripping all
> that out, and that's a little bit more work.
>
> Ripping out the basic throttle is much simpler than replacing all that
> and should be done earlier.
>

Life is easier then. Should them be a sysctl?

Like:
kernel.sched_other_min_runtime_us ?
kernel.sched_other_min_period_us ?

I bet you can come up with better names.

-- Daniel




Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/16/23 13:51, Peter Zijlstra wrote:
> On Thu, Jun 08, 2023 at 05:58:18PM +0200, Daniel Bristot de Oliveira wrote:
>> From: Juri Lelli <[email protected]>
>>
>> Starting deadline server for lower priority classes right away when
>> first task is enqueued might break guarantees, as tasks belonging to
>> intermediate priority classes could be uselessly preempted. E.g., a well
>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
>> there are still CPU cycles available for NORMAL tasks to run, as they'll
>> be running inside the fair deadline server for some period of time.
>>
>> To prevent this issue, implement a starvation monitor mechanism that
>
> Why should this be prevented? FIFO can be preempted by a higher
> priority FIFO/RR, or in this case by DEADLINE, and always by stop_task.

Yeah, but in the context of "real-time throttling" the user is not asking to run things
as SCHED_DEADLINE, they are just running FIFO, and expecting it not to have SCHED_OTHER
jumping in front of them.

If we do not "deffer" the starting of the server for a point in which the starvation
is eminent, we can create some anomalies... for example:

A Interrupt wakes up a FIFO and a CFS task, which will will run first? The CFS!

A FIFO schedules a deferred work on a kworker... the kworker will preempt the
FIFO.

It is counter intuitive for the majority of people out there... It will
be counter-intuitive for most of PREEMPT_RT users :-/. From one day to the other
their tests using cyclictest will... break...

> Just accept this.
>
> Anything that's build around FIFO-99 not getting preempted is just plain
> broken. Don't try and pander to broken

2023-06-16 14:38:41

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH V3 2/6] sched/deadline: Collect sched_dl_entity initialization

On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> Create a single function that initializes a sched_dl_entity.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>


2023-06-16 14:58:18

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH V3 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity

On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> In preparation of introducing !task sched_dl_entity; move the
> bandwidth accounting into {en.de}queue_dl_entity().
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>

Reviewed-by: Valentin Schneider <[email protected]>

Now onto the server fun :-)


2023-06-16 15:01:01

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH V3 1/6] sched: Unify runtime accounting across classes

On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> From: Peter Zijlstra <[email protected]>
>
> All classes use sched_entity::exec_start to track runtime and have
> copies of the exact same code around to compute runtime.
>
> Collapse all that.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>

This one's been around for a while, John also carries it for PE [1] because it
makes things simpler. We should just get it in :-)

The three-layered if (unlikely(delta_exec <= 0)) is unfortunate, but I think we
can live with it. Tiny factorization appended below, but regardless:

Reviewed-by: Valentin Schneider <[email protected]>

[1]: http://lore.kernel.org/r/[email protected]

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7fcf558dc4bc..e52e609724482 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,6 +914,14 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
return delta_exec;
}

+static inline void
+account_curr_runtime(struct task_struct *curr, s64 runtime, u64 vruntime)
+{
+ trace_sched_stat_runtime(curr, runtime, vruntime);
+ account_group_exec_runtime(curr, runtime);
+ cgroup_account_cputime(curr, runtime);
+}
+
/*
* Used by other classes to account runtime.
*/
@@ -926,10 +934,7 @@ s64 update_curr_common(struct rq *rq)
if (unlikely(delta_exec <= 0))
return delta_exec;

- trace_sched_stat_runtime(curr, delta_exec, 0);
-
- account_group_exec_runtime(curr, delta_exec);
- cgroup_account_cputime(curr, delta_exec);
+ account_curr_runtime(curr, delta_exec, 0);

return delta_exec;
}
@@ -955,9 +960,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);

- trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
- cgroup_account_cputime(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
+ account_curr_runtime(curtask, delta_exec, curr->vruntime);
}

account_cfs_rq_runtime(cfs_rq, delta_exec);


2023-06-19 12:22:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On Fri, Jun 16, 2023 at 02:05:07PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 03:41:30PM +0200, Daniel Bristot de Oliveira wrote:
>
> > In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
> > front of DL tasks... that would break EDF. It would be mixing two
> > schedulers in one. It is not required and likely not a good idea either.
>
> I did consider a hybrid 0-laxity and EDF scheduler for mixed
> criticality, as have others like Ted Baker IIRC. IIRC it can be done
> using an augmented tree, but none of that solves the problems 0-laxity
> has (like over preemption and the general problem of playing chicken by
> doing things at the *VERY* last possible moment).
>
> I think I did a talk at OSPERT on this at some point many years ago.
> Luckily some bright fellow had this semi-partitioned stuff that would
> make live much simpler :-)

I must clarify; I was thinking Least-Laxity-First, which is ofcourse not
the same as a 0-laxity scheduler.

Subject: Re: [RFC PATCH V3 6/6] sched/fair: Implement starvation monitor

On 6/19/23 14:02, Peter Zijlstra wrote:
> On Fri, Jun 16, 2023 at 02:05:07PM +0200, Peter Zijlstra wrote:
>> On Tue, Jun 13, 2023 at 03:41:30PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>> In an 0-laxity scheduler, the server would run at 0-laxity, jumping in
>>> front of DL tasks... that would break EDF. It would be mixing two
>>> schedulers in one. It is not required and likely not a good idea either.
>>
>> I did consider a hybrid 0-laxity and EDF scheduler for mixed
>> criticality, as have others like Ted Baker IIRC. IIRC it can be done
>> using an augmented tree, but none of that solves the problems 0-laxity
>> has (like over preemption and the general problem of playing chicken by
>> doing things at the *VERY* last possible moment).
>>
>> I think I did a talk at OSPERT on this at some point many years ago.
>> Luckily some bright fellow had this semi-partitioned stuff that would
>> make live much simpler :-)
>
> I must clarify; I was thinking Least-Laxity-First, which is ofcourse not
> the same as a 0-laxity scheduler.

ok, least-laxity-first is another thing... I think the 0-laxity came from the need
to wait until that point in time to deffer the dl server for the throttling case only...
not as a scheduler.

but still, the vast majority of research is concentrated on EDF. The laxity depends
on the runtime. As the task consumes runtime its laxity changes, so its priority.
With deadline only, the priority stays fixed during the job (Job level fixed priority)
It is easier to take decisions, less overheads & context switch and we can explore
things with virtual deadlines.

IIUC, the EVVDF is also uses virtual deadline abstraction, right?

-- Daniel



2023-06-23 17:08:26

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers

On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> @@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> struct task_struct *p;
>
> p = pick_task_dl(rq);
> - if (p)
> + if (!p)
> + return p;
> +
> + /*
> + * XXX: re-check !dl_server, changed from v2 because of
> + * pick_next_task_dl change
> + */
> + if (!dl_server(&p->dl))
> set_next_task_dl(rq, p, true);
>

Should this be

if (!p->server)

instead? AFAICT dl_server(&p->dl) can never be true since there's no
pi_se-like link to the server via the dl_se, only via the task_struct, and
the server pick cannot return the server itself (as it's a pure sched_entity).

> + /* XXX not quite right */
> + if (hrtick_enabled(rq))
> + start_hrtick_dl(rq, &p->dl);
> +

IIUC that got hauled out of set_next_task_dl() to cover the case where we
pick the server (+ the server pick) and want to more thoroughly enforce the
server's bandwidth. If so, what's the issue with starting the hrtick here?

> return p;
> }
>


Subject: Re: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers


Back from EOSS...

On 6/23/23 18:47, Valentin Schneider wrote:
> On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
>> @@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>> struct task_struct *p;
>>
>> p = pick_task_dl(rq);
>> - if (p)
>> + if (!p)
>> + return p;
>> +
>> + /*
>> + * XXX: re-check !dl_server, changed from v2 because of
>> + * pick_next_task_dl change
>> + */
>> + if (!dl_server(&p->dl))
>> set_next_task_dl(rq, p, true);
>>
>
> Should this be
>
> if (!p->server)
>
> instead? AFAICT dl_server(&p->dl) can never be true since there's no
> pi_se-like link to the server via the dl_se, only via the task_struct, and
> the server pick cannot return the server itself (as it's a pure sched_entity).

makes sense... I will check that in the v4.

>
>> + /* XXX not quite right */
>> + if (hrtick_enabled(rq))
>> + start_hrtick_dl(rq, &p->dl);
>> +
>
> IIUC that got hauled out of set_next_task_dl() to cover the case where we
> pick the server (+ the server pick) and want to more thoroughly enforce the
> server's bandwidth. If so, what's the issue with starting the hrtick here?

I think that the commend was added more as a check if it is correct... it seems it is.

Thanks Vale!
-- Daniel

>
>> return p;
>> }
>>
>


2023-07-04 17:35:13

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers

On Tue, Jul 4, 2023 at 1:25 PM Joel Fernandes <[email protected]> wrote:
>
> On Tue, Jul 4, 2023 at 11:52 AM Daniel Bristot de Oliveira
> <[email protected]> wrote:
> >
> >
> > Back from EOSS...
> >
> > On 6/23/23 18:47, Valentin Schneider wrote:
> > > On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> > >> @@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> > >> struct task_struct *p;
> > >>
> > >> p = pick_task_dl(rq);
> > >> - if (p)
> > >> + if (!p)
> > >> + return p;
> > >> +
> > >> + /*
> > >> + * XXX: re-check !dl_server, changed from v2 because of
> > >> + * pick_next_task_dl change
> > >> + */
> > >> + if (!dl_server(&p->dl))
> > >> set_next_task_dl(rq, p, true);
> > >>
> > >
> > > Should this be
> > >
> > > if (!p->server)
> > >
> > > instead? AFAICT dl_server(&p->dl) can never be true since there's no
> > > pi_se-like link to the server via the dl_se, only via the task_struct, and
> > > the server pick cannot return the server itself (as it's a pure sched_entity).
> >
> > makes sense... I will check that in the v4.
>
> Makes sense to me too. Since p is either a real DL task or a CFS task,
> "if (dl_server(&p->dl))" is incorrect. "if (p->server)" is the right
> check.

Grr, "if (!p->server)" I mean. Which ensures that set_next_task_dl()
is not called on a non-DL task.

- Joel

2023-07-04 17:35:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH V3 4/6] sched/deadline: Introduce deadline servers

On Tue, Jul 4, 2023 at 11:52 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
>
>
> Back from EOSS...
>
> On 6/23/23 18:47, Valentin Schneider wrote:
> > On 08/06/23 17:58, Daniel Bristot de Oliveira wrote:
> >> @@ -2033,9 +2147,20 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> >> struct task_struct *p;
> >>
> >> p = pick_task_dl(rq);
> >> - if (p)
> >> + if (!p)
> >> + return p;
> >> +
> >> + /*
> >> + * XXX: re-check !dl_server, changed from v2 because of
> >> + * pick_next_task_dl change
> >> + */
> >> + if (!dl_server(&p->dl))
> >> set_next_task_dl(rq, p, true);
> >>
> >
> > Should this be
> >
> > if (!p->server)
> >
> > instead? AFAICT dl_server(&p->dl) can never be true since there's no
> > pi_se-like link to the server via the dl_se, only via the task_struct, and
> > the server pick cannot return the server itself (as it's a pure sched_entity).
>
> makes sense... I will check that in the v4.

Makes sense to me too. Since p is either a real DL task or a CFS task,
"if (dl_server(&p->dl))" is incorrect. "if (p->server)" is the right
check.

Optionally, a BUG_ON() as well could be added to make sure the p
returned by pick_task_dl() is always false:
BUG_ON(dl_server(&p->dl));

thanks,

- Joel



>
> >
> >> + /* XXX not quite right */
> >> + if (hrtick_enabled(rq))
> >> + start_hrtick_dl(rq, &p->dl);
> >> +
> >
> > IIUC that got hauled out of set_next_task_dl() to cover the case where we
> > pick the server (+ the server pick) and want to more thoroughly enforce the
> > server's bandwidth. If so, what's the issue with starting the hrtick here?
>
> I think that the commend was added more as a check if it is correct... it seems it is.
>
> Thanks Vale!
> -- Daniel
>
> >
> >> return p;
> >> }
> >>
> >
>