This is v5 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].
SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.
In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:
# rtla timerlat top -u -d 10m
--------------------- %< ------------------------
Timer Latency
0 01:42:24 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #6143559 | 0 0 0 92 | 2 1 3 98 | 4 1 5 100
1 #6143559 | 1 0 0 97 | 7 1 5 101 | 9 1 7 103
2 #6143559 | 0 0 0 88 | 3 1 5 95 | 5 1 7 99
3 #6143559 | 0 0 0 90 | 6 1 5 103 | 10 1 7 126
4 #6143558 | 1 0 0 81 | 7 1 4 86 | 9 1 7 90
5 #6143558 | 0 0 0 74 | 3 1 5 79 | 4 1 7 83
6 #6143558 | 0 0 0 83 | 2 1 5 89 | 3 0 7 108
7 #6143558 | 0 0 0 85 | 3 1 4 126 | 5 1 6 137
--------------------- >% ------------------------
And this is the same tests with DL server activating without any delay:
--------------------- %< ------------------------
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #579147 | 0 0 0 54 | 2 1 52 61095 | 2 2 56 61102
1 #578766 | 0 0 0 83 | 2 1 49 55824 | 3 2 53 55831
2 #578559 | 0 0 1 59 | 2 1 50 55760 | 3 2 54 55770
3 #578318 | 0 0 0 76 | 2 1 49 55751 | 3 2 54 55760
4 #578611 | 0 0 0 64 | 2 1 49 55811 | 3 2 53 55820
5 #578347 | 0 0 1 40 | 2 1 50 56121 | 3 2 55 56133
6 #578938 | 0 0 1 75 | 2 1 49 55755 | 3 2 53 55764
7 #578631 | 0 0 1 36 | 3 1 51 55528 | 4 2 55 55541
--------------------- >% ------------------------
The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.
We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.
Juri and I discussed this and though about delaying the server
activation for the 0-lag time, thus enabling the server only if the
fair scheduler is about to starve.
The patch 6/7 adds the possibility to defer the server start to the
(absolute deadline - runtime) point in time. This is achieved by
enqueuing the dl server throttled, with a next replenishing time
set to activate the server at (absolute deadline - runtime).
Differently from v4, now the server is enqueued with the runtime
replenished. As the fair scheduler runs without boost, its runtime
is consumed. If the fair server has its runtime before the 0-laxity
time, the a new period is set, and the timer armed for the new
(deadline - runtime).
The patch 7/7 add a per_rq interface for the knobs:
fair_server_runtime (950 ms)
fair_server_period (1s)
fair_server_defer (enabled)
With defer enabled on CPUs [0:3], the results get better, having a
behavior similar to the one we have with the rt throttling.
--------------------- %< ------------------------
Timer Latency
0 00:10:01 | IRQ Timer Latency (us) | Thread Timer Latency (us) | Ret user Timer Latency (us)
CPU COUNT | cur min avg max | cur min avg max | cur min avg max
0 #599979 | 0 0 0 64 | 4 1 4 67 | 6 1 5 69
1 #599979 | 0 0 1 17 | 6 1 5 50 | 10 2 7 71
2 #599984 | 1 0 1 22 | 4 1 5 78 | 5 2 7 107
3 #599986 | 0 0 1 72 | 7 1 5 79 | 10 2 7 82
4 #581580 | 1 0 1 37 | 6 1 38 52797 | 10 2 41 52805
5 #583270 | 1 0 1 41 | 9 1 36 52617 | 12 2 38 52623
6 #581240 | 0 0 1 25 | 7 1 39 52870 | 11 2 41 52876
7 #581208 | 0 0 1 69 | 6 1 39 52917 | 9 2 41 52923
--------------------- >% ------------------------
Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
- CPU 2 isolated
- CPU 3 isolated shared with a CFS busy loop task
- CPU 8 non-isolated
- CPU 9 non-isolated shared with a CFS busy loop task
--------------------- %< ------------------------
~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
~# sysctl kernel.sched_rt_runtime_us=-1
~# tuna isolate -c 2
~# tuna isolate -c 3
~# taskset -c 3 ./f &
~# taskset -c 9 ./f &
~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
Operating System Noise
duration: 0 00:10:00 | time is in us
CPU Period Runtime Noise % CPU Aval Max Noise Max Single HW NMI IRQ Softirq Thread
2 #599 599000000 178 99.99997 18 2 0 0 270 0 0
3 #598 598054434 31351553 94.75774 104442 104442 0 0 2837523 0 1794
8 #599 599000001 567456 99.90526 3260 2375 2 89 620490 0 13539
9 #598 598021196 31742537 94.69207 71707 53357 0 90 3411023 0 1762
--------------------- >% ------------------------
the system runs fine!
- no crashes (famous last words)
- FIFO property is kept
- per cpu interface because it is more flexible - and to detach this from
the throttling concept.
Global is broken, but it will > /dev/null.
TODO:
- Move rt throttling code to RT_GROUP_SCHED for now (then send it to the same
place as global then).
Changes from V4:
- Enable the server when nr fair tasks is > 0 (peter)
- Consume runtime if the zerolax server is not boosted (peterz)
- Adjust interface to deal with admission control (peterz)
- Rebased to 6.6
Changes from V3:
- Add the defer server (Daniel)
- Add an per rq interface (Daniel with peter's feedback)
- Add an option not defer the server (for Joel)
- Typos and 1-liner fixes (Valentin, Luca, Peter)
- Fair scheduler running on dl server do not account as RT task (Daniel)
- Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
- Refactor/rephrase/typos changes
- Defferable server using throttling
- The server starts when RT && Fair tasks are enqueued
- Interface with runtime/period/defer option
Changes from v1:
- rebased on 6.4-rc1 tip/sched/core
Daniel Bristot de Oliveira (2):
sched/deadline: Deferrable dl server
sched/fair: Fair server interface
Peter Zijlstra (5):
sched: Unify runtime accounting across classes
sched/deadline: Collect sched_dl_entity initialization
sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
sched/deadline: Introduce deadline servers
sched/fair: Add trivial fair server
include/linux/sched.h | 26 +-
kernel/sched/core.c | 23 +-
kernel/sched/deadline.c | 671 ++++++++++++++++++++++++++++-----------
kernel/sched/debug.c | 202 ++++++++++++
kernel/sched/fair.c | 87 ++++-
kernel/sched/rt.c | 15 +-
kernel/sched/sched.h | 56 +++-
kernel/sched/stop_task.c | 13 +-
8 files changed, 847 insertions(+), 246 deletions(-)
--
2.40.1
Add an interface for fair server setup on debugfs.
Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
- fair_server_runtime: set runtime in ns
- fair_server_period: set period in ns
- fair_server_defer: on/off for the defer mechanism
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/deadline.c | 89 +++++++++++++++---
kernel/sched/debug.c | 202 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 6 --
kernel/sched/sched.h | 2 +
4 files changed, 279 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 69ee1fbd60e4..1092ca8892e0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
__sub_running_bw(dl_se->dl_bw, dl_rq);
}
-static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
{
- struct rq *rq;
-
- WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
-
- if (task_on_rq_queued(p))
- return;
+ if (dl_se->dl_non_contending) {
+ sub_running_bw(dl_se, &rq->dl);
+ dl_se->dl_non_contending = 0;
- rq = task_rq(p);
- if (p->dl.dl_non_contending) {
- sub_running_bw(&p->dl, &rq->dl);
- p->dl.dl_non_contending = 0;
/*
* If the timer handler is currently running and the
* timer cannot be canceled, inactive_task_timer()
@@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
* will not touch the rq's active utilization,
* so we are still safe.
*/
- if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+ if (!dl_server(dl_se))
+ put_task_struct(dl_task_of(dl_se));
+ }
}
- __sub_rq_bw(p->dl.dl_bw, &rq->dl);
+ __sub_rq_bw(dl_se->dl_bw, &rq->dl);
__add_rq_bw(new_bw, &rq->dl);
}
+static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+ WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+
+ if (task_on_rq_queued(p))
+ return;
+
+ dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
+}
+
static void __dl_clear_params(struct sched_dl_entity *dl_se);
/*
@@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
void dl_server_start(struct sched_dl_entity *dl_se)
{
+ /*
+ * XXX: the apply do not work fine at the init phase for the
+ * fair server because things are not yet set. We need to improve
+ * this before getting generic.
+ */
if (!dl_server(dl_se)) {
+ u64 runtime = 50 * NSEC_PER_MSEC;
+ u64 period = 1000 * NSEC_PER_MSEC;
+
+ dl_server_apply_params(dl_se, runtime, period, 1);
+
+ dl_se->dl_zerolax = 1;
dl_se->dl_server = 1;
setup_new_dl_entity(dl_se);
}
+
enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
}
@@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_se->server_pick = pick;
}
+int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
+{
+ u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ u64 new_bw = to_ratio(period, runtime);
+ struct rq *rq = dl_se->rq;
+ int cpu = cpu_of(rq);
+ struct dl_bw *dl_b;
+ unsigned long cap;
+ int retval = 0;
+ int cpus;
+
+ dl_b = dl_bw_of(cpu);
+ raw_spin_lock(&dl_b->lock);
+ cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+
+ if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
+ retval = -EBUSY;
+ goto out;
+ }
+
+ if (init) {
+ __add_rq_bw(new_bw, &rq->dl);
+ __dl_add(dl_b, new_bw, cpus);
+ } else {
+ __dl_sub(dl_b, dl_se->dl_bw, cpus);
+ __dl_add(dl_b, new_bw, cpus);
+
+ dl_rq_change_utilization(rq, dl_se, new_bw);
+ }
+
+ rq->fair_server.dl_runtime = runtime;
+ rq->fair_server.dl_deadline = period;
+ rq->fair_server.dl_period = period;
+
+ dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+ dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
+
+out:
+ raw_spin_unlock(&dl_b->lock);
+
+ return retval;
+}
+
/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700e..bd7ad6b8d3de 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = {
.release = seq_release,
};
+enum dl_param {
+ DL_RUNTIME = 0,
+ DL_PERIOD,
+ DL_ZEROLAX
+};
+
+static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
+
+static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos, enum dl_param param)
+{
+ long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+ u64 runtime, period, zerolax;
+ struct rq *rq = cpu_rq(cpu);
+ size_t err;
+ int retval;
+ u64 value;
+
+ err = kstrtoull_from_user(ubuf, cnt, 10, &value);
+ if (err)
+ return err;
+
+ scoped_guard (rq_lock_irqsave, rq) {
+
+ runtime = rq->fair_server.dl_runtime;
+ period = rq->fair_server.dl_period;
+ zerolax = rq->fair_server.dl_zerolax;
+
+ switch (param) {
+ case DL_RUNTIME:
+ if (runtime == value)
+ goto out;
+ runtime = value;
+ break;
+ case DL_PERIOD:
+ if (value == period)
+ goto out;
+ period = value;
+ break;
+ case DL_ZEROLAX:
+ if (zerolax == value)
+ goto out;
+ zerolax = value;
+ break;
+ }
+
+ if (runtime > period
+ || period > fair_server_period_max
+ || period < fair_server_period_min
+ || zerolax > 1) {
+ cnt = -EINVAL;
+ goto out;
+ }
+
+ if (rq->cfs.h_nr_running) {
+ update_rq_clock(rq);
+ dl_server_stop(&rq->fair_server);
+ }
+
+ /*
+ * The zerolax does not change utilization, so just
+ * setting it is enough.
+ */
+ if (rq->fair_server.dl_zerolax != zerolax) {
+ rq->fair_server.dl_zerolax = zerolax;
+ } else {
+ retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+ if (retval)
+ cnt = retval;
+ }
+
+ if (rq->cfs.h_nr_running)
+ dl_server_start(&rq->fair_server);
+ }
+
+out:
+ *ppos += cnt;
+ return cnt;
+}
+
+static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+{
+ unsigned long cpu = (unsigned long) m->private;
+ struct rq *rq = cpu_rq(cpu);
+ u64 value;
+
+ switch (param) {
+ case DL_RUNTIME:
+ value = rq->fair_server.dl_runtime;
+ break;
+ case DL_PERIOD:
+ value = rq->fair_server.dl_period;
+ break;
+ case DL_ZEROLAX:
+ value = rq->fair_server.dl_zerolax;
+ }
+
+ seq_printf(m, "%llu\n", value);
+ return 0;
+
+}
+
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+ return sched_fair_server_show(m, v, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+ .open = sched_fair_server_runtime_open,
+ .write = sched_fair_server_runtime_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+ return sched_fair_server_show(m, v, DL_PERIOD);
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+ .open = sched_fair_server_period_open,
+ .write = sched_fair_server_period_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static ssize_t
+sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_show(struct seq_file *m, void *v)
+{
+ return sched_fair_server_show(m, v, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_fair_server_defer_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_defer_fops = {
+ .open = sched_fair_server_defer_open,
+ .write = sched_fair_server_defer_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static struct dentry *debugfs_sched;
+static void debugfs_fair_server_init(void)
+{
+ long cpu;
+ struct dentry *rq_dentry;
+
+ rq_dentry = debugfs_create_dir("rq", debugfs_sched);
+ if (!rq_dentry)
+ return;
+
+ for_each_possible_cpu(cpu) {
+ struct dentry *d_cpu;
+ char buf[32];
+
+ snprintf(buf, sizeof(buf), "cpu%ld", cpu);
+ d_cpu = debugfs_create_dir(buf, rq_dentry);
+
+ debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+ debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+ debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+ }
+}
+
static __init int sched_init_debug(void)
{
struct dentry __maybe_unused *numa;
@@ -374,6 +574,8 @@ static __init int sched_init_debug(void)
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
+ debugfs_fair_server_init();
+
return 0;
}
late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 399237cd9f59..5434c52f470d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq)
struct sched_dl_entity *dl_se = &rq->fair_server;
init_dl_entity(dl_se);
-
- dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
- dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
- dl_se->dl_period = 1000 * NSEC_PER_MSEC;
- dl_se->dl_zerolax = 1;
-
dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec0e288c8e06..312b31df5860 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_pick_f pick);
extern void fair_server_init(struct rq *);
+extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
+ u64 runtime, u64 period, bool init);
#ifdef CONFIG_CGROUP_SCHED
--
2.40.1
From: Peter Zijlstra <[email protected]>
Use deadline servers to service fair tasks.
This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).
[ dl_server do not account for rt ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/deadline.c | 7 +++++++
kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
4 files changed, 41 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a721f6776b12..939266d29681 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10019,6 +10019,7 @@ void __init sched_init(void)
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
atomic_set(&rq->nr_iowait, 0);
+ fair_server_init(rq);
#ifdef CONFIG_SCHED_CORE
rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 541d547e1019..1d7b96ca9011 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
resched_curr(rq);
}
+ /*
+ * The fair server (sole dl_server) does not account for real-time
+ * workload because it is running fair work.
+ */
+ if (dl_server(dl_se))
+ return;
+
/*
* Because -- for now -- we share the rt bandwidth, we need to
* account our runtime there too, otherwise actual rt tasks
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc3a4bc6c438..b15f7f376a67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6600,6 +6600,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
*/
util_est_enqueue(&rq->cfs, p);
+ if (!rq->cfs.h_nr_running)
+ dl_server_start(&rq->fair_server);
+
/*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6744,6 +6747,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
rq->next_balance = jiffies;
dequeue_throttle:
+ if (!rq->cfs.h_nr_running)
+ dl_server_stop(&rq->fair_server);
+
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@@ -8396,6 +8402,29 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
return pick_next_task_fair(rq, NULL, NULL);
}
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+ return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+ return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+ struct sched_dl_entity *dl_se = &rq->fair_server;
+
+ init_dl_entity(dl_se);
+
+ dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
+ dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
+ dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+
+ dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
/*
* Account for a descheduled task:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24a2bc7c453b..ec0e288c8e06 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,6 +340,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick);
+extern void fair_server_init(struct rq *);
+
#ifdef CONFIG_CGROUP_SCHED
struct cfs_rq;
@@ -1005,6 +1007,8 @@ struct rq {
struct rt_rq rt;
struct dl_rq dl;
+ struct sched_dl_entity fair_server;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
struct list_head leaf_cfs_rq_list;
--
2.40.1
From: Peter Zijlstra <[email protected]>
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).
Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
include/linux/sched.h | 22 ++-
kernel/sched/core.c | 17 ++
kernel/sched/deadline.c | 332 +++++++++++++++++++++++++++-------------
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 27 ++++
5 files changed, 294 insertions(+), 108 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 31eee8b03dcd..5ac1f252e136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -63,11 +63,13 @@ struct robust_list_head;
struct root_domain;
struct rq;
struct sched_attr;
+struct sched_dl_entity;
struct seq_file;
struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct task_struct;
struct user_event_mm;
/*
@@ -607,6 +609,9 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
struct sched_dl_entity {
struct rb_node rb_node;
@@ -654,6 +659,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
+ unsigned int dl_server : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
@@ -668,7 +674,20 @@ struct sched_dl_entity {
* timer is needed to decrease the active utilization at the correct
* time.
*/
- struct hrtimer inactive_timer;
+ struct hrtimer inactive_timer;
+
+ /*
+ * Bits for DL-server functionality. Also see the comment near
+ * dl_server_update().
+ *
+ * @rq the runqueue this server is for
+ *
+ * @server_has_tasks() returns true if @server_pick return a
+ * runnable task.
+ */
+ struct rq *rq;
+ dl_server_has_tasks_f server_has_tasks;
+ dl_server_pick_f server_pick;
#ifdef CONFIG_RT_MUTEXES
/*
@@ -795,6 +814,7 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
+ struct sched_dl_entity *server;
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257369d30303..a721f6776b12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3795,6 +3795,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
rq->idle_stamp = 0;
}
#endif
+
+ p->server = NULL;
}
/*
@@ -6001,12 +6003,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq);
}
+ /*
+ * This is the fast path; it cannot be a DL server pick;
+ * therefore even if @p == @prev, ->server must be NULL.
+ */
+ if (p->server)
+ p->server = NULL;
+
return p;
}
restart:
put_prev_task_balance(rq, prev, rf);
+ /*
+ * We've updated @prev and no longer need the server link, clear it.
+ * Must be done before ->pick_next_task() because that can (re)set
+ * ->server.
+ */
+ if (prev->server)
+ prev->server = NULL;
+
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 81810f67df7a..541d547e1019 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,8 +54,14 @@ static int __init sched_dl_sysctl_init(void)
late_initcall(sched_dl_sysctl_init);
#endif
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+ return dl_se->dl_server;
+}
+
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
+ BUG_ON(dl_server(dl_se));
return container_of(dl_se, struct task_struct, dl);
}
@@ -64,12 +70,19 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
return container_of(dl_rq, struct rq, dl);
}
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = task_rq(p);
+ struct rq *rq = dl_se->rq;
+
+ if (!dl_server(dl_se))
+ rq = task_rq(dl_task_of(dl_se));
- return &rq->dl;
+ return rq;
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+ return &rq_of_dl_se(dl_se)->dl;
}
static inline int on_dl_rq(struct sched_dl_entity *dl_se)
@@ -394,9 +407,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
static void task_non_contending(struct sched_dl_entity *dl_se)
{
struct hrtimer *timer = &dl_se->inactive_timer;
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
- struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct dl_rq *dl_rq = &rq->dl;
s64 zerolag_time;
/*
@@ -426,25 +438,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
* utilization now, instead of starting a timer
*/
if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
- if (dl_task(p))
+ if (dl_server(dl_se)) {
sub_running_bw(dl_se, dl_rq);
+ } else {
+ struct task_struct *p = dl_task_of(dl_se);
- if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ if (dl_task(p))
+ sub_running_bw(dl_se, dl_rq);
- if (READ_ONCE(p->__state) == TASK_DEAD)
- sub_rq_bw(dl_se, &rq->dl);
- raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
- raw_spin_unlock(&dl_b->lock);
- __dl_clear_params(dl_se);
+ if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ if (READ_ONCE(p->__state) == TASK_DEAD)
+ sub_rq_bw(dl_se, &rq->dl);
+ raw_spin_lock(&dl_b->lock);
+ __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+ raw_spin_unlock(&dl_b->lock);
+ __dl_clear_params(dl_se);
+ }
}
return;
}
dl_se->dl_non_contending = 1;
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
+
hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
}
@@ -471,8 +491,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* will not touch the rq's active utilization,
* so we are still safe.
*/
- if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
- put_task_struct(dl_task_of(dl_se));
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+ if (!dl_server(dl_se))
+ put_task_struct(dl_task_of(dl_se));
+ }
} else {
/*
* Since "dl_non_contending" is not set, the
@@ -485,10 +507,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
}
}
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- struct sched_dl_entity *dl_se = &p->dl;
-
return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
}
@@ -740,8 +760,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
}
#endif /* CONFIG_SMP */
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags);
static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -989,8 +1011,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
*/
static void update_dl_entity(struct sched_dl_entity *dl_se)
{
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
+ struct rq *rq = rq_of_dl_se(dl_se);
if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1021,11 +1042,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
* actually started or not (i.e., the replenishment instant is in
* the future or in the past).
*/
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
struct hrtimer *timer = &dl_se->dl_timer;
- struct rq *rq = task_rq(p);
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
ktime_t now, act;
s64 delta;
@@ -1059,13 +1080,33 @@ static int start_dl_timer(struct task_struct *p)
* and observe our state.
*/
if (!hrtimer_is_queued(timer)) {
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
}
return 1;
}
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+ /*
+ * Queueing this task back might have overloaded rq, check if we need
+ * to kick someone away.
+ */
+ if (has_pushable_dl_tasks(rq)) {
+ /*
+ * Nothing relies on rq->lock after this, so its safe to drop
+ * rq->lock.
+ */
+ rq_unpin_lock(rq, rf);
+ push_dl_task(rq);
+ rq_repin_lock(rq, rf);
+ }
+#endif
+}
+
/*
* This is the bandwidth enforcement timer callback. If here, we know
* a task is not on its dl_rq, since the fact that the timer was running
@@ -1084,10 +1125,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
dl_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p;
struct rq_flags rf;
struct rq *rq;
+ if (dl_server(dl_se)) {
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct rq_flags rf;
+
+ rq_lock(rq, &rf);
+ if (dl_se->dl_throttled) {
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ if (dl_se->server_has_tasks(dl_se)) {
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ resched_curr(rq);
+ __push_dl_task(rq, &rf);
+ } else {
+ replenish_dl_entity(dl_se);
+ }
+
+ }
+ rq_unlock(rq, &rf);
+
+ return HRTIMER_NORESTART;
+ }
+
+ p = dl_task_of(dl_se);
rq = task_rq_lock(p, &rf);
/*
@@ -1158,21 +1223,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
else
resched_curr(rq);
-#ifdef CONFIG_SMP
- /*
- * Queueing this task back might have overloaded rq, check if we need
- * to kick someone away.
- */
- if (has_pushable_dl_tasks(rq)) {
- /*
- * Nothing relies on rq->lock after this, so its safe to drop
- * rq->lock.
- */
- rq_unpin_lock(rq, &rf);
- push_dl_task(rq);
- rq_repin_lock(rq, &rf);
- }
-#endif
+ __push_dl_task(rq, &rf);
unlock:
task_rq_unlock(rq, p, &rf);
@@ -1214,12 +1265,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
*/
static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+ struct rq *rq = rq_of_dl_se(dl_se);
if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
return;
dl_se->dl_throttled = 1;
if (dl_se->runtime > 0)
@@ -1270,29 +1320,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
return (delta * u_act) >> BW_SHIFT;
}
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+ int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
{
- struct task_struct *curr = rq->curr;
- struct sched_dl_entity *dl_se = &curr->dl;
- s64 delta_exec, scaled_delta_exec;
- int cpu = cpu_of(rq);
-
- if (!dl_task(curr) || !on_dl_rq(dl_se))
- return;
+ s64 scaled_delta_exec;
- /*
- * Consumed budget is computed considering the time as
- * observed by schedulable tasks (excluding time spent
- * in hardirq context, etc.). Deadlines are instead
- * computed using hard walltime. This seems to be the more
- * natural solution, but the full ramifications of this
- * approach need further study.
- */
- delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
@@ -1310,10 +1344,9 @@ static void update_curr_dl(struct rq *rq)
* according to current frequency and CPU maximum capacity.
*/
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
- scaled_delta_exec = grub_reclaim(delta_exec,
- rq,
- &curr->dl);
+ scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
} else {
+ int cpu = cpu_of(rq);
unsigned long scale_freq = arch_scale_freq_capacity(cpu);
unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
@@ -1332,11 +1365,20 @@ static void update_curr_dl(struct rq *rq)
(dl_se->flags & SCHED_FLAG_DL_OVERRUN))
dl_se->dl_overrun = 1;
- __dequeue_task_dl(rq, curr, 0);
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
- enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+ dequeue_dl_entity(dl_se, 0);
+ if (!dl_server(dl_se)) {
+ update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+ dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+ }
- if (!is_leftmost(curr, &rq->dl))
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+ if (dl_server(dl_se))
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ else
+ enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+ }
+
+ if (!is_leftmost(dl_se, &rq->dl))
resched_curr(rq);
}
@@ -1366,20 +1408,82 @@ static void update_curr_dl(struct rq *rq)
}
}
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+ update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+ if (!dl_server(dl_se)) {
+ dl_se->dl_server = 1;
+ setup_new_dl_entity(dl_se);
+ }
+ enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+ dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick)
+{
+ dl_se->rq = rq;
+ dl_se->server_has_tasks = has_tasks;
+ dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct sched_dl_entity *dl_se = &curr->dl;
+ s64 delta_exec;
+
+ if (!dl_task(curr) || !on_dl_rq(dl_se))
+ return;
+
+ /*
+ * Consumed budget is computed considering the time as
+ * observed by schedulable tasks (excluding time spent
+ * in hardirq context, etc.). Deadlines are instead
+ * computed using hard walltime. This seems to be the more
+ * natural solution, but the full ramifications of this
+ * approach need further study.
+ */
+ delta_exec = update_curr_common(rq);
+ update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
{
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
inactive_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p = NULL;
struct rq_flags rf;
struct rq *rq;
- rq = task_rq_lock(p, &rf);
+ if (!dl_server(dl_se)) {
+ p = dl_task_of(dl_se);
+ rq = task_rq_lock(p, &rf);
+ } else {
+ rq = dl_se->rq;
+ rq_lock(rq, &rf);
+ }
sched_clock_tick();
update_rq_clock(rq);
+ if (dl_server(dl_se))
+ goto no_task;
+
if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
@@ -1396,14 +1500,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
goto unlock;
}
+
+no_task:
if (dl_se->dl_non_contending == 0)
goto unlock;
sub_running_bw(dl_se, &rq->dl);
dl_se->dl_non_contending = 0;
unlock:
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
+
+ if (!dl_server(dl_se)) {
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ } else {
+ rq_unlock(rq, &rf);
+ }
return HRTIMER_NORESTART;
}
@@ -1466,10 +1577,8 @@ static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
static inline
void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
u64 deadline = dl_se->deadline;
- WARN_ON(!dl_prio(prio));
dl_rq->dl_nr_running++;
add_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1479,9 +1588,6 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
static inline
void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
-
- WARN_ON(!dl_prio(prio));
WARN_ON(!dl_rq->dl_nr_running);
dl_rq->dl_nr_running--;
sub_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1648,8 +1754,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
} else if (flags & ENQUEUE_REPLENISH) {
replenish_dl_entity(dl_se);
} else if ((flags & ENQUEUE_RESTORE) &&
- dl_time_before(dl_se->deadline,
- rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+ dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
setup_new_dl_entity(dl_se);
}
@@ -1730,19 +1835,13 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
enqueue_dl_entity(&p->dl, flags);
+ if (dl_server(&p->dl))
+ return;
+
if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
- update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
- dequeue_dl_entity(&p->dl, flags);
-
- if (!p->dl.dl_throttled)
- dequeue_pushable_dl_task(rq, p);
-}
-
static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
@@ -1750,7 +1849,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (p->on_rq == TASK_ON_RQ_MIGRATING)
flags |= DEQUEUE_MIGRATING;
- __dequeue_task_dl(rq, p, flags);
+ dequeue_dl_entity(&p->dl, flags);
+ if (!p->dl.dl_throttled && !dl_server(&p->dl))
+ dequeue_pushable_dl_task(rq, p);
}
/*
@@ -1940,12 +2041,12 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
}
#ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
- hrtick_start(rq, p->dl.runtime);
+ hrtick_start(rq, dl_se->runtime);
}
#else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
}
#endif
@@ -1965,9 +2066,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;
- if (hrtick_enabled_dl(rq))
- start_hrtick_dl(rq, p);
-
if (rq->curr->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
@@ -1990,12 +2088,25 @@ static struct task_struct *pick_task_dl(struct rq *rq)
struct dl_rq *dl_rq = &rq->dl;
struct task_struct *p;
+again:
if (!sched_dl_runnable(rq))
return NULL;
dl_se = pick_next_dl_entity(dl_rq);
WARN_ON_ONCE(!dl_se);
- p = dl_task_of(dl_se);
+
+ if (dl_server(dl_se)) {
+ p = dl_se->server_pick(dl_se);
+ if (!p) {
+ WARN_ON_ONCE(1);
+ dl_se->dl_yielded = 1;
+ update_curr_dl_se(rq, dl_se, 0);
+ goto again;
+ }
+ p->server = dl_se;
+ } else {
+ p = dl_task_of(dl_se);
+ }
return p;
}
@@ -2005,9 +2116,15 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
struct task_struct *p;
p = pick_task_dl(rq);
- if (p)
+ if (!p)
+ return p;
+
+ if (!p->server)
set_next_task_dl(rq, p, true);
+ if (hrtick_enabled(rq))
+ start_hrtick_dl(rq, &p->dl);
+
return p;
}
@@ -2045,8 +2162,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
* be set and schedule() will start a new hrtick for the next task.
*/
if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
- is_leftmost(p, &rq->dl))
- start_hrtick_dl(rq, p);
+ is_leftmost(&p->dl, &rq->dl))
+ start_hrtick_dl(rq, &p->dl);
}
static void task_fork_dl(struct task_struct *p)
@@ -2986,6 +3103,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
dl_se->dl_yielded = 0;
dl_se->dl_non_contending = 0;
dl_se->dl_overrun = 0;
+ dl_se->dl_server = 0;
#ifdef CONFIG_RT_MUTEXES
dl_se->pi_se = dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2613704a2d2d..bc3a4bc6c438 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1168,6 +1168,8 @@ s64 update_curr_common(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);
cgroup_account_cputime(curr, delta_exec);
+ if (curr->server)
+ dl_server_update(curr->server, delta_exec);
return delta_exec;
}
@@ -1197,6 +1199,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+ if (curtask->server)
+ dl_server_update(curtask->server, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a0cdc540029c..24a2bc7c453b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_bw_check_overflow(int cpu);
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ * dl_se::rq -- runqueue we belong to.
+ *
+ * dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ * server when it runs out of tasks to run.
+ *
+ * dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ * returns NULL.
+ *
+ * dl_server_update() -- called from update_curr_common(), propagates runtime
+ * to the server.
+ *
+ * dl_server_start()
+ * dl_server_stop() -- start/stop the server when it has (no) tasks.
+ *
+ * dl_server_init() -- initializes the server.
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick);
+
#ifdef CONFIG_CGROUP_SCHED
struct cfs_rq;
--
2.40.1
Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.
The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.
For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.
By setting the zerolax option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.
The dl_timer will be set for (period - runtime) ns from start time.
Thus boosting the fair rq on its 0-laxity time with respect to
rt_rq.
If the fair scheduler has the opportunity to run while waiting
for zerolax time, the dl server runtime will be consumed. If
the runtime is completely consumed before the zerolax time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new zerolax time
If the fair server reaches the zerolax time without consuming
its runtime, the server will be boosted, following CBS rules
(thus without breaking SCHED_DEADLINE).
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
include/linux/sched.h | 2 +
kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
kernel/sched/fair.c | 3 ++
3 files changed, 103 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5ac1f252e136..56e53e6fd5a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -660,6 +660,8 @@ struct sched_dl_entity {
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
unsigned int dl_server : 1;
+ unsigned int dl_zerolax : 1;
+ unsigned int dl_zerolax_armed : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1d7b96ca9011..69ee1fbd60e4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
/* for non-boosted task, pi_of(dl_se) == dl_se */
dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
dl_se->runtime = pi_of(dl_se)->dl_runtime;
+
+ /*
+ * If it is a zerolax reservation, throttle it.
+ */
+ if (dl_se->dl_zerolax) {
+ dl_se->dl_throttled = 1;
+ dl_se->dl_zerolax_armed = 1;
+ }
}
/*
@@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
* could happen are, typically, a entity voluntarily trying to overcome its
* runtime, or it just underestimated it during sched_setattr().
*/
+static int start_dl_timer(struct sched_dl_entity *dl_se);
static void replenish_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
dl_se->dl_yielded = 0;
if (dl_se->dl_throttled)
dl_se->dl_throttled = 0;
+
+ /*
+ * If this is the replenishment of a zerolax reservation,
+ * clear the flag and return.
+ */
+ if (dl_se->dl_zerolax_armed) {
+ dl_se->dl_zerolax_armed = 0;
+ return;
+ }
+
+ /*
+ * A this point, if the zerolax server is not armed, and the deadline
+ * is in the future, throttle the server and arm the zerolax timer.
+ */
+ if (dl_se->dl_zerolax &&
+ dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
+ if (!is_dl_boosted(dl_se)) {
+ dl_se->dl_zerolax_armed = 1;
+ dl_se->dl_throttled = 1;
+ start_dl_timer(dl_se);
+ }
+ }
}
/*
@@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
}
replenish_dl_new_period(dl_se, rq);
+ } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
+ /*
+ * The server can still use its previous deadline, so throttle
+ * and arm the zero-laxity timer.
+ */
+ dl_se->dl_zerolax_armed = 1;
+ dl_se->dl_throttled = 1;
}
}
@@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
* We want the timer to fire at the deadline, but considering
* that it is actually coming from rq->clock and not from
* hrtimer's time base reading.
+ *
+ * The zerolax reservation will have its timer set to the
+ * deadline - runtime. At that point, the CBS rule will decide
+ * if the current deadline can be used, or if a replenishment
+ * is required to avoid add too much pressure on the system
+ * (current u > U).
*/
- act = ns_to_ktime(dl_next_period(dl_se));
+ if (dl_se->dl_zerolax_armed) {
+ WARN_ON_ONCE(!dl_se->dl_throttled);
+ act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+ } else {
+ act = ns_to_ktime(dl_next_period(dl_se));
+ }
+
now = hrtimer_cb_get_time(timer);
delta = ktime_to_ns(now) - rq_clock(rq);
act = ktime_add_ns(act, delta);
@@ -1333,6 +1383,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
return;
}
+ if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_zerolax)
+ return;
+
if (dl_entity_is_special(dl_se))
return;
@@ -1356,6 +1409,39 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
dl_se->runtime -= scaled_delta_exec;
+ /*
+ * The fair server can consume its runtime while throttled (not queued).
+ *
+ * If the server consumes its entire runtime in this state. The server
+ * is not required for the current period. Thus, reset the server by
+ * starting a new period, pushing the activation to the zero-lax time.
+ */
+ if (dl_se->dl_zerolax && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
+ s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime;
+
+ /*
+ * If this is a regular throttling case, let it run negative until
+ * the dl_runtime - runtime > 0. The reason being is that the next
+ * replenishment will result in a positive runtime one period ahead.
+ *
+ * Otherwise, the deadline will be pushed more than one period, not
+ * providing runtime/period anymore.
+ *
+ * If the dl_runtime - runtime < 0, then the server was able to get
+ * the runtime/period before the replenishment. So it is safe
+ * to start a new deffered period.
+ */
+ if (!dl_se->dl_zerolax_armed && runtime_diff > 0)
+ return;
+
+ hrtimer_try_to_cancel(&dl_se->dl_timer);
+
+ replenish_dl_new_period(dl_se, dl_se->rq);
+ start_dl_timer(dl_se);
+
+ return;
+ }
+
throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
dl_se->dl_throttled = 1;
@@ -1432,6 +1518,9 @@ void dl_server_start(struct sched_dl_entity *dl_se)
void dl_server_stop(struct sched_dl_entity *dl_se)
{
dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+ hrtimer_try_to_cancel(&dl_se->dl_timer);
+ dl_se->dl_zerolax_armed = 0;
+ dl_se->dl_throttled = 0;
}
void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
@@ -1743,7 +1832,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
* be counted in the active utilization; hence, we need to call
* add_running_bw().
*/
- if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+ if (!dl_se->dl_zerolax && dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
if (flags & ENQUEUE_WAKEUP)
task_contending(dl_se, flags);
@@ -1765,6 +1854,13 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
setup_new_dl_entity(dl_se);
}
+ /*
+ * If we are still throttled, eg. we got replenished but are a
+ * zero-laxity task and still got to wait, don't enqueue.
+ */
+ if (dl_se->dl_throttled && start_dl_timer(dl_se))
+ return;
+
__enqueue_dl_entity(dl_se);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b15f7f376a67..399237cd9f59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
account_group_exec_runtime(curtask, delta_exec);
if (curtask->server)
dl_server_update(curtask->server, delta_exec);
+ else
+ dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
@@ -8421,6 +8423,7 @@ void fair_server_init(struct rq *rq)
dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+ dl_se->dl_zerolax = 1;
dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
}
--
2.40.1
Hi Daniel,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base: tip/sched/core
patch link: https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-buildonly-randconfig-001-20231104 (https://download.01.org/0day-ci/archive/20231104/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231104/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from kernel/sched/build_utility.c:72:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression of type 'long int' results in '-100663296' [-Woverflow]
342 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
| ^
vim +342 kernel/sched/debug.c
341
> 342 static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
343 static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
344
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Daniel,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base: tip/sched/core
patch link: https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-randconfig-013-20231105 (https://download.01.org/0day-ci/archive/20231105/[email protected]/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from kernel/sched/build_utility.c:72:0:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression [-Woverflow]
static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
^
vim +342 kernel/sched/debug.c
341
> 342 static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
343 static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
344
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Sat, Nov 04, 2023 at 11:59:22AM +0100, Daniel Bristot de Oliveira wrote:
> [ dl_server do not account for rt ]
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 541d547e1019..1d7b96ca9011 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
> resched_curr(rq);
> }
>
> + /*
> + * The fair server (sole dl_server) does not account for real-time
> + * workload because it is running fair work.
> + */
> + if (dl_server(dl_se))
> + return;
> +
> /*
> * Because -- for now -- we share the rt bandwidth, we need to
> * account our runtime there too, otherwise actual rt tasks
Should we perhaps write this like so?
if (dl_se == &rq->fair_server)
return;
On 11/6/23 15:24, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:22AM +0100, Daniel Bristot de Oliveira wrote:
>
>> [ dl_server do not account for rt ]
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 541d547e1019..1d7b96ca9011 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>> resched_curr(rq);
>> }
>>
>> + /*
>> + * The fair server (sole dl_server) does not account for real-time
>> + * workload because it is running fair work.
>> + */
>> + if (dl_server(dl_se))
>> + return;
>> +
>> /*
>> * Because -- for now -- we share the rt bandwidth, we need to
>> * account our runtime there too, otherwise actual rt tasks
> Should we perhaps write this like so?
>
> if (dl_se == &rq->fair_server)
> return;
right, it is better for the next step (making it generic).
-- Daniel
On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b15f7f376a67..399237cd9f59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
> account_group_exec_runtime(curtask, delta_exec);
> if (curtask->server)
> dl_server_update(curtask->server, delta_exec);
> + else
> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
> }
>
> account_cfs_rq_runtime(cfs_rq, delta_exec);
I've rewritten that something like so..
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1182,12 +1182,13 @@ s64 update_curr_common(struct rq *rq)
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
+ struct rq *rq = rq_of(cfs_rq);
s64 delta_exec;
if (unlikely(!curr))
return;
- delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+ delta_exec = update_curr_se(rq, curr);
if (unlikely(delta_exec <= 0))
return;
@@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c
update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);
- if (entity_is_task(curr))
- update_curr_task(task_of(curr), delta_exec);
+ if (entity_is_task(curr)) {
+ struct task_struct *p = task_of(curr);
+ update_curr_task(p, delta_exec);
+ /*
+ * Any fair task that runs outside of fair_server should
+ * account against fair_server such that it can account for
+ * this time and possible avoid running this period.
+ */
+ if (p->dl_server != &rq->fair_server)
+ dl_server_update(&rq->fair_server, delta_exec);
+ }
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
>
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>
> - fair_server_runtime: set runtime in ns
> - fair_server_period: set period in ns
> - fair_server_defer: on/off for the defer mechanism
>
This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
total available bandwidth control, right?
But then shouldn've we also rip out the throttle thingy right quick?
On 11/6/23 16:40, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>> - fair_server_runtime: set runtime in ns
>> - fair_server_period: set period in ns
>> - fair_server_defer: on/off for the defer mechanism
>>
>
> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> total available bandwidth control, right?
right, but thinking aloud... given that the per-cpu files are already allocating the
bandwidth on the dl_rq, the spare time for fair scheduler is granted.
Still, we can have them there as a safeguard to not overloading the deadline
scheduler... (thinking aloud 2) as long as global is a thing... as we get away
from it, that global limitation will make less sense - still better to have a form
of limitation so people are aware of bandwidth until there.
> But then shouldn've we also rip out the throttle thingy right quick?
>
I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
for now, because GROUP_SCHED depends on it, no?
With the next step on moving the dl server as the base for the hierarchical scheduling...
That will rip out the CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
Does it make sense?
On 11/6/23 15:55, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b15f7f376a67..399237cd9f59 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>> account_group_exec_runtime(curtask, delta_exec);
>> if (curtask->server)
>> dl_server_update(curtask->server, delta_exec);
>> + else
>> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>> }
>>
>> account_cfs_rq_runtime(cfs_rq, delta_exec);
>
> @@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c
> update_deadline(cfs_rq, curr);
> update_min_vruntime(cfs_rq);
>
> - if (entity_is_task(curr))
> - update_curr_task(task_of(curr), delta_exec);
> + if (entity_is_task(curr)) {
> + struct task_struct *p = task_of(curr);
> + update_curr_task(p, delta_exec);
> + /*
> + * Any fair task that runs outside of fair_server should
> + * account against fair_server such that it can account for
> + * this time and possible avoid running this period.
> + */
> + if (p->dl_server != &rq->fair_server)
> + dl_server_update(&rq->fair_server, delta_exec);
aren't we missing:
else
dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
or am I missing something? :-)
> + }
>
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> }
Hi Daniel,
On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
>
> Among the motivations for the DL servers is the real-time throttling
> mechanism. This mechanism works by throttling the rt_rq after
> running for a long period without leaving space for fair tasks.
>
> The base dl server avoids this problem by boosting fair tasks instead
> of throttling the rt_rq. The point is that it boosts without waiting
> for potential starvation, causing some non-intuitive cases.
>
> For example, an IRQ dispatches two tasks on an idle system, a fair
> and an RT. The DL server will be activated, running the fair task
> before the RT one. This problem can be avoided by deferring the
> dl server activation.
>
> By setting the zerolax option, the dl_server will dispatch an
> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>
> The dl_timer will be set for (period - runtime) ns from start time.
> Thus boosting the fair rq on its 0-laxity time with respect to
> rt_rq.
>
> If the fair scheduler has the opportunity to run while waiting
> for zerolax time, the dl server runtime will be consumed. If
> the runtime is completely consumed before the zerolax time, the
> server will be replenished while still in a throttled state. Then,
> the dl_timer will be reset to the new zerolax time
>
> If the fair server reaches the zerolax time without consuming
> its runtime, the server will be boosted, following CBS rules
> (thus without breaking SCHED_DEADLINE).
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> include/linux/sched.h | 2 +
> kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> kernel/sched/fair.c | 3 ++
> 3 files changed, 103 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5ac1f252e136..56e53e6fd5a0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -660,6 +660,8 @@ struct sched_dl_entity {
> unsigned int dl_non_contending : 1;
> unsigned int dl_overrun : 1;
> unsigned int dl_server : 1;
> + unsigned int dl_zerolax : 1;
> + unsigned int dl_zerolax_armed : 1;
>
> /*
> * Bandwidth enforcement timer. Each -deadline task has its
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1d7b96ca9011..69ee1fbd60e4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> /* for non-boosted task, pi_of(dl_se) == dl_se */
> dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> dl_se->runtime = pi_of(dl_se)->dl_runtime;
> +
> + /*
> + * If it is a zerolax reservation, throttle it.
> + */
> + if (dl_se->dl_zerolax) {
> + dl_se->dl_throttled = 1;
> + dl_se->dl_zerolax_armed = 1;
> + }
> }
>
> /*
> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> * could happen are, typically, a entity voluntarily trying to overcome its
> * runtime, or it just underestimated it during sched_setattr().
> */
> +static int start_dl_timer(struct sched_dl_entity *dl_se);
> static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> dl_se->dl_yielded = 0;
> if (dl_se->dl_throttled)
> dl_se->dl_throttled = 0;
> +
> + /*
> + * If this is the replenishment of a zerolax reservation,
> + * clear the flag and return.
> + */
> + if (dl_se->dl_zerolax_armed) {
> + dl_se->dl_zerolax_armed = 0;
> + return;
> + }
> +
> + /*
> + * A this point, if the zerolax server is not armed, and the deadline
> + * is in the future, throttle the server and arm the zerolax timer.
> + */
> + if (dl_se->dl_zerolax &&
> + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> + if (!is_dl_boosted(dl_se)) {
> + dl_se->dl_zerolax_armed = 1;
> + dl_se->dl_throttled = 1;
> + start_dl_timer(dl_se);
> + }
> + }
> }
>
> /*
> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> }
>
> replenish_dl_new_period(dl_se, rq);
> + } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> + /*
> + * The server can still use its previous deadline, so throttle
> + * and arm the zero-laxity timer.
> + */
> + dl_se->dl_zerolax_armed = 1;
> + dl_se->dl_throttled = 1;
> }
> }
>
> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> * We want the timer to fire at the deadline, but considering
> * that it is actually coming from rq->clock and not from
> * hrtimer's time base reading.
> + *
> + * The zerolax reservation will have its timer set to the
> + * deadline - runtime. At that point, the CBS rule will decide
> + * if the current deadline can be used, or if a replenishment
> + * is required to avoid add too much pressure on the system
> + * (current u > U).
> */
> - act = ns_to_ktime(dl_next_period(dl_se));
> + if (dl_se->dl_zerolax_armed) {
> + WARN_ON_ONCE(!dl_se->dl_throttled);
> + act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
Just a question, here if dl_se->deadline - dl_se->runtime is large,
then does that mean that server activation will be much more into the
future? So say I want to give CFS 30%, then it will take 70% of the
period before CFS preempts RT thus "starving" CFS for this duration. I
think that's Ok for smaller periods and runtimes, though.
I think it does reserve the amount of required CFS bandwidth so it is
probably OK, though it is perhaps letting RT run more initially (say
if CFS tasks are not CPU bound and occasionally wake up, they will
always be hit by the 70% latency AFAICS which may be large for large
periods and small runtimes).
I/we're currently trying these patches on ChromeOS as well.
Just started going over it to understand the patch. Looking nice so
far and thanks,
- Joel
On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <[email protected]> wrote:
>
> Hi Daniel,
>
> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> <[email protected]> wrote:
> >
> > Among the motivations for the DL servers is the real-time throttling
> > mechanism. This mechanism works by throttling the rt_rq after
> > running for a long period without leaving space for fair tasks.
> >
> > The base dl server avoids this problem by boosting fair tasks instead
> > of throttling the rt_rq. The point is that it boosts without waiting
> > for potential starvation, causing some non-intuitive cases.
> >
> > For example, an IRQ dispatches two tasks on an idle system, a fair
> > and an RT. The DL server will be activated, running the fair task
> > before the RT one. This problem can be avoided by deferring the
> > dl server activation.
> >
> > By setting the zerolax option, the dl_server will dispatch an
> > SCHED_DEADLINE reservation with replenished runtime, but throttled.
> >
> > The dl_timer will be set for (period - runtime) ns from start time.
> > Thus boosting the fair rq on its 0-laxity time with respect to
> > rt_rq.
> >
> > If the fair scheduler has the opportunity to run while waiting
> > for zerolax time, the dl server runtime will be consumed. If
> > the runtime is completely consumed before the zerolax time, the
> > server will be replenished while still in a throttled state. Then,
> > the dl_timer will be reset to the new zerolax time
> >
> > If the fair server reaches the zerolax time without consuming
> > its runtime, the server will be boosted, following CBS rules
> > (thus without breaking SCHED_DEADLINE).
> >
> > Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> > ---
> > include/linux/sched.h | 2 +
> > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/fair.c | 3 ++
> > 3 files changed, 103 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 5ac1f252e136..56e53e6fd5a0 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -660,6 +660,8 @@ struct sched_dl_entity {
> > unsigned int dl_non_contending : 1;
> > unsigned int dl_overrun : 1;
> > unsigned int dl_server : 1;
> > + unsigned int dl_zerolax : 1;
> > + unsigned int dl_zerolax_armed : 1;
> >
> > /*
> > * Bandwidth enforcement timer. Each -deadline task has its
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 1d7b96ca9011..69ee1fbd60e4 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> > /* for non-boosted task, pi_of(dl_se) == dl_se */
> > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> > dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > +
> > + /*
> > + * If it is a zerolax reservation, throttle it.
> > + */
> > + if (dl_se->dl_zerolax) {
> > + dl_se->dl_throttled = 1;
> > + dl_se->dl_zerolax_armed = 1;
> > + }
> > }
> >
> > /*
> > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> > * could happen are, typically, a entity voluntarily trying to overcome its
> > * runtime, or it just underestimated it during sched_setattr().
> > */
> > +static int start_dl_timer(struct sched_dl_entity *dl_se);
> > static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > {
> > struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > dl_se->dl_yielded = 0;
> > if (dl_se->dl_throttled)
> > dl_se->dl_throttled = 0;
> > +
> > + /*
> > + * If this is the replenishment of a zerolax reservation,
> > + * clear the flag and return.
> > + */
> > + if (dl_se->dl_zerolax_armed) {
> > + dl_se->dl_zerolax_armed = 0;
> > + return;
> > + }
> > +
> > + /*
> > + * A this point, if the zerolax server is not armed, and the deadline
> > + * is in the future, throttle the server and arm the zerolax timer.
> > + */
> > + if (dl_se->dl_zerolax &&
> > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> > + if (!is_dl_boosted(dl_se)) {
> > + dl_se->dl_zerolax_armed = 1;
> > + dl_se->dl_throttled = 1;
> > + start_dl_timer(dl_se);
> > + }
> > + }
> > }
> >
> > /*
> > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> > }
> >
> > replenish_dl_new_period(dl_se, rq);
> > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> > + /*
> > + * The server can still use its previous deadline, so throttle
> > + * and arm the zero-laxity timer.
> > + */
> > + dl_se->dl_zerolax_armed = 1;
> > + dl_se->dl_throttled = 1;
> > }
> > }
> >
> > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> > * We want the timer to fire at the deadline, but considering
> > * that it is actually coming from rq->clock and not from
> > * hrtimer's time base reading.
> > + *
> > + * The zerolax reservation will have its timer set to the
> > + * deadline - runtime. At that point, the CBS rule will decide
> > + * if the current deadline can be used, or if a replenishment
> > + * is required to avoid add too much pressure on the system
> > + * (current u > U).
> > */
> > - act = ns_to_ktime(dl_next_period(dl_se));
> > + if (dl_se->dl_zerolax_armed) {
> > + WARN_ON_ONCE(!dl_se->dl_throttled);
> > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
>
> Just a question, here if dl_se->deadline - dl_se->runtime is large,
> then does that mean that server activation will be much more into the
> future? So say I want to give CFS 30%, then it will take 70% of the
> period before CFS preempts RT thus "starving" CFS for this duration. I
> think that's Ok for smaller periods and runtimes, though.
>
> I think it does reserve the amount of required CFS bandwidth so it is
> probably OK, though it is perhaps letting RT run more initially (say
> if CFS tasks are not CPU bound and occasionally wake up, they will
> always be hit by the 70% latency AFAICS which may be large for large
> periods and small runtimes).
>
One more consideration I guess is, because the server is throttled
till 0-laxity time, it is possible that if CFS sleeps even a bit
(after the DL-server is unthrottled), then it will be pushed out to a
full current deadline + period due to CBS. In such a situation, if
CFS-server is the only DL task running, it might starve RT for a bit
more time.
Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
at 0.295s (its remaining runtime is 0.01s at this point which is < the
"time till deadline" of 0.005s). Now the runtime of the CFS-server
will be replenished to the full 3s (due to CBS) and the deadline
pushed out. The end result is the total runtime that the CFS-server
actually gets is 0.0595s (though yes it did sleep for 5ms in between,
still that's tiny -- say if it briefly blocked on a kernel mutex).
On the other hand, if the CFS server started a bit earlier than the
0-laxity, it would probably not have had CBS pushing it out.
This is likely also not an issue for shorter runtime/period values,
still the throttling till later has a small trade-off (Not saying we
should not do this, this whole series is likely a huge improvement
over the current RT throttling).
There is a chance I am uttering nonsense as I am not a DL expert, so
apologies if so.
Thanks.
On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <[email protected]> wrote:
>
> On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <[email protected]> wrote:
> >
> > Hi Daniel,
> >
> > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> > <[email protected]> wrote:
> > >
> > > Among the motivations for the DL servers is the real-time throttling
> > > mechanism. This mechanism works by throttling the rt_rq after
> > > running for a long period without leaving space for fair tasks.
> > >
> > > The base dl server avoids this problem by boosting fair tasks instead
> > > of throttling the rt_rq. The point is that it boosts without waiting
> > > for potential starvation, causing some non-intuitive cases.
> > >
> > > For example, an IRQ dispatches two tasks on an idle system, a fair
> > > and an RT. The DL server will be activated, running the fair task
> > > before the RT one. This problem can be avoided by deferring the
> > > dl server activation.
> > >
> > > By setting the zerolax option, the dl_server will dispatch an
> > > SCHED_DEADLINE reservation with replenished runtime, but throttled.
> > >
> > > The dl_timer will be set for (period - runtime) ns from start time.
> > > Thus boosting the fair rq on its 0-laxity time with respect to
> > > rt_rq.
> > >
> > > If the fair scheduler has the opportunity to run while waiting
> > > for zerolax time, the dl server runtime will be consumed. If
> > > the runtime is completely consumed before the zerolax time, the
> > > server will be replenished while still in a throttled state. Then,
> > > the dl_timer will be reset to the new zerolax time
> > >
> > > If the fair server reaches the zerolax time without consuming
> > > its runtime, the server will be boosted, following CBS rules
> > > (thus without breaking SCHED_DEADLINE).
> > >
> > > Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> > > ---
> > > include/linux/sched.h | 2 +
> > > kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> > > kernel/sched/fair.c | 3 ++
> > > 3 files changed, 103 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 5ac1f252e136..56e53e6fd5a0 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -660,6 +660,8 @@ struct sched_dl_entity {
> > > unsigned int dl_non_contending : 1;
> > > unsigned int dl_overrun : 1;
> > > unsigned int dl_server : 1;
> > > + unsigned int dl_zerolax : 1;
> > > + unsigned int dl_zerolax_armed : 1;
> > >
> > > /*
> > > * Bandwidth enforcement timer. Each -deadline task has its
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index 1d7b96ca9011..69ee1fbd60e4 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> > > /* for non-boosted task, pi_of(dl_se) == dl_se */
> > > dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> > > dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > > +
> > > + /*
> > > + * If it is a zerolax reservation, throttle it.
> > > + */
> > > + if (dl_se->dl_zerolax) {
> > > + dl_se->dl_throttled = 1;
> > > + dl_se->dl_zerolax_armed = 1;
> > > + }
> > > }
> > >
> > > /*
> > > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> > > * could happen are, typically, a entity voluntarily trying to overcome its
> > > * runtime, or it just underestimated it during sched_setattr().
> > > */
> > > +static int start_dl_timer(struct sched_dl_entity *dl_se);
> > > static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > > {
> > > struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > > dl_se->dl_yielded = 0;
> > > if (dl_se->dl_throttled)
> > > dl_se->dl_throttled = 0;
> > > +
> > > + /*
> > > + * If this is the replenishment of a zerolax reservation,
> > > + * clear the flag and return.
> > > + */
> > > + if (dl_se->dl_zerolax_armed) {
> > > + dl_se->dl_zerolax_armed = 0;
> > > + return;
> > > + }
> > > +
> > > + /*
> > > + * A this point, if the zerolax server is not armed, and the deadline
> > > + * is in the future, throttle the server and arm the zerolax timer.
> > > + */
> > > + if (dl_se->dl_zerolax &&
> > > + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> > > + if (!is_dl_boosted(dl_se)) {
> > > + dl_se->dl_zerolax_armed = 1;
> > > + dl_se->dl_throttled = 1;
> > > + start_dl_timer(dl_se);
> > > + }
> > > + }
> > > }
> > >
> > > /*
> > > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> > > }
> > >
> > > replenish_dl_new_period(dl_se, rq);
> > > + } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> > > + /*
> > > + * The server can still use its previous deadline, so throttle
> > > + * and arm the zero-laxity timer.
> > > + */
> > > + dl_se->dl_zerolax_armed = 1;
> > > + dl_se->dl_throttled = 1;
> > > }
> > > }
> > >
> > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> > > * We want the timer to fire at the deadline, but considering
> > > * that it is actually coming from rq->clock and not from
> > > * hrtimer's time base reading.
> > > + *
> > > + * The zerolax reservation will have its timer set to the
> > > + * deadline - runtime. At that point, the CBS rule will decide
> > > + * if the current deadline can be used, or if a replenishment
> > > + * is required to avoid add too much pressure on the system
> > > + * (current u > U).
> > > */
> > > - act = ns_to_ktime(dl_next_period(dl_se));
> > > + if (dl_se->dl_zerolax_armed) {
> > > + WARN_ON_ONCE(!dl_se->dl_throttled);
> > > + act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> >
> > Just a question, here if dl_se->deadline - dl_se->runtime is large,
> > then does that mean that server activation will be much more into the
> > future? So say I want to give CFS 30%, then it will take 70% of the
> > period before CFS preempts RT thus "starving" CFS for this duration. I
> > think that's Ok for smaller periods and runtimes, though.
> >
> > I think it does reserve the amount of required CFS bandwidth so it is
> > probably OK, though it is perhaps letting RT run more initially (say
> > if CFS tasks are not CPU bound and occasionally wake up, they will
> > always be hit by the 70% latency AFAICS which may be large for large
> > periods and small runtimes).
> >
>
> One more consideration I guess is, because the server is throttled
> till 0-laxity time, it is possible that if CFS sleeps even a bit
> (after the DL-server is unthrottled), then it will be pushed out to a
> full current deadline + period due to CBS. In such a situation, if
> CFS-server is the only DL task running, it might starve RT for a bit
> more time.
>
> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
> at 0.295s (its remaining runtime is 0.01s at this point which is < the
> "time till deadline" of 0.005s). Now the runtime of the CFS-server
> will be replenished to the full 3s (due to CBS) and the deadline
> pushed out. The end result is the total runtime that the CFS-server
> actually gets is 0.0595s (though yes it did sleep for 5ms in between,
> still that's tiny -- say if it briefly blocked on a kernel mutex).
Blah, I got lost in decimal points. Here's the example again:
Say CFS-server runtime is 0.3s and period is 1s.
At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
point which is < the "time till deadline" of 0.005s)
Now the runtime of the CFS-server will be replenished to the full 0.3s
(due to CBS) and the deadline
pushed out.
The end result is, the total runtime that the CFS-server actually gets
is 0.595s (though yes it did sleep for 5ms in between, still that's
tiny -- say if it briefly blocked on a kernel mutex). That's almost
double the allocated runtime.
This is just theoretical and I have yet to see if it is actually an
issue in practice.
Thanks.
On 11/6/23 20:32, Joel Fernandes wrote:
> Hi Daniel,
>
> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> <[email protected]> wrote:
>>
>> Among the motivations for the DL servers is the real-time throttling
>> mechanism. This mechanism works by throttling the rt_rq after
>> running for a long period without leaving space for fair tasks.
>>
>> The base dl server avoids this problem by boosting fair tasks instead
>> of throttling the rt_rq. The point is that it boosts without waiting
>> for potential starvation, causing some non-intuitive cases.
>>
>> For example, an IRQ dispatches two tasks on an idle system, a fair
>> and an RT. The DL server will be activated, running the fair task
>> before the RT one. This problem can be avoided by deferring the
>> dl server activation.
>>
>> By setting the zerolax option, the dl_server will dispatch an
>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>
>> The dl_timer will be set for (period - runtime) ns from start time.
>> Thus boosting the fair rq on its 0-laxity time with respect to
>> rt_rq.
>>
>> If the fair scheduler has the opportunity to run while waiting
>> for zerolax time, the dl server runtime will be consumed. If
>> the runtime is completely consumed before the zerolax time, the
>> server will be replenished while still in a throttled state. Then,
>> the dl_timer will be reset to the new zerolax time
>>
>> If the fair server reaches the zerolax time without consuming
>> its runtime, the server will be boosted, following CBS rules
>> (thus without breaking SCHED_DEADLINE).
>>
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>> ---
>> include/linux/sched.h | 2 +
>> kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
>> kernel/sched/fair.c | 3 ++
>> 3 files changed, 103 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5ac1f252e136..56e53e6fd5a0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -660,6 +660,8 @@ struct sched_dl_entity {
>> unsigned int dl_non_contending : 1;
>> unsigned int dl_overrun : 1;
>> unsigned int dl_server : 1;
>> + unsigned int dl_zerolax : 1;
>> + unsigned int dl_zerolax_armed : 1;
>>
>> /*
>> * Bandwidth enforcement timer. Each -deadline task has its
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 1d7b96ca9011..69ee1fbd60e4 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
>> /* for non-boosted task, pi_of(dl_se) == dl_se */
>> dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
>> dl_se->runtime = pi_of(dl_se)->dl_runtime;
>> +
>> + /*
>> + * If it is a zerolax reservation, throttle it.
>> + */
>> + if (dl_se->dl_zerolax) {
>> + dl_se->dl_throttled = 1;
>> + dl_se->dl_zerolax_armed = 1;
>> + }
>> }
>>
>> /*
>> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>> * could happen are, typically, a entity voluntarily trying to overcome its
>> * runtime, or it just underestimated it during sched_setattr().
>> */
>> +static int start_dl_timer(struct sched_dl_entity *dl_se);
>> static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>> {
>> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>> dl_se->dl_yielded = 0;
>> if (dl_se->dl_throttled)
>> dl_se->dl_throttled = 0;
>> +
>> + /*
>> + * If this is the replenishment of a zerolax reservation,
>> + * clear the flag and return.
>> + */
>> + if (dl_se->dl_zerolax_armed) {
>> + dl_se->dl_zerolax_armed = 0;
>> + return;
>> + }
>> +
>> + /*
>> + * A this point, if the zerolax server is not armed, and the deadline
>> + * is in the future, throttle the server and arm the zerolax timer.
>> + */
>> + if (dl_se->dl_zerolax &&
>> + dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
>> + if (!is_dl_boosted(dl_se)) {
>> + dl_se->dl_zerolax_armed = 1;
>> + dl_se->dl_throttled = 1;
>> + start_dl_timer(dl_se);
>> + }
>> + }
>> }
>>
>> /*
>> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
>> }
>>
>> replenish_dl_new_period(dl_se, rq);
>> + } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
>> + /*
>> + * The server can still use its previous deadline, so throttle
>> + * and arm the zero-laxity timer.
>> + */
>> + dl_se->dl_zerolax_armed = 1;
>> + dl_se->dl_throttled = 1;
>> }
>> }
>>
>> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>> * We want the timer to fire at the deadline, but considering
>> * that it is actually coming from rq->clock and not from
>> * hrtimer's time base reading.
>> + *
>> + * The zerolax reservation will have its timer set to the
>> + * deadline - runtime. At that point, the CBS rule will decide
>> + * if the current deadline can be used, or if a replenishment
>> + * is required to avoid add too much pressure on the system
>> + * (current u > U).
>> */
>> - act = ns_to_ktime(dl_next_period(dl_se));
>> + if (dl_se->dl_zerolax_armed) {
>> + WARN_ON_ONCE(!dl_se->dl_throttled);
>> + act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
>
> Just a question, here if dl_se->deadline - dl_se->runtime is large,
> then does that mean that server activation will be much more into the
> future? So say I want to give CFS 30%, then it will take 70% of the
> period before CFS preempts RT thus "starving" CFS for this duration. I
> think that's Ok for smaller periods and runtimes, though.
I think you are answering yourself here :-)
If the default values are not good, change them o/
The current interface allows you to have more responsive/small chuck of CPU
or less responsive/large chucks of CPU... you can even place RT bellow CFS
for a "bounded amount of time" by disabling the defer option... per CPU.
All at once with different periods patterns on CPUs to increase the
changes of having a cfs rq ready on another CPU... like...
[3/10 - 2/6 - 1.5/5 - 1/3 no defer] in a 4 cpus system :-).
The default setup is based on the throttling to avoid changing
the historical behavior for those that... are happy with them.
-- Daniel
> - Joel
On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> On 11/6/23 16:40, Peter Zijlstra wrote:
> > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> >> Add an interface for fair server setup on debugfs.
> >>
> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> >>
> >> - fair_server_runtime: set runtime in ns
> >> - fair_server_period: set period in ns
> >> - fair_server_defer: on/off for the defer mechanism
> >>
> >
> > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> > total available bandwidth control, right?
>
> right, but thinking aloud... given that the per-cpu files are already allocating the
> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
>
> Still, we can have them there as a safeguard to not overloading the deadline
> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
> from it, that global limitation will make less sense - still better to have a form
> of limitation so people are aware of bandwidth until there.
Yeah, so having a limit on the deadline thing seems prudent as a way to
model system overhead. I mean 100% sounds nice, but then all the models
also assume no interrupts, no scheduler or migration overhead etc.. So
setting a slightly lower max seems far more realistic to me.
That said, the period/bandwidth thing is now slightly odd, as we really
only care about the utilization. But whatever. One thing at a time.
> > But then shouldn've we also rip out the throttle thingy right quick?
> >
>
> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?
Yes. Until we can delete all that code we'll have to keep some of that.
> With the next step on moving the dl server as the base for the
> hierarchical scheduling... That will rip out the
> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
>
> Does it make sense?
I'm still not sure how to deal with affinities and deadline servers for
RT.
There's a bunch of issues and I thing we've only got some of them solved.
The semi-partitioned thing (someone was working on that, I think you
know the guy), solves DL 'entities' having affinities.
But the problem of FIFO is that they don't have inherent bandwidth. This
in turn means that any server for FIFO needs to be minimally concurrent,
otherwise you hand out bandwidth to lower priority tasks that the higher
priority task might want etc.. (Andersson's group has papers here).
Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
that requires .8 a medium prio task that requires .6 and a low prio task
that soaks up whatever it can get its little grubby paws on.
Then with minimal concurreny this works out nicely, high gets .8, mid
gets .6 and low gets the remaining .1.
If OTOH you don't limit concurrency and let them all run concurrently,
you can end up with the situation where they each get .5. Which is
obviously fail.
Add affinities here though and you're up a creek, how do you distribute
utilization between the slices, what slices, etc.. You say given them a
per-cpu cgroup interface, and have them configure it themselves, but
that's a god-aweful thing to ask userspace to do.
Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
and abysmal failure of a model -- thank you POSIX :-(
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 63ba8422f876e32ee564ea95da9a7313b13ff0a1
Gitweb: https://git.kernel.org/tip/63ba8422f876e32ee564ea95da9a7313b13ff0a1
Author: Peter Zijlstra <[email protected]>
AuthorDate: Sat, 04 Nov 2023 11:59:21 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 15 Nov 2023 09:57:51 +01:00
sched/deadline: Introduce deadline servers
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).
Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
---
include/linux/sched.h | 22 ++-
kernel/sched/core.c | 17 ++-
kernel/sched/deadline.c | 332 ++++++++++++++++++++++++++-------------
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 27 +++-
5 files changed, 292 insertions(+), 108 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 44b46d9..8d25816 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -63,11 +63,13 @@ struct robust_list_head;
struct root_domain;
struct rq;
struct sched_attr;
+struct sched_dl_entity;
struct seq_file;
struct sighand_struct;
struct signal_struct;
struct task_delay_info;
struct task_group;
+struct task_struct;
struct user_event_mm;
/*
@@ -607,6 +609,9 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
struct sched_dl_entity {
struct rb_node rb_node;
@@ -654,6 +659,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
+ unsigned int dl_server : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
@@ -668,7 +674,20 @@ struct sched_dl_entity {
* timer is needed to decrease the active utilization at the correct
* time.
*/
- struct hrtimer inactive_timer;
+ struct hrtimer inactive_timer;
+
+ /*
+ * Bits for DL-server functionality. Also see the comment near
+ * dl_server_update().
+ *
+ * @rq the runqueue this server is for
+ *
+ * @server_has_tasks() returns true if @server_pick return a
+ * runnable task.
+ */
+ struct rq *rq;
+ dl_server_has_tasks_f server_has_tasks;
+ dl_server_pick_f server_pick;
#ifdef CONFIG_RT_MUTEXES
/*
@@ -795,6 +814,7 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
+ struct sched_dl_entity *dl_server;
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 966631f..f5f4495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3797,6 +3797,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
rq->idle_stamp = 0;
}
#endif
+
+ p->dl_server = NULL;
}
/*
@@ -6003,12 +6005,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq);
}
+ /*
+ * This is the fast path; it cannot be a DL server pick;
+ * therefore even if @p == @prev, ->dl_server must be NULL.
+ */
+ if (p->dl_server)
+ p->dl_server = NULL;
+
return p;
}
restart:
put_prev_task_balance(rq, prev, rf);
+ /*
+ * We've updated @prev and no longer need the server link, clear it.
+ * Must be done before ->pick_next_task() because that can (re)set
+ * ->dl_server.
+ */
+ if (prev->dl_server)
+ prev->dl_server = NULL;
+
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 81810f6..a04a436 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,8 +54,14 @@ static int __init sched_dl_sysctl_init(void)
late_initcall(sched_dl_sysctl_init);
#endif
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+ return dl_se->dl_server;
+}
+
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
+ BUG_ON(dl_server(dl_se));
return container_of(dl_se, struct task_struct, dl);
}
@@ -64,12 +70,19 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
return container_of(dl_rq, struct rq, dl);
}
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = task_rq(p);
+ struct rq *rq = dl_se->rq;
+
+ if (!dl_server(dl_se))
+ rq = task_rq(dl_task_of(dl_se));
- return &rq->dl;
+ return rq;
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+ return &rq_of_dl_se(dl_se)->dl;
}
static inline int on_dl_rq(struct sched_dl_entity *dl_se)
@@ -394,9 +407,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
static void task_non_contending(struct sched_dl_entity *dl_se)
{
struct hrtimer *timer = &dl_se->inactive_timer;
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
- struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct dl_rq *dl_rq = &rq->dl;
s64 zerolag_time;
/*
@@ -426,25 +438,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
* utilization now, instead of starting a timer
*/
if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
- if (dl_task(p))
+ if (dl_server(dl_se)) {
sub_running_bw(dl_se, dl_rq);
+ } else {
+ struct task_struct *p = dl_task_of(dl_se);
- if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ if (dl_task(p))
+ sub_running_bw(dl_se, dl_rq);
- if (READ_ONCE(p->__state) == TASK_DEAD)
- sub_rq_bw(dl_se, &rq->dl);
- raw_spin_lock(&dl_b->lock);
- __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
- raw_spin_unlock(&dl_b->lock);
- __dl_clear_params(dl_se);
+ if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+ struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+ if (READ_ONCE(p->__state) == TASK_DEAD)
+ sub_rq_bw(dl_se, &rq->dl);
+ raw_spin_lock(&dl_b->lock);
+ __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+ raw_spin_unlock(&dl_b->lock);
+ __dl_clear_params(dl_se);
+ }
}
return;
}
dl_se->dl_non_contending = 1;
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
+
hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
}
@@ -471,8 +491,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* will not touch the rq's active utilization,
* so we are still safe.
*/
- if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
- put_task_struct(dl_task_of(dl_se));
+ if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+ if (!dl_server(dl_se))
+ put_task_struct(dl_task_of(dl_se));
+ }
} else {
/*
* Since "dl_non_contending" is not set, the
@@ -485,10 +507,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
}
}
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- struct sched_dl_entity *dl_se = &p->dl;
-
return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
}
@@ -740,8 +760,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
}
#endif /* CONFIG_SMP */
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags);
static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -989,8 +1011,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
*/
static void update_dl_entity(struct sched_dl_entity *dl_se)
{
- struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq);
+ struct rq *rq = rq_of_dl_se(dl_se);
if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1021,11 +1042,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
* actually started or not (i.e., the replenishment instant is in
* the future or in the past).
*/
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
{
- struct sched_dl_entity *dl_se = &p->dl;
struct hrtimer *timer = &dl_se->dl_timer;
- struct rq *rq = task_rq(p);
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
ktime_t now, act;
s64 delta;
@@ -1059,13 +1080,33 @@ static int start_dl_timer(struct task_struct *p)
* and observe our state.
*/
if (!hrtimer_is_queued(timer)) {
- get_task_struct(p);
+ if (!dl_server(dl_se))
+ get_task_struct(dl_task_of(dl_se));
hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
}
return 1;
}
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+ /*
+ * Queueing this task back might have overloaded rq, check if we need
+ * to kick someone away.
+ */
+ if (has_pushable_dl_tasks(rq)) {
+ /*
+ * Nothing relies on rq->lock after this, so its safe to drop
+ * rq->lock.
+ */
+ rq_unpin_lock(rq, rf);
+ push_dl_task(rq);
+ rq_repin_lock(rq, rf);
+ }
+#endif
+}
+
/*
* This is the bandwidth enforcement timer callback. If here, we know
* a task is not on its dl_rq, since the fact that the timer was running
@@ -1084,10 +1125,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
dl_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p;
struct rq_flags rf;
struct rq *rq;
+ if (dl_server(dl_se)) {
+ struct rq *rq = rq_of_dl_se(dl_se);
+ struct rq_flags rf;
+
+ rq_lock(rq, &rf);
+ if (dl_se->dl_throttled) {
+ sched_clock_tick();
+ update_rq_clock(rq);
+
+ if (dl_se->server_has_tasks(dl_se)) {
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ resched_curr(rq);
+ __push_dl_task(rq, &rf);
+ } else {
+ replenish_dl_entity(dl_se);
+ }
+
+ }
+ rq_unlock(rq, &rf);
+
+ return HRTIMER_NORESTART;
+ }
+
+ p = dl_task_of(dl_se);
rq = task_rq_lock(p, &rf);
/*
@@ -1158,21 +1223,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
else
resched_curr(rq);
-#ifdef CONFIG_SMP
- /*
- * Queueing this task back might have overloaded rq, check if we need
- * to kick someone away.
- */
- if (has_pushable_dl_tasks(rq)) {
- /*
- * Nothing relies on rq->lock after this, so its safe to drop
- * rq->lock.
- */
- rq_unpin_lock(rq, &rf);
- push_dl_task(rq);
- rq_repin_lock(rq, &rf);
- }
-#endif
+ __push_dl_task(rq, &rf);
unlock:
task_rq_unlock(rq, p, &rf);
@@ -1214,12 +1265,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
*/
static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
{
- struct task_struct *p = dl_task_of(dl_se);
- struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+ struct rq *rq = rq_of_dl_se(dl_se);
if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
return;
dl_se->dl_throttled = 1;
if (dl_se->runtime > 0)
@@ -1270,29 +1320,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
return (delta * u_act) >> BW_SHIFT;
}
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+ int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
{
- struct task_struct *curr = rq->curr;
- struct sched_dl_entity *dl_se = &curr->dl;
- s64 delta_exec, scaled_delta_exec;
- int cpu = cpu_of(rq);
-
- if (!dl_task(curr) || !on_dl_rq(dl_se))
- return;
+ s64 scaled_delta_exec;
- /*
- * Consumed budget is computed considering the time as
- * observed by schedulable tasks (excluding time spent
- * in hardirq context, etc.). Deadlines are instead
- * computed using hard walltime. This seems to be the more
- * natural solution, but the full ramifications of this
- * approach need further study.
- */
- delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
@@ -1310,10 +1344,9 @@ static void update_curr_dl(struct rq *rq)
* according to current frequency and CPU maximum capacity.
*/
if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
- scaled_delta_exec = grub_reclaim(delta_exec,
- rq,
- &curr->dl);
+ scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
} else {
+ int cpu = cpu_of(rq);
unsigned long scale_freq = arch_scale_freq_capacity(cpu);
unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
@@ -1332,11 +1365,20 @@ throttle:
(dl_se->flags & SCHED_FLAG_DL_OVERRUN))
dl_se->dl_overrun = 1;
- __dequeue_task_dl(rq, curr, 0);
- if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
- enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+ dequeue_dl_entity(dl_se, 0);
+ if (!dl_server(dl_se)) {
+ update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+ dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+ }
- if (!is_leftmost(curr, &rq->dl))
+ if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+ if (dl_server(dl_se))
+ enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+ else
+ enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+ }
+
+ if (!is_leftmost(dl_se, &rq->dl))
resched_curr(rq);
}
@@ -1366,20 +1408,82 @@ throttle:
}
}
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+ update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+ if (!dl_server(dl_se)) {
+ dl_se->dl_server = 1;
+ setup_new_dl_entity(dl_se);
+ }
+ enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+ dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick)
+{
+ dl_se->rq = rq;
+ dl_se->server_has_tasks = has_tasks;
+ dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct sched_dl_entity *dl_se = &curr->dl;
+ s64 delta_exec;
+
+ if (!dl_task(curr) || !on_dl_rq(dl_se))
+ return;
+
+ /*
+ * Consumed budget is computed considering the time as
+ * observed by schedulable tasks (excluding time spent
+ * in hardirq context, etc.). Deadlines are instead
+ * computed using hard walltime. This seems to be the more
+ * natural solution, but the full ramifications of this
+ * approach need further study.
+ */
+ delta_exec = update_curr_common(rq);
+ update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
{
struct sched_dl_entity *dl_se = container_of(timer,
struct sched_dl_entity,
inactive_timer);
- struct task_struct *p = dl_task_of(dl_se);
+ struct task_struct *p = NULL;
struct rq_flags rf;
struct rq *rq;
- rq = task_rq_lock(p, &rf);
+ if (!dl_server(dl_se)) {
+ p = dl_task_of(dl_se);
+ rq = task_rq_lock(p, &rf);
+ } else {
+ rq = dl_se->rq;
+ rq_lock(rq, &rf);
+ }
sched_clock_tick();
update_rq_clock(rq);
+ if (dl_server(dl_se))
+ goto no_task;
+
if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
@@ -1396,14 +1500,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
goto unlock;
}
+
+no_task:
if (dl_se->dl_non_contending == 0)
goto unlock;
sub_running_bw(dl_se, &rq->dl);
dl_se->dl_non_contending = 0;
unlock:
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
+
+ if (!dl_server(dl_se)) {
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ } else {
+ rq_unlock(rq, &rf);
+ }
return HRTIMER_NORESTART;
}
@@ -1466,10 +1577,8 @@ static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
static inline
void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
u64 deadline = dl_se->deadline;
- WARN_ON(!dl_prio(prio));
dl_rq->dl_nr_running++;
add_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1479,9 +1588,6 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
static inline
void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
- int prio = dl_task_of(dl_se)->prio;
-
- WARN_ON(!dl_prio(prio));
WARN_ON(!dl_rq->dl_nr_running);
dl_rq->dl_nr_running--;
sub_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1648,8 +1754,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
} else if (flags & ENQUEUE_REPLENISH) {
replenish_dl_entity(dl_se);
} else if ((flags & ENQUEUE_RESTORE) &&
- dl_time_before(dl_se->deadline,
- rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+ dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
setup_new_dl_entity(dl_se);
}
@@ -1730,19 +1835,13 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
enqueue_dl_entity(&p->dl, flags);
+ if (dl_server(&p->dl))
+ return;
+
if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
- update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
- dequeue_dl_entity(&p->dl, flags);
-
- if (!p->dl.dl_throttled)
- dequeue_pushable_dl_task(rq, p);
-}
-
static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_curr_dl(rq);
@@ -1750,7 +1849,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (p->on_rq == TASK_ON_RQ_MIGRATING)
flags |= DEQUEUE_MIGRATING;
- __dequeue_task_dl(rq, p, flags);
+ dequeue_dl_entity(&p->dl, flags);
+ if (!p->dl.dl_throttled && !dl_server(&p->dl))
+ dequeue_pushable_dl_task(rq, p);
}
/*
@@ -1940,12 +2041,12 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
}
#ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
- hrtick_start(rq, p->dl.runtime);
+ hrtick_start(rq, dl_se->runtime);
}
#else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
{
}
#endif
@@ -1965,9 +2066,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first)
return;
- if (hrtick_enabled_dl(rq))
- start_hrtick_dl(rq, p);
-
if (rq->curr->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
@@ -1990,12 +2088,25 @@ static struct task_struct *pick_task_dl(struct rq *rq)
struct dl_rq *dl_rq = &rq->dl;
struct task_struct *p;
+again:
if (!sched_dl_runnable(rq))
return NULL;
dl_se = pick_next_dl_entity(dl_rq);
WARN_ON_ONCE(!dl_se);
- p = dl_task_of(dl_se);
+
+ if (dl_server(dl_se)) {
+ p = dl_se->server_pick(dl_se);
+ if (!p) {
+ WARN_ON_ONCE(1);
+ dl_se->dl_yielded = 1;
+ update_curr_dl_se(rq, dl_se, 0);
+ goto again;
+ }
+ p->dl_server = dl_se;
+ } else {
+ p = dl_task_of(dl_se);
+ }
return p;
}
@@ -2005,9 +2116,15 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
struct task_struct *p;
p = pick_task_dl(rq);
- if (p)
+ if (!p)
+ return p;
+
+ if (!p->dl_server)
set_next_task_dl(rq, p, true);
+ if (hrtick_enabled(rq))
+ start_hrtick_dl(rq, &p->dl);
+
return p;
}
@@ -2045,8 +2162,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
* be set and schedule() will start a new hrtick for the next task.
*/
if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
- is_leftmost(p, &rq->dl))
- start_hrtick_dl(rq, p);
+ is_leftmost(&p->dl, &rq->dl))
+ start_hrtick_dl(rq, &p->dl);
}
static void task_fork_dl(struct task_struct *p)
@@ -2986,6 +3103,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
dl_se->dl_yielded = 0;
dl_se->dl_non_contending = 0;
dl_se->dl_overrun = 0;
+ dl_se->dl_server = 0;
#ifdef CONFIG_RT_MUTEXES
dl_se->pi_se = dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1cd92b1..07f5558 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1131,6 +1131,8 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
trace_sched_stat_runtime(p, delta_exec);
account_group_exec_runtime(p, delta_exec);
cgroup_account_cputime(p, delta_exec);
+ if (p->dl_server)
+ dl_server_update(p->dl_server, delta_exec);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1cda787..8a70d51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_bw_check_overflow(int cpu);
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ * dl_se::rq -- runqueue we belong to.
+ *
+ * dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ * server when it runs out of tasks to run.
+ *
+ * dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ * returns NULL.
+ *
+ * dl_server_update() -- called from update_curr_common(), propagates runtime
+ * to the server.
+ *
+ * dl_server_start()
+ * dl_server_stop() -- start/stop the server when it has (no) tasks.
+ *
+ * dl_server_init() -- initializes the server.
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+ dl_server_has_tasks_f has_tasks,
+ dl_server_pick_f pick);
+
#ifdef CONFIG_CGROUP_SCHED
struct cfs_rq;
On Sat, Nov 04, 2023 at 11:59:17AM +0100, Daniel Bristot de Oliveira wrote:
> This is v5 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
>
> SCHED_DEADLINE servers can help fixing starvation issues of low priority
> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
> cycles. Today we have RT Throttling; DEADLINE servers should be able to
> replace and improve that.
Hello!
Just wanted to provide some ChromeOS data on these patches. There is
great improvement when using DL-sever along with RT for foreground Chrome's
display, audio and main threads. Once they are in the background, we set them
back to CFS (except audio). I think these patches are ready to move forward
as the data looks good to me. I see Peter picked up some of them already
which is nice.
One of the key metrics for us is event latency. We have a test that measures
various latency metrics with typing happening on a Google docs in one window
and a 16-person Google meet call happening on the other. This is a very
complex test but gets us close to what the user experiences (as is typical -
meeting attendees in a Google meet call take notes in a Google doc). As a
result, getting stable numbers requires a lot of care which is why I used
P-value to measure the statistical significance of the results. The P-value
for some metrics show lower significance, so we can ignore those but I still
provided it in the table.
The test is run on a Chromebook with 4 cores (Intel(R) Celeron(R) N4100 CPU @
1.10GHz) and 16GB of RAM. No Hyperthreading.
All units are microseconds. The average is calculated as the average of 20
runs with and without "Chrome using RT + DL-server". The 5% every 1 second
default does not work for us, so I changed the DL server parameters to 5ms
every 30ms. This allows CFS to run more often.
This test runs for 6 hours. Total test time for both before and after is 12 hours:
---------------------------------------------------------------------------------------------------------
| MetricName | Average Before | Average After | Change % | P-value |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.Core.TotalLatency | 90.19 | 78.22 | 13.27% | 0.03 |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.KeyReleased.TotalLatency | 90733.76 | 78602.72 | 13.37% | 0.03 |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.TotalLatency | 90.19 | 78.22 | 13.27% | 0.03 |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.KeyPressed.TotalLatency | 68269.21 | 63310.99 | 7.26% | 0.00 |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.MousePressed.TotalLatency | 192080.44 | 179264.31 | 6.67% | 0.26 |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.TotalLatency | 68795.99 | 63860.04 | 7.17% | 0.00 |
---------------------------------------------------------------------------------------------------------
| EventLatency.GestureScrollUpdt.Wheel.TotalLat | 63420.88 | 59394.18 | 6.35% | 0.02 |
---------------------------------------------------------------------------------------------------------
| EventLatency.KeyPressed.TotalLatency | 68269.21 | 63310.99 | 7.26% | 0.00 |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseDragged.TotalLatency | 106393.09 | 104152.50 | 2.11% | 0.57 |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseMoved.TotalLatency | 129225.65 | 113268.48 | 12.35% | 0.01 |
---------------------------------------------------------------------------------------------------------
| EventLatency.MousePressed.TotalLatency | 192080.44 | 179264.31 | 6.67% | 0.26 |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseReleased.TotalLatency | 152366.33 | 140309.50 | 7.91% | 0.44 |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency | 68795.99 | 63862.45 | 7.17% | 0.00 |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency_ash-Chrome | 68795.99 | 63862.45 | 7.17% | 0.00 |
---------------------------------------------------------------------------------------------------------
I also did another test where I measure the CFS maximum latency (using perf
sched) while a YouTube video is playing, and the CFS max latency looks great
too. In fact, with the vanilla RT throttling, our CFS tasks are doing really
badly (perhaps because of depending on RT tasks due to locks or such). So we
definitely need the DL-server to use RT properly!
We are testing dlserver with 5ms/50ms and 5ms/100ms as well to see the
impact. But at the moment, 5ms/30ms is looking good.
Thanks for all of your work, here's to better Linux and better Chromebooks ;)
- Joel
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
>
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>
> - fair_server_runtime: set runtime in ns
> - fair_server_period: set period in ns
> - fair_server_defer: on/off for the defer mechanism
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.
We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:
WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
place_entity+0x240/0x290 kernel/sched/fair.c:5147
Call Trace:
<TASK>
enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
activate_task+0x60/0xc0 kernel/sched/core.c:2147
ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
ttwu_queue kernel/sched/core.c:4047 [inline]
try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
__queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
queue_delayed_work include/linux/workqueue.h:577 [inline]
srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
which is basically this warning in place_entity:
if (WARN_ON_ONCE(!load))
load = 1;
Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/
Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
[...]
Call Trace:
<TASK>
set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
set_next_task kernel/sched/sched.h:2241 [inline]
pick_next_task kernel/sched/core.c:6014 [inline]
__schedule+0x36fb/0x402d kernel/sched/core.c:6378
preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
Full splat: https://paste.debian.net/1304573/
Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzkaller's syzbot has luck so for reproducing
these.
Also I had a comment below:
> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + u64 new_bw = to_ratio(period, runtime);
> + struct rq *rq = dl_se->rq;
> + int cpu = cpu_of(rq);
> + struct dl_bw *dl_b;
> + unsigned long cap;
> + int retval = 0;
> + int cpus;
> +
> + dl_b = dl_bw_of(cpu);
> + raw_spin_lock(&dl_b->lock);
> + cpus = dl_bw_cpus(cpu);
> + cap = dl_bw_capacity(cpu);
> +
> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.
Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).
In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).
Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.
How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.
thanks,
- Joel
On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
>
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>
> - fair_server_runtime: set runtime in ns
> - fair_server_period: set period in ns
> - fair_server_defer: on/off for the defer mechanism
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.
We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:
WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
place_entity+0x240/0x290 kernel/sched/fair.c:5147
Call Trace:
<TASK>
enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
activate_task+0x60/0xc0 kernel/sched/core.c:2147
ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
ttwu_queue kernel/sched/core.c:4047 [inline]
try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
__queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
queue_delayed_work include/linux/workqueue.h:577 [inline]
srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
which is basically this warning in place_entity:
if (WARN_ON_ONCE(!load))
load = 1;
Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/
Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
[...]
Call Trace:
<TASK>
set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
set_next_task kernel/sched/sched.h:2241 [inline]
pick_next_task kernel/sched/core.c:6014 [inline]
__schedule+0x36fb/0x402d kernel/sched/core.c:6378
preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
Full splat: https://paste.debian.net/1304573/
Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzbot has luck reproducing these.
Also I had a comment below:
> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + u64 new_bw = to_ratio(period, runtime);
> + struct rq *rq = dl_se->rq;
> + int cpu = cpu_of(rq);
> + struct dl_bw *dl_b;
> + unsigned long cap;
> + int retval = 0;
> + int cpus;
> +
> + dl_b = dl_bw_of(cpu);
> + raw_spin_lock(&dl_b->lock);
> + cpus = dl_bw_cpus(cpu);
> + cap = dl_bw_capacity(cpu);
> +
> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.
Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).
In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).
Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.
How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.
thanks,
- Joel
On 1/19/24 02:55, Joel Fernandes wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>> - fair_server_runtime: set runtime in ns
>> - fair_server_period: set period in ns
>> - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>
> Hi Daniel, Peter,
> I am writing on behalf of the ChromeOS scheduler team.
>
> We had to revert the last 3 patches in this series because of a syzkaller
> reported bug, this happens on the sched/more branch in Peter's tree:
>
> WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
> place_entity+0x240/0x290 kernel/sched/fair.c:5147
> Call Trace:
> <TASK>
> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
> enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
> enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
> activate_task+0x60/0xc0 kernel/sched/core.c:2147
> ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
> ttwu_queue kernel/sched/core.c:4047 [inline]
> try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
> kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
> __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
> queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
> queue_delayed_work include/linux/workqueue.h:577 [inline]
> srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
>
> which is basically this warning in place_entity:
> if (WARN_ON_ONCE(!load))
> load = 1;
>
> Full log (scroll to the bottom as there is console/lockdep side effects which
> are likely not relevant to this issue): https://paste.debian.net/1304579/
>
> Side note, we are also looking into a KASAN nullptr deref but this happens
> only on our backport of the patches to a 5.15 kernel, as far as we know.
>
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
> RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
> RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
> [...]
> Call Trace:
> <TASK>
> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
> set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
> set_next_task kernel/sched/sched.h:2241 [inline]
> pick_next_task kernel/sched/core.c:6014 [inline]
> __schedule+0x36fb/0x402d kernel/sched/core.c:6378
> preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
> preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
>
> Full splat: https://paste.debian.net/1304573/
Interesting, does it keep any task hung? I am having a case where I see
a hung task, but I do not get the splat because the system freezes (printk
with rq_lock I guess)...
It might be the same problem.
> Investigation is on going but could you also please take a look at these? It
> is hard to reproduce and only syzbot has luck reproducing these.
>
> Also I had a comment below:
>
>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>> +{
>> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>> + u64 new_bw = to_ratio(period, runtime);
>> + struct rq *rq = dl_se->rq;
>> + int cpu = cpu_of(rq);
>> + struct dl_bw *dl_b;
>> + unsigned long cap;
>> + int retval = 0;
>> + int cpus;
>> +
>> + dl_b = dl_bw_of(cpu);
>> + raw_spin_lock(&dl_b->lock);
>> + cpus = dl_bw_cpus(cpu);
>> + cap = dl_bw_capacity(cpu);
>> +
>> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
>
> The dl_overflow() call here seems introducing an issue with our conceptual
> understanding of how the dl server is supposed to work.
>
> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
> Suppose the DL server params are 50ms runtime in 100ms period (basically we
> want to dedicate 50% of the bandwidth of each CPU to CFS).
>
> In such a situation, __dl_overflow() will return an error right? Because
> total bandwidth will exceed 100% (4 times 50% is 200%).
I might be missing something in your case, but, it accepts:
root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000
your system accepts 400%... the percentage is "global".
is it failing in your system?
-- Daniel
Hi Daniel,
On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> On 1/19/24 02:55, Joel Fernandes wrote:
>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>> - fair_server_runtime: set runtime in ns
>>> - fair_server_period: set period in ns
>>> - fair_server_defer: on/off for the defer mechanism
>>>
>>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>>
>> Hi Daniel, Peter,
>> I am writing on behalf of the ChromeOS scheduler team.
>>
>> We had to revert the last 3 patches in this series because of a syzkaller
>> reported bug, this happens on the sched/more branch in Peter's tree:
>>
>> WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>> place_entity+0x240/0x290 kernel/sched/fair.c:5147
>> Call Trace:
>> <TASK>
>> enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>> enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>> enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>> activate_task+0x60/0xc0 kernel/sched/core.c:2147
>> ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>> ttwu_queue kernel/sched/core.c:4047 [inline]
>> try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>> kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>> __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>> queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>> queue_delayed_work include/linux/workqueue.h:577 [inline]
>> srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
>>
>> which is basically this warning in place_entity:
>> if (WARN_ON_ONCE(!load))
>> load = 1;
>>
>> Full log (scroll to the bottom as there is console/lockdep side effects which
>> are likely not relevant to this issue): https://paste.debian.net/1304579/
>>
>> Side note, we are also looking into a KASAN nullptr deref but this happens
>> only on our backport of the patches to a 5.15 kernel, as far as we know.
>>
>> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>> RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline]
>> RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>> [...]
>> Call Trace:
>> <TASK>
>> set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>> set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>> set_next_task kernel/sched/sched.h:2241 [inline]
>> pick_next_task kernel/sched/core.c:6014 [inline]
>> __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>> preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>> preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
>>
>> Full splat: https://paste.debian.net/1304573/
>
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...
>
> It might be the same problem.
Ah, we have an update. Suleiman found this is happening because of core
scheduling's pick logic. I have some patches to fix it, there's also more fixes
we have on other issues. Will coordinate with the team to send these out soon.
We are currently testing them more.
>> Investigation is on going but could you also please take a look at these? It
>> is hard to reproduce and only syzbot has luck reproducing these.
>>
>> Also I had a comment below:
>>
>>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>>> +{
>>> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>>> + u64 new_bw = to_ratio(period, runtime);
>>> + struct rq *rq = dl_se->rq;
>>> + int cpu = cpu_of(rq);
>>> + struct dl_bw *dl_b;
>>> + unsigned long cap;
>>> + int retval = 0;
>>> + int cpus;
>>> +
>>> + dl_b = dl_bw_of(cpu);
>>> + raw_spin_lock(&dl_b->lock);
>>> + cpus = dl_bw_cpus(cpu);
>>> + cap = dl_bw_capacity(cpu);
>>> +
>>> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
>>
>> The dl_overflow() call here seems introducing an issue with our conceptual
>> understanding of how the dl server is supposed to work.
>>
>> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
>> Suppose the DL server params are 50ms runtime in 100ms period (basically we
>> want to dedicate 50% of the bandwidth of each CPU to CFS).
>>
>> In such a situation, __dl_overflow() will return an error right? Because
>> total bandwidth will exceed 100% (4 times 50% is 200%).
>
> I might be missing something in your case, but, it accepts:
>
> root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
>
> your system accepts 400%... the percentage is "global".
>
> is it failing in your system?
You are right, I was actually trying to change it manually in my kernel in
dl_server_start(). In this case dlserver_apply_server_params() gets init=1 and
old_bw is 0.
I tried using the debugfs, and that works. So I think we will just use the
debugfs. I was being lazy and setting it in my kernel manually for testing like
this:
@@ -1475,7 +1475,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
* this before getting generic.
*/
if (!dl_server(dl_se)) {
u64 runtime = 12 * NSEC_PER_MSEC;
u64 period = 15 * NSEC_PER_MSEC;
That doesn't work but I tried debugfs and it works. But for production, we will
set it from userspace so it should not be an issue.
I feel so much better now :) Thanks Daniel.
By the way, what's the plan on remaining patches in sched/more branch, are you
planning to resend those later? If so, we can just post our fixes on top of
that, and if you don't mind you could include it in your next series posting
(sched/more + our fixes + your fixes).
Thanks!
- Joel
On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...
I missed replying to this part of your email. We see it as a syzkaller report
with splats, so it is not clear if a task was hung at the time of a splat.
syzkaller runs in its own separate VM instance.
The fun part is though, I found a way to do ftrace dump on it without having to
pass any boot parameters (that's another issue that we can't pass it boot
parameters). So we have that tool in our debug arsenal for these issues, thankfully.
> It might be the same problem.
True, possibly.
thanks,
- Joel
On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
>
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>
> - fair_server_runtime: set runtime in ns
> - fair_server_period: set period in ns
> - fair_server_defer: on/off for the defer mechanism
Btw Daniel, there is an interesting side-effect of this interface having runtime
and period in 2 separate files :)
Say I want to set a CPU to 5ms / 10ms.
I cannot set either period or runtime to 5ms or 10ms directly.
I have to first set period to 100ms, then set runtime to 50ms, then set period
to 50ms, then set runtime to 5ms, then finally set period to 10ms.
The reason seems to be because otherwise runtime / period will not be
accomodated and will cause dl_overflow issues.
I'd suggest providing both runtime and period in the same interface to make it
more easier to use. However, for the testing I am going with what we have.
Also a request:
I was wondering if a new version of the last 3 patches could be posted to
LKML or shared in a tree somewhere. I am trying to sync to mainline and
rebase our latest fixes on top of that, however it is difficult to do because
these 3 patches are in bit of a flux (example the discussion between you and
Peter about update_curr()). What's the best way to move forward with rebasing
our fix contributions? I am going with the sched/more in Peter's queue.git
unless you/Peter prefer something else. And I added your update_curr()
suggestion onto that, let me know if you disagree with it:
@@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
if (entity_is_task(curr))
update_curr_task(task_of(curr), delta_exec);
+ else
+ dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
thanks,
- Joel
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> kernel/sched/deadline.c | 89 +++++++++++++++---
> kernel/sched/debug.c | 202 ++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 6 --
> kernel/sched/sched.h | 2 +
> 4 files changed, 279 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 69ee1fbd60e4..1092ca8892e0 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> __sub_running_bw(dl_se->dl_bw, dl_rq);
> }
>
> -static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
> {
> - struct rq *rq;
> -
> - WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> -
> - if (task_on_rq_queued(p))
> - return;
> + if (dl_se->dl_non_contending) {
> + sub_running_bw(dl_se, &rq->dl);
> + dl_se->dl_non_contending = 0;
>
> - rq = task_rq(p);
> - if (p->dl.dl_non_contending) {
> - sub_running_bw(&p->dl, &rq->dl);
> - p->dl.dl_non_contending = 0;
> /*
> * If the timer handler is currently running and the
> * timer cannot be canceled, inactive_task_timer()
> @@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> * will not touch the rq's active utilization,
> * so we are still safe.
> */
> - if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> - put_task_struct(p);
> + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
> + if (!dl_server(dl_se))
> + put_task_struct(dl_task_of(dl_se));
> + }
> }
> - __sub_rq_bw(p->dl.dl_bw, &rq->dl);
> + __sub_rq_bw(dl_se->dl_bw, &rq->dl);
> __add_rq_bw(new_bw, &rq->dl);
> }
>
> +static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +{
> + WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> +
> + if (task_on_rq_queued(p))
> + return;
> +
> + dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
> +}
> +
> static void __dl_clear_params(struct sched_dl_entity *dl_se);
>
> /*
> @@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>
> void dl_server_start(struct sched_dl_entity *dl_se)
> {
> + /*
> + * XXX: the apply do not work fine at the init phase for the
> + * fair server because things are not yet set. We need to improve
> + * this before getting generic.
> + */
> if (!dl_server(dl_se)) {
> + u64 runtime = 50 * NSEC_PER_MSEC;
> + u64 period = 1000 * NSEC_PER_MSEC;
> +
> + dl_server_apply_params(dl_se, runtime, period, 1);
> +
> + dl_se->dl_zerolax = 1;
> dl_se->dl_server = 1;
> setup_new_dl_entity(dl_se);
> }
> +
> enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
> }
>
> @@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> dl_se->server_pick = pick;
> }
>
> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> + u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + u64 new_bw = to_ratio(period, runtime);
> + struct rq *rq = dl_se->rq;
> + int cpu = cpu_of(rq);
> + struct dl_bw *dl_b;
> + unsigned long cap;
> + int retval = 0;
> + int cpus;
> +
> + dl_b = dl_bw_of(cpu);
> + raw_spin_lock(&dl_b->lock);
> + cpus = dl_bw_cpus(cpu);
> + cap = dl_bw_capacity(cpu);
> +
> + if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> + retval = -EBUSY;
> + goto out;
> + }
> +
> + if (init) {
> + __add_rq_bw(new_bw, &rq->dl);
> + __dl_add(dl_b, new_bw, cpus);
> + } else {
> + __dl_sub(dl_b, dl_se->dl_bw, cpus);
> + __dl_add(dl_b, new_bw, cpus);
> +
> + dl_rq_change_utilization(rq, dl_se, new_bw);
> + }
> +
> + rq->fair_server.dl_runtime = runtime;
> + rq->fair_server.dl_deadline = period;
> + rq->fair_server.dl_period = period;
> +
> + dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> + dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
> +
> +out:
> + raw_spin_unlock(&dl_b->lock);
> +
> + return retval;
> +}
> +
> /*
> * Update the current task's runtime statistics (provided it is still
> * a -deadline task and has not been removed from the dl_rq).
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4580a450700e..bd7ad6b8d3de 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = {
> .release = seq_release,
> };
>
> +enum dl_param {
> + DL_RUNTIME = 0,
> + DL_PERIOD,
> + DL_ZEROLAX
> +};
> +
> +static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
> +static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC; /* 100 us */
> +
> +static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos, enum dl_param param)
> +{
> + long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> + u64 runtime, period, zerolax;
> + struct rq *rq = cpu_rq(cpu);
> + size_t err;
> + int retval;
> + u64 value;
> +
> + err = kstrtoull_from_user(ubuf, cnt, 10, &value);
> + if (err)
> + return err;
> +
> + scoped_guard (rq_lock_irqsave, rq) {
> +
> + runtime = rq->fair_server.dl_runtime;
> + period = rq->fair_server.dl_period;
> + zerolax = rq->fair_server.dl_zerolax;
> +
> + switch (param) {
> + case DL_RUNTIME:
> + if (runtime == value)
> + goto out;
> + runtime = value;
> + break;
> + case DL_PERIOD:
> + if (value == period)
> + goto out;
> + period = value;
> + break;
> + case DL_ZEROLAX:
> + if (zerolax == value)
> + goto out;
> + zerolax = value;
> + break;
> + }
> +
> + if (runtime > period
> + || period > fair_server_period_max
> + || period < fair_server_period_min
> + || zerolax > 1) {
> + cnt = -EINVAL;
> + goto out;
> + }
> +
> + if (rq->cfs.h_nr_running) {
> + update_rq_clock(rq);
> + dl_server_stop(&rq->fair_server);
> + }
> +
> + /*
> + * The zerolax does not change utilization, so just
> + * setting it is enough.
> + */
> + if (rq->fair_server.dl_zerolax != zerolax) {
> + rq->fair_server.dl_zerolax = zerolax;
> + } else {
> + retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> + if (retval)
> + cnt = retval;
> + }
> +
> + if (rq->cfs.h_nr_running)
> + dl_server_start(&rq->fair_server);
> + }
> +
> +out:
> + *ppos += cnt;
> + return cnt;
> +}
> +
> +static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
> +{
> + unsigned long cpu = (unsigned long) m->private;
> + struct rq *rq = cpu_rq(cpu);
> + u64 value;
> +
> + switch (param) {
> + case DL_RUNTIME:
> + value = rq->fair_server.dl_runtime;
> + break;
> + case DL_PERIOD:
> + value = rq->fair_server.dl_period;
> + break;
> + case DL_ZEROLAX:
> + value = rq->fair_server.dl_zerolax;
> + }
> +
> + seq_printf(m, "%llu\n", value);
> + return 0;
> +
> +}
> +
> +static ssize_t
> +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
> +{
> + return sched_fair_server_show(m, v, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_runtime_fops = {
> + .open = sched_fair_server_runtime_open,
> + .write = sched_fair_server_runtime_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_show(struct seq_file *m, void *v)
> +{
> + return sched_fair_server_show(m, v, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, sched_fair_server_period_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_period_fops = {
> + .open = sched_fair_server_period_open,
> + .write = sched_fair_server_period_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_show(struct seq_file *m, void *v)
> +{
> + return sched_fair_server_show(m, v, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, sched_fair_server_defer_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_defer_fops = {
> + .open = sched_fair_server_defer_open,
> + .write = sched_fair_server_defer_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> static struct dentry *debugfs_sched;
>
> +static void debugfs_fair_server_init(void)
> +{
> + long cpu;
> + struct dentry *rq_dentry;
> +
> + rq_dentry = debugfs_create_dir("rq", debugfs_sched);
> + if (!rq_dentry)
> + return;
> +
> + for_each_possible_cpu(cpu) {
> + struct dentry *d_cpu;
> + char buf[32];
> +
> + snprintf(buf, sizeof(buf), "cpu%ld", cpu);
> + d_cpu = debugfs_create_dir(buf, rq_dentry);
> +
> + debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
> + debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
> + debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
> + }
> +}
> +
> static __init int sched_init_debug(void)
> {
> struct dentry __maybe_unused *numa;
> @@ -374,6 +574,8 @@ static __init int sched_init_debug(void)
>
> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>
> + debugfs_fair_server_init();
> +
> return 0;
> }
> late_initcall(sched_init_debug);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 399237cd9f59..5434c52f470d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq)
> struct sched_dl_entity *dl_se = &rq->fair_server;
>
> init_dl_entity(dl_se);
> -
> - dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
> - dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
> - dl_se->dl_period = 1000 * NSEC_PER_MSEC;
> - dl_se->dl_zerolax = 1;
> -
> dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec0e288c8e06..312b31df5860 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> dl_server_pick_f pick);
>
> extern void fair_server_init(struct rq *);
> +extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
> + u64 runtime, u64 period, bool init);
>
> #ifdef CONFIG_CGROUP_SCHED
>
On 2/12/2024 9:13 PM, Joel Fernandes wrote:
>
>
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>> - fair_server_runtime: set runtime in ns
>> - fair_server_period: set period in ns
>> - fair_server_defer: on/off for the defer mechanism
>
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
>
> Say I want to set a CPU to 5ms / 10ms.
>
Sorry let me try again, say I want to set 500us / 1ms.
Then I have to do the following to get it to work:
# echo 100000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 5000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 10000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 1000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
IOW, if I boot and do the following, it fails:
# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
bash: echo: write error: Device or resource busy
- Joel
On 2/13/24 03:13, Joel Fernandes wrote:
>
>
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>> - fair_server_runtime: set runtime in ns
>> - fair_server_period: set period in ns
>> - fair_server_defer: on/off for the defer mechanism
>
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
>
> Say I want to set a CPU to 5ms / 10ms.
>
> I cannot set either period or runtime to 5ms or 10ms directly.
>
> I have to first set period to 100ms, then set runtime to 50ms, then set period
> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
Hummm yeah I could reproduce that, it seems that it is not even a problem of having
two files, but a bug in the logic, I will have a look.
> The reason seems to be because otherwise runtime / period will not be
> accomodated and will cause dl_overflow issues.
>
> I'd suggest providing both runtime and period in the same interface to make it
> more easier to use. However, for the testing I am going with what we have.
>
> Also a request:
>
> I was wondering if a new version of the last 3 patches could be posted to
> LKML or shared in a tree somewhere. I am trying to sync to mainline and
> rebase our latest fixes on top of that, however it is difficult to do because
> these 3 patches are in bit of a flux (example the discussion between you and
> Peter about update_curr()). What's the best way to move forward with rebasing
> our fix contributions?
Juri and I chat about, and we think it is a good thing to re-send this patch set,
including a fix I have to it (to avoid regression wrt rt throttling), explaining
these things in the mailing list so peter will be able to follow the discussion.
I still need to finish testing, and to make a proper cover page with all updates, the
latest thing is here (tm):
https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
It is based on peter's sched/more. I will probably re-send it today or tomorrow,
but at least you can have a look at it.
Another reason to send it is to get the regression test machinery running....
I am going with the sched/more in Peter's queue.git
> unless you/Peter prefer something else. And I added your update_curr()
> suggestion onto that, let me know if you disagree with it:
>
> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>
> if (entity_is_task(curr))
> update_curr_task(task_of(curr), delta_exec);
> + else
> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> }
That part of the code was optimized by peter during the last round of discussions.
It is like this now:
------------ %< -----------
- if (entity_is_task(curr))
- update_curr_task(task_of(curr), delta_exec);
+ if (entity_is_task(curr)) {
+ struct task_struct *p = task_of(curr);
+ update_curr_task(p, delta_exec);
+ /*
+ * Any fair task that runs outside of fair_server should
+ * account against fair_server such that it can account for
+ * this time and possibly avoid running this period.
+ */
+ if (p->dl_server != &rq->fair_server)
+ dl_server_update(&rq->fair_server, delta_exec);
+ }
------------ >% -----------
It is not straightforward to understand... but the ideia is:
if it is a task, and the server is ! of the fair server, discount time
directly from the fair server. This also means that if dl_server is NULL
(the server is not enabled) it will discount time from the fair server.
-- Daniel
> thanks,
>
> - Joel
Hello, Daniel,
On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
> On 2/13/24 03:13, Joel Fernandes wrote:
>>
>>
>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>> - fair_server_runtime: set runtime in ns
>>> - fair_server_period: set period in ns
>>> - fair_server_defer: on/off for the defer mechanism
>>
>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>> and period in 2 separate files :)
>>
>> Say I want to set a CPU to 5ms / 10ms.
>>
>> I cannot set either period or runtime to 5ms or 10ms directly.
>>
>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>
> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
> two files, but a bug in the logic, I will have a look.
Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
not set 45ms/50ms for instance.
Also just want to mention, if you could please CC my colleagues Suleiman and
Youssef on the patches who are also working on / reviewing these:
Suleiman Souhlal <[email protected]>
Youssef Esmat <[email protected]>
>> The reason seems to be because otherwise runtime / period will not be
>> accomodated and will cause dl_overflow issues.
>>
>> I'd suggest providing both runtime and period in the same interface to make it
>> more easier to use. However, for the testing I am going with what we have.
>>
>> Also a request:
>>
>> I was wondering if a new version of the last 3 patches could be posted to
>> LKML or shared in a tree somewhere. I am trying to sync to mainline and
>> rebase our latest fixes on top of that, however it is difficult to do because
>> these 3 patches are in bit of a flux (example the discussion between you and
>> Peter about update_curr()). What's the best way to move forward with rebasing
>> our fix contributions?
>
> Juri and I chat about, and we think it is a good thing to re-send this patch set,
> including a fix I have to it (to avoid regression wrt rt throttling), explaining
> these things in the mailing list so peter will be able to follow the discussion.
>
> I still need to finish testing, and to make a proper cover page with all updates, the
> latest thing is here (tm):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
>
> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
> but at least you can have a look at it.
>> Another reason to send it is to get the regression test machinery running....
Sure, looking forward to it. I rebased on above tree and it applied cleanly.
What I'll do is I will send our patches today (not those in sched/more) after a
bit more testing and tweaks.
There are 2 reasons for this:
1. Can get the build robot do its thing.
2. Our internal system checks whether patches backported were posted upstream to
list.
Hope that sounds good to you and we can start reviewing as well.
> I am going with the sched/more in Peter's queue.git
>> unless you/Peter prefer something else. And I added your update_curr()
>> suggestion onto that, let me know if you disagree with it:
>>
>> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>
>> if (entity_is_task(curr))
>> update_curr_task(task_of(curr), delta_exec);
>> + else
>> + dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>>
>> account_cfs_rq_runtime(cfs_rq, delta_exec);
>> }
>
> That part of the code was optimized by peter during the last round of discussions.
>
> It is like this now:
>
> ------------ %< -----------
> - if (entity_is_task(curr))
> - update_curr_task(task_of(curr), delta_exec);
> + if (entity_is_task(curr)) {
> + struct task_struct *p = task_of(curr);
> + update_curr_task(p, delta_exec);
> + /*
> + * Any fair task that runs outside of fair_server should
> + * account against fair_server such that it can account for
> + * this time and possibly avoid running this period.
> + */
> + if (p->dl_server != &rq->fair_server)
> + dl_server_update(&rq->fair_server, delta_exec);
> + }
> ------------ >% -----------
>
> It is not straightforward to understand... but the ideia is:
>
> if it is a task, and the server is ! of the fair server, discount time
> directly from the fair server. This also means that if dl_server is NULL
> (the server is not enabled) it will discount time from the fair server.
Yes, that makes sense. We certainly want to debit from the server even when DL
is not picking the task indirectly. I guess Peter's optimization also handles
the case where multiple servers are in play. That will help us when/if we make
RT as a server as well, right?
thanks,
- Joel
On 2/15/24 14:57, Joel Fernandes wrote:
> Hello, Daniel,
>
> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>
>>>
>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>> - fair_server_runtime: set runtime in ns
>>>> - fair_server_period: set period in ns
>>>> - fair_server_defer: on/off for the defer mechanism
>>>
>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>> and period in 2 separate files :)
>>>
>>> Say I want to set a CPU to 5ms / 10ms.
>>>
>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>
>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>
>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>> two files, but a bug in the logic, I will have a look.
>
> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
> not set 45ms/50ms for instance.
I isolated the problem. It is not an interface problem.
Long story short, the servers are initialized at the defrootdomain, but
the dl_bw info is not being carried over to the new domain because the
servers are not a task.
I am discussing this with Valentin (topology) & Juri. We will try to find a
solution, or at least an presentable XXX: solution... in the next days.
You can work around it by disabling the admission control via:
# sysctl kernel.sched_rt_runtime_us=-1
the the values will be accepted. For the best of my knowledge, you guys are
only using SCHED_RR/FIFO so the admission control for DL is not an issue.
>> I still need to finish testing, and to make a proper cover page with all updates, the
>> latest thing is here (tm):
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
>>
>> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
>> but at least you can have a look at it.
>>> Another reason to send it is to get the regression test machinery running....
>
> Sure, looking forward to it. I rebased on above tree and it applied cleanly.
> What I'll do is I will send our patches today (not those in sched/more) after a
> bit more testing and tweaks.
>
> There are 2 reasons for this:
> 1. Can get the build robot do its thing.
> 2. Our internal system checks whether patches backported were posted upstream to
> list.
>
> Hope that sounds good to you and we can start reviewing as well.
If it helps downstream for you guys, it is not a problem for me. Still, peter is
the person that has more comments to give so...
-- Daniel
On 2/15/2024 12:27 PM, Daniel Bristot de Oliveira wrote:
> On 2/15/24 14:57, Joel Fernandes wrote:
>> Hello, Daniel,
>>
>> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>>
>>>>
>>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>>> Add an interface for fair server setup on debugfs.
>>>>>
>>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>>
>>>>> - fair_server_runtime: set runtime in ns
>>>>> - fair_server_period: set period in ns
>>>>> - fair_server_defer: on/off for the defer mechanism
>>>>
>>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>>> and period in 2 separate files :)
>>>>
>>>> Say I want to set a CPU to 5ms / 10ms.
>>>>
>>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>>
>>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>>
>>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>>> two files, but a bug in the logic, I will have a look.
>>
>> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
>> not set 45ms/50ms for instance.
>
> I isolated the problem. It is not an interface problem.
>
> Long story short, the servers are initialized at the defrootdomain, but
> the dl_bw info is not being carried over to the new domain because the
> servers are not a task.
Nice work on nailing the issue.
> I am discussing this with Valentin (topology) & Juri. We will try to find a
> solution, or at least an presentable XXX: solution... in the next days.
>
> You can work around it by disabling the admission control via:
>
> # sysctl kernel.sched_rt_runtime_us=-1
>
> the the values will be accepted. For the best of my knowledge, you guys are
> only using SCHED_RR/FIFO so the admission control for DL is not an issue.
That's right, we only use deadline for the server. However, on some devices,
schedutil is used and AFAIR its kthread uses SCHED_DEADLINE. I don't anticipate
problems related to admission control and that kthread, so I think your proposed
workaround sounds good to me.
thanks,
- Joel
Hi, Daniel,
Thanks a lot for your great patchset!
We have a similar starvation issue in mm subsystem too. Details are in
the patch description of the below commit. In short, task A is busy
looping on some event, while task B will signal the event after some
work. If the priority of task A is higher than that of task B, task B
may be starved.
IIUC, if task A is RT task while task B is fair task, then your patchset
will solve the issue. If both task A and task B is RT tasks, is there
some way to solve the issue?
Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
resolve the issue, is there something better?
Best Regards,
Huang, Ying
--------------------------8<---------------------------------------
commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
Author: Guo Ziliang <[email protected]>
Date: Wed Mar 16 16:15:03 2022 -0700
mm: swap: get rid of livelock in swapin readahead
In our testing, a livelock task was found. Through sysrq printing, same
stack was found every time, as follows:
__swap_duplicate+0x58/0x1a0
swapcache_prepare+0x24/0x30
__read_swap_cache_async+0xac/0x220
read_swap_cache_async+0x58/0xa0
swapin_readahead+0x24c/0x628
do_swap_page+0x374/0x8a0
__handle_mm_fault+0x598/0xd60
handle_mm_fault+0x114/0x200
do_page_fault+0x148/0x4d0
do_translation_fault+0xb0/0xd4
do_mem_abort+0x50/0xb0
The reason for the livelock is that swapcache_prepare() always returns
EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
cannot jump out of the loop. We suspect that the task that clears the
SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
priority of the task stuck in a livelock so that the task that clears
the SWAP_HAS_CACHE flag will run. The results show that the system
returns to normal after the priority is lowered.
In our testing, multiple real-time tasks are bound to the same core, and
the task in the livelock is the highest priority task of the core, so
the livelocked task cannot be preempted.
Although cond_resched() is used by __read_swap_cache_async, it is an
empty function in the preemptive system and cannot achieve the purpose
of releasing the CPU. A high-priority task cannot release the CPU
unless preempted by a higher-priority task. But when this task is
already the highest priority task on this core, other tasks will not be
able to be scheduled. So we think we should replace cond_resched() with
schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
call set_current_state first to set the task state, so the task will be
removed from the running queue, so as to achieve the purpose of giving
up the CPU and prevent it from running in kernel mode for too long.
(akpm: ugly hack becomes uglier. But it fixes the issue in a
backportable-to-stable fashion while we hopefully work on something
better)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Guo Ziliang <[email protected]>
Reported-by: Zeal Robot <[email protected]>
Reviewed-by: Ran Xiaokai <[email protected]>
Reviewed-by: Jiang Xuexin <[email protected]>
Reviewed-by: Yang Yang <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roger Quadros <[email protected]>
Cc: Ziliang Guo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..ee67164531c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* __read_swap_cache_async(), which has set SWAP_HAS_CACHE
* in swap_map, but not yet added its page to swap cache.
*/
- cond_resched();
+ schedule_timeout_uninterruptible(1);
}
/*
Hi
On 2/19/24 08:33, Huang, Ying wrote:
> Hi, Daniel,
>
> Thanks a lot for your great patchset!
>
> We have a similar starvation issue in mm subsystem too. Details are in
> the patch description of the below commit. In short, task A is busy
> looping on some event, while task B will signal the event after some
> work. If the priority of task A is higher than that of task B, task B
> may be starved.
ok...
>
> IIUC, if task A is RT task while task B is fair task, then your patchset
> will solve the issue.
This patch set will not solve the issue. It will mitigate the effect of the
problem. Still the system will perform very poorly...
If both task A and task B is RT tasks, is there
> some way to solve the issue?
I would say reworking the swap algorithm, as it is not meant to be used when
real-time tasks are in place.
As an exercise, let's say that we add a server per priority on FIFO, with a default
50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
busy loop in vain.
Then one would say, let's lower the parameters, so the granularity of
the server would provide lower latencies. The same problem would still
exist, as it exists with sched fair....
So, the solution is not on schedule. Busy loop waiting is not right when you
have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
spin_locks that do busy waiting. spin_locks do not have this problem you
show because they disable preemption... but disabling preemption is not
the best solution either.
So, a first try of duct tape would using (local_?) locks like in
preempt rt to make things sleepable...
AFAICS, this was already discussed in the previous link, right?
>
> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
> resolve the issue, is there something better?
I am not a swap/mm expert.. my guesses would be all on sleepable locking.
But I know there are many smart people on the mm side with better guesses...
It is just that the DL server or any type of starvation avoidance does not
seem to be a solution for your problem.
-- Daniel
> Best Regards,
> Huang, Ying
>
> --------------------------8<---------------------------------------
> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
> Author: Guo Ziliang <[email protected]>
> Date: Wed Mar 16 16:15:03 2022 -0700
>
> mm: swap: get rid of livelock in swapin readahead
>
> In our testing, a livelock task was found. Through sysrq printing, same
> stack was found every time, as follows:
>
> __swap_duplicate+0x58/0x1a0
> swapcache_prepare+0x24/0x30
> __read_swap_cache_async+0xac/0x220
> read_swap_cache_async+0x58/0xa0
> swapin_readahead+0x24c/0x628
> do_swap_page+0x374/0x8a0
> __handle_mm_fault+0x598/0xd60
> handle_mm_fault+0x114/0x200
> do_page_fault+0x148/0x4d0
> do_translation_fault+0xb0/0xd4
> do_mem_abort+0x50/0xb0
>
> The reason for the livelock is that swapcache_prepare() always returns
> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
> cannot jump out of the loop. We suspect that the task that clears the
> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
> priority of the task stuck in a livelock so that the task that clears
> the SWAP_HAS_CACHE flag will run. The results show that the system
> returns to normal after the priority is lowered.
>
> In our testing, multiple real-time tasks are bound to the same core, and
> the task in the livelock is the highest priority task of the core, so
> the livelocked task cannot be preempted.
>
> Although cond_resched() is used by __read_swap_cache_async, it is an
> empty function in the preemptive system and cannot achieve the purpose
> of releasing the CPU. A high-priority task cannot release the CPU
> unless preempted by a higher-priority task. But when this task is
> already the highest priority task on this core, other tasks will not be
> able to be scheduled. So we think we should replace cond_resched() with
> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
> call set_current_state first to set the task state, so the task will be
> removed from the running queue, so as to achieve the purpose of giving
> up the CPU and prevent it from running in kernel mode for too long.
>
> (akpm: ugly hack becomes uglier. But it fixes the issue in a
> backportable-to-stable fashion while we hopefully work on something
> better)
>
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Guo Ziliang <[email protected]>
> Reported-by: Zeal Robot <[email protected]>
> Reviewed-by: Ran Xiaokai <[email protected]>
> Reviewed-by: Jiang Xuexin <[email protected]>
> Reviewed-by: Yang Yang <[email protected]>
> Acked-by: Hugh Dickins <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Roger Quadros <[email protected]>
> Cc: Ziliang Guo <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8d4104242100..ee67164531c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> * in swap_map, but not yet added its page to swap cache.
> */
> - cond_resched();
> + schedule_timeout_uninterruptible(1);
> }
>
> /*
Daniel Bristot de Oliveira <[email protected]> writes:
> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>>
>> Thanks a lot for your great patchset!
>>
>> We have a similar starvation issue in mm subsystem too. Details are in
>> the patch description of the below commit. In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work. If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>>
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...
I don't think that it's common (or even reasonable) for real-time tasks
to use swap. So, IMHO, performance isn't very important here. But, we
need to avoid live-lock anyway. I think that your patchset solves the
live-lock issue.
>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.
If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?
> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>>
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.
Yes. To improve the performance, we need something else.
--
Best Regards,
Huang, Ying
> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>>
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <[email protected]>
>> Date: Wed Mar 16 16:15:03 2022 -0700
>>
>> mm: swap: get rid of livelock in swapin readahead
>>
>> In our testing, a livelock task was found. Through sysrq printing, same
>> stack was found every time, as follows:
>>
>> __swap_duplicate+0x58/0x1a0
>> swapcache_prepare+0x24/0x30
>> __read_swap_cache_async+0xac/0x220
>> read_swap_cache_async+0x58/0xa0
>> swapin_readahead+0x24c/0x628
>> do_swap_page+0x374/0x8a0
>> __handle_mm_fault+0x598/0xd60
>> handle_mm_fault+0x114/0x200
>> do_page_fault+0x148/0x4d0
>> do_translation_fault+0xb0/0xd4
>> do_mem_abort+0x50/0xb0
>>
>> The reason for the livelock is that swapcache_prepare() always returns
>> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>> cannot jump out of the loop. We suspect that the task that clears the
>> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
>> priority of the task stuck in a livelock so that the task that clears
>> the SWAP_HAS_CACHE flag will run. The results show that the system
>> returns to normal after the priority is lowered.
>>
>> In our testing, multiple real-time tasks are bound to the same core, and
>> the task in the livelock is the highest priority task of the core, so
>> the livelocked task cannot be preempted.
>>
>> Although cond_resched() is used by __read_swap_cache_async, it is an
>> empty function in the preemptive system and cannot achieve the purpose
>> of releasing the CPU. A high-priority task cannot release the CPU
>> unless preempted by a higher-priority task. But when this task is
>> already the highest priority task on this core, other tasks will not be
>> able to be scheduled. So we think we should replace cond_resched() with
>> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>> call set_current_state first to set the task state, so the task will be
>> removed from the running queue, so as to achieve the purpose of giving
>> up the CPU and prevent it from running in kernel mode for too long.
>>
>> (akpm: ugly hack becomes uglier. But it fixes the issue in a
>> backportable-to-stable fashion while we hopefully work on something
>> better)
>>
>> Link: https://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Guo Ziliang <[email protected]>
>> Reported-by: Zeal Robot <[email protected]>
>> Reviewed-by: Ran Xiaokai <[email protected]>
>> Reviewed-by: Jiang Xuexin <[email protected]>
>> Reviewed-by: Yang Yang <[email protected]>
>> Acked-by: Hugh Dickins <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Roger Quadros <[email protected]>
>> Cc: Ziliang Guo <[email protected]>
>> Cc: <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>> Signed-off-by: Linus Torvalds <[email protected]>
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>> * in swap_map, but not yet added its page to swap cache.
>> */
>> - cond_resched();
>> + schedule_timeout_uninterruptible(1);
>> }
>>
>> /*
On 2/20/24 04:28, Huang, Ying wrote:
> Daniel Bristot de Oliveira <[email protected]> writes:
>
>> Hi
>>
>> On 2/19/24 08:33, Huang, Ying wrote:
>>> Hi, Daniel,
>>>
>>> Thanks a lot for your great patchset!
>>>
>>> We have a similar starvation issue in mm subsystem too. Details are in
>>> the patch description of the below commit. In short, task A is busy
>>> looping on some event, while task B will signal the event after some
>>> work. If the priority of task A is higher than that of task B, task B
>>> may be starved.
>>
>> ok...
>>
>>>
>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>> will solve the issue.
>>
>> This patch set will not solve the issue. It will mitigate the effect of the
>> problem. Still the system will perform very poorly...
>
> I don't think that it's common (or even reasonable) for real-time tasks
> to use swap. So, IMHO, performance isn't very important here. But, we
> need to avoid live-lock anyway. I think that your patchset solves the
> live-lock issue.
I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
find a way to tune your system as a user... use it :)
But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
default since... two years ago, I think). So there is not much news here.
IMHO, it is not a solution. As a developer, there is a synchronization problem
in swap code, and pushing a workaround to the scheduling side is not the way to go...
>
>>> If both task A and task B is RT tasks, is there
>>> some way to solve the issue?
>>
>> I would say reworking the swap algorithm, as it is not meant to be used when
>> real-time tasks are in place.
>>
>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>> busy loop in vain.
>
> If the target is only the live-lock avoidance, is it possible to run
> lower priority runnable tasks for a short while if we run long enough in
> the busy loop?
If you do it in the algorithm side (instead of relying on scheduling), it could be a
thing.
I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
context (a priority higher than all threads on the !rt kernel), then move to thread
the thread context to avoid starvation. In the swap case, it could run for two jiffies
and then go to sleep for a while. How well will swap people receive this as a solution...
I do not know :) I would first try something better than this using synchronization
primitives.
This patch set is for things outside of kernel control. For example, people running
poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
with a better design than rt throttling.
Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
fix kernel problems? I do not think so, and I bet many other people do not believe as
well.
-- Daniel
Daniel Bristot de Oliveira <[email protected]> writes:
> On 2/20/24 04:28, Huang, Ying wrote:
>> Daniel Bristot de Oliveira <[email protected]> writes:
>>
>>> Hi
>>>
>>> On 2/19/24 08:33, Huang, Ying wrote:
>>>> Hi, Daniel,
>>>>
>>>> Thanks a lot for your great patchset!
>>>>
>>>> We have a similar starvation issue in mm subsystem too. Details are in
>>>> the patch description of the below commit. In short, task A is busy
>>>> looping on some event, while task B will signal the event after some
>>>> work. If the priority of task A is higher than that of task B, task B
>>>> may be starved.
>>>
>>> ok...
>>>
>>>>
>>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>>> will solve the issue.
>>>
>>> This patch set will not solve the issue. It will mitigate the effect of the
>>> problem. Still the system will perform very poorly...
>>
>> I don't think that it's common (or even reasonable) for real-time tasks
>> to use swap. So, IMHO, performance isn't very important here. But, we
>> need to avoid live-lock anyway. I think that your patchset solves the
>> live-lock issue.
>
> I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
> find a way to tune your system as a user... use it :)
>
> But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
> default since... two years ago, I think). So there is not much news here.
>
> IMHO, it is not a solution. As a developer, there is a synchronization problem
> in swap code, and pushing a workaround to the scheduling side is not the way to go...
>
>>
>>>> If both task A and task B is RT tasks, is there
>>>> some way to solve the issue?
>>>
>>> I would say reworking the swap algorithm, as it is not meant to be used when
>>> real-time tasks are in place.
>>>
>>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>>> busy loop in vain.
>>
>> If the target is only the live-lock avoidance, is it possible to run
>> lower priority runnable tasks for a short while if we run long enough in
>> the busy loop?
>
> If you do it in the algorithm side (instead of relying on scheduling), it could be a
> thing.
>
> I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
> context (a priority higher than all threads on the !rt kernel), then move to thread
> the thread context to avoid starvation. In the swap case, it could run for two jiffies
> and then go to sleep for a while. How well will swap people receive this as a solution...
> I do not know :) I would first try something better than this using synchronization
> primitives.
>
> This patch set is for things outside of kernel control. For example, people running
> poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
> with a better design than rt throttling.
>
> Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
> fix kernel problems? I do not think so, and I bet many other people do not believe as
> well.
I totally agree with you that we need to fix the kernel problems. And,
Thanks for your information!
--
Best Regards,
Huang, Ying
On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
> Among the motivations for the DL servers is the real-time throttling
> mechanism. This mechanism works by throttling the rt_rq after
> running for a long period without leaving space for fair tasks.
>
> The base dl server avoids this problem by boosting fair tasks instead
> of throttling the rt_rq. The point is that it boosts without waiting
> for potential starvation, causing some non-intuitive cases.
>
> For example, an IRQ dispatches two tasks on an idle system, a fair
> and an RT. The DL server will be activated, running the fair task
> before the RT one. This problem can be avoided by deferring the
> dl server activation.
>
> By setting the zerolax option, the dl_server will dispatch an
> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>
> The dl_timer will be set for (period - runtime) ns from start time.
> Thus boosting the fair rq on its 0-laxity time with respect to
> rt_rq.
>
> If the fair scheduler has the opportunity to run while waiting
> for zerolax time, the dl server runtime will be consumed. If
> the runtime is completely consumed before the zerolax time, the
> server will be replenished while still in a throttled state. Then,
> the dl_timer will be reset to the new zerolax time
>
> If the fair server reaches the zerolax time without consuming
> its runtime, the server will be boosted, following CBS rules
> (thus without breaking SCHED_DEADLINE).
>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Hi, Daniel,
We have one additional patch (other than the 15 I just sent). Since I have just
3 more working days for the next 3 weeks, I thought I might as well reply inline
here since it might be unnecessary to resend all 15 patches so soon just for the
one new addition below. I am replying to this patch here, because the new patch
is related (to 0-laxity). But once I am back from holiday, I can resend it with
the set I have unless you've applied it.
So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
time (max cap is default off keeping the regular behavior). This is needed to
guarantee bandwidth for periodic CFS runners/sleepers.
The example usecase is:
Consider DL server params 25ms / 50ms.
Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
run 25ms run 25ms
_______ _______
| | sleep 51 | | sleep 51
-|------|-------|---------|---------|-------|----------|--------|------> t
0 25 50 101 126 151 202 227
\ 0-lax / \ 0-lax /
Here the 0-lax addition in the original v5's zero-lax patch causes lesser bandwidth.
So the task runs 50ms every 227ms, instead of 50ms every 152ms.
A simple unit test confirms the issue, and it is fixed by Vineeth's patch below:
Please take a look at the patch below (applies only to v5.15 but Vineeth is
rebase on mainline as we speak), thanks.
-----8<--------
From: Vineeth Pillai (Google) <[email protected]>
Subject: [PATCH] sched/deadline/dlserver: sysctl for dlserver maxdefer time
Inorder to avoid dlserver preempting RT tasks when it wakes up, dlserver
is throttled(deferred) until zero lax time. This is the farthest time
before deadline where dlserver can meet its deadline.
Zero lax time causes cfs tasks with sleep/run pattern where the cfs
tasks doesn't get the bandwidth promised by dlserver. So introduce a
sysctl for limiting the defer time of dlserver.
Suggested-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Vineeth Pillai (Google) <[email protected]>
---
include/linux/sched/sysctl.h | 2 ++
kernel/sched/deadline.c | 6 ++++++
kernel/sysctl.c | 7 +++++++
3 files changed, 15 insertions(+)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 4939e6128840..a27fba6fe0ab 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_iowait_apply_ticks;
extern unsigned int sysctl_sched_dl_period_max;
extern unsigned int sysctl_sched_dl_period_min;
+extern unsigned int sysctl_sched_dlserver_maxdefer_ms;
+
#ifdef CONFIG_UCLAMP_TASK
extern unsigned int sysctl_sched_uclamp_util_min;
extern unsigned int sysctl_sched_uclamp_util_max;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d638cc5b45c7..69c9fd80a67d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1071,6 +1071,11 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
if (dl_se->dl_defer_armed) {
WARN_ON_ONCE(!dl_se->dl_throttled);
act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+ if (sysctl_sched_dlserver_maxdefer_ms) {
+ ktime_t dlserver_maxdefer = rq_clock(rq) +
ms_to_ktime(sysctl_sched_dlserver_maxdefer_ms);
+ if (ktime_after(act, dlserver_maxdefer))
+ act = dlserver_maxdefer;
+ }
} else {
act = ns_to_ktime(dl_next_period(dl_se));
}
@@ -3099,6 +3104,7 @@ void __getparam_dl(struct task_struct *p, struct
sched_attr *attr)
*/
unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
unsigned int sysctl_sched_dl_period_min = 100; /* 100 us */
+unsigned int sysctl_sched_dlserver_maxdefer_ms = 2;
/*
* This function validates the new parameters of a -deadline task.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 39f47a871fb4..027193302e7e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "sched_dlserver_maxdefer_ms",
+ .data = &sysctl_sched_dlserver_maxdefer_ms,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{
.procname = "sched_rr_timeslice_ms",
.data = &sysctl_sched_rr_timeslice,
--
2.40.1
On 3/20/24 01:03, Joel Fernandes wrote:
>
>
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Among the motivations for the DL servers is the real-time throttling
>> mechanism. This mechanism works by throttling the rt_rq after
>> running for a long period without leaving space for fair tasks.
>>
>> The base dl server avoids this problem by boosting fair tasks instead
>> of throttling the rt_rq. The point is that it boosts without waiting
>> for potential starvation, causing some non-intuitive cases.
>>
>> For example, an IRQ dispatches two tasks on an idle system, a fair
>> and an RT. The DL server will be activated, running the fair task
>> before the RT one. This problem can be avoided by deferring the
>> dl server activation.
>>
>> By setting the zerolax option, the dl_server will dispatch an
>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>
>> The dl_timer will be set for (period - runtime) ns from start time.
>> Thus boosting the fair rq on its 0-laxity time with respect to
>> rt_rq.
>>
>> If the fair scheduler has the opportunity to run while waiting
>> for zerolax time, the dl server runtime will be consumed. If
>> the runtime is completely consumed before the zerolax time, the
>> server will be replenished while still in a throttled state. Then,
>> the dl_timer will be reset to the new zerolax time
>>
>> If the fair server reaches the zerolax time without consuming
>> its runtime, the server will be boosted, following CBS rules
>> (thus without breaking SCHED_DEADLINE).
notice: at this point in history, the term zero-laxity was removed from
the latest code we have, the term was moved to defer server... I will
remove from the long in the next time I send it.
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>
> Hi, Daniel,
> We have one additional patch (other than the 15 I just sent).
Those 15 are the next thing I will review... I was working in the merge window.
Since I have just
> 3 more working days for the next 3 weeks, I thought I might as well reply inline
> here since it might be unnecessary to resend all 15 patches so soon just for the
> one new addition below. I am replying to this patch here, because the new patch
> is related (to 0-laxity). But once I am back from holiday, I can resend it with
> the set I have unless you've applied it.
before starting... in three weeks, we will be very far still from any attempt to
ping Peter & ingo to ask if they could think about putting us into a queue.
Have fun :-)
as I explained at LPC.. and in chat... there is no "0-laxity", and repeating it
only creates more confusion. Let's use a deferred server... a regular deadline
reservation with deferred starting time at (now + period - runtime), and at that point
il will receive a new deadline one period away - (not one runtime away).
There will always be a person reading these emails and echoing the wrong things...
using 0-lax/0-laxity term here is a lose-lose.
> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
> time (max cap is default off keeping the regular behavior). This is needed to
> guarantee bandwidth for periodic CFS runners/sleepers.
Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
If the conditions are not respected, the guarantee a dl reservation will provide is that
the task will not put a utilization higher than the one requested, so yes, a dl reservation
can and will enforce a lower utilization if the task does not respect the conditions.
Also, if the reservation is ready, but no task is ready...
>
> The example usecase is:
>
> Consider DL server params 25ms / 50ms.
>
> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
or not?
there are some holes in your explanation, it is tricky to reply inline for these cases...
I am continuing but....
>
> run 25ms run 25ms
> _______ _______
> | | sleep 51 | | sleep 51
> -|------|-------|---------|---------|-------|----------|--------|------> t
> 0 25 50 101 126 151 202 227
> \ 0-lax / \ 0-lax /
trying to understand...
at time 0 the task is activated... RT tasks are spinning... assuming that so the
server was deferred to 25?
At 25, it becomes DL, with a new deadline of 75. If there is no other DL
task, run [25..50].
at 75 the task would have another 25 ms... but it decided not to run, throwing
away 25 ms until 100. At this point, I would say: is not an odd setup....
At point 100, the deferred server assumes that the starvation condition is gone,
and goes to the initial state.
now, that 0-lax means what? the zero-laxity time for the task... but not
from its start time, but from the beginning of the deferred server = 25 + 76 = 101
.. also 0-lax is a range?
and here, the system seems to start the same cycle with a 1 shift to repeat the
case that the task and the reservation did not match.
looking back, for the first cycle... with defer at 75, without at point 50, there
would be time for the server to run, but the task is just ready, so runtime is
thrown away.
So, this miss match between the configuration and the task setup is... clearly
causing runtime to be wasted... but one can change the parameter for them
to be better in sync...
and here, I assume that there is something missing in the explanation because...
> Here the 0-lax addition in the original v5's zero-lax patch causes lesser bandwidth.
which addition?
> So the task runs 50ms every 227ms, instead of 50ms every 152ms.
..
>
> A simple unit test confirms the issue, and it is fixed by Vineeth's patch below:
>
> Please take a look at the patch below (applies only to v5.15 but Vineeth is
> rebase on mainline as we speak), thanks.
>
> -----8<--------
> From: Vineeth Pillai (Google) <[email protected]>
> Subject: [PATCH] sched/deadline/dlserver: sysctl for dlserver maxdefer time
>
> Inorder to avoid dlserver preempting RT tasks when it wakes up, dlserver
> is throttled(deferred) until zero lax time. This is the farthest time
> before deadline where dlserver can meet its deadline.
>
> Zero lax time causes cfs tasks with sleep/run pattern where the cfs
> tasks doesn't get the bandwidth promised by dlserver. So introduce a
> sysctl for limiting the defer time of dlserver.
so... that explanation before to reach to the conclusion that limiting the
amount of time to defer the server is a fix? there is a huge gap here.
>
> Suggested-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Vineeth Pillai (Google) <[email protected]>
> ---
> include/linux/sched/sysctl.h | 2 ++
> kernel/sched/deadline.c | 6 ++++++
> kernel/sysctl.c | 7 +++++++
> 3 files changed, 15 insertions(+)
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 4939e6128840..a27fba6fe0ab 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -41,6 +41,8 @@ extern unsigned int sysctl_iowait_apply_ticks;
> extern unsigned int sysctl_sched_dl_period_max;
> extern unsigned int sysctl_sched_dl_period_min;
> +extern unsigned int sysctl_sched_dlserver_maxdefer_ms;
> +
> #ifdef CONFIG_UCLAMP_TASK
> extern unsigned int sysctl_sched_uclamp_util_min;
> extern unsigned int sysctl_sched_uclamp_util_max;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d638cc5b45c7..69c9fd80a67d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1071,6 +1071,11 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> if (dl_se->dl_defer_armed) {
> WARN_ON_ONCE(!dl_se->dl_throttled);
> act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> + if (sysctl_sched_dlserver_maxdefer_ms) {
> + ktime_t dlserver_maxdefer = rq_clock(rq) +
> ms_to_ktime(sysctl_sched_dlserver_maxdefer_ms);
> + if (ktime_after(act, dlserver_maxdefer))
> + act = dlserver_maxdefer;
that is, having a global limit... we have a per-cpu set of variables,
that is bounded by a global limit.
<joking>
It is already hard to put in sync with two parameters. Now we need a
third one that is global :-)
<joking>
reading the code was actually more instructive than reading the comments.
The good point is that this puts a nail in the coffin of "zerolax" :-)
that is, you all want also to control for how long the defer happens. Now
it is fixed... (period - runtime). You all would like to have the ability
to set it so something closer, so to defer less.
That phrase is simple :-)
It is already possible! how? This can be done by adjusting runtime/period.
Which are already per CPU. The DL server does not need to give an entire
chuck all at once; one can split it into smaller runtime/period slices,
as EEVDF does. And here we get back to the things we talked about when
trying to use EDF... but now it is only for one task :-) It is easier.
This also reduces the amount of time thrown away because there is no
task ready. It is the main cause of time wasted anyway. And the CFS
scheduler is really aperiodic, in practice, these more corner
case timelines will hardly happen... unless you try to force with
a simple test case.
now... could I change the defer per-cpu option to have a value between
0 and (period - runtime) so we could defer less than (period - runtime),
with proper check and granularity (per cpu) in the next version I send
for the interface.... maybe... it is less prone to have a no no from
people who care a lot about the interface.
But I am asking myself: is it worth the complexity? I would try first
getting used with the runtime/period setup only...
I will start reviewing the other patches as soon as the worries
of the merge window passes away. Hopefully tomorrow.
-- Daniel
Hello Daniel,
Thank you for the reply and I replied below.
On 3/20/2024 3:24 PM, Daniel Bristot de Oliveira wrote:
> On 3/20/24 01:03, Joel Fernandes wrote:
>>
>>
>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>> Among the motivations for the DL servers is the real-time throttling
>>> mechanism. This mechanism works by throttling the rt_rq after
>>> running for a long period without leaving space for fair tasks.
>>>
>>> The base dl server avoids this problem by boosting fair tasks instead
>>> of throttling the rt_rq. The point is that it boosts without waiting
>>> for potential starvation, causing some non-intuitive cases.
>>>
>>> For example, an IRQ dispatches two tasks on an idle system, a fair
>>> and an RT. The DL server will be activated, running the fair task
>>> before the RT one. This problem can be avoided by deferring the
>>> dl server activation.
>>>
>>> By setting the zerolax option, the dl_server will dispatch an
>>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>>
>>> The dl_timer will be set for (period - runtime) ns from start time.
>>> Thus boosting the fair rq on its 0-laxity time with respect to
Note your patch changelog. --- (1)
>>> rt_rq.
>>>
>>> If the fair scheduler has the opportunity to run while waiting
>>> for zerolax time, the dl server runtime will be consumed. If
>>> the runtime is completely consumed before the zerolax time, the
Note your patch changelog. --- (2)
>>> server will be replenished while still in a throttled state. Then,
>>> the dl_timer will be reset to the new zerolax time
>>>
>>> If the fair server reaches the zerolax time without consuming
>>> its runtime, the server will be boosted, following CBS rules
>>> (thus without breaking SCHED_DEADLINE).
>
> notice: at this point in history, the term zero-laxity was removed from
> the latest code we have, the term was moved to defer server... I will
> remove from the long in the next time I send it.
I am confused because your change log in your v5 (and the v6 in kernel.org)
mentions "zero-laxity" all over the place (example see above for (1) and (2)).
Further, the terminology is the least of the problems unfortunately (more
below). Call it defer or whatever but it has to work. :)
>>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>>
>> Hi, Daniel,
>> We have one additional patch (other than the 15 I just sent).
>
> Those 15 are the next thing I will review... I was working in the merge window.
Me too ;-) It has been a fun merge window.
> Since I have just
>> 3 more working days for the next 3 weeks, I thought I might as well reply inline
>> here since it might be unnecessary to resend all 15 patches so soon just for the
>> one new addition below. I am replying to this patch here, because the new patch
>> is related (to 0-laxity). But once I am back from holiday, I can resend it with
>> the set I have unless you've applied it.
>
> before starting... in three weeks, we will be very far still from any attempt to
> ping Peter & ingo to ask if they could think about putting us into a queue.
I wasn't really expecting an imminent "ping for merge" to be honest. My main
motivation with posting it before my 3 week off time was to trigger some review
and discussion here (which we desperately need more off IMO ;)). I am not in a
hurry to request for a merge without it working correctly either.
> Have fun :-)
Thanks!
> as I explained at LPC.. and in chat... there is no "0-laxity", and repeating it
> only creates more confusion.
We can stick to defer but could you please update your changelog as well. I
don't think you posted an update to the patch since November either so maybe you
should do that as well, with the patches adjusted with the bugs we found. I do
agree that using zero-laxity creates a source of confusion and prefer "defer".
> Let's use a deferred server... a regular deadline
> reservation with deferred starting time at (now + period - runtime), and at that point
> il will receive a new deadline one period away - (not one runtime away).
>
> There will always be a person reading these emails and echoing the wrong things...
> using 0-lax/0-laxity term here is a lose-lose.
Agreed, so why not update your patch changelog to correct that (or post a new
revision)?
>> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
>> time (max cap is default off keeping the regular behavior). This is needed to
>> guarantee bandwidth for periodic CFS runners/sleepers.
>
> Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
> If the conditions are not respected, the guarantee a dl reservation will provide is that
> the task will not put a utilization higher than the one requested, so yes, a dl reservation
> can and will enforce a lower utilization if the task does not respect the conditions.
> Also, if the reservation is ready, but no task is ready...
Please clarify what conditions you mean? The conditions I am looking for are
those given by RT throttling (see below). i.e., if RT takes up all of the CPU in
a certain amount of time, then there is a certain amount reserved for CFS.
>> The example usecase is:
>>
>> Consider DL server params 25ms / 50ms.
>>
>> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
>
> define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
> or not?
There is no starting time. The CFS task in the quoted example only does run +
sleep. Run for 25ms and sleep for 51ms, then run again for 25ms, etc.
>
> there are some holes in your explanation, it is tricky to reply inline for these cases...
> I am continuing but....
>
>>
>> run 25ms run 25ms
>> _______ _______
>> | | sleep 51 | | sleep 51
>> -|------|-------|---------|---------|-------|----------|--------|------> t
>> 0 25 50 101 126 151 202 227
>> \ 0-lax / \ 0-lax /
So before going into the discussion below, I just want to mention that if the
DL-server cannot provide the same bandwidth that the RT throttling provided,
then its broken by definition. And the breakage comes specifically because of
this patch and nothing else. There are many breakages with this patch and I will
go over some of them with unit tests below. Basically, in my view -- if a test
case shows it works with RT throttling, but not with the DL-server, then its
broken. Period. And most of those functional "breakages" come about because of
this patch (6/7) and not the initial series actually.
Here are some cases to shed some light:
Case 1. Consider a CFS task with runtime 15ms and period 50ms. With the
parameters set to 25ms runtime and 50ms period.
The test fails with DL server (because of 6/7), and passes with RT throttling.
See results below. For this test's code, see: https://shorturl.at/rwW07
Specifically, it breaks because of this patch (6/7). If you revert the patch,
the issue goes away.
With the patch 6/7:
# ./dlserver_test
# Runtime of PID 85 is 0.430000 seconds
Bail out! Runtime of PID 85 is not within 10% of expected runtime 0.900000
Without the patch 6/7:
# ./dlserver_test
# Runtime of PID 87 is 0.900000 seconds
ok 1 PASS
So basically, because of defer (or whatever you want to call it ;)), it gives
less than 50% of the bandwidth that it gave without the defer.
And here it is with vanilla RT throttling:
# ./dlserver_test
[ 44.968900] sched: RT throttling activated
# Runtime of PID 87 is 0.880000 seconds
Vineeth's patch I shared fixes the issue.
Case 2:
For this case, please run the core scheduling test I provided with
CONFIG_SCHED_CORE=y. It is basically like Case 1 but with some slightly changes.
You will see that because of patch 6/7, that test also breaks and gets lesser
bandwidth. And it happens of the negative runtime stuff you added to
update_curr_dl_se(). Deleting that fixes this, but is indication of yet another
problem with this patch.
The patch fixing this issue (by deleting that block) is also included in the set
of 15 I posted earlier.
-------
Further, my impression is this patch (6/7) does not even solve all the issue it
intended. For example, consider that a CFS task is in the boosted phase, and now
an RT task wakes up. That RT task *will wait* for possibly the whole runtime
granted to CFS, so it might not always help. Contrasting that with RT
throttling, if an RT task is very well behaved (well behaved defined as not
running to the limit that RT throttling should kick in), and it wakes up, it
will run right away without any wait time, regardless of what CFS was or was not
doing.
thanks,
- Joel
> On 3/20/2024 3:24 PM, Daniel Bristot de Oliveira wrote:
>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>> Among the motivations for the DL servers is the real-time throttling
>>>> mechanism. This mechanism works by throttling the rt_rq after
>>>> running for a long period without leaving space for fair tasks.
>>>>
>>>> The base dl server avoids this problem by boosting fair tasks instead
>>>> of throttling the rt_rq. The point is that it boosts without waiting
>>>> for potential starvation, causing some non-intuitive cases.
>>>>
>>>> For example, an IRQ dispatches two tasks on an idle system, a fair
>>>> and an RT. The DL server will be activated, running the fair task
>>>> before the RT one. This problem can be avoided by deferring the
>>>> dl server activation.
>>>>
>>>> By setting the zerolax option, the dl_server will dispatch an
>>>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>>>
>>>> The dl_timer will be set for (period - runtime) ns from start time.
>>>> Thus boosting the fair rq on its 0-laxity time with respect to
>
Hi,
Upon reflection, might we simplify the solution by treating RT as a deadline
reservation as well?
The RT deadline reservation can have shorter deadline so it will be interrupted
less immediately by CFS due to EDF. Would that work, or was that already tried
and has other dragons?
If we could pull that off, then we do not need all the deferral/timer stuff and
could considering dropping this patch. Yes it is more code, but this 6th patch
is also big and non trivial.
Juri, Daniel, all, what do you think?
(On the other hand if we want to keep this patch as a first step, and
incrementally improve that is Ok, but I believe we do need to make a decision..)
By the way, is there a still a slot in OSPM available to discuss these? If so
that would be great. I can put up some slides.
cheers,
- Joel
On 2/15/24 18:41, Joel Fernandes wrote:
>> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
>> not set 45ms/50ms for instance.
> I isolated the problem. It is not an interface problem.
>
> Long story short, the servers are initialized at the defrootdomain, but
> the dl_bw info is not being carried over to the new domain because the
> servers are not a task.
Fixed on v6. When an rq is attached to a new rd, at rq_attach_root(), the bw of
from fair server is being added to the new rd.
Without this, the RD has no bandwidth registered (no-dl task) and so the
admission control becomes broken.
-- Daniel
>> There will always be a person reading these emails and echoing the wrong things...
>> using 0-lax/0-laxity term here is a lose-lose.
>
> Agreed, so why not update your patch changelog to correct that (or post a new
> revision)?
That v6 is the repo was a partial update, that I sent there to sync our work. The v6 that I will
send already removed that.
>>> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
>>> time (max cap is default off keeping the regular behavior). This is needed to
>>> guarantee bandwidth for periodic CFS runners/sleepers.
>>
>> Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
>> If the conditions are not respected, the guarantee a dl reservation will provide is that
>> the task will not put a utilization higher than the one requested, so yes, a dl reservation
>> can and will enforce a lower utilization if the task does not respect the conditions.
>> Also, if the reservation is ready, but no task is ready...
>
> Please clarify what conditions you mean? The conditions I am looking for are
> those given by RT throttling (see below). i.e., if RT takes up all of the CPU in
> a certain amount of time, then there is a certain amount reserved for CFS.
>
>>> The example usecase is:
>>>
>>> Consider DL server params 25ms / 50ms.
>>>
>>> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
>>
>> define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
>> or not?
>
> There is no starting time. The CFS task in the quoted example only does run +
> sleep. Run for 25ms and sleep for 51ms, then run again for 25ms, etc.
>
>>
>> there are some holes in your explanation, it is tricky to reply inline for these cases...
>> I am continuing but....
>>
>>>
>>> run 25ms run 25ms
>>> _______ _______
>>> | | sleep 51 | | sleep 51
>>> -|------|-------|---------|---------|-------|----------|--------|------> t
>>> 0 25 50 101 126 151 202 227
>>> \ 0-lax / \ 0-lax /
>
> So before going into the discussion below, I just want to mention that if the
> DL-server cannot provide the same bandwidth that the RT throttling provided,
> then its broken by definition. And the breakage comes specifically because of
> this patch and nothing else. There are many breakages with this patch and I will
> go over some of them with unit tests below. Basically, in my view -- if a test
> case shows it works with RT throttling, but not with the DL-server, then its
> broken. Period. And most of those functional "breakages" come about because of
> this patch (6/7) and not the initial series actually.
>
> Here are some cases to shed some light:
>
> Case 1. Consider a CFS task with runtime 15ms and period 50ms. With the
> parameters set to 25ms runtime and 50ms period.
>
> The test fails with DL server (because of 6/7), and passes with RT throttling.
> See results below. For this test's code, see: https://shorturl.at/rwW07
A reproducer always helps. So, your task there is not a periodic task... it is
a sporadic task because it sleeps for a fixed amount of time after the runtime.
A periodic task with period 76 would wake at 0, 76, 152 - like cyclictest...
so consuming at a fixed time rate if the scheduler allows it.
In the case of a fixed sleep time at the end of the execution, it will end up
"throwing away bandwidth" if the runtime is not given at the beginning of the
period because it will run slower... accumulating error. But that was not the
main point here...
The problem here was more like: if a fair task goes to sleep in the middle of
the server activation (for a lock?), and then wakes up again, the code in v5 is
forcing it to defer... again. Thus, it is getting less bandwidth... notice that
it does not even need to be at the start of the period. It is the middle of the
execution.
Intuitively, reducing the deferred time would help there. But the best thing to do is:
If the fair task waited for the defer, and the real-time tasks are still using all
CPU time, do not defer the activation again, and keep the defer mechanism disabled
until the real-time tasks allow the fair scheduler to run in the background. So,
making the defer mode equivalent to the non-defer mode until the RT tasks start
to behave again.
For that, in the v6, there is a variable (dl_defer_running), once the dl_server
is enqueued after the defer time, the variable dl_defer_running is set.
If the fair task sleeps in the middle of the period, that variable do not change.
If the fair task wakes up and the dl_defer_running is still set, do not defer.
Keep running until you consume the reservation.
The variable dl_defer_running is set to 0 only after the fair tasks consume
its runtime without being in a dl_server... IOW, when the RT tasks start to
behave.
No interface change.
With that in place, your reproducers are working. I have a periodic version
of your reproducer, also improving how the task consumes the runtime,.. I
will send it to you so you can have a look.
>
> Specifically, it breaks because of this patch (6/7). If you revert the patch,
> the issue goes away.
>
> With the patch 6/7:
> # ./dlserver_test
> # Runtime of PID 85 is 0.430000 seconds
> Bail out! Runtime of PID 85 is not within 10% of expected runtime 0.900000
>
> Without the patch 6/7:
> # ./dlserver_test
> # Runtime of PID 87 is 0.900000 seconds
> ok 1 PASS
>
> So basically, because of defer (or whatever you want to call it ;)), it gives
> less than 50% of the bandwidth that it gave without the defer.
There was a problem with the non-defer mode as well, the dl_server_start() was
missing a set need resched. Fixed that in v6.
> Further, my impression is this patch (6/7) does not even solve all the issue it
> intended. For example, consider that a CFS task is in the boosted phase, and now
> an RT task wakes up. That RT task *will wait* for possibly the whole runtime
> granted to CFS, so it might not always help. Contrasting that with RT
> throttling, if an RT task is very well behaved (well behaved defined as not
> running to the limit that RT throttling should kick in), and it wakes up, it
> will run right away without any wait time, regardless of what CFS was or was not
> doing.
I fixed that as well.
The problem happens when a DL server has a large runtime (>=~ 50%).
Let's say 25 ms runtime, 50 ms period.
At time 0, the defer timer will be set at 25 ms (50 - 25).
From 0 to 25, the RT task would consume, for instance, only 2 ms... so
it is behaving...
At time 25, the defer timer fires... and as the fair task ran for 23 ms
(25 - 2 ms taken by RT) it still has 2 ms runtime to run... so the server
is activated... it is not correct.
The change I made in v6 is:
Same case...
At time 25, the defer timer fires...
Then, the timer will re-compute the defer time:
If the RT tasks are behaving, forward the timer for the
new (deadline - runtime).
return;
For instance, in the previous case, the new defer timer would be: 50 ms - 2 ms.
CFS will continue working, consuming runtime and resetting the period to avoid
activating the dl server.
The idea of forwarding the timer was taken from the cfs period timer. It is also
possible to forward the timer on other points... if necessary...
I did more testing, with different task sets, including tasks that goes to sleep...
it is working as expected.
-- Daniel
> thanks,
>
> - Joel
>
On Fri, 5 Apr 2024 16:35:49 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:
> A reproducer always helps. So, your task there is not a periodic task... it is
> a sporadic task because it sleeps for a fixed amount of time after the runtime.
>
> A periodic task with period 76 would wake at 0, 76, 152 - like cyclictest...
> so consuming at a fixed time rate if the scheduler allows it.
>
> In the case of a fixed sleep time at the end of the execution, it will end up
> "throwing away bandwidth" if the runtime is not given at the beginning of the
> period because it will run slower... accumulating error. But that was not the
> main point here...
>
> The problem here was more like: if a fair task goes to sleep in the middle of
> the server activation (for a lock?), and then wakes up again, the code in v5 is
> forcing it to defer... again. Thus, it is getting less bandwidth... notice that
> it does not even need to be at the start of the period. It is the middle of the
> execution.
>
> Intuitively, reducing the deferred time would help there. But the best thing to do is:
>
> If the fair task waited for the defer, and the real-time tasks are still using all
> CPU time, do not defer the activation again, and keep the defer mechanism disabled
> until the real-time tasks allow the fair scheduler to run in the background. So,
> making the defer mode equivalent to the non-defer mode until the RT tasks start
> to behave again.
>
> For that, in the v6, there is a variable (dl_defer_running), once the dl_server
> is enqueued after the defer time, the variable dl_defer_running is set.
>
> If the fair task sleeps in the middle of the period, that variable do not change.
>
> If the fair task wakes up and the dl_defer_running is still set, do not defer.
> Keep running until you consume the reservation.
>
> The variable dl_defer_running is set to 0 only after the fair tasks consume
> its runtime without being in a dl_server... IOW, when the RT tasks start to
> behave.
Very nice explanation! Thanks Daniel.
>
> No interface change.
>
> With that in place, your reproducers are working. I have a periodic version
> of your reproducer, also improving how the task consumes the runtime,.. I
> will send it to you so you can have a look.
Looking forward to reviewing your patches when I'm back from PTO.
-- Steve