Hi,
Glenn reported a bug concerning SCHED_DEADLINE nested (cross-class)
priority inheritance. With his words:
"""
Hello maintainers,
My application produces a BUG in deadline.c when a SCHED_DEADLINE task
contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I
believe the bug is triggered when a CFS task that was boosted by a
SCHED_DEADLINE task boosts another CFS task (nested priority
inheritance).
I am able to reproduce the bug on 4.15, 4.19-rt, and 5.3.0-19-generic
(an Ubuntu kernel) kernels.
I have a small test program that can reproduce the issue. Please find
the source code for this test at the end of this email. I have tested
on an Intel Xeon 8276 CPU, as well as within a x86_64 VM. I can try
this test with a 4.19-rt kernel on aarch64, if desired.
Here is the BUG output on a 4.19-rt kernel:
------------[ cut here ]------------
kernel BUG at kernel/sched/deadline.c:1462!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: P O 4.19.72-rt25-appaloosa-v1.5 #1
Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
RIP: 0010:enqueue_task_dl+0x335/0x910
Code: ...
RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
? intel_pstate_update_util_hwp+0x13/0x170
rt_mutex_setprio+0x1cc/0x4b0
task_blocks_on_rt_mutex+0x225/0x260
rt_spin_lock_slowlock_locked+0xab/0x2d0
rt_spin_lock_slowlock+0x50/0x80
hrtimer_grab_expiry_lock+0x20/0x30
hrtimer_cancel+0x13/0x30
do_nanosleep+0xa0/0x150
hrtimer_nanosleep+0xe1/0x230
? __hrtimer_init_sleeper+0x60/0x60
__x64_sys_nanosleep+0x8d/0xa0
do_syscall_64+0x4a/0x100
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa58b52330d
Code: ...
RSP: 002b:00007fa58ac54ef0 EFLAGS: 00000293 ORIG_RAX: 0000000000000023
RAX: ffffffffffffffda RBX: ffffffffffffff98 RCX: 00007fa58b52330d
RDX: 00007fa58b81d780 RSI: 00007fa58ac54f00 RDI: 00007fa58ac54f00
RBP: 0000000000000000 R08: 00007fa58ac55700 R09: 0000000000000000
R10: 0000000000000735 R11: 0000000000000293 R12: 0000000000000000
R13: 00007ffc807fd43f R14: 00007fa58ac559c0 R15: 0000000000000000
Modules linked in: ...
---[ end trace 0000000000000002 ]—
"""
He also provided a reproducer (appended to this cover) with which I was
able to confirm that tip/sched/core is affect as well.
Proposed fix is composed of 2 patches
1 - Trigger priority inheritance (boost) for "indirect" cases as well;
where I (improperly?) consider "indirect" a non-SCHED_DEADLINE waiter
currently boosted from a different lock chain containing a
SCHED_DEADLINE task
2 - Temporarily copy static parameters to non-SCHED_DEADLINE boosted
tasks, so that they can be used in "indirect" cases
Now. This is pretty far from ideal, I know. Consider it a stop gap
solution (hack, if it makes any sense) to proper proxy execution (on
which I'm still working :-/). Better ideas are more than welcome!
Best,
Juri
Juri Lelli (2):
sched/deadline: Fix nested priority inheritace at enqueue time
sched/deadline: Temporary copy static parameters to boosted
non-DEADLINE entities
kernel/sched/core.c | 6 ++++--
kernel/sched/deadline.c | 19 ++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 23 insertions(+), 3 deletions(-)
--
2.17.2
--->8---
— test case —
/* gcc main.c -lpthread */
/*
* This program reproduces a kernel bug at line
* https://elixir.bootlin.com/linux/v4.15/source/kernel/sched/deadline.c#L1405, but not limited
* to the version v4.15.
*
* This is bug is triggered when a non-deadline task that was boosted by a deadline task boosts
* another non-deadline task.
*
* So the execution order of locking steps are the following
* (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled
* with priority inheritance.)
*
* Time moves forward as this timeline goes down:
*
* N1 N2 D1
* | | |
* | | |
* Lock(M1) | |
* | | |
* | Lock(M2) |
* | | |
* | | Lock(M2)
* | | |
* | Lock(M1) |
* | (!!bug triggered!) |
*
*/
#define _GNU_SOURCE
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/unistd.h>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <time.h>
#include <unistd.h>
#define gettid() syscall(__NR_gettid)
#define SCHED_DEADLINE 6
#ifdef __x86_64__
#define __NR_sched_setattr 314
#define __NR_sched_getattr 315
#endif
#ifdef __i386__
#define __NR_sched_setattr 351
#define __NR_sched_getattr 352
#endif
#ifdef __arm__
#define __NR_sched_setattr 380
#define __NR_sched_getattr 381
#endif
static volatile int step;
pthread_mutex_t m1;
pthread_mutex_t m2;
struct sched_attr {
__u32 size;
__u32 sched_policy;
__u64 sched_flags;
/* SCHED_NORMAL, SCHED_BATCH */
__s32 sched_nice;
/* SCHED_FIFO, SCHED_RR */
__u32 sched_priority;
/* SCHED_DEADLINE (nsec) */
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
};
int sched_setattr(pid_t pid, const struct sched_attr *attr,
unsigned int flags) {
return syscall(__NR_sched_setattr, pid, attr, flags);
}
int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size,
unsigned int flags) {
return syscall(__NR_sched_getattr, pid, attr, size, flags);
}
void *run_normal_1(void *data) {
int x = 0;
printf("normal thread started [%ld]\n", gettid());
// N1 locks M1
printf("N1 is locking M1\n");
pthread_mutex_lock(&m1);
printf("N1 locked M1\n");
// Notify N2
step = 1;
// Wait to be boosted by N2 (on M1)
sleep(10);
// Won't be able to reach here because of the rt_mutex + sched_deadline bug.
printf("N1 is unlocking M1\n");
pthread_mutex_unlock(&m1);
printf("N1 unlocked M1\n");
printf("normal thread dies [%ld]\n", gettid());
return NULL;
}
void *run_normal_2(void *data) {
int x = 0;
printf("normal thread started [%ld]\n", gettid());
// N2 locks M2
printf("N2 is locking M2\n");
pthread_mutex_lock(&m2);
printf("N2 locked M2\n");
// Wait until N1 locked M1
while (step < 1) {
x++;
}
// Notify D1
step = 2;
// Wait to be boosted by D1 (on M2)
sleep(5);
printf("N2 is locking M1\n");
// This will boost N1 and trigger the bug.
pthread_mutex_lock(&m1);
// Won't reach here because of the bug
printf("N2 locked M1\n");
printf("N2 is unlocking M1\n");
pthread_mutex_unlock(&m1);
printf("N2 unlocked M1\n");
printf("N2 is unlocking M2\n");
pthread_mutex_unlock(&m2);
printf("N2 unlocked M2\n");
printf("normal thread dies [%ld]\n", gettid());
return NULL;
}
void *run_deadline(void *data) {
struct sched_attr attr;
int x = 0;
int ret = 0;
unsigned int flags = 0;
printf("deadline thread started [%ld]\n", gettid());
attr.size = sizeof(attr);
attr.sched_flags = 0;
attr.sched_nice = 0;
attr.sched_priority = 0;
/* This creates a 10ms/30ms reservation */
attr.sched_policy = SCHED_DEADLINE;
attr.sched_runtime = 10 * 1000 * 1000;
attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;
ret = sched_setattr(0, &attr, flags);
if (ret < 0) {
step = 0;
perror("sched_setattr");
exit(-1);
}
// Wait until N2 locked M2
while (step < 2) {
x++;
}
printf("D1 is locking M2\n");
// This will boost N2
pthread_mutex_lock(&m2);
printf("D1 locked M2\n");
sleep(10);
// Won't reach here because of the bug.
printf("D1 is unlocking M2\n");
pthread_mutex_unlock(&m2);
printf("D1 unlocked M2\n");
printf("deadline thread dies [%ld]\n", gettid());
return NULL;
}
int main(int argc, char **argv) {
pthread_t thread[3];
printf("main thread [%ld]\n", gettid());
int rtn;
pthread_mutexattr_t mutexattr;
if ((rtn = pthread_mutexattr_init(&mutexattr) != 0)) {
fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
exit(1);
}
if ((rtn = pthread_mutexattr_setprotocol(&mutexattr,
PTHREAD_PRIO_INHERIT)) != 0) {
fprintf(stderr, "pthread_mutexattr_setprotocol: %s", strerror(rtn));
exit(1);
}
if ((rtn = pthread_mutex_init(&m1, &mutexattr)) != 0) {
fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
exit(1);
}
if ((rtn = pthread_mutex_init(&m2, &mutexattr)) != 0) {
fprintf(stderr, "pthread_mutexattr_init: %s", strerror(rtn));
exit(1);
}
// use this volatile variable to coordinate execution order between threads
step = 0;
pthread_create(thread, NULL, run_normal_1, NULL);
pthread_create(thread + 1, NULL, run_normal_2, NULL);
pthread_create(thread + 2, NULL, run_deadline, NULL);
pthread_join(thread[0], NULL);
pthread_join(thread[1], NULL);
pthread_join(thread[2], NULL);
printf("main dies [%ld]\n", gettid());
return 0;
}
Boosted entities (Priority Inheritance) use static DEADLINE parameters
of the top priority waiter. However, there might be cases where top
waiter could be a non-DEADLINE entity that is currently boosted by a
DEADLINE entity from a different lock chain (i.e., nested priority
chains involving entities of non-DEADLINE classes). In this case, top
waiter static DEADLINE parameters could null (initialized to 0 at
fork()) and replenish_dl_entity() would hit a BUG().
Fix this by temporarily copying static DEADLINE parameters of top
DEADLINE waiter (there must be at least one in the chain(s) for the
problem above to happen) into boosted entities. Parameters are reset
during deboost.
Reported-by: Glenn Elliott <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/core.c | 6 ++++--
kernel/sched/deadline.c | 17 +++++++++++++++++
kernel/sched/sched.h | 1 +
3 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f64d0e..a3eb57cfcfb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4441,19 +4441,21 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
if (!dl_prio(p->normal_prio) ||
(pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
p->dl.dl_boosted = 1;
+ if (!dl_prio(p->normal_prio))
+ __dl_copy_static(p, pi_task);
queue_flag |= ENQUEUE_REPLENISH;
} else
p->dl.dl_boosted = 0;
p->sched_class = &dl_sched_class;
} else if (rt_prio(prio)) {
if (dl_prio(oldprio))
- p->dl.dl_boosted = 0;
+ __dl_clear_params(p);
if (oldprio < prio)
queue_flag |= ENQUEUE_HEAD;
p->sched_class = &rt_sched_class;
} else {
if (dl_prio(oldprio))
- p->dl.dl_boosted = 0;
+ __dl_clear_params(p);
if (rt_prio(oldprio))
p->rt.timeout = 0;
p->sched_class = &fair_sched_class;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 951a7b44156f..a823391b245e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2676,6 +2676,22 @@ bool __checkparam_dl(const struct sched_attr *attr)
return true;
}
+/*
+ * This function clears the sched_dl_entity static params.
+ */
+void __dl_copy_static(struct task_struct *to, struct task_struct *from)
+{
+ struct sched_dl_entity *to_se = &to->dl;
+ struct sched_dl_entity *from_se = &from->dl;
+
+ to_se->dl_runtime = from_se->dl_runtime;
+ to_se->dl_deadline = from_se->dl_deadline;
+ to_se->dl_period = from_se->dl_period;
+ to_se->flags = from_se->flags;
+ to_se->dl_bw = from_se->dl_bw;
+ to_se->dl_density = from_se->dl_density;
+}
+
/*
* This function clears the sched_dl_entity static params.
*/
@@ -2690,6 +2706,7 @@ void __dl_clear_params(struct task_struct *p)
dl_se->dl_bw = 0;
dl_se->dl_density = 0;
+ dl_se->dl_boosted = 0;
dl_se->dl_throttled = 0;
dl_se->dl_yielded = 0;
dl_se->dl_non_contending = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b3361e..92444306fff7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -239,6 +239,7 @@ struct rt_bandwidth {
unsigned int rt_period_active;
};
+void __dl_copy_static(struct task_struct *to, struct task_struct *from);
void __dl_clear_params(struct task_struct *p);
/*
--
2.17.2
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).
Here is the BUG output on a 4.19-rt kernel:
------------[ cut here ]------------
kernel BUG at kernel/sched/deadline.c:1462!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: P O 4.19.72-rt25-appaloosa-v1.5 #1
Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
RIP: 0010:enqueue_task_dl+0x335/0x910
Code: ...
RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
? intel_pstate_update_util_hwp+0x13/0x170
rt_mutex_setprio+0x1cc/0x4b0
task_blocks_on_rt_mutex+0x225/0x260
rt_spin_lock_slowlock_locked+0xab/0x2d0
rt_spin_lock_slowlock+0x50/0x80
hrtimer_grab_expiry_lock+0x20/0x30
hrtimer_cancel+0x13/0x30
do_nanosleep+0xa0/0x150
hrtimer_nanosleep+0xe1/0x230
? __hrtimer_init_sleeper+0x60/0x60
__x64_sys_nanosleep+0x8d/0xa0
do_syscall_64+0x4a/0x100
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa58b52330d
...
---[ end trace 0000000000000002 ]—
He also provided a simple reproducer creating the situation below:
So the execution order of locking steps are the following
(N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
are mutexes that are enabled * with priority inheritance.)
Time moves forward as this timeline goes down:
N1 N2 D1
| | |
| | |
Lock(M1) | |
| | |
| Lock(M2) |
| | |
| | Lock(M2)
| | |
| Lock(M1) |
| (!!bug triggered!) |
This patch (of a 2 patches series) fixes one part of the problem, by
correctly triggering priority inheritance in cases top lock waiter is a
boosted non-DEADLINE entity (like in the example above, N2 and N1).
Reported-by: Glenn Elliott <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/deadline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2dc48720f189..951a7b44156f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1482,7 +1482,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
* boosted due to a SCHED_DEADLINE pi-waiter).
* Otherwise we keep our runtime and deadline.
*/
- if (pi_task && dl_prio(pi_task->normal_prio) && p->dl.dl_boosted) {
+ if (pi_task && dl_prio(pi_task->prio) && p->dl.dl_boosted) {
pi_se = &pi_task->dl;
} else if (!dl_prio(p->normal_prio)) {
/*
--
2.17.2
On Tue, Nov 12, 2019 at 08:50:56AM +0100, Juri Lelli wrote:
> Boosted entities (Priority Inheritance) use static DEADLINE parameters
> of the top priority waiter. However, there might be cases where top
> waiter could be a non-DEADLINE entity that is currently boosted by a
> DEADLINE entity from a different lock chain (i.e., nested priority
> chains involving entities of non-DEADLINE classes). In this case, top
> waiter static DEADLINE parameters could null (initialized to 0 at
> fork()) and replenish_dl_entity() would hit a BUG().
Argh!
> Fix this by temporarily copying static DEADLINE parameters of top
> DEADLINE waiter (there must be at least one in the chain(s) for the
> problem above to happen) into boosted entities. Parameters are reset
> during deboost.
Also, yuck!
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4441,19 +4441,21 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> if (!dl_prio(p->normal_prio) ||
> (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> p->dl.dl_boosted = 1;
> + if (!dl_prio(p->normal_prio))
> + __dl_copy_static(p, pi_task);
> queue_flag |= ENQUEUE_REPLENISH;
> } else
> p->dl.dl_boosted = 0;
> p->sched_class = &dl_sched_class;
So I thought our basic approach was deadline inheritance and screw
runtime accounting.
Given that, I don't quite understand the REPLENISH hack there. Should we
not simply copy dl->deadline around (and restore on unboost)?
That is, should we not do something 'simple' like this:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84b26d38c929..1579c571cb83 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -522,6 +522,7 @@ struct sched_dl_entity {
*/
s64 runtime; /* Remaining runtime for this instance */
u64 deadline; /* Absolute deadline for this instance */
+ u64 normal_deadline;
unsigned int flags; /* Specifying the scheduler behaviour */
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26e4ffa01e7a..16164b0ba80b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4452,9 +4452,11 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
if (!dl_prio(p->normal_prio) ||
(pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
p->dl.dl_boosted = 1;
- queue_flag |= ENQUEUE_REPLENISH;
- } else
+ p->dl.deadline = pi_task->dl.deadline;
+ } else {
p->dl.dl_boosted = 0;
+ p->dl.deadline = p->dl.normal_deadline;
+ }
p->sched_class = &dl_sched_class;
} else if (rt_prio(prio)) {
if (dl_prio(oldprio))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43323f875cb9..0ad7c2797f11 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -674,6 +674,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
* spent on hardirq context, etc.).
*/
dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+ dl_se->normal_deadline = dl_se->deadline;
dl_se->runtime = dl_se->dl_runtime;
}
@@ -709,6 +710,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
*/
if (dl_se->dl_deadline == 0) {
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+ dl_se->normal_deadline = dl_se->deadline;
dl_se->runtime = pi_se->dl_runtime;
}
@@ -723,6 +725,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
*/
while (dl_se->runtime <= 0) {
dl_se->deadline += pi_se->dl_period;
+ dl_se->normal_deadline = dl_se->normal;
dl_se->runtime += pi_se->dl_runtime;
}
@@ -738,6 +741,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
printk_deferred_once("sched: DL replenish lagged too much\n");
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+ dl_se->normal_deadline = dl_se->deadline;
dl_se->runtime = pi_se->dl_runtime;
}
@@ -898,6 +902,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
}
dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+ dl_se->normal_deadline = dl_se->deadline;
dl_se->runtime = pi_se->dl_runtime;
}
}
On Tue, Nov 12, 2019 at 11:51:30AM +0100, Peter Zijlstra wrote:
> dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> + dl_se->normal_deadline = dl_se->deadline;
Or rather something like:
static inline dl_set_deadline(struct sched_dl_entity *dl_se, u64 deadline)
{
dl_se->normal_deadline = deadline;
/*
* We should never update the deadline while boosted,
* but if we do, make sure to not change the effective
* deadline until deboost.
*/
if (WARN_ON_ONCE(dl_se->dl_boosted))
return;
dl_se->deadline = dl_se->normal_deadline;
}
Hi,
On 12/11/19 11:51, Peter Zijlstra wrote:
> On Tue, Nov 12, 2019 at 08:50:56AM +0100, Juri Lelli wrote:
> > Boosted entities (Priority Inheritance) use static DEADLINE parameters
> > of the top priority waiter. However, there might be cases where top
> > waiter could be a non-DEADLINE entity that is currently boosted by a
> > DEADLINE entity from a different lock chain (i.e., nested priority
> > chains involving entities of non-DEADLINE classes). In this case, top
> > waiter static DEADLINE parameters could null (initialized to 0 at
> > fork()) and replenish_dl_entity() would hit a BUG().
>
> Argh!
>
> > Fix this by temporarily copying static DEADLINE parameters of top
> > DEADLINE waiter (there must be at least one in the chain(s) for the
> > problem above to happen) into boosted entities. Parameters are reset
> > during deboost.
>
> Also, yuck!
Indeed. :-(
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4441,19 +4441,21 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> > if (!dl_prio(p->normal_prio) ||
> > (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> > p->dl.dl_boosted = 1;
> > + if (!dl_prio(p->normal_prio))
> > + __dl_copy_static(p, pi_task);
> > queue_flag |= ENQUEUE_REPLENISH;
> > } else
> > p->dl.dl_boosted = 0;
> > p->sched_class = &dl_sched_class;
>
> So I thought our basic approach was deadline inheritance and screw
> runtime accounting.
>
> Given that, I don't quite understand the REPLENISH hack there. Should we
> not simply copy dl->deadline around (and restore on unboost)?
>
> That is, should we not do something 'simple' like this:
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84b26d38c929..1579c571cb83 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -522,6 +522,7 @@ struct sched_dl_entity {
> */
> s64 runtime; /* Remaining runtime for this instance */
> u64 deadline; /* Absolute deadline for this instance */
> + u64 normal_deadline;
> unsigned int flags; /* Specifying the scheduler behaviour */
>
> /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26e4ffa01e7a..16164b0ba80b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4452,9 +4452,11 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> if (!dl_prio(p->normal_prio) ||
> (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> p->dl.dl_boosted = 1;
> - queue_flag |= ENQUEUE_REPLENISH;
> - } else
> + p->dl.deadline = pi_task->dl.deadline;
> + } else {
> p->dl.dl_boosted = 0;
> + p->dl.deadline = p->dl.normal_deadline;
> + }
> p->sched_class = &dl_sched_class;
> } else if (rt_prio(prio)) {
> if (dl_prio(oldprio))
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 43323f875cb9..0ad7c2797f11 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -674,6 +674,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> * spent on hardirq context, etc.).
> */
> dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> + dl_se->normal_deadline = dl_se->deadline;
> dl_se->runtime = dl_se->dl_runtime;
> }
>
> @@ -709,6 +710,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
> */
> if (dl_se->dl_deadline == 0) {
> dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
> + dl_se->normal_deadline = dl_se->deadline;
> dl_se->runtime = pi_se->dl_runtime;
> }
>
> @@ -723,6 +725,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
> */
> while (dl_se->runtime <= 0) {
> dl_se->deadline += pi_se->dl_period;
> + dl_se->normal_deadline = dl_se->normal;
> dl_se->runtime += pi_se->dl_runtime;
So, the problem is more related to pi_se->dl_runtime than its deadline.
Even if we don't replenish at the instant in time when boosting happens,
the boosted task might still deplete its runtime while being boosted and
that would cause update_curr_dl() to eventually call
enqueue_task_dl(..., ENQUEUE_REPLENISH) - we don't perform runtime
enforcement on boosted tasks, but still do accounting and 'instant'
replenishment with deadline postponement ('soft CBS'). This in turn will
BUG_ON(pi_se->dl_runtime <= 0), as, in a case like Glenn's, N2 and N1
are non-deadline tasks and N1 would be using N2's (pi_se) dl_runtime to
replenish finding it to be 0.
Does it make any sense?
On Wed, Nov 13, 2019 at 10:22:41AM +0100, Juri Lelli wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 26e4ffa01e7a..16164b0ba80b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4452,9 +4452,11 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> > if (!dl_prio(p->normal_prio) ||
> > (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> > p->dl.dl_boosted = 1;
> > - queue_flag |= ENQUEUE_REPLENISH;
> > - } else
> > + p->dl.deadline = pi_task->dl.deadline;
> > + } else {
> > p->dl.dl_boosted = 0;
> > + p->dl.deadline = p->dl.normal_deadline;
> > + }
> > p->sched_class = &dl_sched_class;
> > } else if (rt_prio(prio)) {
> > if (dl_prio(oldprio))
> So, the problem is more related to pi_se->dl_runtime than its deadline.
> Even if we don't replenish at the instant in time when boosting happens,
> the boosted task might still deplete its runtime while being boosted and
I thought we ignored all runtime checks when we were boosted? Yes, that
is all sorts of broken, but IIRC we figured that barring something like
proxy-execution there really wasn't anything sane we could do wrt
bandwidth anyway.
Seeing how proper bandwidth handling would have the boosted task consume
the boostee's budget etc.. And blocking the entire boost chain when it
collectively runs out.
On 13/11/19 10:36, Peter Zijlstra wrote:
> On Wed, Nov 13, 2019 at 10:22:41AM +0100, Juri Lelli wrote:
>
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 26e4ffa01e7a..16164b0ba80b 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4452,9 +4452,11 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
> > > if (!dl_prio(p->normal_prio) ||
> > > (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
> > > p->dl.dl_boosted = 1;
> > > - queue_flag |= ENQUEUE_REPLENISH;
> > > - } else
> > > + p->dl.deadline = pi_task->dl.deadline;
> > > + } else {
> > > p->dl.dl_boosted = 0;
> > > + p->dl.deadline = p->dl.normal_deadline;
> > > + }
> > > p->sched_class = &dl_sched_class;
> > > } else if (rt_prio(prio)) {
> > > if (dl_prio(oldprio))
>
> > So, the problem is more related to pi_se->dl_runtime than its deadline.
> > Even if we don't replenish at the instant in time when boosting happens,
> > the boosted task might still deplete its runtime while being boosted and
>
> I thought we ignored all runtime checks when we were boosted? Yes, that
We don't throttle and replenish instantly when runtime is depleted, but
we still account runtime. See update_curr_dl(), dl_runtime_exceeded()
and if(unlikely(dl_se->dl_boosted ...) case.
Mmm, maybe we should stop accounting as well and only postpone
deadlines, is this what you had in mind?
> is all sorts of broken, but IIRC we figured that barring something like
> proxy-execution there really wasn't anything sane we could do wrt
> bandwidth anyway.
>
> Seeing how proper bandwidth handling would have the boosted task consume
> the boostee's budget etc.. And blocking the entire boost chain when it
> collectively runs out.
Yep, this is how it should eventually all work.