2021-12-18 14:25:46

by John Keeping

[permalink] [raw]
Subject: [RT] BUG in sched/cpupri.c

Hi,

On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
occasional BUG at cpupri.c:151 (full trace below).

Having added extra logging, it seems that p->prio == 120 which isn't
handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").

This happens maybe half the time as userspace is starting up, but if the
system boots to a login prompt I haven't seen any issues after that.
The process isn't always the same, I've seen systemd-udevd as well as
pr/ttyS2.

I can easily "fix" this by handling normal priority tasks in
convert_prio() but I guess there's some wider reason why that's not an
expected value there, so perhaps the real problem lies elsewhere.


Thanks,
John

------------[ cut here ]------------
kernel BUG at kernel/sched/cpupri.c:151!
Internal error: Oops - BUG: 0 [#1] PREEMPT_RT SMP ARM
Modules linked in:
CPU: 1 PID: 117 Comm: pr/ttyS2 Tainted: G B 5.15.10-rt24 #1
Hardware name: Rockchip (Device Tree)
PC is at cpupri_find_fitness+0x78/0x1a4
LR is at cpupri_find_fitness+0x28/0x1a4
pc : [<c0183be8>] lr : [<c0183b98>] psr: 20030193
sp : c3ea38f8 ip : c38461bc fp : 00000001
r10: e7db8b00 r9 : 00000001 r8 : c1ccd880
r7 : c3846180 r6 : c3846180 r5 : 00000000 r4 : e7db13dc
r3 : c0183b90 r2 : 00000000 r1 : 00000007 r0 : 00000078
Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 0296406a DAC: 00000051
Register r0 information: non-paged memory
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: non-slab/vmalloc memory
Register r4 information: non-slab/vmalloc memory
Register r5 information: NULL pointer
Register r6 information: slab task_struct start c3846180 pointer offset 0
Register r7 information: slab task_struct start c3846180 pointer offset 0
Register r8 information: slab kmalloc-1k start c1ccd800 pointer offset 128 size 1024
Register r9 information: non-paged memory
Register r10 information: non-slab/vmalloc memory
Register r11 information: non-paged memory
Register r12 information: slab task_struct start c3846180 pointer offset 60
Process pr/ttyS2 (pid: 117, stack limit = 0x(ptrval))
Stack: (0xc3ea38f8 to 0xc3ea4000)
38e0: 187d4724 00000078
3900: 41b58ab3 c11a824a c016ffcc 00000001 e7db13dc c1388b00 c3846180 c1408e04
3920: 00000001 e7db8b00 00000001 c017a378 c017bcd0 e7db9098 e7db8b00 c1d84b00
3940: 00000001 e7db9290 e7db9098 c1ccd858 c1ccd858 c017bcd8 e7db8b00 c1ccd858
3960: e7db9008 c1ccd800 c1408e04 00000008 c1ccd858 c017cfc8 c1ccd85c c1ccd858
3980: 0000002b 0000002a c1ccd85c c026718c 00000000 00000000 c1ccd858 00000000
39a0: c1ccd85c 00000020 c1ccd858 c01e7300 c17a5524 e7db8b40 00000000 0181b72f
39c0: c3846288 00000003 c17a4fe0 00000001 00000013 c0d0ab00 f0802000 26a30000
39e0: c15c4dc0 c01114e0 c1cc7000 c1c3db80 c1409064 00000013 c0111664 f0802000
3a00: 26a30000 c0111678 c1cc7000 c01a05d4 c1cc7000 c3ea0000 00000003 c1409064
3a20: 00000003 f0802000 26a30000 c0198080 00000000 00000000 00000003 c019894c
3a40: c1387850 00000003 c3ea3a80 c06417a0 c3ea3a80 40030193 00000000 c0cda8ec
3a60: 60030013 ffffffff c3ea3ab4 c40893f8 c3ea0000 c1408e04 c3ea3b8c c0100b00
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff 00000051 c0cda8e8 c4088f00 c0167624 c122da30 187d4760
3ae0: c1388b00 80030013 e7db8b00 e7db8b00 c17a5dc0 c02353f0 00000000 00000000
3b00: 41b58ab3 c11a63b8 c01673fc c016580c 00000001 c3ea0004 26a30000 c3ea0000
3b20: c3ea3b84 c023523c c1d84b00 c0cda838 c17a5dc0 c02353f0 00000000 00000000
3b40: 00000000 c3ea0000 c1387020 c0cda8e8 00000001 c01640b4 00000001 c3ea0000
3b60: c3ea3c20 f6251f9e c3ea3b84 c181a338 b77d4774 c3ea3c20 c3ea0000 c3ea0000
3b80: c3ea3be0 c4073d18 c181a344 c0cd5ba8 80030013 c3ea000c 00000001 c3ea0000
3ba0: 41b58ab3 c11a889d c0cd59a8 00000003 c17a5dc0 c02353f0 ffffc000 c0235580
3bc0: c01641c8 c01640c4 00000000 00000001 c3ea0000 40030013 c38463c4 c14b7044
3be0: 00000001 c3ea3be0 c4088f00 c01641dc 40030013 c38463c4 40030013 c3846180
3c00: c14b7098 c0cda91c c14b7040 c01b5cc4 c01b5f2c 00000000 00000000 00000000
3c20: 00000001 00000000 00000000 00000001 c0d39640 f6251f9e 00000000 b77d4790
3c40: c181a338 c3ea3ce0 c3ea3ca0 c181a344 00000000 00000007 c181a4b8 c0cd5dac
3c60: 41b58ab3 c11e5383 c061d930 c023523c c17a5dc0 c02353f0 00000000 00000000
3c80: 41b58ab3 c11a8997 c0cd5cf0 c02353f0 00000000 00000000 00000000 c3ea0000
3ca0: c3846180 c06d95c4 00000001 c3ea0004 26a30000 00000007 c181a4b8 c02340c4
3cc0: c181a338 60030013 00000001 00000007 00000036 00000000 00000007 f6251f9e
3ce0: c181a338 b77d47a4 c181a4c4 c3f59000 00000036 c06ddb94 00000035 00000035
3d00: c3f59000 c3f59000 00000035 c0192a98 00000035 187d47a8 c3ea3e44 000003fe
3d20: 41b58ab3 c11eea53 c06dd92c c01b5eec b77d47ac c1448740 c3ea3dc0 c3ea3d80
3d40: 41b58ab3 c11a8f2c c0192900 c0cd5d9c c161de70 c0197ba4 26a20000 0000003e
3d60: 41b58ab3 c023523c c0cd5cf0 c0cda838 ffffffff c0cda838 00000000 c3ea0000
3d80: c17a5dc0 c02353f0 ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001
3da0: c3ea0000 c3ea3f40 00000000 00000000 00000243 00000036 c3ea3dd4 f6251f9e
3dc0: c3ea3f40 c161de40 c3efed00 c3ea3f40 00000000 00000000 00000243 00000036
3de0: c161de70 c0194fd8 c3ea3e40 000016ca c14d1250 00000000 c3f59000 00000000
3e00: 187d47c4 c3f59000 00000017 c3ea0000 c1387020 c016580c 00000001 c3ea0004
3e20: 41b58ab3 c11a9104 c0194bb0 c02340c4 c1d80000 e7db8b00 00000000 c3ea0000
3e40: c3ea3ea0 c3f59000 00000400 c0165830 00000002 c01b5eec 00000000 00000002
3e60: 00000000 c3846180 c01828b4 c3ea3e6c c3ea3e6c c1d80000 c1512280 c3846188
3e80: c384640c 00000001 c3ea3f54 c0cd01b0 c1d64000 b77d47d8 c3ea3f54 c0167508
3ea0: 00000242 00000000 8bb3f10e 00000001 c2030035 00000001 00000000 00000000
3ec0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3ee0: 00000000 00000000 00000000 00000000 00000000 00000000 c17a5dc0 c02353f0
3f00: ffffc000 c0235580 c01641c8 c01640c4 00000000 00000001 c3ea0000 c3efec80
3f20: 60000013 c3ea000c 000004f8 c3846180 c3ea3f4c c01641dc c3efec80 60000013
3f40: 60000013 ffffc000 c3846180 f6251f9e c3ea0000 c3846180 c3efec80 00000000
3f60: c3846180 c0194bb0 c161de40 c1d67d20 c3846180 c0157eb0 00000000 c3846180
3f80: c3efeca0 c3efec0c 00000000 c3efec00 c0157c74 00000000 00000000 00000000
3fa0: 00000000 00000000 00000000 c01000fc 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[<c0183be8>] (cpupri_find_fitness) from [<c017a378>] (find_lowest_rq+0x1bc/0x258)
[<c017a378>] (find_lowest_rq) from [<c017bcd8>] (push_rt_task.part.0+0xe8/0x23c)
[<c017bcd8>] (push_rt_task.part.0) from [<c017cfc8>] (rto_push_irq_work_func+0x7c/0xd8)
[<c017cfc8>] (rto_push_irq_work_func) from [<c026718c>] (irq_work_single+0x8c/0x140)
[<c026718c>] (irq_work_single) from [<c01e7300>] (flush_smp_call_function_queue+0x238/0x31c)
[<c01e7300>] (flush_smp_call_function_queue) from [<c01114e0>] (do_handle_IPI+0x29c/0x420)
[<c01114e0>] (do_handle_IPI) from [<c0111678>] (ipi_handler+0x14/0x20)
[<c0111678>] (ipi_handler) from [<c01a05d4>] (handle_percpu_devid_irq+0x8c/0x140)
[<c01a05d4>] (handle_percpu_devid_irq) from [<c0198080>] (handle_irq_desc+0x38/0x48)
[<c0198080>] (handle_irq_desc) from [<c019894c>] (handle_domain_irq+0x40/0x54)
[<c019894c>] (handle_domain_irq) from [<c06417a0>] (gic_handle_irq+0x88/0xa0)
[<c06417a0>] (gic_handle_irq) from [<c0100b00>] (__irq_svc+0x60/0xac)
Exception stack(0xc3ea3a80 to 0xc3ea3ac8)
3a80: e7db7020 00000003 00000000 00000000 80030013 b77d4760 00000000 00000001
3aa0: c40893f8 c3ea3b60 c1408e04 c3ea3b8c e7db7020 c3ea3ad0 c02340c4 c0cda8ec
3ac0: 60030013 ffffffff
[<c0100b00>] (__irq_svc) from [<c0cda8ec>] (_raw_spin_unlock_irqrestore+0x1c/0x70)
[<c0cda8ec>] (_raw_spin_unlock_irqrestore) from [<c0167624>] (try_to_wake_up+0x228/0x468)
[<c0167624>] (try_to_wake_up) from [<c0cd5ba8>] (rt_mutex_slowunlock+0x200/0x348)
[<c0cd5ba8>] (rt_mutex_slowunlock) from [<c0cd5dac>] (rt_spin_unlock+0xbc/0x104)
[<c0cd5dac>] (rt_spin_unlock) from [<c06ddb94>] (serial8250_console_write+0x268/0x39c)
[<c06ddb94>] (serial8250_console_write) from [<c0194fd8>] (printk_kthread_func+0x428/0x4c4)
[<c0194fd8>] (printk_kthread_func) from [<c0157eb0>] (kthread+0x23c/0x25c)
[<c0157eb0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc3ea3fb0 to 0xc3ea3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
Code: e1a00008 e28dd014 e8bd4ff0 ea00004a (e7f001f2)
---[ end trace 0000000000000002 ]---


2021-12-20 17:35:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 18.12.21 15:25, John Keeping wrote:
> Hi,
>
> On v5.15.10-rt24 (and earlier v5.15 series RT kernels) I'm seeing an
> occasional BUG at cpupri.c:151 (full trace below).
>
> Having added extra logging, it seems that p->prio == 120 which isn't
> handled by convert_prio() following commit 934fc3314b39 ("sched/cpupri:
> Remap CPUPRI_NORMAL to MAX_RT_PRIO-1").
>
> This happens maybe half the time as userspace is starting up, but if the
> system boots to a login prompt I haven't seen any issues after that.
> The process isn't always the same, I've seen systemd-udevd as well as
> pr/ttyS2.
>
> I can easily "fix" this by handling normal priority tasks in
> convert_prio() but I guess there's some wider reason why that's not an
> expected value there, so perhaps the real problem lies elsewhere.

find_lowest_rq() -> [ cpupri_find() ] -> cpupri_find_fitness() ->
convert_prio(p->prio) can only be called for an RT task (p->prio=[0..98])

I can recreate this on my Juno-r0 Arm64 machine with v5.15.10-rt24,
CONFIG_PREEMPT_RT=y .

------------[ cut here ]------------
[ 15.256482] WARNING: CPU: 3 PID: 35 at kernel/sched/rt.c:1898 push_rt_task.part.0+0x190/0x364
[ 15.256521] Modules linked in:
[ 15.256532] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #14
[ 15.256546] Hardware name: ARM Juno development board (r0) (DT)
[ 15.256552] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 15.256567] pc : push_rt_task.part.0+0x190/0x364
[ 15.256582] lr : push_rt_task.part.0+0x20/0x364
[ 15.256597] sp : ffff800011ff3e20
[ 15.256602] x29: ffff800011ff3e20 x28: ffff0008001ca940 x27: 0000000000000000
[ 15.256622] x26: 0000000000000000 x25: ffff800011f61da0 x24: 0000000000000001
[ 15.256639] x23: 0000000000000000 x22: ffff00097ef923c0 x21: ffff0008000e0dc0
[ 15.256657] x20: ffff0008000e0dc0 x19: ffff00097ef923c0 x18: 0000000000000002
[ 15.256674] x17: ffff80096d7e7000 x16: ffff800011ff4000 x15: 0000000000004000
[ 15.256692] x14: 00000000000000c1 x13: 0000000000000001 x12: 0000000000000000
[ 15.256709] x11: 0000000000000001 x10: 0000000000000001 x9 : 0000000000000400
[ 15.256725] x8 : ffff00097ef924c0 x7 : ffff00097ef92440 x6 : 00000000356c5f2b
[ 15.256743] x5 : ffff80096d7e7000 x4 : 0000000000000000 x3 : 0000000000000003
[ 15.256760] x2 : ffff00097ef923c0 x1 : 0000000000000062 x0 : 0000000000000078
[ 15.256777] Call trace:
[ 15.256783] push_rt_task.part.0+0x190/0x364
[ 15.256799] rto_push_irq_work_func+0x180/0x190
[ 15.256815] irq_work_single+0x34/0xa0
[ 15.256829] flush_smp_call_function_queue+0x138/0x244
[ 15.256845] generic_smp_call_function_single_interrupt+0x18/0x24
[ 15.256860] ipi_handler+0xb0/0x15c
[ 15.256875] handle_percpu_devid_irq+0x88/0x140
[ 15.256890] handle_domain_irq+0x64/0x94
[ 15.256907] gic_handle_irq+0x50/0xf0
[ 15.256924] call_on_irq_stack+0x2c/0x60
[ 15.256937] do_interrupt_handler+0x54/0x60
[ 15.256951] el1_interrupt+0x30/0x80
[ 15.256965] el1h_64_irq_handler+0x1c/0x30
[ 15.256978] el1h_64_irq+0x78/0x7c
[ 15.256989] preempt_schedule_irq+0x40/0x150
[ 15.257004] el1_interrupt+0x60/0x80
[ 15.257016] el1h_64_irq_handler+0x1c/0x30
[ 15.257029] el1h_64_irq+0x78/0x7c
[ 15.257039] run_ksoftirqd+0x94/0xc0
[ 15.257053] smpboot_thread_fn+0x2d8/0x320
[ 15.257066] kthread+0x18c/0x1a0
[ 15.257081] ret_from_fork+0x10/0x20
[ 15.257094] ---[ end trace 0000000000000002 ]---
[ 16.257349] next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/3 35]

root@juno:~# ps -eTo comm,pid,lwp,pri,rtprio,nice,class | more
COMMAND PID LWP PRI RTPRIO NI CLS
..
rcu_preempt 11 11 41 1 - FF
...
ksoftirqd/3 35 35 19 - 0 TS
...

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..798887f1eeff 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
struct task_struct *push_task = NULL;
int cpu;

+ if (WARN_ON_ONCE(!rt_task(rq->curr))) {
+ printk("next_task=[%s %d] rq->curr=[%s %d]\n",
+ next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
+ }
+
if (!pull || rq->push_busy)
return 0;

+ if (!rt_task(rq->curr))
+ return 0;
+
cpu = find_lowest_rq(rq->curr);
if (cpu == -1 || cpu == rq->cpu)
return 0;

The 2. condition prevents the BUG_ON() in cpupri_find_fitness().

[...]

2021-12-21 16:11:44

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 20/12/21 18:35, Dietmar Eggemann wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..798887f1eeff 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> struct task_struct *push_task = NULL;
> int cpu;
>
> + if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> + printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> + }
> +
> if (!pull || rq->push_busy)
> return 0;
>
> + if (!rt_task(rq->curr))
> + return 0;
> +

If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
case), that's buggered: we shouldn't be trying to pull RT tasks when we
have queued RT tasks and a less-than-RT current, we should be rescheduling
right now.

I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
CFS task (or straight up sched_setscheduler()):
check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
so I suspect we get to the push/pull callback before getting a
resched (I actually don't see where we'd get a resched in that case other
than at the next tick).

IOW, feels like we want the below. Unfortunately I can't reproduce the
issue locally (yet), so that's untested.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
* this is the right place to try to pull some other one
* from an overloaded CPU, if any.
*/
- if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+ if (!task_on_rq_queued(p))
return;

- deadline_queue_pull_task(rq);
+ if (!rq->dl.dl_nr_running)
+ deadline_queue_pull_task(rq);
+ else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+ resched_curr(rq);
}

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..1ea2567612fb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
* we may need to handle the pulling of RT tasks
* now.
*/
- if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+ if (!task_on_rq_queued(p))
return;

- rt_queue_pull_task(rq);
+ if (!rq->rt.rt_nr_running)
+ rt_queue_pull_task(rq);
+ else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+ resched_curr(rq);
}

void __init init_sched_rt_class(void)

2021-12-21 16:45:44

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Tue, 21 Dec 2021 16:11:34 +0000
Valentin Schneider <[email protected]> wrote:

> On 20/12/21 18:35, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..798887f1eeff 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> > struct task_struct *push_task = NULL;
> > int cpu;
> >
> > + if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> > + printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> > + }
> > +
> > if (!pull || rq->push_busy)
> > return 0;
> >
> > + if (!rt_task(rq->curr))
> > + return 0;
> > +
>
> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> have queued RT tasks and a less-than-RT current, we should be rescheduling
> right now.
>
> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> CFS task (or straight up sched_setscheduler()):
> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> so I suspect we get to the push/pull callback before getting a
> resched (I actually don't see where we'd get a resched in that case other
> than at the next tick).
>
> IOW, feels like we want the below. Unfortunately I can't reproduce the
> issue locally (yet), so that's untested.

This patch doesn't make any difference for me - I hit the BUG on the
first boot with this applied.

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> * this is the right place to try to pull some other one
> * from an overloaded CPU, if any.
> */
> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> + if (!task_on_rq_queued(p))
> return;
>
> - deadline_queue_pull_task(rq);
> + if (!rq->dl.dl_nr_running)
> + deadline_queue_pull_task(rq);
> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> + resched_curr(rq);
> }
>
> /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..1ea2567612fb 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> * we may need to handle the pulling of RT tasks
> * now.
> */
> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> + if (!task_on_rq_queued(p))
> return;
>
> - rt_queue_pull_task(rq);
> + if (!rq->rt.rt_nr_running)
> + rt_queue_pull_task(rq);
> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> + resched_curr(rq);
> }
>
> void __init init_sched_rt_class(void)


2021-12-21 17:22:46

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 21/12/21 16:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <[email protected]> wrote:
>
>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> > index ef8228d19382..798887f1eeff 100644
>> > --- a/kernel/sched/rt.c
>> > +++ b/kernel/sched/rt.c
>> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
>> > struct task_struct *push_task = NULL;
>> > int cpu;
>> >
>> > + if (WARN_ON_ONCE(!rt_task(rq->curr))) {
>> > + printk("next_task=[%s %d] rq->curr=[%s %d]\n",
>> > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
>> > + }
>> > +
>> > if (!pull || rq->push_busy)
>> > return 0;
>> >
>> > + if (!rt_task(rq->curr))
>> > + return 0;
>> > +
>>
>> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
>> case), that's buggered: we shouldn't be trying to pull RT tasks when we
>> have queued RT tasks and a less-than-RT current, we should be rescheduling
>> right now.
>>
>> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
>> CFS task (or straight up sched_setscheduler()):
>> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
>> so I suspect we get to the push/pull callback before getting a
>> resched (I actually don't see where we'd get a resched in that case other
>> than at the next tick).
>>
>> IOW, feels like we want the below. Unfortunately I can't reproduce the
>> issue locally (yet), so that's untested.
>
> This patch doesn't make any difference for me - I hit the BUG on the
> first boot with this applied.
>

Thanks for the swift testing!

Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
we can somehow get to the push IRQ work before rescheduling (which I think
might happen if we try to resched_curr(this_rq)), then we need his
bailout.

2021-12-21 17:42:41

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Tue, 21 Dec 2021 17:22:34 +0000
Valentin Schneider <[email protected]> wrote:

> On 21/12/21 16:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <[email protected]> wrote:
> >
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:
> >> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> > index ef8228d19382..798887f1eeff 100644
> >> > --- a/kernel/sched/rt.c
> >> > +++ b/kernel/sched/rt.c
> >> > @@ -1895,9 +1895,17 @@ static int push_rt_task(struct rq *rq, bool pull)
> >> > struct task_struct *push_task = NULL;
> >> > int cpu;
> >> >
> >> > + if (WARN_ON_ONCE(!rt_task(rq->curr))) {
> >> > + printk("next_task=[%s %d] rq->curr=[%s %d]\n",
> >> > + next_task->comm, next_task->pid, rq->curr->comm, rq->curr->pid);
> >> > + }
> >> > +
> >> > if (!pull || rq->push_busy)
> >> > return 0;
> >> >
> >> > + if (!rt_task(rq->curr))
> >> > + return 0;
> >> > +
> >>
> >> If current is a DL/stopper task, why not; if that's CFS (which IIUC is your
> >> case), that's buggered: we shouldn't be trying to pull RT tasks when we
> >> have queued RT tasks and a less-than-RT current, we should be rescheduling
> >> right now.
> >>
> >> I'm thinking this can happen via rt_mutex_setprio() when we demote an RT-boosted
> >> CFS task (or straight up sched_setscheduler()):
> >> check_class_changed()->switched_from_rt() doesn't trigger a resched_curr(),
> >> so I suspect we get to the push/pull callback before getting a
> >> resched (I actually don't see where we'd get a resched in that case other
> >> than at the next tick).
> >>
> >> IOW, feels like we want the below. Unfortunately I can't reproduce the
> >> issue locally (yet), so that's untested.
> >
> > This patch doesn't make any difference for me - I hit the BUG on the
> > first boot with this applied.
> >
>
> Thanks for the swift testing!
>
> Did you give Dietmar's patch a try? ITSM it lacks a resched_curr(), but if
> we can somehow get to the push IRQ work before rescheduling (which I think
> might happen if we try to resched_curr(this_rq)), then we need his
> bailout.

With Dietmar's patch I hit the added WARN_ON_ONCE with:

next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]
next_task=[rcu_preempt 11] rq->curr=[ksoftirqd/1 21]

# ps -eTo comm,pid,lwp,pri,rtprio,nice,class
...
rcu_preempt 11 11 41 1 - FF
...
ksoftirqd/1 21 21 19 - 0 TS

Out of three reproductions, rcu_preempt as next_task is consistent, but
I've seen three different tasks in rq->curr (although all with
SCHED_OTHER).

And as expected, the added early return does stop the BUG.

2021-12-22 17:47:15

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 21.12.21 17:45, John Keeping wrote:
> On Tue, 21 Dec 2021 16:11:34 +0000
> Valentin Schneider <[email protected]> wrote:
>
>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index fd7c4f972aaf..7d61ceec1a3b 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>> * this is the right place to try to pull some other one
>> * from an overloaded CPU, if any.
>> */
>> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>> + if (!task_on_rq_queued(p))
>> return;
>>
>> - deadline_queue_pull_task(rq);
>> + if (!rq->dl.dl_nr_running)
>> + deadline_queue_pull_task(rq);
>> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>> + resched_curr(rq);
>> }
>>
>> /*
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..1ea2567612fb 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>> * we may need to handle the pulling of RT tasks
>> * now.
>> */
>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> + if (!task_on_rq_queued(p))
>> return;
>>
>> - rt_queue_pull_task(rq);
>> + if (!rq->rt.rt_nr_running)
>> + rt_queue_pull_task(rq);
>> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>> + resched_curr(rq);

switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
rto_push_irq_work_func() -> push_rt_task(rq, true)

seems to be the only way with pull=true.

In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.

[ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
[ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
[ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
^^^^^^^^^^^^^^^^^^^^^^
[ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
[ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
^^^^^^^^^^ ^^^^^^
[ 22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
[ 22.288711] Hardware name: ARM Juno development board (r0) (DT)
[ 22.288718] Call trace:
[ 22.288722] dump_backtrace+0x0/0x1ac
[ 22.288747] show_stack+0x1c/0x70
[ 22.288763] dump_stack_lvl+0x68/0x84
[ 22.288777] dump_stack+0x1c/0x38
[ 22.288788] push_rt_task.part.0+0x364/0x370
[ 22.288805] rto_push_irq_work_func+0x180/0x190
[ 22.288821] irq_work_single+0x34/0xa0
[ 22.288836] flush_smp_call_function_queue+0x138/0x244
[ 22.288852] generic_smp_call_function_single_interrupt+0x18/0x24
[ 22.288867] ipi_handler+0xb0/0x15c
...

What about slightly changing the layout in switched_from_rt() (only lightly tested):


@@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
* we may need to handle the pulling of RT tasks
* now.
*/
- if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+ if (!task_on_rq_queued(p))
+ return;
+
+ if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
+ resched_curr(rq);
+ return;
+ }
+
+ if (rq->rt.rt_nr_running)
return;

rt_queue_pull_task(rq);

2021-12-22 18:45:44

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Wed, 22 Dec 2021 18:46:57 +0100
Dietmar Eggemann <[email protected]> wrote:

> On 21.12.21 17:45, John Keeping wrote:
> > On Tue, 21 Dec 2021 16:11:34 +0000
> > Valentin Schneider <[email protected]> wrote:
> >
> >> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index fd7c4f972aaf..7d61ceec1a3b 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >> * this is the right place to try to pull some other one
> >> * from an overloaded CPU, if any.
> >> */
> >> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >> + if (!task_on_rq_queued(p))
> >> return;
> >>
> >> - deadline_queue_pull_task(rq);
> >> + if (!rq->dl.dl_nr_running)
> >> + deadline_queue_pull_task(rq);
> >> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >> + resched_curr(rq);
> >> }
> >>
> >> /*
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index ef8228d19382..1ea2567612fb 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >> * we may need to handle the pulling of RT tasks
> >> * now.
> >> */
> >> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >> + if (!task_on_rq_queued(p))
> >> return;
> >>
> >> - rt_queue_pull_task(rq);
> >> + if (!rq->rt.rt_nr_running)
> >> + rt_queue_pull_task(rq);
> >> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >> + resched_curr(rq);
>
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
> rto_push_irq_work_func() -> push_rt_task(rq, true)
>
> seems to be the only way with pull=true.
>
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>
> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
> ^^^^^^^^^^^^^^^^^^^^^^
> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
> ^^^^^^^^^^ ^^^^^^
> [ 22.288698] CPU: 3 PID: 35 Comm: ksoftirqd/3 Not tainted 5.15.10-rt24-dirty #36
> [ 22.288711] Hardware name: ARM Juno development board (r0) (DT)
> [ 22.288718] Call trace:
> [ 22.288722] dump_backtrace+0x0/0x1ac
> [ 22.288747] show_stack+0x1c/0x70
> [ 22.288763] dump_stack_lvl+0x68/0x84
> [ 22.288777] dump_stack+0x1c/0x38
> [ 22.288788] push_rt_task.part.0+0x364/0x370
> [ 22.288805] rto_push_irq_work_func+0x180/0x190
> [ 22.288821] irq_work_single+0x34/0xa0
> [ 22.288836] flush_smp_call_function_queue+0x138/0x244
> [ 22.288852] generic_smp_call_function_single_interrupt+0x18/0x24
> [ 22.288867] ipi_handler+0xb0/0x15c
> ...
>
> What about slightly changing the layout in switched_from_rt() (only lightly tested):

I still see the BUG splat with the patch below applied :-(

> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> * we may need to handle the pulling of RT tasks
> * now.
> */
> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> + if (!task_on_rq_queued(p))
> + return;
> +
> + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> + resched_curr(rq);
> + return;
> + }
> +
> + if (rq->rt.rt_nr_running)
> return;
>
> rt_queue_pull_task(rq);


2021-12-22 19:48:45

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 22/12/21 18:46, Dietmar Eggemann wrote:
> On 21.12.21 17:45, John Keeping wrote:
>> On Tue, 21 Dec 2021 16:11:34 +0000
>> Valentin Schneider <[email protected]> wrote:
>>
>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index fd7c4f972aaf..7d61ceec1a3b 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>> * this is the right place to try to pull some other one
>>> * from an overloaded CPU, if any.
>>> */
>>> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
>>> + if (!task_on_rq_queued(p))
>>> return;
>>>
>>> - deadline_queue_pull_task(rq);
>>> + if (!rq->dl.dl_nr_running)
>>> + deadline_queue_pull_task(rq);
>>> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
>>> + resched_curr(rq);
>>> }
>>>
>>> /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..1ea2567612fb 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>>> * we may need to handle the pulling of RT tasks
>>> * now.
>>> */
>>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>>> + if (!task_on_rq_queued(p))
>>> return;
>>>
>>> - rt_queue_pull_task(rq);
>>> + if (!rq->rt.rt_nr_running)
>>> + rt_queue_pull_task(rq);
>>> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
>>> + resched_curr(rq);
>
> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
> rto_push_irq_work_func() -> push_rt_task(rq, true)
>
> seems to be the only way with pull=true.
>
> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>
> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
> ^^^^^^^^^^^^^^^^^^^^^^
> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
> ^^^^^^^^^^ ^^^^^^

mark_wakeup_next_waiter() first deboosts the previous owner and then
wakeups the next top waiter. Seems like you somehow have the wakeup happen
before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
only pick a CPU that is in rq->rd->rto_mask, which requires having at least
2 RT tasks there...

Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
check_preempt_curr() if required, which is good, though I think we are
still missing some for sched_setscheduler() (there are no wakeups
there). So if we just have to live with an IRQ work popping in before we
get to preempt_schedule_irq() (or somesuch), then perhaps the below would
be sufficient.

> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>
>
> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> * we may need to handle the pulling of RT tasks
> * now.
> */
> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> + if (!task_on_rq_queued(p))
> + return;
> +
> + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
> + resched_curr(rq);
> + return;
> + }
> +
> + if (rq->rt.rt_nr_running)
> return;
>
> rt_queue_pull_task(rq);

If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
least from RT's perspective; p->sched_class->switched_to() takes care of
the rest)

---

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fd7c4f972aaf..7d61ceec1a3b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
* this is the right place to try to pull some other one
* from an overloaded CPU, if any.
*/
- if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+ if (!task_on_rq_queued(p))
return;

- deadline_queue_pull_task(rq);
+ if (!rq->dl.dl_nr_running)
+ deadline_queue_pull_task(rq);
+ else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
+ resched_curr(rq);
}

/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ef8228d19382..8f3e3a1367b6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
if (!next_task)
return 0;

+ /*
+ * It's possible that the next_task slipped in of higher priority than
+ * current, or current has *just* changed priority. If that's the case
+ * just reschedule current.
+ */
+ if (unlikely(next_task->prio < rq->curr->prio)) {
+ resched_curr(rq);
+ return 0;
+ }
+
retry:
if (is_migration_disabled(next_task)) {
struct task_struct *push_task = NULL;
@@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
if (WARN_ON(next_task == rq->curr))
return 0;

- /*
- * It's possible that the next_task slipped in of
- * higher priority than current. If that's the case
- * just reschedule current.
- */
- if (unlikely(next_task->prio < rq->curr->prio)) {
- resched_curr(rq);
- return 0;
- }
-
/* We might release rq lock */
get_task_struct(next_task);

@@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
* we may need to handle the pulling of RT tasks
* now.
*/
- if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+ if (!task_on_rq_queued(p))
return;

- rt_queue_pull_task(rq);
+ if (!rq->rt.rt_nr_running)
+ rt_queue_pull_task(rq);
+ else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
+ resched_curr(rq);
}

void __init init_sched_rt_class(void)

2021-12-23 11:58:19

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
> > On 21.12.21 17:45, John Keeping wrote:
> >> On Tue, 21 Dec 2021 16:11:34 +0000
> >> Valentin Schneider <[email protected]> wrote:
> >>
> >>> On 20/12/21 18:35, Dietmar Eggemann wrote:
> >
> > [...]
> >
> >>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >>> index fd7c4f972aaf..7d61ceec1a3b 100644
> >>> --- a/kernel/sched/deadline.c
> >>> +++ b/kernel/sched/deadline.c
> >>> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>> * this is the right place to try to pull some other one
> >>> * from an overloaded CPU, if any.
> >>> */
> >>> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> >>> + if (!task_on_rq_queued(p))
> >>> return;
> >>>
> >>> - deadline_queue_pull_task(rq);
> >>> + if (!rq->dl.dl_nr_running)
> >>> + deadline_queue_pull_task(rq);
> >>> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> >>> + resched_curr(rq);
> >>> }
> >>>
> >>> /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..1ea2567612fb 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> >>> * we may need to handle the pulling of RT tasks
> >>> * now.
> >>> */
> >>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> >>> + if (!task_on_rq_queued(p))
> >>> return;
> >>>
> >>> - rt_queue_pull_task(rq);
> >>> + if (!rq->rt.rt_nr_running)
> >>> + rt_queue_pull_task(rq);
> >>> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> >>> + resched_curr(rq);
> >
> > switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
> > pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
> > rto_push_irq_work_func() -> push_rt_task(rq, true)
> >
> > seems to be the only way with pull=true.
> >
> > In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
> >
> > [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
> > [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
> > [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
> > ^^^^^^^^^^^^^^^^^^^^^^
> > [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
> > [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
> > ^^^^^^^^^^ ^^^^^^
>
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...
>
> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

With this patch I ran 100 reboots without hitting the BUG, so it looks
like this is the solution!

If you post this as a patch, feel free to add:

Tested-by: John Keeping <[email protected]>

> ---
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fd7c4f972aaf..7d61ceec1a3b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2467,10 +2467,13 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> * this is the right place to try to pull some other one
> * from an overloaded CPU, if any.
> */
> - if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
> + if (!task_on_rq_queued(p))
> return;
>
> - deadline_queue_pull_task(rq);
> + if (!rq->dl.dl_nr_running)
> + deadline_queue_pull_task(rq);
> + else if (task_current(rq, p) && (p->sched_class < &dl_sched_class))
> + resched_curr(rq);
> }
>
> /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (!next_task)
> return 0;
>
> + /*
> + * It's possible that the next_task slipped in of higher priority than
> + * current, or current has *just* changed priority. If that's the case
> + * just reschedule current.
> + */
> + if (unlikely(next_task->prio < rq->curr->prio)) {
> + resched_curr(rq);
> + return 0;
> + }
> +
> retry:
> if (is_migration_disabled(next_task)) {
> struct task_struct *push_task = NULL;
> @@ -1922,16 +1932,6 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (WARN_ON(next_task == rq->curr))
> return 0;
>
> - /*
> - * It's possible that the next_task slipped in of
> - * higher priority than current. If that's the case
> - * just reschedule current.
> - */
> - if (unlikely(next_task->prio < rq->curr->prio)) {
> - resched_curr(rq);
> - return 0;
> - }
> -
> /* We might release rq lock */
> get_task_struct(next_task);
>
> @@ -2322,10 +2322,13 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> * we may need to handle the pulling of RT tasks
> * now.
> */
> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> + if (!task_on_rq_queued(p))
> return;
>
> - rt_queue_pull_task(rq);
> + if (!rq->rt.rt_nr_running)
> + rt_queue_pull_task(rq);
> + else if (task_current(rq, p) && (p->sched_class < &rt_sched_class))
> + resched_curr(rq);
> }
>
> void __init init_sched_rt_class(void)

2021-12-23 14:05:23

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 23/12/21 11:58, John Keeping wrote:
> On Wed, Dec 22, 2021 at 07:48:33PM +0000, Valentin Schneider wrote:
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>>
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> With this patch I ran 100 reboots without hitting the BUG, so it looks
> like this is the solution!
>
> If you post this as a patch, feel free to add:
>
> Tested-by: John Keeping <[email protected]>

Thanks!

I still need to convince myself beyond reasonable doubt that this is really
what is happening (esp the sched_setscheduler()) part, so the next episode
will probably air after the break :-)

2022-01-07 10:46:57

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 22/12/2021 20:48, Valentin Schneider wrote:
> On 22/12/21 18:46, Dietmar Eggemann wrote:
>> On 21.12.21 17:45, John Keeping wrote:
>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>> Valentin Schneider <[email protected]> wrote:
>>>
>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:

[...]

>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>> rto_push_irq_work_func() -> push_rt_task(rq, true)
>>
>> seems to be the only way with pull=true.
>>
>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>
>> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>> ^^^^^^^^^^^^^^^^^^^^^^
>> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>> ^^^^^^^^^^ ^^^^^^
>
> mark_wakeup_next_waiter() first deboosts the previous owner and then
> wakeups the next top waiter. Seems like you somehow have the wakeup happen
> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
> 2 RT tasks there...

True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:

mark_wakeup_next_waiter()

(1) /* deboost */
rt_mutex_adjust_prio()

rt_mutex_setprio(current, ...)

rq = __task_rq_lock(current, );
check_class_changed(rq, p, prev_class, oldprio)

switched_from_rt()

if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
return;

rt_queue_pull_task(rq)

queue_balance_callback(rq, ..., pull_rt_task);

pull_rt_task()

tell_cpu_to_push()

*cpu* = rto_next_cpu(rq->rd)
irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)

rto_push_irq_work_func()
push_rt_task(rq, true) <-- !!!

(2) /* waking the top waiter */
rt_mutex_wake_q_add(wqh, waiter);

> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
> check_preempt_curr() if required, which is good, though I think we are
> still missing some for sched_setscheduler() (there are no wakeups
> there). So if we just have to live with an IRQ work popping in before we
> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
> be sufficient.

I think that's the case here but we are on the RT overloaded CPU (*cpu*).

>
>> What about slightly changing the layout in switched_from_rt() (only lightly tested):
>>
>>
>> @@ -2322,7 +2338,15 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
>> * we may need to handle the pulling of RT tasks
>> * now.
>> */
>> - if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
>> + if (!task_on_rq_queued(p))
>> + return;
>> +
>> + if (task_current(rq, p) && (p->sched_class < &rt_sched_class)) {
>> + resched_curr(rq);
>> + return;
>> + }
>> +
>> + if (rq->rt.rt_nr_running)
>> return;
>>
>> rt_queue_pull_task(rq);
>
> If !rq->rt.rt_nr_running then there's no point in issuing a reschedule (at
> least from RT's perspective; p->sched_class->switched_to() takes care of
> the rest)

Right.

[...]

> /*
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ef8228d19382..8f3e3a1367b6 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (!next_task)
> return 0;
>
> + /*
> + * It's possible that the next_task slipped in of higher priority than
> + * current, or current has *just* changed priority. If that's the case
> + * just reschedule current.
> + */
> + if (unlikely(next_task->prio < rq->curr->prio)) {
> + resched_curr(rq);
> + return 0;
> + }

IMHO, that's the bit which prevents the BUG.

But this would also prevent the case in which rq->curr is an RT task
with lower prio than next_task.

Also `rq->curr = migration/X` goes still though which is somehow fine
since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).

And DL tasks (like sugov:X go through and they can have
task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
by coincidence.

So maybe something like this:

@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
if (!pull || rq->push_busy)
return 0;

+ if (rq->curr->sched_class != &rt_sched_class) {
+ resched_curr(rq);
+ return 0;
+ }
+
cpu = find_lowest_rq(rq->curr);

[...]

2022-01-07 11:49:55

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
> > /*
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index ef8228d19382..8f3e3a1367b6 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> > if (!next_task)
> > return 0;
> >
> > + /*
> > + * It's possible that the next_task slipped in of higher priority than
> > + * current, or current has *just* changed priority. If that's the case
> > + * just reschedule current.
> > + */
> > + if (unlikely(next_task->prio < rq->curr->prio)) {
> > + resched_curr(rq);
> > + return 0;
> > + }
>
> IMHO, that's the bit which prevents the BUG.
>
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
>
> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
>
> So maybe something like this:

Do you mean to replace just the one hunk from Valentin's patch with the
change below (keeping the rest), or are you saying that only the change
below is needed?

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (!pull || rq->push_busy)
> return 0;
>
> + if (rq->curr->sched_class != &rt_sched_class) {
> + resched_curr(rq);
> + return 0;
> + }
> +
> cpu = find_lowest_rq(rq->curr);
>
> [...]

2022-01-07 14:25:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On 07/01/2022 12:49, John Keeping wrote:
> On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
>> On 22/12/2021 20:48, Valentin Schneider wrote:
>>> /*
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index ef8228d19382..8f3e3a1367b6 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>>> if (!next_task)
>>> return 0;
>>>
>>> + /*
>>> + * It's possible that the next_task slipped in of higher priority than
>>> + * current, or current has *just* changed priority. If that's the case
>>> + * just reschedule current.
>>> + */
>>> + if (unlikely(next_task->prio < rq->curr->prio)) {
>>> + resched_curr(rq);
>>> + return 0;
>>> + }
>>
>> IMHO, that's the bit which prevents the BUG.
>>
>> But this would also prevent the case in which rq->curr is an RT task
>> with lower prio than next_task.
>>
>> Also `rq->curr = migration/X` goes still though which is somehow fine
>> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>>
>> And DL tasks (like sugov:X go through and they can have
>> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
>> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
>> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
>> by coincidence.
>>
>> So maybe something like this:
>
> Do you mean to replace just the one hunk from Valentin's patch with the
> change below (keeping the rest), or are you saying that only the change
> below is needed?

The latter.

I think Valentin wanted to see if something like this can also occur via
sched_setscheduler() and maybe for this changes in switched_from_[rt/dl]
will be necessary.

>> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
>> if (!pull || rq->push_busy)
>> return 0;
>>
>> + if (rq->curr->sched_class != &rt_sched_class) {
>> + resched_curr(rq);
>> + return 0;
>> + }
>> +
>> cpu = find_lowest_rq(rq->curr);
>>
>> [...

2022-01-07 18:36:20

by John Keeping

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c

On Fri, Jan 07, 2022 at 03:25:21PM +0100, Dietmar Eggemann wrote:
> On 07/01/2022 12:49, John Keeping wrote:
> > On Fri, Jan 07, 2022 at 11:46:45AM +0100, Dietmar Eggemann wrote:
> >> On 22/12/2021 20:48, Valentin Schneider wrote:
> >>> /*
> >>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>> index ef8228d19382..8f3e3a1367b6 100644
> >>> --- a/kernel/sched/rt.c
> >>> +++ b/kernel/sched/rt.c
> >>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
> >>> if (!next_task)
> >>> return 0;
> >>>
> >>> + /*
> >>> + * It's possible that the next_task slipped in of higher priority than
> >>> + * current, or current has *just* changed priority. If that's the case
> >>> + * just reschedule current.
> >>> + */
> >>> + if (unlikely(next_task->prio < rq->curr->prio)) {
> >>> + resched_curr(rq);
> >>> + return 0;
> >>> + }
> >>
> >> IMHO, that's the bit which prevents the BUG.
> >>
> >> But this would also prevent the case in which rq->curr is an RT task
> >> with lower prio than next_task.
> >>
> >> Also `rq->curr = migration/X` goes still though which is somehow fine
> >> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
> >>
> >> And DL tasks (like sugov:X go through and they can have
> >> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> >> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> >> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> >> by coincidence.
> >>
> >> So maybe something like this:
> >
> > Do you mean to replace just the one hunk from Valentin's patch with the
> > change below (keeping the rest), or are you saying that only the change
> > below is needed?
>
> The latter.

Thanks! I tested the patch below and can confirm that I no longer see
any BUGs with this applied.

Tested-By: John Keeping <[email protected]>

--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
if (!pull || rq->push_busy)
return 0;

+ if (rq->curr->sched_class != &rt_sched_class) {
+ resched_curr(rq);
+ return 0;
+ }
+
cpu = find_lowest_rq(rq->curr);
if (cpu == -1 || cpu == rq->cpu)
return 0;

2022-01-14 23:01:32

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RT] BUG in sched/cpupri.c


Trying to page this back in...

On 07/01/22 11:46, Dietmar Eggemann wrote:
> On 22/12/2021 20:48, Valentin Schneider wrote:
>> On 22/12/21 18:46, Dietmar Eggemann wrote:
>>> On 21.12.21 17:45, John Keeping wrote:
>>>> On Tue, 21 Dec 2021 16:11:34 +0000
>>>> Valentin Schneider <[email protected]> wrote:
>>>>
>>>>> On 20/12/21 18:35, Dietmar Eggemann wrote:
>
> [...]
>
>>> switched_from_rt() -> rt_queue_pull_task(, pull_rt_task)
>>> pull_rt_task()->tell_cpu_to_push()->irq_work_queue_on(&rq->rd->rto_push_work,)
>>> rto_push_irq_work_func() -> push_rt_task(rq, true)
>>>
>>> seems to be the only way with pull=true.
>>>
>>> In my tests, rq->rt.rt_nr_running seems to be 0 when it happens.
>>>
>>> [ 22.288537] CPU3 switched_to_rt: p=[ksoftirqd/3 35]
>>> [ 22.288554] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] pi_task=[rcu_preempt 11] queued=1 running=0 prio=98 oldprio=120
>>> [ 22.288636] CPU3 switched_from_rt: p=[ksoftirqd/3 35] rq->rt.rt_nr_running=0
>>> ^^^^^^^^^^^^^^^^^^^^^^
>>> [ 22.288649] rt_mutex_setprio: CPU3 p=[ksoftirqd/3 35] queued=1 running=1 prio=120 oldprio=98
>>> [ 22.288681] CPU3 push_rt_task: next_task=[rcu_preempt 11] migr_dis=1 rq->curr=[ksoftirqd/3 35] pull=1
>>> ^^^^^^^^^^ ^^^^^^
>>
>> mark_wakeup_next_waiter() first deboosts the previous owner and then
>> wakeups the next top waiter. Seems like you somehow have the wakeup happen
>> before the push_rt_task IRQ work is run. Also, tell_cpu_to_push() should
>> only pick a CPU that is in rq->rd->rto_mask, which requires having at least
>> 2 RT tasks there...
>
> True, this_cpu has rt_nr_running = 0 and *cpu* has rt_nr_running >= 2:
>
> mark_wakeup_next_waiter()
>
> (1) /* deboost */
> rt_mutex_adjust_prio()
>
> rt_mutex_setprio(current, ...)
>
> rq = __task_rq_lock(current, );
> check_class_changed(rq, p, prev_class, oldprio)
>
> switched_from_rt()
>
> if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
> return;
>
> rt_queue_pull_task(rq)
>
> queue_balance_callback(rq, ..., pull_rt_task);
>
> pull_rt_task()
>
> tell_cpu_to_push()
>
> *cpu* = rto_next_cpu(rq->rd)
> irq_work_queue_on(&rq->rd->rto_push_work, *cpu*)
>
> rto_push_irq_work_func()
> push_rt_task(rq, true) <-- !!!
>
> (2) /* waking the top waiter */
> rt_mutex_wake_q_add(wqh, waiter);
>
>> Now, that wakeup from the rtmutex unlock would give us a resched_curr() via
>> check_preempt_curr() if required, which is good, though I think we are
>> still missing some for sched_setscheduler() (there are no wakeups
>> there). So if we just have to live with an IRQ work popping in before we
>> get to preempt_schedule_irq() (or somesuch), then perhaps the below would
>> be sufficient.
>
> I think that's the case here but we are on the RT overloaded CPU (*cpu*).
>

So one thing I wasn't entirely clear on (and holidays didn't fix that) is
the rt_queue_pull_task() from switched_from_rt() only happens if that rq
has no other runnable RT tasks, so I don't quite see how the
irq_work_queue_on() can end up as a self-IPI...

>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index ef8228d19382..8f3e3a1367b6 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1890,6 +1890,16 @@ static int push_rt_task(struct rq *rq, bool pull)
>> if (!next_task)
>> return 0;
>>
>> + /*
>> + * It's possible that the next_task slipped in of higher priority than
>> + * current, or current has *just* changed priority. If that's the case
>> + * just reschedule current.
>> + */
>> + if (unlikely(next_task->prio < rq->curr->prio)) {
>> + resched_curr(rq);
>> + return 0;
>> + }
>
> IMHO, that's the bit which prevents the BUG.
>
> But this would also prevent the case in which rq->curr is an RT task
> with lower prio than next_task.
>

I think that's what we want - if current isn't the (logical) highest
priority task on the runqueue, we should forgo push/pull and reschedule
ASAP.

> Also `rq->curr = migration/X` goes still though which is somehow fine
> since find_lowest_rq() bails out for if (task->nr_cpus_allowed == 1).
>
> And DL tasks (like sugov:X go through and they can have
> task->nr_cpus_allowed > 1 (arm64 slow-switching boards with shared
> freuency domains with schedutil). cpupri_find_fitness()->convert_prio()
> can handle task_pri, p->prio = -1 (CPUPRI_INVALID) although its somehow
> by coincidence.
>

Right. This reminds me of:
https://lore.kernel.org/lkml/[email protected]/


> So maybe something like this:
>

Ah, so you explicitely prevent rt.c::find_lowest_rq() invocations with a
non-RT task... But what if current is an RT task that just got deboosted,
so that next_task->prio < rq->curr->prio? IMO we should reschedule ASAP (as
I already blabbered about above). If next_task is migration_disabled but
higher (logical) prio than current, we don't need to do any of the
migration_disabled specific crud, we just reschedule.

> @ -1898,6 +1898,11 @@ static int push_rt_task(struct rq *rq, bool pull)
> if (!pull || rq->push_busy)
> return 0;
>
> + if (rq->curr->sched_class != &rt_sched_class) {
> + resched_curr(rq);
> + return 0;
> + }
> +
> cpu = find_lowest_rq(rq->curr);
>
> [...]