2020-09-21 17:02:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/9] sched: Migrate disable support

Hi,

Here's my take on migrate_disable(). It avoids growing a second means of
changing the affinity, documents how the things violates locking rules but
still mostly works.

It also avoids blocking completely, so no more futex band-aids required.

Also, no more atomics/locks on the fast path.

I also put in a comment that explains how the whole concept is fundamentally
flawed but a necessary evil to get PREEMPT_RT going -- for now.

Somewhat tested with PREEMPT_RT.

---
include/linux/cpuhotplug.h | 1
include/linux/preempt.h | 60 +++
include/linux/sched.h | 4
include/linux/sched/hotplug.h | 2
include/linux/stop_machine.h | 5
kernel/cpu.c | 9
kernel/sched/core.c | 673 +++++++++++++++++++++++++++++++-----------
kernel/sched/deadline.c | 5
kernel/sched/sched.h | 24 +
kernel/stop_machine.c | 23 +
lib/dump_stack.c | 2
lib/smp_processor_id.c | 5
12 files changed, 635 insertions(+), 178 deletions(-)


2020-09-25 09:16:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support

On 21/09/2020 18:35, Peter Zijlstra wrote:
> Hi,
>
> Here's my take on migrate_disable(). It avoids growing a second means of
> changing the affinity, documents how the things violates locking rules but
> still mostly works.
>
> It also avoids blocking completely, so no more futex band-aids required.
>
> Also, no more atomics/locks on the fast path.
>
> I also put in a comment that explains how the whole concept is fundamentally
> flawed but a necessary evil to get PREEMPT_RT going -- for now.
>
> Somewhat tested with PREEMPT_RT.
>
> ---
> include/linux/cpuhotplug.h | 1
> include/linux/preempt.h | 60 +++
> include/linux/sched.h | 4
> include/linux/sched/hotplug.h | 2
> include/linux/stop_machine.h | 5
> kernel/cpu.c | 9
> kernel/sched/core.c | 673 +++++++++++++++++++++++++++++++-----------
> kernel/sched/deadline.c | 5
> kernel/sched/sched.h | 24 +
> kernel/stop_machine.c | 23 +
> lib/dump_stack.c | 2
> lib/smp_processor_id.c | 5
> 12 files changed, 635 insertions(+), 178 deletions(-)
>

I get this when running 6 (periodic) RT50 tasks with CPU hp stress on my
6 CPU JUNO board (!CONFIG_PREEMPT_RT).

[ 55.490263] ------------[ cut here ]------------
[ 55.505261] Modules linked in:
[ 55.508322] CPU: 3 PID: 24 Comm: migration/3 Not tainted
5.9.0-rc1-00132-gc096e6406c50-dirty #90
[ 55.517119] Hardware name: ARM Juno development board (r0) (DT)
[ 55.523058] Stopper: multi_cpu_stop+0x0/0x170 <- 0x0
[ 55.528029] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
[ 55.533612] pc : sched_cpu_dying+0x124/0x130
[ 55.537887] lr : sched_cpu_dying+0xd8/0x130
[ 55.542071] sp : ffff800011f0bca0
[ 55.545385] x29: ffff800011f0bca0 x28: 0000000000000002
[ 55.550703] x27: 0000000000000000 x26: 0000000000000060
[ 55.556022] x25: 0000000000000000 x24: 0000000000000001
[ 55.561340] x23: 0000000000000000 x22: 0000000000000003
[ 55.566659] x21: 0000000000000080 x20: 0000000000000003
[ 55.571977] x19: ffff00097ef9e1c0 x18: 0000000000000010
[ 55.577295] x17: 0000000000000000 x16: 0000000000000000
[ 55.582613] x15: 0000000000000000 x14: 000000000000015c
[ 55.587932] x13: 0000000000000000 x12: 00000000000006f1
[ 55.593250] x11: 0000000000000080 x10: 0000000000000000
[ 55.598567] x9 : 0000000000000003 x8 : ffff0009743f5900
[ 55.603886] x7 : 0000000000000003 x6 : 0000000000000000
[ 55.609204] x5 : 0000000000000001 x4 : 0000000000000002
[ 55.614521] x3 : 0000000000000000 x2 : 0000000000000013
[ 55.619839] x1 : 0000000000000008 x0 : 0000000000000003
[ 55.625158] Call trace:
[ 55.627607] sched_cpu_dying+0x124/0x130
[ 55.631535] cpuhp_invoke_callback+0x88/0x210
[ 55.635897] take_cpu_down+0x7c/0xd8
[ 55.639475] multi_cpu_stop+0xac/0x170
[ 55.643227] cpu_stopper_thread+0x98/0x130
[ 55.647327] smpboot_thread_fn+0x1c4/0x280
[ 55.651427] kthread+0x140/0x160
[ 55.654658] ret_from_fork+0x10/0x34
[ 55.658239] Code: f000e1c1 913fc021 1400034a 17ffffde (d4210000)
[ 55.664342] ---[ end trace c5b8988b7b701e56 ]---
[ 55.668963] note: migration/3[24] exited with preempt_count 3

7309 int sched_cpu_dying(unsigned int cpu)
...
BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
...

rq->nr_running is always 2 here in this cases.

balance_hotplug_wait and sched_cpu_wait_empty run in cpuhp/X (CFS)
whereas sched_cpu_dying in migration/X ?

2020-09-25 10:13:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support

On Fri, Sep 25, 2020 at 11:12:09AM +0200, Dietmar Eggemann wrote:

> I get this when running 6 (periodic) RT50 tasks with CPU hp stress on my
> 6 CPU JUNO board (!CONFIG_PREEMPT_RT).
>
> [ 55.490263] ------------[ cut here ]------------
> [ 55.505261] Modules linked in:
> [ 55.508322] CPU: 3 PID: 24 Comm: migration/3 Not tainted
> 5.9.0-rc1-00132-gc096e6406c50-dirty #90
> [ 55.517119] Hardware name: ARM Juno development board (r0) (DT)
> [ 55.523058] Stopper: multi_cpu_stop+0x0/0x170 <- 0x0
> [ 55.528029] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
> [ 55.533612] pc : sched_cpu_dying+0x124/0x130
> [ 55.537887] lr : sched_cpu_dying+0xd8/0x130
> [ 55.542071] sp : ffff800011f0bca0
> [ 55.545385] x29: ffff800011f0bca0 x28: 0000000000000002
> [ 55.550703] x27: 0000000000000000 x26: 0000000000000060
> [ 55.556022] x25: 0000000000000000 x24: 0000000000000001
> [ 55.561340] x23: 0000000000000000 x22: 0000000000000003
> [ 55.566659] x21: 0000000000000080 x20: 0000000000000003
> [ 55.571977] x19: ffff00097ef9e1c0 x18: 0000000000000010
> [ 55.577295] x17: 0000000000000000 x16: 0000000000000000
> [ 55.582613] x15: 0000000000000000 x14: 000000000000015c
> [ 55.587932] x13: 0000000000000000 x12: 00000000000006f1
> [ 55.593250] x11: 0000000000000080 x10: 0000000000000000
> [ 55.598567] x9 : 0000000000000003 x8 : ffff0009743f5900
> [ 55.603886] x7 : 0000000000000003 x6 : 0000000000000000
> [ 55.609204] x5 : 0000000000000001 x4 : 0000000000000002
> [ 55.614521] x3 : 0000000000000000 x2 : 0000000000000013
> [ 55.619839] x1 : 0000000000000008 x0 : 0000000000000003
> [ 55.625158] Call trace:
> [ 55.627607] sched_cpu_dying+0x124/0x130
> [ 55.631535] cpuhp_invoke_callback+0x88/0x210
> [ 55.635897] take_cpu_down+0x7c/0xd8
> [ 55.639475] multi_cpu_stop+0xac/0x170
> [ 55.643227] cpu_stopper_thread+0x98/0x130
> [ 55.647327] smpboot_thread_fn+0x1c4/0x280
> [ 55.651427] kthread+0x140/0x160
> [ 55.654658] ret_from_fork+0x10/0x34
> [ 55.658239] Code: f000e1c1 913fc021 1400034a 17ffffde (d4210000)
> [ 55.664342] ---[ end trace c5b8988b7b701e56 ]---
> [ 55.668963] note: migration/3[24] exited with preempt_count 3
>
> 7309 int sched_cpu_dying(unsigned int cpu)
> ...
> BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> ...
>
> rq->nr_running is always 2 here in this cases.
>
> balance_hotplug_wait and sched_cpu_wait_empty run in cpuhp/X (CFS)
> whereas sched_cpu_dying in migration/X ?

takedown_cpu() has:

kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);

before calling:

err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));

So when we get to sched_cpu_dying(), the only task that _should_ still
be there is migration/X.

Do you have any idea what thread, other than migration/X, is still
active on that CPU? per sched_cpu_wait_empty() we should've pushed out
all userspace tasks, and the cpu hotplug machinery should've put all the
per-cpu kthreads to sleep at this point.

2020-09-25 11:59:31

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support

On 25/09/2020 12:10, Peter Zijlstra wrote:
> On Fri, Sep 25, 2020 at 11:12:09AM +0200, Dietmar Eggemann wrote:
>
>> I get this when running 6 (periodic) RT50 tasks with CPU hp stress on my
>> 6 CPU JUNO board (!CONFIG_PREEMPT_RT).
>>
>> [ 55.490263] ------------[ cut here ]------------
>> [ 55.505261] Modules linked in:
>> [ 55.508322] CPU: 3 PID: 24 Comm: migration/3 Not tainted
>> 5.9.0-rc1-00132-gc096e6406c50-dirty #90
>> [ 55.517119] Hardware name: ARM Juno development board (r0) (DT)
>> [ 55.523058] Stopper: multi_cpu_stop+0x0/0x170 <- 0x0
>> [ 55.528029] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>> [ 55.533612] pc : sched_cpu_dying+0x124/0x130
>> [ 55.537887] lr : sched_cpu_dying+0xd8/0x130
>> [ 55.542071] sp : ffff800011f0bca0
>> [ 55.545385] x29: ffff800011f0bca0 x28: 0000000000000002
>> [ 55.550703] x27: 0000000000000000 x26: 0000000000000060
>> [ 55.556022] x25: 0000000000000000 x24: 0000000000000001
>> [ 55.561340] x23: 0000000000000000 x22: 0000000000000003
>> [ 55.566659] x21: 0000000000000080 x20: 0000000000000003
>> [ 55.571977] x19: ffff00097ef9e1c0 x18: 0000000000000010
>> [ 55.577295] x17: 0000000000000000 x16: 0000000000000000
>> [ 55.582613] x15: 0000000000000000 x14: 000000000000015c
>> [ 55.587932] x13: 0000000000000000 x12: 00000000000006f1
>> [ 55.593250] x11: 0000000000000080 x10: 0000000000000000
>> [ 55.598567] x9 : 0000000000000003 x8 : ffff0009743f5900
>> [ 55.603886] x7 : 0000000000000003 x6 : 0000000000000000
>> [ 55.609204] x5 : 0000000000000001 x4 : 0000000000000002
>> [ 55.614521] x3 : 0000000000000000 x2 : 0000000000000013
>> [ 55.619839] x1 : 0000000000000008 x0 : 0000000000000003
>> [ 55.625158] Call trace:
>> [ 55.627607] sched_cpu_dying+0x124/0x130
>> [ 55.631535] cpuhp_invoke_callback+0x88/0x210
>> [ 55.635897] take_cpu_down+0x7c/0xd8
>> [ 55.639475] multi_cpu_stop+0xac/0x170
>> [ 55.643227] cpu_stopper_thread+0x98/0x130
>> [ 55.647327] smpboot_thread_fn+0x1c4/0x280
>> [ 55.651427] kthread+0x140/0x160
>> [ 55.654658] ret_from_fork+0x10/0x34
>> [ 55.658239] Code: f000e1c1 913fc021 1400034a 17ffffde (d4210000)
>> [ 55.664342] ---[ end trace c5b8988b7b701e56 ]---
>> [ 55.668963] note: migration/3[24] exited with preempt_count 3
>>
>> 7309 int sched_cpu_dying(unsigned int cpu)
>> ...
>> BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
>> ...
>>
>> rq->nr_running is always 2 here in this cases.
>>
>> balance_hotplug_wait and sched_cpu_wait_empty run in cpuhp/X (CFS)
>> whereas sched_cpu_dying in migration/X ?
>
> takedown_cpu() has:
>
> kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);
>
> before calling:
>
> err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
>
> So when we get to sched_cpu_dying(), the only task that _should_ still
> be there is migration/X.
>
> Do you have any idea what thread, other than migration/X, is still
> active on that CPU? per sched_cpu_wait_empty() we should've pushed out
> all userspace tasks, and the cpu hotplug machinery should've put all the
> per-cpu kthreads to sleep at this point.

With Valentin's print_rq() inspired test snippet I always see one of the
RT user tasks as the second guy? BTW, it has to be RT tasks, never
triggered with CFS tasks.

[ 57.849268] CPU2 nr_running=2
[ 57.852241] p=migration/2
[ 57.854967] p=task0-0

2020-09-25 12:21:13

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support


On 25/09/20 12:58, Dietmar Eggemann wrote:
> On 25/09/2020 12:10, Peter Zijlstra wrote:
>> On Fri, Sep 25, 2020 at 11:12:09AM +0200, Dietmar Eggemann wrote:
>>
>>> I get this when running 6 (periodic) RT50 tasks with CPU hp stress on my
>>> 6 CPU JUNO board (!CONFIG_PREEMPT_RT).
>>>
>>> [ 55.490263] ------------[ cut here ]------------
>>> [ 55.505261] Modules linked in:
>>> [ 55.508322] CPU: 3 PID: 24 Comm: migration/3 Not tainted
>>> 5.9.0-rc1-00132-gc096e6406c50-dirty #90
>>> [ 55.517119] Hardware name: ARM Juno development board (r0) (DT)
>>> [ 55.523058] Stopper: multi_cpu_stop+0x0/0x170 <- 0x0
>>> [ 55.528029] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
>>> [ 55.533612] pc : sched_cpu_dying+0x124/0x130
>>> [ 55.537887] lr : sched_cpu_dying+0xd8/0x130
>>> [ 55.542071] sp : ffff800011f0bca0
>>> [ 55.545385] x29: ffff800011f0bca0 x28: 0000000000000002
>>> [ 55.550703] x27: 0000000000000000 x26: 0000000000000060
>>> [ 55.556022] x25: 0000000000000000 x24: 0000000000000001
>>> [ 55.561340] x23: 0000000000000000 x22: 0000000000000003
>>> [ 55.566659] x21: 0000000000000080 x20: 0000000000000003
>>> [ 55.571977] x19: ffff00097ef9e1c0 x18: 0000000000000010
>>> [ 55.577295] x17: 0000000000000000 x16: 0000000000000000
>>> [ 55.582613] x15: 0000000000000000 x14: 000000000000015c
>>> [ 55.587932] x13: 0000000000000000 x12: 00000000000006f1
>>> [ 55.593250] x11: 0000000000000080 x10: 0000000000000000
>>> [ 55.598567] x9 : 0000000000000003 x8 : ffff0009743f5900
>>> [ 55.603886] x7 : 0000000000000003 x6 : 0000000000000000
>>> [ 55.609204] x5 : 0000000000000001 x4 : 0000000000000002
>>> [ 55.614521] x3 : 0000000000000000 x2 : 0000000000000013
>>> [ 55.619839] x1 : 0000000000000008 x0 : 0000000000000003
>>> [ 55.625158] Call trace:
>>> [ 55.627607] sched_cpu_dying+0x124/0x130
>>> [ 55.631535] cpuhp_invoke_callback+0x88/0x210
>>> [ 55.635897] take_cpu_down+0x7c/0xd8
>>> [ 55.639475] multi_cpu_stop+0xac/0x170
>>> [ 55.643227] cpu_stopper_thread+0x98/0x130
>>> [ 55.647327] smpboot_thread_fn+0x1c4/0x280
>>> [ 55.651427] kthread+0x140/0x160
>>> [ 55.654658] ret_from_fork+0x10/0x34
>>> [ 55.658239] Code: f000e1c1 913fc021 1400034a 17ffffde (d4210000)
>>> [ 55.664342] ---[ end trace c5b8988b7b701e56 ]---
>>> [ 55.668963] note: migration/3[24] exited with preempt_count 3
>>>
>>> 7309 int sched_cpu_dying(unsigned int cpu)
>>> ...
>>> BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
>>> ...
>>>
>>> rq->nr_running is always 2 here in this cases.
>>>
>>> balance_hotplug_wait and sched_cpu_wait_empty run in cpuhp/X (CFS)
>>> whereas sched_cpu_dying in migration/X ?
>>
>> takedown_cpu() has:
>>
>> kthread_park(per_cpu_ptr(&cpuhp_state, cpu)->thread);
>>
>> before calling:
>>
>> err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
>>
>> So when we get to sched_cpu_dying(), the only task that _should_ still
>> be there is migration/X.
>>
>> Do you have any idea what thread, other than migration/X, is still
>> active on that CPU? per sched_cpu_wait_empty() we should've pushed out
>> all userspace tasks, and the cpu hotplug machinery should've put all the
>> per-cpu kthreads to sleep at this point.
>
> With Valentin's print_rq() inspired test snippet I always see one of the
> RT user tasks as the second guy? BTW, it has to be RT tasks, never
> triggered with CFS tasks.
>
> [ 57.849268] CPU2 nr_running=2
> [ 57.852241] p=migration/2
> [ 57.854967] p=task0-0

I can also trigger the BUG_ON() using the built-in locktorture module
(+enabling hotplug torture), and it happens very early on. I can't trigger
it under qemu sadly :/ Also, in my case it's always a kworker:

[ 0.830462] CPU3 nr_running=2
[ 0.833443] p=migration/3
[ 0.836150] p=kworker/3:0

I'm looking into what workqueue.c is doing about hotplug...

2020-09-25 17:53:25

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support


On 25/09/20 13:19, Valentin Schneider wrote:
> On 25/09/20 12:58, Dietmar Eggemann wrote:
>> With Valentin's print_rq() inspired test snippet I always see one of the
>> RT user tasks as the second guy? BTW, it has to be RT tasks, never
>> triggered with CFS tasks.
>>
>> [ 57.849268] CPU2 nr_running=2
>> [ 57.852241] p=migration/2
>> [ 57.854967] p=task0-0
>
> I can also trigger the BUG_ON() using the built-in locktorture module
> (+enabling hotplug torture), and it happens very early on. I can't trigger
> it under qemu sadly :/ Also, in my case it's always a kworker:
>
> [ 0.830462] CPU3 nr_running=2
> [ 0.833443] p=migration/3
> [ 0.836150] p=kworker/3:0
>
> I'm looking into what workqueue.c is doing about hotplug...

So with
- The pending migration fixup ([email protected])
- The workqueue set_cpus_allowed_ptr() change (from IRC)
- The set_rq_offline() move + DL/RT pull && rq->online (also from IRC)

my Juno survives rtmutex + hotplug locktorture, where it would previously
explode < 1s after boot (mostly due to the workqueue thing).

I stared a bit more at the rq_offline() + DL/RT bits and they look fine to
me.

The one thing I'm not entirely sure about is while you plugged the
class->balance() hole, AIUI we might still get RT (DL?) pull callbacks
enqueued - say if we just unthrottled an RT RQ and something changes the
priority of one of the freshly-released tasks (user or rtmutex
interaction), I don't see any stopgap preventing a pull from happening.

I slapped the following on top of my kernel and it didn't die, although I'm
not sure I'm correctly stressing this path. Perhaps we could limit that to
the pull paths, since technically we're okay with pushing out of an !online
RQ.

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50aac5b6db26..00d1a7b85e97 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1403,7 +1403,7 @@ queue_balance_callback(struct rq *rq,
{
lockdep_assert_held(&rq->lock);

- if (unlikely(head->next))
+ if (unlikely(head->next || !rq->online))
return;

head->func = (void (*)(struct callback_head *))func;
---

Subject: Re: [PATCH 0/9] sched: Migrate disable support

On 2020-09-21 18:35:57 [+0200], Peter Zijlstra wrote:
> Hi,
Hi,

> Here's my take on migrate_disable(). It avoids growing a second means of

I have here:

|005: numa_remove_cpu cpu 5 node 0: mask now 0,3-4,6-7
|007: smpboot: CPU 5 is now offline
|006: ------------[ cut here ]------------
|006: rq->balance_callback
|006: WARNING: CPU: 6 PID: 8392 at kernel/sched/sched.h:1234 try_to_wake_up+0x696/0x860
|006: Modules linked in:
|006:
|006: CPU: 6 PID: 8392 Comm: hackbench Not tainted 5.9.0-rc6-rt9+ #60
|006: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014
|006: RIP: 0010:try_to_wake_up+0x696/0x860
|006: Code: c0 01 00 00 01 e9 d9 fb ff ff 80 3d 90 ef 6d 01 00 0f 85 6c fb ff ff 48 c7 c7 d4 4a 2c 82 c6 05 7c ef 6d 01 01 e8 dd 21 fc ff <0f> 0b e9 52 fb ff ff 0f 0b e9 b2
|006: RSP: 0018:ffffc90005b978f8 EFLAGS: 00010082
|006:
|006: RAX: 0000000000000000 RBX: ffff8882755cca40 RCX: 0000000000000000
|006: RDX: ffffffff8247aab8 RSI: 00000000ffffffff RDI: 00000000ffffffff
|006: RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffff8247a9a0
|006: R10: ffffc90005b97838 R11: 332e39313320205b R12: ffff888276da8600
|006: R13: 0000000000000093 R14: ffff8882755cd7a0 R15: ffff888276da8618
|006: FS: 00007f6fa7805740(0000) GS:ffff888276d80000(0000) knlGS:0000000000000000
|006: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
|006: CR2: 00007f6fa796af90 CR3: 0000000262588000 CR4: 00000000003506e0
|006: Call Trace:
|006: ? cpu_stop_queue_work+0x8e/0x150
|006: __wake_up_q+0x96/0xc0
|006: cpu_stop_queue_work+0x9a/0x150
|006: finish_task_switch.isra.0+0x2f1/0x460
|006: __schedule+0x3bd/0xb20
|006: schedule+0x4a/0x100
|006: schedule_hrtimeout_range_clock+0x14f/0x160
|006: ? rt_spin_unlock+0x39/0x90
|006: ? rt_mutex_futex_unlock+0xcb/0xe0
|006: poll_schedule_timeout.constprop.0+0x4d/0x90
|006: do_sys_poll+0x314/0x430
|006: ? __lock_acquire+0x39b/0x2010
|006: ? poll_schedule_timeout.constprop.0+0x90/0x90
|006: ? mark_held_locks+0x49/0x70
|006: ? find_held_lock+0x2b/0x80
|006: ? rt_spin_unlock+0x39/0x90
|006: ? rt_mutex_futex_unlock+0xcb/0xe0
|006: ? rt_spin_unlock+0x51/0x90
|006: ? handle_mm_fault+0xfbd/0x1510
|006: ? find_held_lock+0x2b/0x80
|006: ? do_user_addr_fault+0x214/0x420
|006: __x64_sys_poll+0x37/0x130
|006: do_syscall_64+0x33/0x40
|006: entry_SYSCALL_64_after_hwframe+0x44/0xa9
|006: RIP: 0033:0x7f6fa78fb483

Is this somewhere among the fixes Valentin received?
This SCHED_WARN_ON(rq->balance_callback); in rq_pin_lock().

Sebastian

2020-09-25 20:15:46

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support


On 25/09/20 19:17, Sebastian Andrzej Siewior wrote:
> On 2020-09-21 18:35:57 [+0200], Peter Zijlstra wrote:
>> Hi,
> Hi,
>
>> Here's my take on migrate_disable(). It avoids growing a second means of
>
> I have here:
>
> |005: numa_remove_cpu cpu 5 node 0: mask now 0,3-4,6-7
> |007: smpboot: CPU 5 is now offline
> |006: ------------[ cut here ]------------
> |006: rq->balance_callback
> |006: WARNING: CPU: 6 PID: 8392 at kernel/sched/sched.h:1234 try_to_wake_up+0x696/0x860
> |006: Modules linked in:
> |006:
> |006: CPU: 6 PID: 8392 Comm: hackbench Not tainted 5.9.0-rc6-rt9+ #60
> |006: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014
> |006: RIP: 0010:try_to_wake_up+0x696/0x860
> |006: Code: c0 01 00 00 01 e9 d9 fb ff ff 80 3d 90 ef 6d 01 00 0f 85 6c fb ff ff 48 c7 c7 d4 4a 2c 82 c6 05 7c ef 6d 01 01 e8 dd 21 fc ff <0f> 0b e9 52 fb ff ff 0f 0b e9 b2
> |006: RSP: 0018:ffffc90005b978f8 EFLAGS: 00010082
> |006:
> |006: RAX: 0000000000000000 RBX: ffff8882755cca40 RCX: 0000000000000000
> |006: RDX: ffffffff8247aab8 RSI: 00000000ffffffff RDI: 00000000ffffffff
> |006: RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffff8247a9a0
> |006: R10: ffffc90005b97838 R11: 332e39313320205b R12: ffff888276da8600
> |006: R13: 0000000000000093 R14: ffff8882755cd7a0 R15: ffff888276da8618
> |006: FS: 00007f6fa7805740(0000) GS:ffff888276d80000(0000) knlGS:0000000000000000
> |006: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> |006: CR2: 00007f6fa796af90 CR3: 0000000262588000 CR4: 00000000003506e0
> |006: Call Trace:
> |006: ? cpu_stop_queue_work+0x8e/0x150
> |006: __wake_up_q+0x96/0xc0
> |006: cpu_stop_queue_work+0x9a/0x150
> |006: finish_task_switch.isra.0+0x2f1/0x460
> |006: __schedule+0x3bd/0xb20
> |006: schedule+0x4a/0x100
> |006: schedule_hrtimeout_range_clock+0x14f/0x160
> |006: ? rt_spin_unlock+0x39/0x90
> |006: ? rt_mutex_futex_unlock+0xcb/0xe0
> |006: poll_schedule_timeout.constprop.0+0x4d/0x90
> |006: do_sys_poll+0x314/0x430
> |006: ? __lock_acquire+0x39b/0x2010
> |006: ? poll_schedule_timeout.constprop.0+0x90/0x90
> |006: ? mark_held_locks+0x49/0x70
> |006: ? find_held_lock+0x2b/0x80
> |006: ? rt_spin_unlock+0x39/0x90
> |006: ? rt_mutex_futex_unlock+0xcb/0xe0
> |006: ? rt_spin_unlock+0x51/0x90
> |006: ? handle_mm_fault+0xfbd/0x1510
> |006: ? find_held_lock+0x2b/0x80
> |006: ? do_user_addr_fault+0x214/0x420
> |006: __x64_sys_poll+0x37/0x130
> |006: do_syscall_64+0x33/0x40
> |006: entry_SYSCALL_64_after_hwframe+0x44/0xa9
> |006: RIP: 0033:0x7f6fa78fb483
>
> Is this somewhere among the fixes Valentin received?
> This SCHED_WARN_ON(rq->balance_callback); in rq_pin_lock().
>

The IRC handout so far is:
https://paste.debian.net/1164646/
https://paste.debian.net/1164656/

As for your splat, I think this is what I was worrying about wrt
suppressing callbacks in the switch but not preventing them from being
queued. Perhaps the below is "better" than what I previously sent.

Technically should be doable with a cpu_active() check instead given this
all gets flipped in sched_cpu_deactivate(), but at least this makes it
obvious that PUSH suppresses any other callback.

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50aac5b6db26..40d78a20fbcb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1403,7 +1403,7 @@ queue_balance_callback(struct rq *rq,
{
lockdep_assert_held(&rq->lock);

- if (unlikely(head->next))
+ if (unlikely(head->next) || (rq->balance_flags & BALANCE_PUSH))
return;

head->func = (void (*)(struct callback_head *))func;
---

> Sebastian

2020-09-29 09:19:23

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support

On 25/09/2020 19:49, Valentin Schneider wrote:
>
> On 25/09/20 13:19, Valentin Schneider wrote:
>> On 25/09/20 12:58, Dietmar Eggemann wrote:
>>> With Valentin's print_rq() inspired test snippet I always see one of the
>>> RT user tasks as the second guy? BTW, it has to be RT tasks, never
>>> triggered with CFS tasks.
>>>
>>> [ 57.849268] CPU2 nr_running=2
>>> [ 57.852241] p=migration/2
>>> [ 57.854967] p=task0-0
>>
>> I can also trigger the BUG_ON() using the built-in locktorture module
>> (+enabling hotplug torture), and it happens very early on. I can't trigger
>> it under qemu sadly :/ Also, in my case it's always a kworker:
>>
>> [ 0.830462] CPU3 nr_running=2
>> [ 0.833443] p=migration/3
>> [ 0.836150] p=kworker/3:0
>>
>> I'm looking into what workqueue.c is doing about hotplug...
>
> So with
> - The pending migration fixup ([email protected])
> - The workqueue set_cpus_allowed_ptr() change (from IRC)
> - The set_rq_offline() move + DL/RT pull && rq->online (also from IRC)
>
> my Juno survives rtmutex + hotplug locktorture, where it would previously
> explode < 1s after boot (mostly due to the workqueue thing).
>
> I stared a bit more at the rq_offline() + DL/RT bits and they look fine to
> me.
>
> The one thing I'm not entirely sure about is while you plugged the
> class->balance() hole, AIUI we might still get RT (DL?) pull callbacks
> enqueued - say if we just unthrottled an RT RQ and something changes the
> priority of one of the freshly-released tasks (user or rtmutex
> interaction), I don't see any stopgap preventing a pull from happening.
>
> I slapped the following on top of my kernel and it didn't die, although I'm
> not sure I'm correctly stressing this path. Perhaps we could limit that to
> the pull paths, since technically we're okay with pushing out of an !online
> RQ.
>
> ---
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 50aac5b6db26..00d1a7b85e97 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1403,7 +1403,7 @@ queue_balance_callback(struct rq *rq,
> {
> lockdep_assert_held(&rq->lock);
>
> - if (unlikely(head->next))
> + if (unlikely(head->next || !rq->online))
> return;
>
> head->func = (void (*)(struct callback_head *))func;
> ---

When I use the original patch-set (i.e. without the pending migration
fixup and the two changes from IRC) it looks like that the rt task is
already on the rq before the rq_offline_rt() -> __disable_runtime() call.

pr_crit("CPU%d X: %d %d %lu %lu %d %d %d %llu %llu\n",
cpu_of(rq), rq->nr_running,
rt_rq->rt_nr_running, rt_rq->rt_nr_migratory,
rt_rq->rt_nr_total, rt_rq->overloaded,
rt_rq->rt_queued, rt_rq->rt_throttled,
rt_rq->rt_time, rt_rq->rt_runtime);

X = 1 : in rq_offline_rt() before __disable_runtime()
2 : " after "
3 : in rq_online_rt() before __enable_runtime()
4 : " after "
5 : in sched_cpu_dying() if (rq->nr_running > 1)

*[ 70.369719] CPU0 1: 1 1 1 1 0 1 0 36093160 950000000
*[ 70.374689] CPU0 2: 1 1 1 1 0 1 0 36093160 18446744073709551615
*[ 70.380615] CPU0 3: 1 1 1 1 0 1 0 36093160 18446744073709551615
*[ 70.386540] CPU0 4: 1 1 1 1 0 1 0 0 950000000
[ 70.395637] CPU1 1: 1 0 0 0 0 0 0 31033300 950000000
[ 70.400606] CPU1 2: 1 0 0 0 0 0 0 31033300 18446744073709551615
[ 70.406532] CPU1 3: 1 0 0 0 0 0 0 31033300 18446744073709551615
[ 70.412457] CPU1 4: 1 0 0 0 0 0 0 0 950000000
[ 70.421609] CPU4 1: 0 0 0 0 0 0 0 19397300 950000000
[ 70.426577] CPU4 2: 0 0 0 0 0 0 0 19397300 18446744073709551615
[ 70.432503] CPU4 3: 0 0 0 0 0 0 0 19397300 18446744073709551615
[ 70.438428] CPU4 4: 0 0 0 0 0 0 0 0 950000000
[ 70.484133] CPU3 3: 2 0 0 0 0 0 0 3907020 18446744073709551615
[ 70.489984] CPU3 4: 2 0 0 0 0 0 0 0 950000000
[ 70.540112] CPU2 3: 1 0 0 0 0 0 0 3605180 18446744073709551615
[ 70.545953] CPU2 4: 1 0 0 0 0 0 0 0 950000000
*[ 70.647548] CPU0 1: 2 1 1 1 0 1 0 5150760 950000000
*[ 70.652441] CPU0 2: 2 1 1 1 0 1 0 5150760 18446744073709551615
*[ 70.658281] CPU0 nr_running=2
[ 70.661255] p=migration/0
[ 70.664022] p=task0-4
*[ 70.666384] CPU0 5: 2 1 1 1 0 1 0 5150760 18446744073709551615
[ 70.672230] ------------[ cut here ]------------
[ 70.676850] kernel BUG at kernel/sched/core.c:7346!
[ 70.681733] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 70.687223] Modules linked in:
[ 70.690284] CPU: 0 PID: 11 Comm: migration/0 Not tainted 5.9.0-
rc1-00134-g7104613975b6-dirty #173
[ 70.699168] Hardware name: ARM Juno development board (r0) (DT)
[ 70.705107] Stopper: multi_cpu_stop+0x0/0x170 <- 0x0
[ 70.710078] pstate: 20000085 (nzCv daIf -PAN -UAO BTYPE=--)
[ 70.715661] pc : sched_cpu_dying+0x210/0x250
[ 70.719936] lr : sched_cpu_dying+0x204/0x250
[ 70.724207] sp : ffff800011e7bc60
[ 70.727521] x29: ffff800011e7bc80 x28: 0000000000000002
[ 70.732840] x27: 0000000000000000 x26: ffff800011ab30c0
[ 70.738159] x25: ffff8000112d37e0 x24: ffff800011ab30c0
[ 70.743477] x23: ffff800011ab3440 x22: ffff000975e40790
[ 70.748796] x21: 0000000000000080 x20: 0000000000000000
[ 70.754115] x19: ffff00097ef591c0 x18: 0000000000000010
[ 70.759433] x17: 0000000000000000 x16: 0000000000000000
[ 70.764752] x15: ffff000975cf2108 x14: ffffffffffffffff
[ 70.770070] x13: ffff800091e7b9e7 x12: ffff800011e7b9ef
[ 70.775388] x11: ffff800011ac2000 x10: ffff800011ce86d0
[ 70.780707] x9 : 0000000000000001 x8 : ffff800011ce9000
[ 70.786026] x7 : ffff8000106edad8 x6 : 000000000000131c
[ 70.791344] x5 : ffff00097ef4f230 x4 : 0000000000000000
[ 70.796662] x3 : 0000000000000027 x2 : 414431aad459c700
[ 70.801981] x1 : 0000000000000000 x0 : 0000000000000002
[ 70.807299] Call trace:
[ 70.809747] sched_cpu_dying+0x210/0x250
[ 70.813676] cpuhp_invoke_callback+0x88/0x210
[ 70.818038] take_cpu_down+0x7c/0xd8
[ 70.821617] multi_cpu_stop+0xac/0x170
[ 70.825369] cpu_stopper_thread+0x98/0x130
[ 70.829469] smpboot_thread_fn+0x1c4/0x280
[ 70.833570] kthread+0x140/0x160
[ 70.836801] ret_from_fork+0x10/0x34
[ 70.840384] Code: 94011fd8 b9400660 7100041f 54000040 (d4210000)
[ 70.846487] ---[ end trace 7eb0e0efe803dcfe ]---
[ 70.851109] note: migration/0[11] exited with preempt_count 3

2020-10-02 14:34:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Migrate disable support

On Fri, Sep 25, 2020 at 08:32:24PM +0100, Valentin Schneider wrote:

> The IRC handout so far is:

I meant to go post a new version today; but of course cleanup took
longer than expected.

In any case, there's a slightly updated version here:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip2

it has push/pull balancer changes that sorta work. There's still
something funny with the NO_RT_PUSH_IPI case (and concequently also the
deadline code, which is an exact mirror of that code).

(it looks like it ends up migrating things in circles due to the for
loop in pull_rt_task)

> As for your splat, I think this is what I was worrying about wrt
> suppressing callbacks in the switch but not preventing them from being
> queued. Perhaps the below is "better" than what I previously sent.
>
> Technically should be doable with a cpu_active() check instead given this
> all gets flipped in sched_cpu_deactivate(), but at least this makes it
> obvious that PUSH suppresses any other callback.
>
> ---
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 50aac5b6db26..40d78a20fbcb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1403,7 +1403,7 @@ queue_balance_callback(struct rq *rq,
> {
> lockdep_assert_held(&rq->lock);
>
> - if (unlikely(head->next))
> + if (unlikely(head->next) || (rq->balance_flags & BALANCE_PUSH))
> return;

Yeah, I like that. Let me go add it.