2019-05-09 19:35:52

by Corey Minyard

[permalink] [raw]
Subject: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

From: Corey Minyard <[email protected]>

The function call do_wait_for_common() has a race condition that
can result in lockups waiting for completions. Adding the thread
to (and removing the thread from) the wait queue for the completion
is done outside the do loop in that function. However, if the thread
is woken up, the swake_up_locked() function will delete the entry
from the wait queue. If that happens and another thread sneaks
in and decrements the done count in the completion to zero, the
loop will go around again, but the thread will no longer be in the
wait queue, so there is no way to wake it up.

Visually, here's a diagram from Sebastian Andrzej Siewior:
T0 T1 T2
wait_for_completion()
do_wait_for_common()
__prepare_to_swait()
schedule()
complete()
x->done++ (0 -> 1)
raw_spin_lock_irqsave()
swake_up_locked() wait_for_completion()
wake_up_process(T0)
list_del_init()
raw_spin_unlock_irqrestore()
raw_spin_lock_irq(&x->wait.lock)
raw_spin_lock_irq(&x->wait.lock) x->done != UINT_MAX, 1 -> 0
raw_spin_unlock_irq(&x->wait.lock)
return 1
while (!x->done && timeout),
continue loop, not enqueued
on &x->wait

Basically, the problem is that the original wait queues used in
completions did not remove the item from the queue in the wakeup
function, but swake_up_locked() does.

Fix it by adding the thread to the wait queue inside the do loop.
The design of swait detects if it is already in the list and doesn't
do the list add again.

Fixes: a04ff6b4ec4ee7e ("completion: Use simple wait queues")
Signed-off-by: Corey Minyard <[email protected]>
---
Changes since v1:
* Only move __prepare_to_swait() into the loop. __prepare_to_swait()
handles if called when already in the list, and of course
__finish_swait() handles if the item is not queued on the
list.

kernel/sched/completion.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 755a58084978..49c14137988e 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -72,12 +72,12 @@ do_wait_for_common(struct completion *x,
if (!x->done) {
DECLARE_SWAITQUEUE(wait);

- __prepare_to_swait(&x->wait, &wait);
do {
if (signal_pending_state(state, current)) {
timeout = -ERESTARTSYS;
break;
}
+ __prepare_to_swait(&x->wait, &wait);
__set_current_state(state);
raw_spin_unlock_irq(&x->wait.lock);
timeout = action(timeout);
--
2.17.1


2019-05-09 19:52:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Thu, 9 May 2019 14:33:20 -0500
[email protected] wrote:

> From: Corey Minyard <[email protected]>
>
> The function call do_wait_for_common() has a race condition that
> can result in lockups waiting for completions. Adding the thread
> to (and removing the thread from) the wait queue for the completion
> is done outside the do loop in that function. However, if the thread
> is woken up, the swake_up_locked() function will delete the entry
> from the wait queue. If that happens and another thread sneaks
> in and decrements the done count in the completion to zero, the
> loop will go around again, but the thread will no longer be in the
> wait queue, so there is no way to wake it up.
>
> Visually, here's a diagram from Sebastian Andrzej Siewior:
> T0 T1 T2
> wait_for_completion()
> do_wait_for_common()
> __prepare_to_swait()
> schedule()
> complete()
> x->done++ (0 -> 1)
> raw_spin_lock_irqsave()
> swake_up_locked() wait_for_completion()
> wake_up_process(T0)
> list_del_init()
> raw_spin_unlock_irqrestore()
> raw_spin_lock_irq(&x->wait.lock)
> raw_spin_lock_irq(&x->wait.lock) x->done != UINT_MAX, 1 -> 0
> raw_spin_unlock_irq(&x->wait.lock)
> return 1
> while (!x->done && timeout),
> continue loop, not enqueued
> on &x->wait
>
> Basically, the problem is that the original wait queues used in
> completions did not remove the item from the queue in the wakeup
> function, but swake_up_locked() does.
>
> Fix it by adding the thread to the wait queue inside the do loop.
> The design of swait detects if it is already in the list and doesn't
> do the list add again.
>
> Fixes: a04ff6b4ec4ee7e ("completion: Use simple wait queues")
> Signed-off-by: Corey Minyard <[email protected]>

Thanks!

Acked-by: Steven Rostedt (VMware) <[email protected]>

-- Steve

> ---
> Changes since v1:
> * Only move __prepare_to_swait() into the loop. __prepare_to_swait()
> handles if called when already in the list, and of course
> __finish_swait() handles if the item is not queued on the
> list.
>
> kernel/sched/completion.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
> index 755a58084978..49c14137988e 100644
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -72,12 +72,12 @@ do_wait_for_common(struct completion *x,
> if (!x->done) {
> DECLARE_SWAITQUEUE(wait);
>
> - __prepare_to_swait(&x->wait, &wait);
> do {
> if (signal_pending_state(state, current)) {
> timeout = -ERESTARTSYS;
> break;
> }
> + __prepare_to_swait(&x->wait, &wait);
> __set_current_state(state);
> raw_spin_unlock_irq(&x->wait.lock);
> timeout = action(timeout);

Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On 2019-05-09 14:33:20 [-0500], [email protected] wrote:
> From: Corey Minyard <[email protected]>
>
> The function call do_wait_for_common() has a race condition that
> can result in lockups waiting for completions. Adding the thread
> to (and removing the thread from) the wait queue for the completion
> is done outside the do loop in that function. However, if the thread
> is woken up, the swake_up_locked() function will delete the entry
> from the wait queue. If that happens and another thread sneaks
> in and decrements the done count in the completion to zero, the
> loop will go around again, but the thread will no longer be in the
> wait queue, so there is no way to wake it up.

applied, thank you.

Sebastian

2019-05-10 12:13:28

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Fri, May 10, 2019 at 12:33:18PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-05-09 14:33:20 [-0500], [email protected] wrote:
> > From: Corey Minyard <[email protected]>
> >
> > The function call do_wait_for_common() has a race condition that
> > can result in lockups waiting for completions. Adding the thread
> > to (and removing the thread from) the wait queue for the completion
> > is done outside the do loop in that function. However, if the thread
> > is woken up, the swake_up_locked() function will delete the entry
> > from the wait queue. If that happens and another thread sneaks
> > in and decrements the done count in the completion to zero, the
> > loop will go around again, but the thread will no longer be in the
> > wait queue, so there is no way to wake it up.
>
> applied, thank you.
>
> Sebastian

Thanks a bunch. Do I need to do anything for backports? I'm pretty
sure this goes back to 4.4.

-corey

Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On 2019-05-10 07:08:24 [-0500], Corey Minyard wrote:
> Thanks a bunch. Do I need to do anything for backports? I'm pretty
> sure this goes back to 4.4.

No, you don't. I plan to release a new 5.0-RT today. Then the stable
maintainer will pick it at once they get to it.
4.4 is maintained by Daniel Wagner.

> -corey

Sebastian

2019-06-29 01:49:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Fri, 10 May 2019 12:33:18 +0200
Sebastian Andrzej Siewior <[email protected]> wrote:

> On 2019-05-09 14:33:20 [-0500], [email protected] wrote:
> > From: Corey Minyard <[email protected]>
> >
> > The function call do_wait_for_common() has a race condition that
> > can result in lockups waiting for completions. Adding the thread
> > to (and removing the thread from) the wait queue for the completion
> > is done outside the do loop in that function. However, if the thread
> > is woken up, the swake_up_locked() function will delete the entry
> > from the wait queue. If that happens and another thread sneaks
> > in and decrements the done count in the completion to zero, the
> > loop will go around again, but the thread will no longer be in the
> > wait queue, so there is no way to wake it up.
>
> applied, thank you.
>

When I applied this patch to 4.19-rt, I get the following lock up:

watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sh:745]
Modules linked in: floppy i915 drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect syscopyarea iosf_mbi i2c_algo_bit video
CPU: 2 PID: 745 Comm: sh Not tainted 4.19.56-test-rt23+ #16
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
RIP: 0010:_raw_spin_unlock_irq+0x17/0x4d
Code: 48 8b 12 0f ba e2 12 73 07 e8 f1 4a 92 ff 31 c0 5b 5d c3 66 66 66 66 90 55 48 89 e5 c6 07 00 e8 de 3d a3 ff fb bf 01 00 00 00 <e8> a7 27 9a ff 65 8b 05 c8 7f 93 7e 85 c0 74 1f a9 ff ff
ff 7f 75
RSP: 0018:ffffc90000c8bbb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffffc90000c8bd58 RCX: 0000000000000003
RDX: 0000000000000000 RSI: ffffffff8108ffab RDI: 0000000000000001
RBP: ffffc90000c8bbb8 R08: ffffffff816dcd76 R09: 0000000000020600
R10: 0000000000000400 R11: 0000001c0eef1808 R12: ffffc90000c8bbc8
R13: ffffc90000f13ca0 R14: ffff888074b2d7d8 R15: ffff8880789efe10
FS: 0000000000000000(0000) GS:ffff88807b300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000030662001b8 CR3: 00000000376ac000 CR4: 00000000000006e0
Call Trace:
swake_up_all+0xa6/0xde
__d_lookup_done+0x7c/0xc7
__d_add+0x44/0xf7
d_splice_alias+0x208/0x218
ext4_lookup+0x1a6/0x1c5
path_openat+0x63a/0xb15
? preempt_latency_stop+0x25/0x27
do_filp_open+0x51/0xae
? trace_preempt_on+0xde/0xe7
? rt_spin_unlock+0x13/0x24
? __alloc_fd+0x145/0x155
do_sys_open+0x81/0x125
__x64_sys_open+0x21/0x23
do_syscall_64+0x5c/0x6e
entry_SYSCALL_64_after_hwframe+0x49/0xbe

I haven't really looked too much into it though. I ran out of time :-/

-- Steve

2019-07-01 19:12:16

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Fri, Jun 28, 2019 at 09:49:03PM -0400, Steven Rostedt wrote:
> On Fri, 10 May 2019 12:33:18 +0200
> Sebastian Andrzej Siewior <[email protected]> wrote:
>
> > On 2019-05-09 14:33:20 [-0500], [email protected] wrote:
> > > From: Corey Minyard <[email protected]>
> > >
> > > The function call do_wait_for_common() has a race condition that
> > > can result in lockups waiting for completions. Adding the thread
> > > to (and removing the thread from) the wait queue for the completion
> > > is done outside the do loop in that function. However, if the thread
> > > is woken up, the swake_up_locked() function will delete the entry
> > > from the wait queue. If that happens and another thread sneaks
> > > in and decrements the done count in the completion to zero, the
> > > loop will go around again, but the thread will no longer be in the
> > > wait queue, so there is no way to wake it up.
> >
> > applied, thank you.
> >
>
> When I applied this patch to 4.19-rt, I get the following lock up:

I was unable to reproduce, and I looked at the code and I can't really
see a connection between this change and this crash.

Can you reproduce at will? If so, can you send a testcase?

-corey

>
> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sh:745]
> Modules linked in: floppy i915 drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect syscopyarea iosf_mbi i2c_algo_bit video
> CPU: 2 PID: 745 Comm: sh Not tainted 4.19.56-test-rt23+ #16
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> RIP: 0010:_raw_spin_unlock_irq+0x17/0x4d
> Code: 48 8b 12 0f ba e2 12 73 07 e8 f1 4a 92 ff 31 c0 5b 5d c3 66 66 66 66 90 55 48 89 e5 c6 07 00 e8 de 3d a3 ff fb bf 01 00 00 00 <e8> a7 27 9a ff 65 8b 05 c8 7f 93 7e 85 c0 74 1f a9 ff ff
> ff 7f 75
> RSP: 0018:ffffc90000c8bbb8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000000000 RBX: ffffc90000c8bd58 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: ffffffff8108ffab RDI: 0000000000000001
> RBP: ffffc90000c8bbb8 R08: ffffffff816dcd76 R09: 0000000000020600
> R10: 0000000000000400 R11: 0000001c0eef1808 R12: ffffc90000c8bbc8
> R13: ffffc90000f13ca0 R14: ffff888074b2d7d8 R15: ffff8880789efe10
> FS: 0000000000000000(0000) GS:ffff88807b300000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000030662001b8 CR3: 00000000376ac000 CR4: 00000000000006e0
> Call Trace:
> swake_up_all+0xa6/0xde
> __d_lookup_done+0x7c/0xc7
> __d_add+0x44/0xf7
> d_splice_alias+0x208/0x218
> ext4_lookup+0x1a6/0x1c5
> path_openat+0x63a/0xb15
> ? preempt_latency_stop+0x25/0x27
> do_filp_open+0x51/0xae
> ? trace_preempt_on+0xde/0xe7
> ? rt_spin_unlock+0x13/0x24
> ? __alloc_fd+0x145/0x155
> do_sys_open+0x81/0x125
> __x64_sys_open+0x21/0x23
> do_syscall_64+0x5c/0x6e
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> I haven't really looked too much into it though. I ran out of time :-/
>
> -- Steve

2019-07-01 20:19:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, 1 Jul 2019 14:09:49 -0500
Corey Minyard <[email protected]> wrote:

> On Fri, Jun 28, 2019 at 09:49:03PM -0400, Steven Rostedt wrote:
> > On Fri, 10 May 2019 12:33:18 +0200
> > Sebastian Andrzej Siewior <[email protected]> wrote:

> >
> > When I applied this patch to 4.19-rt, I get the following lock up:
>
> I was unable to reproduce, and I looked at the code and I can't really
> see a connection between this change and this crash.
>
> Can you reproduce at will? If so, can you send a testcase?
>

Yes, it wont boot. There is no testcase as I don't even make it to a
boot prompt. I applied the patch and it crashes, I remove the patch and
it boots without issue.

Attached is the full dmesg and the config used. I applied it to

ae97a0ba0197fb424008a317b79bebacd6a50213
Linux 4.19.56-rt23

It works fine for 5.0.14-rt9 where it was added.

-- Steve


Attachments:
(No filename) (902.00 B)
config (126.93 kB)
dmesg (45.61 kB)
Download all attachments

2019-07-01 20:45:27

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, Jul 01, 2019 at 04:18:40PM -0400, Steven Rostedt wrote:
> On Mon, 1 Jul 2019 14:09:49 -0500
> Corey Minyard <[email protected]> wrote:
>
> > On Fri, Jun 28, 2019 at 09:49:03PM -0400, Steven Rostedt wrote:
> > > On Fri, 10 May 2019 12:33:18 +0200
> > > Sebastian Andrzej Siewior <[email protected]> wrote:
>
> > >
> > > When I applied this patch to 4.19-rt, I get the following lock up:
> >
> > I was unable to reproduce, and I looked at the code and I can't really
> > see a connection between this change and this crash.
> >
> > Can you reproduce at will? If so, can you send a testcase?
> >
>
> Yes, it wont boot. There is no testcase as I don't even make it to a
> boot prompt. I applied the patch and it crashes, I remove the patch and
> it boots without issue.
>
> Attached is the full dmesg and the config used. I applied it to
>
> ae97a0ba0197fb424008a317b79bebacd6a50213
> Linux 4.19.56-rt23
>
> It works fine for 5.0.14-rt9 where it was added.
>
> -- Steve

I show that patch is already applied at

1921ea799b7dc561c97185538100271d88ee47db
sched/completion: Fix a lockup in wait_for_completion()

git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
v4.19.37-rt20~1

So I'm not sure what is going on.

-corey

2019-07-01 21:07:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, 1 Jul 2019 15:43:25 -0500
Corey Minyard <[email protected]> wrote:


> I show that patch is already applied at
>
> 1921ea799b7dc561c97185538100271d88ee47db
> sched/completion: Fix a lockup in wait_for_completion()
>
> git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
> v4.19.37-rt20~1
>
> So I'm not sure what is going on.

Bah, I'm replying to the wrong commit that I'm having issues with.

I searched your name to find the patch that is of trouble, and picked
this one.

I'll go find the problem patch, sorry for the noise on this one.

-- Steve

2019-07-01 21:14:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, 1 Jul 2019 17:06:02 -0400
Steven Rostedt <[email protected]> wrote:

> On Mon, 1 Jul 2019 15:43:25 -0500
> Corey Minyard <[email protected]> wrote:
>
>
> > I show that patch is already applied at
> >
> > 1921ea799b7dc561c97185538100271d88ee47db
> > sched/completion: Fix a lockup in wait_for_completion()
> >
> > git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
> > v4.19.37-rt20~1
> >
> > So I'm not sure what is going on.
>
> Bah, I'm replying to the wrong commit that I'm having issues with.
>
> I searched your name to find the patch that is of trouble, and picked
> this one.
>
> I'll go find the problem patch, sorry for the noise on this one.
>

No, I did reply to the right email, but it wasn't the top patch I was
having issues with. It was the patch I replied to:

This change below that Sebastian marked as stable-rt is what is causing
me an issue. Not the patch that started the thread.

-- Steve


> Now.. that will fix it, but I think it is also wrong.
>
> The problem being that it violates FIFO, something that might be more
> important on -RT than elsewhere.
>
> The regular wait API seems confused/inconsistent when it uses
> autoremove_wake_function and default_wake_function, which doesn't help,
> but we can easily support this with swait -- the problematic thing is
> the custom wake functions, we musn't do that.
>
> (also, mingo went and renamed a whole bunch of wait_* crap and didn't do
> the same to swait_ so now its named all different :/)
>
> Something like the below perhaps.
>
> ---
> diff --git a/include/linux/swait.h b/include/linux/swait.h
> index 73e06e9986d4..f194437ae7d2 100644
> --- a/include/linux/swait.h
> +++ b/include/linux/swait.h
> @@ -61,11 +61,13 @@ struct swait_queue_head {
> struct swait_queue {
> struct task_struct *task;
> struct list_head task_list;
> + unsigned int remove;
> };
>
> #define __SWAITQUEUE_INITIALIZER(name) { \
> .task = current, \
> .task_list = LIST_HEAD_INIT((name).task_list), \
> + .remove = 1, \
> }
>
> #define DECLARE_SWAITQUEUE(name) \
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index e83a3f8449f6..86974ecbabfc 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -28,7 +28,8 @@ void swake_up_locked(struct swait_queue_head *q)
>
> curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
> wake_up_process(curr->task);
> - list_del_init(&curr->task_list);
> + if (curr->remove)
> + list_del_init(&curr->task_list);
> }
> EXPORT_SYMBOL(swake_up_locked);
>
> @@ -57,7 +58,8 @@ void swake_up_all(struct swait_queue_head *q)
> curr = list_first_entry(&tmp, typeof(*curr), task_list);
>
> wake_up_state(curr->task, TASK_NORMAL);
> - list_del_init(&curr->task_list);
> + if (curr->remove)
> + list_del_init(&curr->task_list);
>
> if (list_empty(&tmp))
> break;

2019-07-01 21:29:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, 1 Jul 2019 17:13:33 -0400
Steven Rostedt <[email protected]> wrote:

> On Mon, 1 Jul 2019 17:06:02 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Mon, 1 Jul 2019 15:43:25 -0500
> > Corey Minyard <[email protected]> wrote:
> >
> >
> > > I show that patch is already applied at
> > >
> > > 1921ea799b7dc561c97185538100271d88ee47db
> > > sched/completion: Fix a lockup in wait_for_completion()
> > >
> > > git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
> > > v4.19.37-rt20~1
> > >
> > > So I'm not sure what is going on.
> >
> > Bah, I'm replying to the wrong commit that I'm having issues with.
> >
> > I searched your name to find the patch that is of trouble, and picked
> > this one.
> >
> > I'll go find the problem patch, sorry for the noise on this one.
> >
>
> No, I did reply to the right email, but it wasn't the top patch I was
> having issues with. It was the patch I replied to:
>
> This change below that Sebastian marked as stable-rt is what is causing
> me an issue. Not the patch that started the thread.
>

In fact, my system doesn't boot with this commit in 5.0-rt.

If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
wakeup occured") the machine boots again.

Sebastian, I think that's a bad commit, please revert it.

Thanks!

-- Steve

>
>
> > Now.. that will fix it, but I think it is also wrong.
> >
> > The problem being that it violates FIFO, something that might be more
> > important on -RT than elsewhere.
> >
> > The regular wait API seems confused/inconsistent when it uses
> > autoremove_wake_function and default_wake_function, which doesn't help,
> > but we can easily support this with swait -- the problematic thing is
> > the custom wake functions, we musn't do that.
> >
> > (also, mingo went and renamed a whole bunch of wait_* crap and didn't do
> > the same to swait_ so now its named all different :/)
> >
> > Something like the below perhaps.
> >
> > ---
> > diff --git a/include/linux/swait.h b/include/linux/swait.h
> > index 73e06e9986d4..f194437ae7d2 100644
> > --- a/include/linux/swait.h
> > +++ b/include/linux/swait.h
> > @@ -61,11 +61,13 @@ struct swait_queue_head {
> > struct swait_queue {
> > struct task_struct *task;
> > struct list_head task_list;
> > + unsigned int remove;
> > };
> >
> > #define __SWAITQUEUE_INITIALIZER(name) { \
> > .task = current, \
> > .task_list = LIST_HEAD_INIT((name).task_list), \
> > + .remove = 1, \
> > }
> >
> > #define DECLARE_SWAITQUEUE(name) \
> > diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> > index e83a3f8449f6..86974ecbabfc 100644
> > --- a/kernel/sched/swait.c
> > +++ b/kernel/sched/swait.c
> > @@ -28,7 +28,8 @@ void swake_up_locked(struct swait_queue_head *q)
> >
> > curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
> > wake_up_process(curr->task);
> > - list_del_init(&curr->task_list);
> > + if (curr->remove)
> > + list_del_init(&curr->task_list);
> > }
> > EXPORT_SYMBOL(swake_up_locked);
> >
> > @@ -57,7 +58,8 @@ void swake_up_all(struct swait_queue_head *q)
> > curr = list_first_entry(&tmp, typeof(*curr), task_list);
> >
> > wake_up_state(curr->task, TASK_NORMAL);
> > - list_del_init(&curr->task_list);
> > + if (curr->remove)
> > + list_del_init(&curr->task_list);
> >
> > if (list_empty(&tmp))
> > break;
>

2019-07-01 21:37:05

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Mon, Jul 01, 2019 at 05:28:25PM -0400, Steven Rostedt wrote:
> On Mon, 1 Jul 2019 17:13:33 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Mon, 1 Jul 2019 17:06:02 -0400
> > Steven Rostedt <[email protected]> wrote:
> >
> > > On Mon, 1 Jul 2019 15:43:25 -0500
> > > Corey Minyard <[email protected]> wrote:
> > >
> > >
> > > > I show that patch is already applied at
> > > >
> > > > 1921ea799b7dc561c97185538100271d88ee47db
> > > > sched/completion: Fix a lockup in wait_for_completion()
> > > >
> > > > git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
> > > > v4.19.37-rt20~1
> > > >
> > > > So I'm not sure what is going on.
> > >
> > > Bah, I'm replying to the wrong commit that I'm having issues with.
> > >
> > > I searched your name to find the patch that is of trouble, and picked
> > > this one.
> > >
> > > I'll go find the problem patch, sorry for the noise on this one.
> > >
> >
> > No, I did reply to the right email, but it wasn't the top patch I was
> > having issues with. It was the patch I replied to:
> >
> > This change below that Sebastian marked as stable-rt is what is causing
> > me an issue. Not the patch that started the thread.
> >
>
> In fact, my system doesn't boot with this commit in 5.0-rt.
>
> If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> wakeup occured") the machine boots again.
>
> Sebastian, I think that's a bad commit, please revert it.

Yeah. d_wait_lookup() does not use __SWAITQUEUE_INITIALIZER() to
intitialize it's queue item, but uses swake_up_all(), so it goes
into an infinite loop since it won't remove the item because remove
isn't set.

I'd suspect there are other places this is the case.

-corey

>
> Thanks!
>
> -- Steve
>
> >
> >
> > > Now.. that will fix it, but I think it is also wrong.
> > >
> > > The problem being that it violates FIFO, something that might be more
> > > important on -RT than elsewhere.
> > >
> > > The regular wait API seems confused/inconsistent when it uses
> > > autoremove_wake_function and default_wake_function, which doesn't help,
> > > but we can easily support this with swait -- the problematic thing is
> > > the custom wake functions, we musn't do that.
> > >
> > > (also, mingo went and renamed a whole bunch of wait_* crap and didn't do
> > > the same to swait_ so now its named all different :/)
> > >
> > > Something like the below perhaps.
> > >
> > > ---
> > > diff --git a/include/linux/swait.h b/include/linux/swait.h
> > > index 73e06e9986d4..f194437ae7d2 100644
> > > --- a/include/linux/swait.h
> > > +++ b/include/linux/swait.h
> > > @@ -61,11 +61,13 @@ struct swait_queue_head {
> > > struct swait_queue {
> > > struct task_struct *task;
> > > struct list_head task_list;
> > > + unsigned int remove;
> > > };
> > >
> > > #define __SWAITQUEUE_INITIALIZER(name) { \
> > > .task = current, \
> > > .task_list = LIST_HEAD_INIT((name).task_list), \
> > > + .remove = 1, \
> > > }
> > >
> > > #define DECLARE_SWAITQUEUE(name) \
> > > diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> > > index e83a3f8449f6..86974ecbabfc 100644
> > > --- a/kernel/sched/swait.c
> > > +++ b/kernel/sched/swait.c
> > > @@ -28,7 +28,8 @@ void swake_up_locked(struct swait_queue_head *q)
> > >
> > > curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
> > > wake_up_process(curr->task);
> > > - list_del_init(&curr->task_list);
> > > + if (curr->remove)
> > > + list_del_init(&curr->task_list);
> > > }
> > > EXPORT_SYMBOL(swake_up_locked);
> > >
> > > @@ -57,7 +58,8 @@ void swake_up_all(struct swait_queue_head *q)
> > > curr = list_first_entry(&tmp, typeof(*curr), task_list);
> > >
> > > wake_up_state(curr->task, TASK_NORMAL);
> > > - list_del_init(&curr->task_list);
> > > + if (curr->remove)
> > > + list_del_init(&curr->task_list);
> > >
> > > if (list_empty(&tmp))
> > > break;
> >
>

2019-07-02 07:05:20

by Kurt Kanzenbach

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

Hi,

On Mon, Jul 01, 2019 at 05:28:25PM -0400, Steven Rostedt wrote:
> On Mon, 1 Jul 2019 17:13:33 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Mon, 1 Jul 2019 17:06:02 -0400
> > Steven Rostedt <[email protected]> wrote:
> >
> > > On Mon, 1 Jul 2019 15:43:25 -0500
> > > Corey Minyard <[email protected]> wrote:
> > >
> > >
> > > > I show that patch is already applied at
> > > >
> > > > 1921ea799b7dc561c97185538100271d88ee47db
> > > > sched/completion: Fix a lockup in wait_for_completion()
> > > >
> > > > git describe --contains 1921ea799b7dc561c97185538100271d88ee47db
> > > > v4.19.37-rt20~1
> > > >
> > > > So I'm not sure what is going on.
> > >
> > > Bah, I'm replying to the wrong commit that I'm having issues with.
> > >
> > > I searched your name to find the patch that is of trouble, and picked
> > > this one.
> > >
> > > I'll go find the problem patch, sorry for the noise on this one.
> > >
> >
> > No, I did reply to the right email, but it wasn't the top patch I was
> > having issues with. It was the patch I replied to:
> >
> > This change below that Sebastian marked as stable-rt is what is causing
> > me an issue. Not the patch that started the thread.
> >
>
> In fact, my system doesn't boot with this commit in 5.0-rt.
>
> If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> wakeup occured") the machine boots again.
>
> Sebastian, I think that's a bad commit, please revert it.

I'm having the same problem on a Cyclone V based ARM board. Reverting
this commit solves the boot issue for me as well.

Thanks,
Kurt


Attachments:
(No filename) (1.60 kB)
signature.asc (849.00 B)
Download all attachments
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On 2019-07-02 09:04:18 [+0200], Kurt Kanzenbach wrote:
> > In fact, my system doesn't boot with this commit in 5.0-rt.
> >
> > If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> > wakeup occured") the machine boots again.
> >
> > Sebastian, I think that's a bad commit, please revert it.
>
> I'm having the same problem on a Cyclone V based ARM board. Reverting
> this commit solves the boot issue for me as well.

Okay. So the original Corey fix as in v5.0.14-rt9 works for everyone.
Peter's version as I picked it up for v5.0.21-rt14 is causing problems
for two persons now.

I'm leaning towards reverting it back to old version for now…

> Thanks,
> Kurt

Sebastian

2019-07-02 11:42:08

by Corey Minyard

[permalink] [raw]
Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On Tue, Jul 02, 2019 at 10:35:36AM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-02 09:04:18 [+0200], Kurt Kanzenbach wrote:
> > > In fact, my system doesn't boot with this commit in 5.0-rt.
> > >
> > > If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> > > wakeup occured") the machine boots again.
> > >
> > > Sebastian, I think that's a bad commit, please revert it.
> >
> > I'm having the same problem on a Cyclone V based ARM board. Reverting
> > this commit solves the boot issue for me as well.
>
> Okay. So the original Corey fix as in v5.0.14-rt9 works for everyone.
> Peter's version as I picked it up for v5.0.21-rt14 is causing problems
> for two persons now.
>
> I'm leaning towards reverting it back to old version for now…

Just to avoid confusion... it wasn't my patch 1921ea799b7dc56
(sched/completion: Fix a lockup in wait_for_completion()) that caused
the issue, nor was it Peter's version of it. Instead, it was the patch
mentioned above, 90e1b18eba2ae4a729 ("swait: Delete the task from after a
wakeup occured"), which came from someone else. I can verify by visual
inspection that that patch is broken and it should definitely be removed.
Just don't want someone to be confused and remove the wrong patch.

-corey


>
> > Thanks,
> > Kurt
>
> Sebastian

Subject: Re: [PATCH RT v2] Fix a lockup in wait_for_completion() and friends

On 2019-07-02 06:40:28 [-0500], Corey Minyard wrote:
> On Tue, Jul 02, 2019 at 10:35:36AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-07-02 09:04:18 [+0200], Kurt Kanzenbach wrote:
> > > > In fact, my system doesn't boot with this commit in 5.0-rt.
> > > >
> > > > If I revert 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> > > > wakeup occured") the machine boots again.
> > > >
> > > > Sebastian, I think that's a bad commit, please revert it.
> > >
> > > I'm having the same problem on a Cyclone V based ARM board. Reverting
> > > this commit solves the boot issue for me as well.
> >
> > Okay. So the original Corey fix as in v5.0.14-rt9 works for everyone.
> > Peter's version as I picked it up for v5.0.21-rt14 is causing problems
> > for two persons now.
> >
> > I'm leaning towards reverting it back to old version for now…
>
> Just to avoid confusion... it wasn't my patch 1921ea799b7dc56
> (sched/completion: Fix a lockup in wait_for_completion()) that caused
> the issue, nor was it Peter's version of it. Instead, it was the patch
> mentioned above, 90e1b18eba2ae4a729 ("swait: Delete the task from after a
> wakeup occured"), which came from someone else. I can verify by visual
> inspection that that patch is broken and it should definitely be removed.
> Just don't want someone to be confused and remove the wrong patch.

The commit 90e1b18eba2ae4a729 is delta of reverting your patch and
adding Peter's patch instead. If you look into the queue
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/diff/patches/swait-Delete-the-task-from-after-a-wakeup-occured.patch?h=linux-5.0.y-rt-patches&id=8ef6644ae2ac8fc18f157c3deb70fa9acb95a486

is what Peter suggested in the thread and this
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/diff/patches/sched-completion-Fix-a-lockup-in-wait_for_completion.patch?h=linux-5.0.y-rt-patches&id=8ef6644ae2ac8fc18f157c3deb70fa9acb95a486

is the removal.

> -corey
Sebastian